High energy physics is facing serious challenges in the coming decades due to the projected shortfall of CPU and storage resources compared to our anticipated budgets. In the past, HEP has not made extensive use of HPCs, however the U.S. has had a long term investment in HPCs and it is the platform of choice for many simulation workloads, and more recently, data processing for projects such as LIGO, the light sources, sky surveys, as well as for many AI and ML tasks. By mid to late decade, we expect on the order of 10 exaflops of peak power to be available in HPCs, and an order of magnitude more in the following decade. This is at least two orders of magnitude more than HEP requires, but it would be a significant challenge for HEP experiments to use, especially since most of the cycles will be provided by accelerators like GPUs. Can the HEP community leverage these resources to address our computational shortfalls?
The High Energy Physics Center for Computational Excellence (HEP-CCE), a 3 year pilot project which started in 2020, was formed to investigate this challenge, and provide strategies for HEP experiments to make use of HPC and other massively parallel resources. HEP-CCE functions in close co-operation with the stakeholder experiments, and is split into 4 parts. The first is to investigate Portable Parallelization Strategies, to make use of the massive available parallelism in GPU enabled HPCs, and to engineer portable coding solutions that allow single source software to run on all architectures. The second is to tackle fine grained I/O and the related storage issues on HPCs, by enhancing the existing Darshan HPC I/O monitoring tool to handle HEP workflows and characterize those for ATLAS, CMS & DUNE, developing a I/O mimicking framework allowing scalability studies for different I/O implementations (including ROOT, HDF5) in regimes not yet accessible to HEP production jobs, using HDF5 via ROOT serialization with parallel I/O and investigating new data model with more performant I/O and offloading to GPU resources. The third looks at Event Generators, such as MadGraph and Sherpa, to convert them to run efficiently on GPUs. And the last is to understand how we can map our Complex Workflows onto HPC resources, which are very different from normal HPC workflows.
In this submission we present the results of our 3 year investigations from all 4 domains and give an outlook on recommendations for current and future HEP experiments on how to best use the U.S. HPC environment.
|Consider for long presentation