ALICE is one of the four large experiments at the CERN LHC designed to study the structure and origins of matter in collisions of heavy ions (and protons) at ultra-relativistic energies. The experiment measures the particles produced as a result of collisions in its center so that it can reconstruct and study the evolution of the system produced during these collisions. To perform these measurements, many different sub-detectors combined to form the experimental apparatus, each providing specific information. The ALICE Collaboration is composed of 2,000 members from over 175 physics institutes in 39 countries. Besides, numerous computing resources are available to researchers to collect, process, and analyze data gathered from experiments.
The ALICE experiment at CERN started its LHC Run 3 in 2022 with an upgraded detector and an entirely new data acquisition system, capable of collecting 100 times more events than the previous setup. One of the key elements of the new DAQ is the Event Processing Nodes (EPN) farm, which currently comprises 250 servers, each equipped with 8 MI50 ATI GPU accelerators. The role of the EPN is to make a lossy compression of the detector data from approximately 600GB/s to 100GB/s during the heavy-ion data taking period. The 100GB/s stream is written to an 80PB EOS disk buffer for further offline processing. The EPNs handle data streams, called Time Frames, of 10ms duration from the detector independently from each other and write the output, called Compressed Time Frames (CTF), to a local disk. The CTFs must be removed from the buffer as soon as the compression is completed to free the local disk for the next data. In addition to the CTFs, the EPNs process calibration data, which is also written to the local node storage and must be transferred to persistent storage rapidly. The data transfer functions are done by the EPN2EOS system, which in addition to the data copy also registers the CTFs and calibration data in the ALICE Grid catalogue. EPN2EOS is highly optimized to perform the copy and registration functions outside of the EPN data compression times, it is also capable to redirect data streams to an alternative storage system in case of network interruptions or unavailability of the primary storage and has extensive monitoring and messaging system to present the ALICE operations with real-time alerts in case of problems. The system has been in production since November 2021 and in this paper, we describe its architecture, implementation, and analysis of its first year of utilization.
|Consider for long presentation||No|