The increased footprint foreseen for Run-3 and HL-LHC data will soon expose
the limits of currently available storage and CPU resources. Data formats
are already optimized according to the processing chain for which they are
designed. ATLAS events are stored in ROOT-based reconstruction output files
called Analysis Object Data (AOD), which are then processed within the
derivation framework to produce Derived AOD (DAOD) files.
Numerous DAOD formats, tailored for specific physics and performance groups,
have been in use throughout the ATLAS Run-2 phase. In view of Run-3, ATLAS
has changed its Analysis Model, which entailed a significant reduction of
the existing DAOD flavors. Two new, unfiltered and skimmable on read,
formats have been proposed as replacements: DAOD_PHYS, designed to meet the
requirements of the majority of the analysis workflows, and DAOD_PHYSLITE, a
smaller format containing already calibrated physics objects. As ROOT-based
formats, they natively support four lossless compression algorithms: Lzma,
Lz4, Zlib and Zstd.
In this study, the effects of different compression settings on file size,
compression time, compression factor and reading speed are investigated
considering both DAOD_PHYS and DAOD_PHYSLITE formats. Total as well as
partial event reading strategies have been tested. Moreover, the impact of
AutoFlush and SplitLevel, two parameters controlling how in-memory data
structures are serialized to ROOT files, has been evaluated.
This study yields quantitative results that can serve as a paradigm on how
to make compression decisions for different ATLAS' use cases. As an example,
for both DAOD_PHYS and DAOD_PHYSLITE, the Lz4 library exhibits the fastest
reading speed, but results in the largest files, whereas the Lzma algorithm
provides larger compression factors at the cost of significantly slower
reading speeds. In addition, guidelines for setting appropriate AutoFlush
and SplitLevel values are outlined.
|Consider for long presentation||No|