Speaker
Description
The CernVM File System (CVMFS) provides the software distribution backbone for High Energy and Nuclear Physics experiments and many other scientific communities in the form of a globally available shared software area. It has been designed for the software distribution problem of experiment software for LHC Runs 1 and 2. For LHC Run 3 and even more so for HL-LHC (Runs 4-6), the complexity of the experiment software stacks and their build pipelines is substantially larger. For instance, software is being distributed for several CPU architectures, often in the form of containers which includes base and operating system libraries, the number of external packages such as machine learning libraries has multiplied, and there is a shift from C++ to more Python-heavy software stacks that results in more and smaller files needing to be distributed. For CVMFS, the new software landscape means an order of magnitude increase of scale in key metrics such as number of files, number of system calls, and number of concurrent processes accessing the file system client. In this contribution, we report on the performance and reliability engineering on the file system client to sustain current and expected future software access load. Concretely, we show the impact of the newly designed file system cache management, including upstreamed improvements to the fuse kernel module itself, improved utilization of network links and caches (such as line optimization, prefetching, and proxy sharding), and operational improvements on network failure handling, error reporting, and integration with container runtimes. Overall, the new CVMFS client is designed to sustain applications with more than one million file lookups during startup, nodes with hundreds of cores, and thousands of concurrent processes accessing software from the file system client.
Consider for long presentation | No |
---|