Speaker
Description
Effective analysis computing requires rapid turnaround times in order to enable frequent iteration, adjustment, and exploration, leading to discovery. An informal goal of reducing 10TB of experimental data in about ten minutes using campus-scale computing infrastructure is an achievable goal, just considering raw hardware capability. However, compared to production computing, which seeks to maximize throughput at a massive scale over the timescale of weeks and months, analysis computing requires different optimizations in terms of startup latency, data locality, scalability limits, and long-tail behavior. At Notre Dame, we have developed substantial experience with running scalable analysis codes on campus infrastructure on a daily basis. Using the TopEFT application, based on the Coffea data analysis framework and the Work Queue distributed executor, we reliably process 2TB of data, 375 CPU-hours analysis codes to completion in about one hour on hundreds of nodes, albeit with a high variability due to competing system loads. The python environment needed in the compute nodes is setup and cached on the fly if needed (300MB as tarball sent to worker nodes, 1GB unpacked). In this talk, we present our analysis of the performance limits of the current system, taking into account software dependencies, data access, result generation, and fault tolerance. We present our plans for attacking the ten minute goal through a combination of hardware evolution, improved storage management, and application scheduling.
Consider for long presentation | No |
---|