For LHC Run3 the ALICE experiment software stack has been completely refactored, incorporating support for multicore job execution. The new multicore jobs spawn multiple processes and threads within the payload. Given that some of the deployed processes may be short-lived, accounting for their resource consumption presents a challenge. This article presents the newly developed methodology for payload execution monitoring, which correctly accounts for the resources used by all processes within the payload.
We also present a black box analysis of the new multicore experiment software framework tracing the used resources and system function calls issued by MonteCarlo simulation jobs. Multiple sources of overhead in the processes and threads lifecycle have thus been identified. This paper describes the tracing techniques and what solutions were implemented to address them. The analysis and subsequent improvements of the code have positively impacted the resource consumption and the overall turnaround time of the payloads with a notable 35% reduction in execution time for a reference production job. We also introduce how this methodology will be used to further improve the efficiency of our experiment software and what other optimization venues are currently being pursued.
|Consider for long presentation||No|