Please visit Jefferson Lab Event Policies and Guidance before planning your next event: https://www.jlab.org/conference_planning.

May 8 – 12, 2023
Norfolk Waterside Marriott
US/Eastern timezone

Dynamic scheduling using CPU oversubscription in the ALICE Grid

May 11, 2023, 3:15 PM
15m
Marriott Ballroom II-III (Norfolk Waterside Marriott)

Marriott Ballroom II-III

Norfolk Waterside Marriott

235 East Main Street Norfolk, VA 23510
Oral Track 4 - Distributed Computing Track 4 - Distributed Computing

Speaker

Bertran Ferrer, Marta (CERN)

Description

The ALICE Grid is designed to perform a realtime comprehensive monitoring of both jobs and execution nodes in order to maintain a continuous and consistent status of the Grid infrastructure. An extensive database of historical data is available and is periodically analyzed to tune the workflow and data management to optimal performance levels. This data, when evaluated in real time, has the power to trigger decisions for efficient resource management of the currently running payloads, for example to enable the execution of a higher volume of work per unit of time. In this article, we consider scenarios in which, through constant interaction with the monitoring agents, a dynamic adaptation of the running workflows is performed. The target resources are memory and CPU with the objective of using them in their entirety and ensuring optimal utilization fairness between executing jobs.

Grid resources are heterogeneous and of different generations, which means that some of them have superior hardware characteristics than the minimum required to execute ALICE jobs. Our middleware, JAliEn, works on the basis of allocating 2GB of RAM memory per job (allowing up to 8GB when including SWAP). Many of the worker nodes have higher memory per core ratios than these basic limits, thus in terms of available memory they have free resources to accommodate extra jobs. The running jobs may have different behaviors and unequal resource usage depending on their nature. For example, analysis tasks are I/O bound while MonteCarlo tasks are CPU intensive. Running additional jobs with complementary resource usage patterns on a worker node has a great potential to increase the total efficiency of the worker nodes. This paper presents the methodology to exploit the different resource usage profiles by oversubscribing the executing nodes with extra jobs taking into account their CPU resource usage levels and memory capacity.

Consider for long presentation No

Primary author

Presentation materials

Peer reviewing

Paper