26TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY & NUCLEAR PHYSICS (CHEP2023)

Name: 26TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY & NUCLEAR PHYSICS (CHEP2023)
Start: 2023-05-08T08:00:00-04:00
End: 2023-05-12T16:00:00-04:00
Location: Norfolk Waterside Marriott

May 8 – 12, 2023

Norfolk Waterside Marriott

US/Eastern timezone

Conference Secretariat

chep2023-secretariat@jlab.org

Dynamic scheduling using CPU oversubscription in the ALICE Grid

May 11, 2023, 3:15 PM

15m

Marriott Ballroom II-III (Norfolk Waterside Marriott)

Marriott Ballroom II-III

Norfolk Waterside Marriott

235 East Main Street Norfolk, VA 23510

Oral Track 4 - Distributed Computing Track 4 - Distributed Computing

Bertran Ferrer, Marta (CERN)

The ALICE Grid is designed to perform a realtime comprehensive monitoring of both jobs and execution nodes in order to maintain a continuous and consistent status of the Grid infrastructure. An extensive database of historical data is available and is periodically analyzed to tune the workflow and data management to optimal performance levels. This data, when evaluated in real time, has the power to trigger decisions for efficient resource management of the currently running payloads, for example to enable the execution of a higher volume of work per unit of time. In this article, we consider scenarios in which, through constant interaction with the monitoring agents, a dynamic adaptation of the running workflows is performed. The target resources are memory and CPU with the objective of using them in their entirety and ensuring optimal utilization fairness between executing jobs.

Grid resources are heterogeneous and of different generations, which means that some of them have superior hardware characteristics than the minimum required to execute ALICE jobs. Our middleware, JAliEn, works on the basis of allocating 2GB of RAM memory per job (allowing up to 8GB when including SWAP). Many of the worker nodes have higher memory per core ratios than these basic limits, thus in terms of available memory they have free resources to accommodate extra jobs. The running jobs may have different behaviors and unequal resource usage depending on their nature. For example, analysis tasks are I/O bound while MonteCarlo tasks are CPU intensive. Running additional jobs with complementary resource usage patterns on a worker node has a great potential to increase the total efficiency of the worker nodes. This paper presents the methodology to exploit the different resource usage profiles by oversubscribing the executing nodes with extra jobs taking into account their CPU resource usage levels and memory capacity.

Consider for long presentation	No

Bertran Ferrer, Marta (CERN)

DynamicSchedulingOversubscription_MartaBertran.pdf

Paper files:

26TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY & NUCLEAR PHYSICS (CHEP2023)

Conference Secretariat

Dynamic scheduling using CPU oversubscription in the ALICE Grid

Marriott Ballroom II-III

Norfolk Waterside Marriott

Speaker

Description

Author

Presentation materials

Peer reviewing

Paper

Choose timezone

26TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY & NUCLEAR PHYSICS (CHEP2023)

Conference Secretariat

Speaker

Description

Author

Presentation materials

Peer reviewing

Paper