26TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY & NUCLEAR PHYSICS (CHEP2023)

Name: 26TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY & NUCLEAR PHYSICS (CHEP2023)
Start: 2023-05-08T08:00:00-04:00
End: 2023-05-12T16:00:00-04:00
Location: Norfolk Waterside Marriott

May 8 – 12, 2023

Norfolk Waterside Marriott

US/Eastern timezone

Conference Secretariat

chep2023-secretariat@jlab.org

Managing remote cloud resources for multiple HEP VO’s with cloudscheduler

May 11, 2023, 12:30 PM

15m

Marriott Ballroom II-III (Norfolk Waterside Marriott)

Marriott Ballroom II-III

Norfolk Waterside Marriott

235 East Main Street Norfolk, VA 23510

Oral Track 4 - Distributed Computing Track 4 - Distributed Computing

Ebert, Marcus (University of Victoria)

Cloudscheduler is a system to manage resources of local and remote compute clouds and makes those resources available to HTCondor pools. It examines the resource needs of idle jobs, then starts virtual machines (VMs) sized to suit those resource needs on allowed clouds with available resources. Using yaml files, cloudscheduler then provisions the VMs during the boot process with all necessary tools needed to register with HTCondor and run the experiment's jobs. Although we have run cloudscheduler in its first version for ATLAS and Belle-II workloads successfully for more than 10 years, we developed cloudscheduler version 2 (CSV2), a complete overhaul and modernization of cloudscheduler. We published the technical design of CSV2 in 2019, however, many features have been added since then and the system is used successfully in production for Belle-II, ATLAS, DUNE, and BaBar. In addition to using CSV2 as a WLCG grid site, we also run it as a service for other WLCG grid sites, and the Canadian Advanced Network for Astronomical Research (CANFAR) group uses its own instance of CSV2 for their astronomy workload. In this talk, we report on our experience in operating CSV2 for the different experiment's jobs from a user's and administrator's point of view, running on up to 10,000 cores across all experiments and clouds in North America, Australia, and Europe. We will also report on how to correctly account for the resource usage in the APEL system. CSV2 can be used with its own HTCondor system, but it can also extend an existing HTCondor system with cloud resources, for example in times of high demand of batch computing resources. We will detail how projects can be created and integrated with an existing or new HTCondor system, and how the monitoring works. We will also report on the integration of different clouds, as well as using the integrated opportunistic system. CSV2’s integrated opportunistic system allows the use of the same cloud for different experiments, giving one experiment the preferred usage and others an opportunity to make temporary use of idle resources. In addition, we report on how we worked with different cloud administrators to allow opportunistic use of idle cloud resources, managed by the cloud administrators through cloud metadata.

Consider for long presentation	Yes

Mr Driemel, Colson (University of Victoria) Ebert, Marcus (University of Victoria) Dr Sobie, Randall (University of Victoria) Dr Sullivan, Tristan (University of Victoria)

CHEP_CSV2.pdf

Paper files:

26TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY & NUCLEAR PHYSICS (CHEP2023)

Conference Secretariat

Managing remote cloud resources for multiple HEP VO’s with cloudscheduler

Marriott Ballroom II-III

Norfolk Waterside Marriott

Speaker

Description

Authors

Presentation materials

Peer reviewing

Paper

Choose timezone

26TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY & NUCLEAR PHYSICS (CHEP2023)

Conference Secretariat

Speaker

Description

Authors

Presentation materials

Peer reviewing

Paper