Please visit Jefferson Lab Event Policies and Guidance before planning your next event: https://www.jlab.org/conference_planning.

May 8 – 12, 2023
Norfolk Waterside Marriott
US/Eastern timezone

Managing remote cloud resources for multiple HEP VO’s with cloudscheduler

May 11, 2023, 12:30 PM
15m
Marriott Ballroom II-III (Norfolk Waterside Marriott)

Marriott Ballroom II-III

Norfolk Waterside Marriott

235 East Main Street Norfolk, VA 23510
Oral Track 4 - Distributed Computing Track 4 - Distributed Computing

Speaker

Ebert, Marcus (University of Victoria)

Description

Cloudscheduler is a system to manage resources of local and remote compute clouds and makes those resources available to HTCondor pools. It examines the resource needs of idle jobs, then starts virtual machines (VMs) sized to suit those resource needs on allowed clouds with available resources. Using yaml files, cloudscheduler then provisions the VMs during the boot process with all necessary tools needed to register with HTCondor and run the experiment's jobs. Although we have run cloudscheduler in its first version for ATLAS and Belle-II workloads successfully for more than 10 years, we developed cloudscheduler version 2 (CSV2), a complete overhaul and modernization of cloudscheduler. We published the technical design of CSV2 in 2019, however, many features have been added since then and the system is used successfully in production for Belle-II, ATLAS, DUNE, and BaBar. In addition to using CSV2 as a WLCG grid site, we also run it as a service for other WLCG grid sites, and the Canadian Advanced Network for Astronomical Research (CANFAR) group uses its own instance of CSV2 for their astronomy workload. In this talk, we report on our experience in operating CSV2 for the different experiment's jobs from a user's and administrator's point of view, running on up to 10,000 cores across all experiments and clouds in North America, Australia, and Europe. We will also report on how to correctly account for the resource usage in the APEL system. CSV2 can be used with its own HTCondor system, but it can also extend an existing HTCondor system with cloud resources, for example in times of high demand of batch computing resources. We will detail how projects can be created and integrated with an existing or new HTCondor system, and how the monitoring works. We will also report on the integration of different clouds, as well as using the integrated opportunistic system. CSV2’s integrated opportunistic system allows the use of the same cloud for different experiments, giving one experiment the preferred usage and others an opportunity to make temporary use of idle resources. In addition, we report on how we worked with different cloud administrators to allow opportunistic use of idle cloud resources, managed by the cloud administrators through cloud metadata.

Consider for long presentation Yes

Primary authors

Mr Driemel, Colson (University of Victoria) Ebert, Marcus (University of Victoria) Dr Sobie, Randall (University of Victoria) Dr Sullivan, Tristan (University of Victoria)

Presentation materials

Peer reviewing

Paper