Indico is back online after maintenance on Tuesday, April 30, 2024.
Please visit Jefferson Lab Event Policies and Guidance before planning your next event: https://www.jlab.org/conference_planning.

May 8 – 12, 2023
Norfolk Waterside Marriott
US/Eastern timezone

Using CRIU for checkpointing batch jobs

May 9, 2023, 2:15 PM
15m
Marriott Ballroom IV (Norfolk Waterside Marriott)

Marriott Ballroom IV

Norfolk Waterside Marriott

235 East Main Street Norfolk, VA 23510
Oral Track 7 - Facilities and Virtualization Track 7 - Facilities and Virtualization

Speaker

Andrijauskas, Fabio

Description

Creating new materials, discovering new drugs, and simulating systems are essential processes for research and innovation, and require substantial computational power. While many applications can be split into many smaller independent tasks, some cannot and may take hours or weeks to run to completion. To better manage those longer-running jobs, it would be desirable to stop them at any arbitrary point in time and later continue their computation on another compute resource; this is usually referred to as checkpointing. While some applications can manage checkpointing in a programmatic way, it would be preferable if the batch scheduling system could do that independently. In this paper, we evaluate the feasibility of using CRIU (Checkpoint Restore in Userspace), an open-source tool available for the GNU/Linux environment, with an emphasis on the OSG’s OSPool HTCondor setup. CRIU allows for checkpointing of the process state into a disk image, and is able to seamlessly deal with both open files and established network connections. Furthermore, it can be used for checkpointing of both traditional Linux processes and containerized workloads. The functionality seems adequate for many scenarios supported in the OSPool. although there are some limitations that prevent it from being usable in all circumstances.

Consider for long presentation No

Primary author

Co-authors

Sfiligoi, Igor Davila, Diego (University of California at San Diego) Guiang, Jonathan (UCSD) Bockelman, Brian (Morgridge Institute for Research) Thain, Greg (University of Wisconsin - Madison) Arora, Aashay (UCSD) Prof. Würthwein, Frank

Presentation materials

Peer reviewing

Paper