Creating new materials, discovering new drugs, and simulating systems are essential processes for research and innovation, and require substantial computational power. While many applications can be split into many smaller independent tasks, some cannot and may take hours or weeks to run to completion. To better manage those longer-running jobs, it would be desirable to stop them at any arbitrary point in time and later continue their computation on another compute resource; this is usually referred to as checkpointing. While some applications can manage checkpointing in a programmatic way, it would be preferable if the batch scheduling system could do that independently. In this paper, we evaluate the feasibility of using CRIU (Checkpoint Restore in Userspace), an open-source tool available for the GNU/Linux environment, with an emphasis on the OSG’s OSPool HTCondor setup. CRIU allows for checkpointing of the process state into a disk image, and is able to seamlessly deal with both open files and established network connections. Furthermore, it can be used for checkpointing of both traditional Linux processes and containerized workloads. The functionality seems adequate for many scenarios supported in the OSPool. although there are some limitations that prevent it from being usable in all circumstances.
|Consider for long presentation||No|