Please visit Jefferson Lab Event Policies and Guidance before planning your next event:

May 8 – 12, 2023
Norfolk Waterside Marriott
US/Eastern timezone

Management of Batch worker node lifecycle and maintenance at CERN with BrainSlug (our in-house state-manager daemon), Rundeck and StackStorm

May 11, 2023, 3:00 PM
Marriott Ballroom IV (Norfolk Waterside Marriott)

Marriott Ballroom IV

Norfolk Waterside Marriott

235 East Main Street Norfolk, VA 23510
Oral Track 7 - Facilities and Virtualization Track 7 - Facilities and Virtualization


Mr Jones, Ben (CERN)


Batch@CERN team manages over 250k CPU cores for Batch processing of LHC data with our HTCondor cluster comprising ~5k nodes. We will present a lifecycle management solution of our systems with our in-house developed state-manager daemon BrainSlug and how it handles draining, rebooting, interventions and other actions on the worker nodes, with Rundeck as our human-interaction endpoint and using StackStorm for automated procedures of remediating minor alarms, health-checks and enabling an overall self-healing infrastructure. We will demonstrate how these processes and reduced the manual overhead of handling daily operations by a 10x factor with StackStorm enabled workflows, and how we can enable operators to schedule and manage interventions while having granular control on the actions exposed to such operators.

Consider for long presentation Yes

Primary authors

Presentation materials