Speaker
Description
Batch@CERN team manages over 250k CPU cores for Batch processing of LHC data with our HTCondor cluster comprising ~5k nodes. We will present a lifecycle management solution of our systems with our in-house developed state-manager daemon BrainSlug and how it handles draining, rebooting, interventions and other actions on the worker nodes, with Rundeck as our human-interaction endpoint and using StackStorm for automated procedures of remediating minor alarms, health-checks and enabling an overall self-healing infrastructure. We will demonstrate how these processes and reduced the manual overhead of handling daily operations by a 10x factor with StackStorm enabled workflows, and how we can enable operators to schedule and manage interventions while having granular control on the actions exposed to such operators.
| Consider for long presentation | Yes |
|---|