Speaker
Description
Batch@CERN team manages over 250k CPU cores for Batch processing of LHC data with our HTCondor cluster comprising ~5k nodes. We will present a lifecycle management solution of our systems with our in-house developed state-manager daemon BrainSlug and how it handles draining, rebooting, interventions and other actions on the worker nodes, with Rundeck as our human-interaction endpoint and using StackStorm for automated procedures of remediating minor alarms, health-checks and enabling an overall self-healing infrastructure. We will demonstrate how these processes and reduced the manual overhead of handling daily operations by a 10x factor with StackStorm enabled workflows, and how we can enable operators to schedule and manage interventions while having granular control on the actions exposed to such operators.
Consider for long presentation | Yes |
---|