Speaker
Description
DIRAC is a widely used “framework for distributed computing”. It works by building a layer between the users and the resources offering a common interface to a number of heterogeneous providers. DIRAC, like many other workload management systems, uses pilot jobs to check and configure the worker-node environment before fetching a user payload. The pilot also records a number of different parameters of the jobs it runs, e.g. memory usage, efficiency etc. The log messages generated by these pilot jobs ("pilot logs") are crucial for diagnosing problems in the infrastructure; however the retention policies for these logs vary by technology and resource providers. Transient (cloud) resources often do not have space suitable for archiving these logs at all and logs can be completely lost in cases where the payload job crashes. Retaining pilot logs in a reliable, resource independent manner was identified as a high priority issue by LHCb and other communities using the DIRAC workload manager.
We implemented a configurable remote pilot logging system which can be enabled in parallel with the existing computing resource based logging. In order to facilitate logging to different back-end storage systems we use a web server which receives pilot log messages and dispatches them to configurable plugins for external services, for example Grid Storage Elements or message brokers. Since extensive remote pilot logging for all pilots could easily overload the receiving server, we allow for a targeted configuration of the system such as only enabling the logging system for specific Virtual Organisations and/or sites.
In this talk we describe the design of this system and the results of its initial deployment.
Consider for long presentation | No |
---|