Please visit Jefferson Lab Event Policies and Guidance before planning your next event: https://www.jlab.org/conference_planning.

May 8 – 12, 2023
Norfolk Waterside Marriott
US/Eastern timezone

Improved Pilot Logging in DIRAC

Not scheduled
1h
Hampton Roads Ballroom and Foyer Area (Norfolk Waterside Marriott)

Hampton Roads Ballroom and Foyer Area

Norfolk Waterside Marriott

235 East Main Street Norfolk, VA 23510
Poster Poster Poster Session

Speaker

Martyniak, Janusz (Imperial College London, UK)

Description

DIRAC is a widely used “framework for distributed computing”. It works by building a layer between the users and the resources offering a common interface to a number of heterogeneous providers. DIRAC, like many other workload management systems, uses pilot jobs to check and configure the worker-node environment before fetching a user payload. The pilot also records a number of different parameters of the jobs it runs, e.g. memory usage, efficiency etc. The log messages generated by these pilot jobs ("pilot logs") are crucial for diagnosing problems in the infrastructure; however the retention policies for these logs vary by technology and resource providers. Transient (cloud) resources often do not have space suitable for archiving these logs at all and logs can be completely lost in cases where the payload job crashes. Retaining pilot logs in a reliable, resource independent manner was identified as a high priority issue by LHCb and other communities using the DIRAC workload manager.
We implemented a configurable remote pilot logging system which can be enabled in parallel with the existing computing resource based logging. In order to facilitate logging to different back-end storage systems we use a web server which receives pilot log messages and dispatches them to configurable plugins for external services, for example Grid Storage Elements or message brokers. Since extensive remote pilot logging for all pilots could easily overload the receiving server, we allow for a targeted configuration of the system such as only enabling the logging system for specific Virtual Organisations and/or sites.
In this talk we describe the design of this system and the results of its initial deployment.

Consider for long presentation No

Primary authors

Martyniak, Janusz (Imperial College London, UK) Fayer, Simon (Imperial College) Stagni, Federico (CERN)

Presentation materials

Peer reviewing

Paper