Most analyses in the LHCb experiment start by filtering data and simulation stored on the WLCG. Traditionally this has been achieved by submitting user jobs that each process a small fraction of the total dataset. While this has worked well, it has become increasingly complex as the LHCb datasets have grown and this model requires all analysts to understand the intricacies of the grid. This model also burdens individuals with needing to document the way in which each file was processed.
Here we present a more robust and efficient approach, known within LHCb as Analysis Productions. Filtering LHCb datasets to create ntuples is done by creating a merge request in GitLab, which is then tested automatically on a small subset of the data using Continuous Integration. Results of these tests are exposed via a dedicated website that aggregates the most important details. Once the merge request is reviewed and accepted, productions are submitted and run automatically using the power of the DIRAC transformation system. The output data is stored on grid storage and tools are provided to make it easily accessible for analysis.
This new approach has the advantage of being faster and simpler for analysts while also ensuring that the full processing chain is preserved and reproducible. Using GitLab to manage submissions encourages code review and the sharing of derived datasets between analyses.
The Analysis Productions system has been stress-tested with legacy data for a couple of years and is becoming the de facto standard by which data, legacy or run-3, is prepared for physics analysis. It has been scaled to analyses that process thousands of datasets and the approach of testing prior to submission is now being expanded to other production types in LHCb.
|Consider for long presentation||Yes|