May 8 – 12, 2023
US/Eastern timezone

Automatising open data publishing workflows: experience with CMS open data curation

May 9, 2023, 2:45 PM
Track 8 - Collaboration, Reinterpretation, Outreach and Education


Dr Simko, Tibor (CERN)


In this paper we discuss the CMS open data publishing workflows, summarising experience with eight releases of CMS open data on the CERN Open Data portal since its initial launch in 2014. We present the recent enhancements of data curation procedures, including (i) mining information about collision and simulated datasets with accompanying generation parameters and processing configuration files, (ii) building an API service covering information related to luminosity, run number ranges and other contextual dataset information, as well as (iii) configuring the CERN Open Data storage area as a Rucio endpoint that manages over four petabytes of released CMS open data and serves as a WLCG Tier 3 site to simplify data transfers. Finally, we discuss the latest CMS content released as open data (completed Run 1 data, first samples from Run 2 data) and the associated runnable analysis examples demonstrating its use in containerised data analysis workflows. We conclude by a short list of lessons learnt as well as general recommendations to facilitate upcoming releases of Run 2 data.

Primary authors

Dr Lassila-Perini, Kati (Helsinki Institute of Physics) Dr Simko, Tibor (CERN)

