Conveners
Track 6 - Physics Analysis Tools: Statistical Inference and Fitting
- Alexander Held (University of Wisconsin–Madison)
- Stephan Hageboeck (CERN)
Track 6 - Physics Analysis Tools: I/O and Data Formats
- Dave Heddle (CNU)
- Stephan Hageboeck (CERN)
Track 6 - Physics Analysis Tools: Machine Learning in Analysis
- Dave Heddle (CNU)
- Alexander Held (University of Wisconsin–Madison)
Track 6 - Physics Analysis Tools: Reconstruction and Amplitude Fitting
- Nicole Skidmore (University of Manchester)
- Dave Heddle (CNU)
Track 6 - Physics Analysis Tools: Physics Analysis Workflows
- Stephan Hageboeck (CERN)
- Nicole Skidmore (University of Manchester)
Track 6 - Physics Analysis Tools: AM Parallel
- Dave Heddle (CNU)
- Stephan Hageboeck (CERN)
Track 6 - Physics Analysis Tools: PM Parallel
- Alexander Held (University of Wisconsin–Madison)
- Nicole Skidmore (University of Manchester)
RooFit is a library for building and fitting statistical models that is part of ROOT. It is used in most experiments in particle physics, in particular, the LHC experiments. Recently, the backend that evaluates the RooFit likelihood functions was rewritten to support performant computations of model components on different hardware. This new backend is referred to as the "batch mode". So far,...
With the growing datasets of current and next-generation High-Energy and Nuclear Physics (HEP/NP) experiments, statistical analysis has become more computationally demanding. These increasing demands elicit improvements and modernizations in existing statistical analysis software. One way to address these issues is to improve parameter estimation performance and numeric stability using...
RooFit is a toolkit for statistical modeling and fitting, presented first at CHEP2003, and together with RooStats is used for measurements and statistical tests by most experiments in particle physics, particularly the LHC experiments.
As the LHC program progresses, physics analyses become more ambitious and computationally more demanding, with fits of hundreds of data samples to joint...
Minuit is a program implementing a function minimisation algorithm written at CERN more than 50 years ago. It is still used by almost all statistical analysis in High Energy Physics to find optimal likelihood and best parameter values. A new version, Minuit2, has been re-implemented the original algorithm in C++ a few years ago and it is provided as a ROOT library or a standalone C++...
Collider physics analyses have historically favored Frequentist statistical methodologies, with some exceptions of Bayesian inference in LHC analyses through use of the Bayesian Analysis Toolkit (BAT). We demonstrate work towards an approach for performing Bayesian inference for LHC physics analyses that builds upon the existing APIs and model building technology of the pyhf and PyMC Python...
Many current analyses in nuclear and particle physics are in search for signals that are encompassed by irreducible background events. These background events, entirely surrounding a signal of interest, would lead to inaccurate results when extracting physical observables from the data, due to the inability to reduce the signal to background ratio using any type of selection criteria. By...
The Deep Underground Neutrino Experiment (DUNE) has historically represented data using a combination of custom data formats and those based on ROOT I/O. Recently, DUNE has begun using the Hierarchical Data Format (HDF5) for some of its data storage applications. HDF5 provides high-performance, low-overhead I/O in DUNE’s data acquisition (DAQ) environment. DUNE will use HDF5 to record raw...
ROOT's TTree data structure has been highly successful and useful for HEP; nevertheless, alternative file formats now exist which may offer broader software tool support and more-stable in-memory interfacing. We present a data serialization library that produces a similar data structure within the HDF5 data format; supporting C++ standard collections, user-defined data types, and schema...
The RNTuple I/O subsystem is ROOT's future event data file format and access API. It is driven by the expected data volume increase at upcoming HEP experiments, e.g. at the HL-LHC, and recent opportunities in the storage hardware and software landscape such as NVMe drives and distributed object stores. RNTuple is a redesign of the TTree binary format and API and has shown to deliver...
After using ROOT TTree for over two decades and storing more than an exabyte of compressed data, advances in technology have motivated a complete redesign, RNTuple, that breaks backward-compatibility to take better advantage of these storage options. The RNTuple I/O subsystem has been designed to address performance bottlenecks and shortcomings of ROOT's current state of the art TTree I/O...
Analysis performance has a significant impact on the productivity of physicists. The vast majority of analyses use ROOT (https://root.cern). For a few years now, ROOT has offered an analysis interface called RDataFrame which helps getting the best performance for analyses, ideally making them I/O limited, i.e. with their performance limited by the throughput of reading the input data.
The...
RDataFrame is ROOT's high-level interface for Python and C++ data analysis. Since it first became available, RDataFrame adoption has grown steadily and it is now poised to be a major component of analysis software pipelines for LHC Run 3 and beyond. Thanks to its design inspired by declarative programming principles, RDataFrame enables the development of high-performance, highly parallel...
The usage of Deep Neural Networks (DNNs) as multi-classifiers is widespread in modern HEP analyses. In standard categorisation methods, the high-dimensional output of the DNN is often reduced to a one-dimensional distribution by exclusively passing the information about the highest class score to the statistical inference method. Correlations to other classes are hereby omitted.
Moreover, in...
The search for the dimuon decay of the Standard Model (SM) Higgs boson looks for a tiny peak on top of a smoothly falling SM background in the dimuon invariant mass spectrum 𝑚(𝜇𝜇). Due to the very small signal-to-background ratio, which is at the level of 0.2% in the region 𝑚(𝜇𝜇) = 120–130 GeV for an inclusive selection, an accurate determination of the background is of paramount importance....
We present New Physics Learning Machine (NPLM), a machine learning-based strategy to detect data departures from a Reference model, with no prior bias on the source of discrepancy. The main idea behind the method is to approximate the optimal log-likelihood-ratio hypothesis test parametrising the data distribution with a universal approximating function, and solving its maximum-likelihood fit...
Data-driven methods are widely used to overcome shortcomings of Monte Carlo (MC) simulations (lack of statistics, mismodeling of processes, etc.) in experimental High Energy Physics. A precise description of background processes is crucial to reach the optimal sensitivity for a measurement. However, the selection of the control region used to describe the background process in a region of...
Many theories of Beyond Standard Model (BSM) physics feature multiple BSM particles. Generally, these theories live in higher dimensional phase spaces that are spanned by multiple independent BSM parameters such as BSM particle masses, widths, and coupling constants. Fully probing these phase spaces to extract comprehensive exclusion regions in the high dimensional space is challenging....
The matrix element method (MEM) is a powerful technique that can be used for the analysis of particle collider data utilizing an ab initio calculation of the approximate probability density function for a collision event to be due to a physics process of interest. The most serious difficulty with the ME method, which has limited its applicability to searches for beyond-the-SM physics and...
The primary physics goal of the Mu2e experiment requires reconstructing an isolated 105 MeV electron with better than 500 KeV/c momentum resolution. Mu2e uses a low-mass straw tube tracker, and a CsI crystal calorimeter, to reconstruct tracks.
In this paper, we present the design and performance of a track reconstruction algorithm optimized for Mu2e’s unusual requirements. The algorithm is...
Among the biggest computational challenges for High Energy Physics (HEP) experiments there are the increasingly larger datasets that are being collected, which often require correspondingly complex data analyses. In particular, the PDFs used for modeling the experimental data can have hundreds of free parameters. The optimization of such models involves a significant computational effort and a...
To accurately describe data, tuning the parameters of MC event Generators is essential. At first, experts performed tunings manually based on their sense of physics and goodness of fit. The software, Professor, made tuning more objective by employing polynomial surrogate functions to model the relationship between generator parameters and experimental observables (inner-loop optimization),...
Performing a physics analysis of data from simulations of a high energy experiment requires the application of several common procedures, from obtaining and reading the data to producing detailed plots for interpretation. Implementing common procedures in a general analysis framework allows the analyzer to focus on the unique parts of their analysis. Over the past few years, EIC simulations...
Apache Spark is a distributed computing framework which can process very large datasets using large clusters of servers. Laurelin is a Java-based implementation of ROOT I/O which allows Spark to read and write ROOT files from common HEP storage systems without a dependency on the C++ implementation of ROOT. We discuss improvements due to the migration to an Arrow-based in-memory representation...
The challenges expected for the HL-LHC era are pushing LHC experiments to re-think their computing models at many levels. The evolution toward solutions that allow an effortless interactive analysis experience is, among others, one of the topics followed closely by the CMS experiment. In this context, ROOT RDataFrame offers a high-level, lazy programming model which makes it a flexible and...
ALICE, one of the four large experiments at CERN LHC, is a detector for the physics of heavy ions. In a high interaction rate environment, the pile-up of multiple events leads to an environment that requires advanced multidimensional data analysis methods.
Machine learning (ML) has become very popular in multidimensional data analysis in recent years. Compared to the simple, low-dimensional...
Realistic environments for prototyping, studying and improving analysis workflows are a crucial element on the way towards user-friendly physics analysis at HL-LHC scale. The IRIS-HEP Analysis Grand Challenge (AGC) provides such an environment. It defines a scalable and modular analysis task that captures relevant workflow aspects, ranging from large-scale data processing and handling of...
The growing amount of data generated by the LHC requires a shift in how HEP analysis tasks are approached. Efforts to address this computational challenge have led to the rise of a middle-man software layer, a mixture of simple, effective APIs and fast execution engines underneath. Having common, open and reproducible analysis benchmarks proves beneficial in the development of these modern...
Abstract
PyPWA is a toolkit designed to fit (regression) parametric models to data and to generate distributions (simulation) according to a given model (function). PyPWA software has been written under the python ecosystem with the goal of performing Amplitude or Partial Wave Analysis (PWA) in nuclear and particle physics experiments. The aim of spectroscopy experiments is often the...
Most analyses in the LHCb experiment start by filtering data and simulation stored on the WLCG. Traditionally this has been achieved by submitting user jobs that each process a small fraction of the total dataset. While this has worked well, it has become increasingly complex as the LHCb datasets have grown and this model requires all analysts to understand the intricacies of the grid. This...
We present tools for high-performance analysis written in pure Julia, a just-in-time (JIT) compiled dynamic programming language with a high-level syntax and performance. The packages we present center around UnROOT.jl, a pure Julia ROOT file I/O package that is optimized for speed, lazy reading, flexibility, and thread safety.
We discuss what affects performance in Julia, the challenges,...
We will describe how ServiceX, an IRIS-HEP project, generates C++ or python code from user queries and orchestrates thousands of experiment-provided docker containers to filter and select event data. The source datafiles are identified using Rucio. We will show how the service encapsulates best practice for using Rucio and helps inexperienced analysers get up to speed quickly. The data is...
Awkward Arrays is a library for performing NumPy-like computations on nested, variable-sized data, enabling array-oriented programming on arbitrary data structures in Python. However, imperative (procedural) solutions can sometimes be easier to write or faster to run. Performant imperative programming requires compilation; JIT-compilation makes it convenient to compile in an interactive Python...
In particle physics, data analysis frequently needs variable-length, nested data structures such as arbitrary numbers of particles per event and combinatorial operations to search for particle decay. Arrays of these data types are provided by the Awkward Array library.
The previous version of this library was implemented in C++, but this impeded its ability to grow. Thus, driven by this...
Recent developments of HEP software allow novel approaches to physics analysis workflows. The novel data delivery system, ServiceX, can be very effective when accessing a fraction of large datasets at remote grid sites. ServiceX can deliver user-selected columns with filtering and run at scale. We will introduce the ServiceX data management package, ServiceX DataBinder, for easy manipulations...
Cling is a clang/LLVM-based, high-performance C++ interpreter originating from HEP. In ROOT, cling is used as the basis for language interoperability, it provides reflection data to ROOT's I/O system, and enables RDataFrame's dynamic type-safe interfaces.
Cling regularely moves to more recent LLVM versions to bring new features and support for new language standards. The recent LLVM 13...
During Run 2 the ATLAS experiment employed a large number of different user frameworks to perform the final corrections of its event data. For Run 3 a common framework was developed that incorporates the lessons learned from existing frameworks. Besides providing analysis standardization it also incorporates optimizations that lead to a substantial reduction in computing needs during analysis.
ATLAS is one of the main experiments at the Large Hadron Collider, with a diverse physics program covering precision measurements as well as new physics searches in countless final states, carried out by more than 2600 active authors. The High Luminosity LHC (HL-LHC) era brings unprecedented computing challenges that call for novel approaches to reduce the amount of data and MC that is stored,...
Current analysis at ATLAS always involves a step of producing an analysis-specific n-tuple that incorporates the final step of calibrations as well as systematic variations. The goal for Run 4 is to make these analysis-specific corrections fast enough that they can be applied "on-the-fly" without the need for an intermediate n-tuple. The main complications for this are that some of these...
A performant and easy-to-use event data model (EDM) is a key component of any HEP software stack.The podio EDM toolkit provides a user friendly way of generating such a performant implementation in C++ from a high level description in yaml format. Finalizing a few important developments, we release v1.0 of podio, a stable release with backward compatibility for datafiles written with podio...
In the last two decades, there have been major breakthroughs in Quantum Chromodynamics (QCD), the theory of the strong interaction of quark and gluons, as well as major advances in the accelerator and detector technologies that allow us to map the spatial distribution and motion of the quarks and gluons in terms of quantum correlation functions (QCF). This field of research broadly known as...
In the ideal world, we describe our models with recognizable mathematical expressions and directly fit those models to large data sample with high performance. It turns out that this can be done with a CAS, using its symbolic expression trees as template to computational back-ends like JAX. The CAS can in fact further simplify the expression tree, which can result in speed-ups in the numerical...