Conveners
Track 4 - Distributed Computing: Analysis Workflows, Modeling and Optimisation
- Fernando Barreiro Megino (Unive)
- Rohini Joshi (SKAO)
Track 4 - Distributed Computing: Computing Strategies and Evolution
- Hideki Miyake (KEK/IPNS)
- Fernando Barreiro Megino (Unive)
Track 4 - Distributed Computing: Infrastructure and Services
- Katy Ellis (STFC-RAL)
- Fernando Barreiro Megino (Unive)
Track 4 - Distributed Computing: Monitoring, Testing and Analytics
- Hideki Miyake (KEK/IPNS)
- Katy Ellis (STFC-RAL)
Track 4 - Distributed Computing: Security and Tokens
- Katy Ellis (STFC-RAL)
- Fernando Barreiro Megino (Unive)
Track 4 - Distributed Computing: Distributed Storage and Computing Resources
- Rohini Joshi (SKAO)
- Hideki Miyake (KEK/IPNS)
Track 4 - Distributed Computing: Workload Management
- Rohini Joshi (SKAO)
- Katy Ellis (STFC-RAL)
Machine learning has become one of the important tools for High Energy Physics analysis. As the size of the dataset increases at the Large Hadron Collider (LHC), and at the same time the search spaces become bigger and bigger in order to exploit the physics potentials, more and more computing resources are required for processing these machine learning tasks. In addition, complex advanced...
We present a new implementation of simulation-based inference using data collected by the ATLAS experiment at the LHC. The method relies on large ensembles of deep neural networks to approximate the exact likelihood. Additional neural networks are introduced to model systematic uncertainties in the measurement. Training of the large number of deep neural networks is automated using a...
Predicting the performance of various infrastructure design options in complex federated infrastructures with computing sites distributed over a wide area that support a plethora of users and workflows, such as the Worldwide LHC Computing Grid (WLCG), is not trivial. Due to the complexity and size of these infrastructures, it is not feasible to deploy experimental test-beds at large scales...
InterTwin is an EU-funded project that started on the 1st of September 2022. The project will work with domain experts from different scientific domains in building a technology to support digital twins within scientific research. Digital twins are models for predicting the behaviour and evolution of real-world systems and applications.
InterTwin will focus on employing machine-learning...
The IceCube Neutrino Observatory is a cubic kilometer neutrino telescope located at the geographic South Pole. To accurately and promptly reconstruct the arrival direction of candidate neutrino events for Multi-Messenger Astrophysics use cases, IceCube employs Skymap Scanner workflows managed by the SkyDriver service. The Skymap Scanner performs maximum-likelihood tests on individual pixels...
A fast turn-around time and ease of use are important factors for systems supporting the analysis of large HEP data samples. We study and compare multiple technical approaches.
This presentation will be about setting up and benchmarking the Analysis Grand Challenge (AGC) [1] using CMS Open Data. The AGC is an effort to provide a realistic physics analysis with the intent of showcasing the...
The Jiangmen Underground Neutrino Observatory (JUNO) is a multipurpose neutrino experiment and the determination of the neutrino mass hierarchy is its primary physics goal. JUNO is going to take data in 2024 with 2PB raw data each year and use distributed computing infrastructure for simulation, reconstruction and analysis tasks. The JUNO distributed computing system has been built up based on...
The discovery of gravitational waves, first observed in September 2015 following the merger of a binary black hole system, has already revolutionised our understanding of the Universe. This was further enhanced in August 2017, when the coalescence of a binary neutron star system was observed both with gravitational waves and a variety of electromagnetic counterparts; this joint observation...
The LIGO, VIRGO and KAGRA Gravitational-wave (GW) observatories are getting ready for their fourth observational period, O4, scheduled to begin in March 2023, with improved sensitivities and thus higher event rates.
GW-related computing has both large commonalities with HEP computing, particularly in the domain of offline data processing and analysis, and important differences, for example in...
The HL-LHC run is anticipated to start at the end of this decade and will pose a significant challenge for the scale of the HEP software and computing infrastructure. The mission of the U.S. CMS Software & Computing Operations Program is to develop and operate the software and computing resources necessary to process CMS data expeditiously and to enable U.S. physicists to fully participate in...
The computing challenges at HL-LHC require fundamental changes to the distributed computing models that have served experiments well throughout LHC. ATLAS planning for HL-LHC computing started back in 2020 with a Conceptual Design Report outlining various challenges to explore. This was followed in 2022 by a roadmap defining concrete milestones and associated effort required. Today, ATLAS is...
In this talk, we discuss the evolution of the computing model of the ATLAS experiment at the LHC. After LHC Run 1, it became obvious that the available computing resources at the WLCG were fully used. The processing queue could reach millions of jobs during peak loads, for example before major scientific conferences and during large scale data processing. The unprecedented performance of the...
We present a collection of tools and processes that facilitate onboarding a new science collaboration onto the OSG Fabric of Services. Such collaborations typically rely on computational workflows for simulations and analysis that are ideal for executing on OSG's distributed High Throughput Computing environment (dHTC). The produced output can be accumulated and aggregated at available...
There is no lack of approaches for managing the deployment of distributed services – in the last 15 years of running distributed infrastructure, the OSG Consortium has seen many of them. One persistent problem has been each physical site has its style of configuration management and service operations, leading to a partitioning of the staff knowledge and inflexibility in migrating services...
The CernVM File System (CVMFS) provides the software distribution backbone for High Energy and Nuclear Physics experiments and many other scientific communities in the form of a globally available shared software area. It has been designed for the software distribution problem of experiment software for LHC Runs 1 and 2. For LHC Run 3 and even more so for HL-LHC (Runs 4-6), the complexity of...
The increasing computational demand in High Energy Physics (HEP) as well as increasing concerns about energy efficiency in high performance/throughput computing are driving forces in the search for more efficient ways to utilize available resources. Since avoiding idle resources is key in achieving high efficiency, an appropriate measure is sharing of idle resources of under-utilized sites...
The JIRIAF project aims to combine geographically diverse computing facilities into an integrated science infrastructure. This project starts by dynamically evaluating temporarily unallocated or idled compute resources from multiple providers. These resources are integrated to handle additional workloads without affecting local running jobs. This paper describes our approach to launch...
The Worldwide LHC Computing Grid (WLCG) is a large-scale collaboration which gathers the computing resources of around 170 computing centres from more than 40 countries. The grid paradigm, unique to the realm of high energy physics, has successfully supported a broad variety of scientific achievements. To fulfil the requirements of new applications and to improve the long-term sustainability...
Data taking at the Large Hadron Collider (LHC) at CERN restarted in 2022. The CMS experiment relies on a distributed computing infrastructure based on WLCG (Worldwide LHC Computing Grid) to support the LHC Run 3 physics program. The CMS computing infrastructure is highly heterogeneous and relies on a set of centrally provided services, such as distributed workload management and data...
Monitoring services play a crucial role in the day-to-day operation of distributed computing systems. The ATLAS experiment at LHC uses the production and distributed analysis workload management system (PanDA WMS), which allows a million computational jobs to run daily at over 170 computing centers of the WLCG and other opportunistic resources, utilizing 600k cores simultaneously on average....
The ALICE experiment at the CERN Large Hadron Collider relies on a massive, distributed Computing Grid for its data processing. The ALICE Computing Grid is built by combining a large number of individual computing sites distributed globally. These Grid sites are maintained by different institutions across the world and contribute thousands of worker nodes possessing different capabilities and...
HammerCloud (HC) is a testing service and framework for continuous functional tests, on-demand large-scale stress tests, and performance benchmarks. It checks the computing resources and various components of distributed systems with realistic full-chain experiment workflows.
The HammerCloud software was initially developed in Python 2. After support for Python 2 was discontinued in 2020,...
Operational analytics is the direction of research related to the analysis of the current state of computing processes and the prediction of the future in order to anticipate imbalances and take timely measures to stabilize a complex system. There are two relevant areas in ATLAS Distributed Computing that are currently in the focus of studies: end-user physics analysis including the forecast...
For LHC Run3 the ALICE experiment software stack has been completely refactored, incorporating support for multicore job execution. The new multicore jobs spawn multiple processes and threads within the payload. Given that some of the deployed processes may be short-lived, accounting for their resource consumption presents a challenge. This article presents the newly developed methodology for...
No single organisation has the resources to defend its services alone against most modern malicious actors and so we must protect ourselves as a community. In the face of determined and well-resourced attackers, we must actively collaborate in this effort across HEP and more broadly across Research and Education (R&E).
Parallel efforts are necessary to appropriately respond to this...
In 2022, CERN ran its annual phishing campaign in which 2000 users gave away their passwords (Note: this number is in line with results of campaigns at other organisations). In a real phishing incident this would have meant 2000 compromised accounts... unless they were protected by Two-Factor Authentication (2FA)! In the same year, CERN introduced 2FA for accounts with access to critical...
Since 2017, the Worldwide LHC Computing Grid (WLCG) has been working towards enabling token-based authentication and authorization throughout its entire middleware stack. Following the initial publication of the WLCG v1.0 Token Schema in 2019, work has been done to integrate OAuth2.0 token flows across the Grid middleware. There are many complex challenges to be addressed before the WLCG can...
GlideinWMS is a distributed workload manager that has been used in production for many years to provision resources for experiments like CERN's CMS, many Neutrino experiments, and the OSG. Its security model was based mainly on GSI (Grid Security Infrastructure), using x509 certificate proxies and VOMS (Virtual Organization Membership Service) extensions. Even if other credentials, like ssh...
The CMS Submission Infrastructure (SI) is the main computing resource provisioning system for CMS workloads. A number of HTCondor pools are employed to manage this infrastructure, which aggregates geographically distributed resources from the WLCG and other providers. Historically, the model of authentication among the diverse components of this infrastructure has relied on the Grid Security...
DIRAC is the interware for building and operating large scale distributed computing systems. It is adopted by multiple collaborations from various scientific domains for implementing their computing models.
DIRAC provides a framework and a rich set of ready-to-use services for Workload, Data and Production Management tasks of small, medium and large scientific communities having different...
The Electron Ion Collider (EIC) collaboration and future experiment is a unique scientific ecosystem within Nuclear Physics as the experiment starts right off as a cross-collaboration from Brookhaven National Lab (BNL) & Jefferson Lab (JLab). As a result, this muti-lab computing model tries at best to provide services accessible from anywhere by anyone who is part of the collaboration. While...
Rucio, the data management software initially developed for ATLAS, has been in use at Belle II since January 2021. After the transition to Rucio, new features and functionality were implemented in Belle II grid tools based on Rucio, to improve the experience of grid users. The container structure in the Rucio File Catalog enabled us to define collections of arbitrary datasets, allowing the...
A critical challenge of performing data transfers or remote reads is to be fast and efficient as possible while, at the same time, keeping the usage of system resources as low as possible. Ideally, the software that manages these data transfers should be able to organize them so that one can have them run up to the hardware limits. Significant portions of LHC analysis use the same datasets,...
In recent years, advanced and complex analysis workflows have gained increasing importance in the ATLAS experiment at CERN, one of the large scientific experiments at the Large Hadron Collider (LHC). Support for such workflows has allowed users to exploit remote computing resources and service providers distributed worldwide, overcoming limitations on local resources and services. The spectrum...
The computing resources supporting the LHC experiments research programmes are still dominated by x86 processors deployed at WLCG sites. This will however evolve in the coming years, as a growing number of HPC and Cloud facilities will be employed by the collaborations in order to process the vast amounts of data to be collected in the LHC Run 3 and into the HL-LHC phase. Compute power in...
Cloudscheduler is a system to manage resources of local and remote compute clouds and makes those resources available to HTCondor pools. It examines the resource needs of idle jobs, then starts virtual machines (VMs) sized to suit those resource needs on allowed clouds with available resources. Using yaml files, cloudscheduler then provisions the VMs during the boot process with all necessary...
The Vera C. Rubin Observatory will produce an unprecedented astronomical data set for studies of the deep and dynamic universe. Its Legacy Survey of Space and Time (LSST) will image the entire southern sky every three days and produce tens of petabytes of raw image data and associated calibration data. More than 20 terabytes of data must be processed and stored every night for ten...
The Vera C. Rubin Observatory, currently in construction in Chile, will start performing the Large Survey of Space and Time (LSST) late 2024 for 10 years. Its 8.4-meter telescope will survey the southern sky in less than 4 nights in six optical bands, and repeatedly generate about 2000 exposures per night, corresponding to a data volume of about 20 TB every night. Three data facilities are...
The Large High Altitude Air Shower Observatory (LHAASO) is a large-scale astrophysics experiment led by China. The offline data processing was highly dependent on the Institute of High Energy Physics(IHEP) local cluster and the local file system.
As the LHAASO experimental cooperation groups’ resources are located geographically and most of them have the characteristics of limited scale, low...
The Cherenkov Telescope Array Observatory (CTAO) is the next generation ground-based observatory for gamma-ray astronomy at very high energies. It will consist of tens of Cherenkov telescopes, spread between two array sites: one in the Northern hemisphere in La Palma (Spain), and one in the Southern hemisphere in Paranal (Chile). Currently under construction, CTAO will start scientific...
In preparation for LHC Run 3 and 4, the ALICE Collaboration has moved to a new Grid middleware, JAliEn, and workflow management system. The migration was dictated by the substantially higher requirements on the Grid infrastructure in terms of payload complexity, increased number of jobs and managed data volume, all of which required a complete rewrite of the middleware using modern software...
The ALICE Grid is designed to perform a realtime comprehensive monitoring of both jobs and execution nodes in order to maintain a continuous and consistent status of the Grid infrastructure. An extensive database of historical data is available and is periodically analyzed to tune the workflow and data management to optimal performance levels. This data, when evaluated in real time, has the...