26TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY & NUCLEAR PHYSICS (CHEP2023)

Name: 26TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY & NUCLEAR PHYSICS (CHEP2023)
Start: 2023-05-08T08:00:00-04:00
End: 2023-05-12T16:00:00-04:00
Location: Norfolk Waterside Marriott

May 8 – 12, 2023

Norfolk Waterside Marriott

US/Eastern timezone

Conference Secretariat

chep2023-secretariat@jlab.org

Session

Track 4 - Distributed Computing

May 8, 2023, 11:00 AM

Marriott Ballroom II-III (Norfolk Waterside Marriott)

Marriott Ballroom II-III

Norfolk Waterside Marriott

235 East Main Street Norfolk, VA 23510

Track 4 - Distributed Computing: Analysis Workflows, Modeling and Optimisation

Fernando Barreiro Megino (Unive)
Rohini Joshi (SKAO)

Track 4 - Distributed Computing: Computing Strategies and Evolution

Hideki Miyake (KEK/IPNS)
Fernando Barreiro Megino (Unive)

Track 4 - Distributed Computing: Infrastructure and Services

Katy Ellis (STFC-RAL)
Fernando Barreiro Megino (Unive)

Track 4 - Distributed Computing: Monitoring, Testing and Analytics

Hideki Miyake (KEK/IPNS)
Katy Ellis (STFC-RAL)

Track 4 - Distributed Computing: Security and Tokens

Katy Ellis (STFC-RAL)
Fernando Barreiro Megino (Unive)

Track 4 - Distributed Computing: Distributed Storage and Computing Resources

Rohini Joshi (SKAO)
Hideki Miyake (KEK/IPNS)

Track 4 - Distributed Computing: Workload Management

Rohini Joshi (SKAO)
Katy Ellis (STFC-RAL)

There are no materials yet.

197. Distributed Machine Learning with PanDA and iDDS in LHC ATLAS

Weber, Christian (Brookhaven National Laboratory)

5/8/23, 11:00 AM

Track 4 - Distributed Computing

Oral

Machine learning has become one of the important tools for High Energy Physics analysis. As the size of the dataset increases at the Large Hadron Collider (LHC), and at the same time the search spaces become bigger and bigger in order to exploit the physics potentials, more and more computing resources are required for processing these machine learning tasks. In addition, complex advanced...

423. ATLAS data analysis using a parallelized workflow on distributed cloud-based services with GPUs

Sandesara, Jay (University of Massachusetts, Amherst)

5/8/23, 11:15 AM

Track 4 - Distributed Computing

Oral

We present a new implementation of simulation-based inference using data collected by the ATLAS experiment at the LHC. The method relies on large ensembles of deep neural networks to approximate the exact likelihood. Additional neural networks are introduced to model systematic uncertainties in the measurement. Training of the large number of deep neural networks is automated using a...

306. Modelling Distributed Heterogeneous Computing Infrastructures for HEP Applications

Mr Horzela, Maximilian (Karlsruhe Institute of Technology)

5/8/23, 11:30 AM

Track 4 - Distributed Computing

Oral

Predicting the performance of various infrastructure design options in complex federated infrastructures with computing sites distributed over a wide area that support a plethora of users and workflows, such as the Worldwide LHC Computing Grid (WLCG), is not trivial. Due to the complexity and size of these infrastructures, it is not feasible to deploy experimental test-beds at large scales...

544. Digital Twin Engine Infrastructure in the interTwin Project

Fuhrmann, Patrick (DESY)

5/8/23, 11:45 AM

Track 4 - Distributed Computing

Oral

InterTwin is an EU-funded project that started on the 1st of September 2022. The project will work with domain experts from different scientific domains in building a technology to support digital twins within scientific research. Digital twins are models for predicting the behaviour and evolution of real-world systems and applications.

InterTwin will focus on employing machine-learning...

349. IceCube SkyDriver - A SaaS Solution for Event Reconstruction using the Skymap Scanner

Evans, Ric (IceCube, University of Wisconsin-Madison)

5/8/23, 12:00 PM

Track 4 - Distributed Computing

Oral

The IceCube Neutrino Observatory is a cubic kilometer neutrino telescope located at the geographic South Pole. To accurately and promptly reconstruct the arrival direction of candidate neutrino events for Multi-Messenger Astrophysics use cases, IceCube employs Skymap Scanner workflows managed by the SkyDriver service. The Skymap Scanner performs maximum-likelihood tests on individual pixels...

511. Analysis Grand Challenge benchmarking tests on selected sites

Koch, David (LMU)

5/8/23, 12:15 PM

Track 4 - Distributed Computing

Oral

A fast turn-around time and ease of use are important factors for systems supporting the analysis of large HEP data samples. We study and compare multiple technical approaches.
This presentation will be about setting up and benchmarking the Analysis Grand Challenge (AGC) [1] using CMS Open Data. The AGC is an effort to provide a realistic physics analysis with the intent of showcasing the...

267. JUNO distributed computing system

Zhang, Xiaomei (Institute of High Energy Physics)

5/8/23, 2:00 PM

Track 4 - Distributed Computing

Oral

The Jiangmen Underground Neutrino Observatory (JUNO) is a multipurpose neutrino experiment and the determination of the neutrino mass hierarchy is its primary physics goal. JUNO is going to take data in 2024 with 2PB raw data each year and use distributed computing infrastructure for simulation, reconstruction and analysis tasks. The JUNO distributed computing system has been built up based on...

291. Computing Challenges for the Einstein Telescope project

Pardi, Silvio (INFN)

5/8/23, 2:15 PM

Track 4 - Distributed Computing

Oral

The discovery of gravitational waves, first observed in September 2015 following the merger of a binary black hole system, has already revolutionised our understanding of the Universe. This was further enhanced in August 2017, when the coalescence of a binary neutron star system was observed both with gravitational waves and a variety of electromagnetic counterparts; this joint observation...

378. The Ligo-Virgo-KAGRA Computing Infrastructure for Gravitational-wave Research

Legger, Federica (INFN Torino)

5/8/23, 2:30 PM

Track 4 - Distributed Computing

Oral

The LIGO, VIRGO and KAGRA Gravitational-wave (GW) observatories are getting ready for their fourth observational period, O4, scheduled to begin in March 2023, with improved sensitivities and thus higher event rates.
GW-related computing has both large commonalities with HEP computing, particularly in the domain of offline data processing and analysis, and important differences, for example in...

62. The U.S. CMS HL-LHC R&D Strategic Plan

Gutsche, Oliver (Fermi National Accelerator Laboratory)

5/8/23, 2:45 PM

Track 4 - Distributed Computing

Oral

The HL-LHC run is anticipated to start at the end of this decade and will pose a significant challenge for the scale of the HEP software and computing infrastructure. The mission of the U.S. CMS Software & Computing Operations Program is to develop and operate the software and computing resources necessary to process CMS data expeditiously and to enable U.S. physicists to fully participate in...

193. R&D in ATLAS Distributed Computing towards HL-LHC

South, David

5/8/23, 3:00 PM

Track 4 - Distributed Computing

Oral

The computing challenges at HL-LHC require fundamental changes to the distributed computing models that have served experiments well throughout LHC. ATLAS planning for HL-LHC computing started back in 2020 with a Conceptual Design Report outlining various challenges to explore. This was followed in 2022 by a roadmap defining concrete milestones and associated effort required. Today, ATLAS is...

415. Future Data-Intensive Experiment Computing Models: Lessons learned from the recent evolution of the ATLAS Computing Model

Klimentov, Alexei (Brookhaven National Laboratory)

5/8/23, 3:15 PM

Track 4 - Distributed Computing

Oral

In this talk, we discuss the evolution of the computing model of the ATLAS experiment at the LHC. After LHC Run 1, it became obvious that the available computing resources at the WLCG were fully used. The processing queue could reach millions of jobs during peak loads, for example before major scientific conferences and during large scale data processing. The unprecedented performance of the...

527. Tools and Services for Collaborations: On-Boarding to the OSG Fabric of Services

Paschos, Pascal (University of Chicago)

5/9/23, 11:00 AM

Track 4 - Distributed Computing

Oral

We present a collection of tools and processes that facilitate onboarding a new science collaboration onto the OSG Fabric of Services. Such collaborations typically rely on computational workflows for simulations and analysis that are ideal for executing on OSG's distributed High Throughput Computing environment (dHTC). The produced output can be accumulated and aggregated at available...

545. Managing the OSG Fabric of Services the GitOps Way

Bockelman, Brian (Morgridge Institute for Research)

5/9/23, 11:15 AM

Track 4 - Distributed Computing

Oral

There is no lack of approaches for managing the deployment of distributed services – in the last 15 years of running distributed infrastructure, the OSG Consortium has seen many of them. One persistent problem has been each physical site has its style of configuration management and service operations, leading to a partitioning of the staff knowledge and inflexibility in migrating services...

391. CernVM-FS at Extreme Scales

Promberger, Laura (CERN)

5/9/23, 11:30 AM

Track 4 - Distributed Computing

Oral

The CernVM File System (CVMFS) provides the software distribution backbone for High Energy and Nuclear Physics experiments and many other scientific communities in the form of a globally available shared software area. It has been designed for the software distribution problem of experiment software for LHC Runs 1 and 2. For LHC Run 3 and even more so for HL-LHC (Runs 4-6), the complexity of...

108. AUDITOR: Accounting for opportunistic resources

Dr Boehler, Michael (Freiburg University)

5/9/23, 11:45 AM

Track 4 - Distributed Computing

Oral

The increasing computational demand in High Energy Physics (HEP) as well as increasing concerns about energy efficiency in high performance/throughput computing are driving forces in the search for more efficient ways to utilize available resources. Since avoiding idle resources is key in achieving high efficiency, an appropriate measure is sharing of idle resources of under-utilized sites...

320. JIRIAF: JLAB Integrated Research Infrastructure Across Facilities

Gyurjyan, Vardan (Jefferson Lab)

5/9/23, 12:00 PM

Track 4 - Distributed Computing

Oral

The JIRIAF project aims to combine geographically diverse computing facilities into an integrated science infrastructure. This project starts by dynamically evaluating temporarily unallocated or idled compute resources from multiple providers. These resources are integrated to handle additional workloads without affecting local running jobs. This paper describes our approach to launch...

198. The Platform-as-a-Service paradigm meets ATLAS: developing an automated analysis workflow on the newly established INFN Cloud

Marcon, Caterina (INFN Milan (IT))

5/9/23, 12:15 PM

Track 4 - Distributed Computing

Oral

The Worldwide LHC Computing Grid (WLCG) is a large-scale collaboration which gathers the computing resources of around 170 computing centres from more than 40 countries. The grid paradigm, unique to the realm of high energy physics, has successfully supported a broad variety of scientific achievements. To fulfil the requirements of new applications and to improve the long-term sustainability...

69. The CMS monitoring applications for LHC Run 3

Legger, Federica (INFN Torino)

5/9/23, 2:00 PM

Track 4 - Distributed Computing

Oral

Data taking at the Large Hadron Collider (LHC) at CERN restarted in 2022. The CMS experiment relies on a distributed computing infrastructure based on WLCG (Worldwide LHC Computing Grid) to support the LHC Run 3 physics program. The CMS computing infrastructure is highly heterogeneous and relies on a set of centrally provided services, such as distributed workload management and data...

228. BigPanDA monitoring system evolution in the ATLAS experiment

Klimentov, Alexei (Brookhaven National Laboratory)

5/9/23, 2:15 PM

Track 4 - Distributed Computing

Oral

Monitoring services play a crucial role in the day-to-day operation of distributed computing systems. The ATLAS experiment at LHC uses the production and distributed analysis workload management system (PanDA WMS), which allows a million computational jobs to run daily at over 170 computing centers of the WLCG and other opportunistic resources, utilizing 600k cores simultaneously on average....

22. Site Sonar - A Flexible and Extensible Infrastructure Monitoring Tool for ALICE Grid

Storetvedt, Maksim (CERN)

5/9/23, 2:30 PM

Track 4 - Distributed Computing

Oral

The ALICE experiment at the CERN Large Hadron Collider relies on a massive, distributed Computing Grid for its data processing. The ALICE Computing Grid is built by combining a large number of individual computing sites distributed globally. These Grid sites are maintained by different institutions across the world and contribute thousands of worker nodes possessing different capabilities and...

246. Bringing the ATLAS HammerCloud setup to the next level with containerization

Mr Rottler, Benjamin (Freiburg University)

5/9/23, 2:45 PM

Track 4 - Distributed Computing

Oral

HammerCloud (HC) is a testing service and framework for continuous functional tests, on-demand large-scale stress tests, and performance benchmarks. It checks the computing resources and various components of distributed systems with realistic full-chain experiment workflows.

The HammerCloud software was initially developed in Python 2. After support for Python 2 was discontinued in 2020,...

248. Operational Analytics Studies for ATLAS Distributed Computing: Data Popularity Forecast and Ranking of the WLCG Centers

Klimentov, Alexei (Brookhaven National Laboratory)

5/9/23, 3:00 PM

Track 4 - Distributed Computing

Oral

Operational analytics is the direction of research related to the analysis of the current state of computing processes and the prediction of the future in order to anticipate imbalances and take timely measures to stabilize a complex system. There are two relevant areas in ATLAS Distributed Computing that are currently in the focus of studies: end-user physics analysis including the forecast...

29. Analysis and optimization of ALICE Run 3 multicore Grid jobs

Bertran Ferrer, Marta (CERN)

5/9/23, 3:15 PM

Track 4 - Distributed Computing

Oral

For LHC Run3 the ALICE experiment software stack has been completely refactored, incorporating support for multicore job execution. The new multicore jobs spawn multiple processes and threads within the payload. Given that some of the deployed processes may be short-lived, accounting for their resource consumption presents a challenge. This article presents the newly developed methodology for...

480. Collaborative Operational Security: The future of cybersecurity for Research and Education

Crooks, David (UKRI - STFC)

5/9/23, 4:30 PM

Track 4 - Distributed Computing

Oral

No single organisation has the resources to defend its services alone against most modern malicious actors and so we must protect ourselves as a community. In the face of determined and well-resourced attackers, we must actively collaborate in this effort across HEP and more broadly across Research and Education (R&E).

Parallel efforts are necessary to appropriately respond to this...

270. Improving computer security in HEP with multiple factor authentication: experience and reflections

Ahmad, Adeel (CERN)

5/9/23, 4:45 PM

Track 4 - Distributed Computing

Oral

In 2022, CERN ran its annual phishing campaign in which 2000 users gave away their passwords (Note: this number is in line with results of campaigns at other organisations). In a real phishing incident this would have meant 2000 compromised accounts... unless they were protected by Two-Factor Authentication (2FA)! In the same year, CERN introduced 2FA for accounts with access to critical...

139. WLCG transition from X.509 to tokens: Status and Plans

Dack, Tom (STFC UKRI)

5/9/23, 5:00 PM

Track 4 - Distributed Computing

Oral

Since 2017, the Worldwide LHC Computing Grid (WLCG) has been working towards enabling token-based authentication and authorization throughout its entire middleware stack. Following the initial publication of the WLCG v1.0 Token Schema in 2019, work has been done to integrate OAuth2.0 token flows across the Grid middleware. There are many complex challenges to be addressed before the WLCG can...

624. Transitioning GlideinWMS, a multi domain distributed workload manager, from GSI proxies to tokens and other granular credentials

Mambelli, Marco (Fermilab)

5/9/23, 5:15 PM

Track 4 - Distributed Computing

Oral

GlideinWMS is a distributed workload manager that has been used in production for many years to provision resources for experiments like CERN's CMS, many Neutrino experiments, and the OSG. Its security model was based mainly on GSI (Grid Security Infrastructure), using x509 certificate proxies and VOMS (Virtual Organization Membership Service) extensions. Even if other credentials, like ssh...

212. Adoption of a token-based authentication model for the CMS Submission Infrastructure

Mascheroni, Marco (University of California San Diego)

5/9/23, 5:30 PM

Track 4 - Distributed Computing

Oral

The CMS Submission Infrastructure (SI) is the main computing resource provisioning system for CMS workloads. A number of HTCondor pools are employed to manage this infrastructure, which aggregates geographically distributed resources from the WLCG and other providers. Historically, the model of authentication among the diverse components of this infrastructure has relied on the Grid Security...

313. DIRAC: current, upcoming and planned capabilities and technologies

Dr Boyer, Alexandre (CERN)

5/9/23, 5:45 PM

Track 4 - Distributed Computing

Oral

DIRAC is the interware for building and operating large scale distributed computing systems. It is adopted by multiple collaborations from various scientific domains for implementing their computing models.
DIRAC provides a framework and a rich set of ready-to-use services for Workload, Data and Production Management tasks of small, medium and large scientific communities having different...

575. Federated Access to Distributed Storage in the EIC Computing Era

Poat, Michael (BNL)

5/11/23, 11:15 AM

Track 4 - Distributed Computing

Oral

The Electron Ion Collider (EIC) collaboration and future experiment is a unique scientific ecosystem within Nuclear Physics as the experiment starts right off as a cross-collaboration from Brookhaven National Lab (BNL) & Jefferson Lab (JLab). As a result, this muti-lab computing model tries at best to provide services accessible from anywhere by anyone who is part of the collaboration. While...

103. Improvement in user experience with Rucio integration into grid tools at Belle II

Panta, Anil (University of Mississippi)

5/11/23, 11:30 AM

Track 4 - Distributed Computing

Oral

Rucio, the data management software initially developed for ATLAS, has been in use at Belle II since January 2021. After the transition to Rucio, new features and functionality were implemented in Belle II grid tools based on Rucio, to improve the experience of grid users. The container structure in the Rucio File Catalog enabled us to define collections of arbitrary datasets, allowing the...

59. Job CPU Performance comparison based on MINIAOD reading options: local versus remote

Balcas, Justas (California Institute of Technology)

5/11/23, 11:45 AM

Track 4 - Distributed Computing

Oral

A critical challenge of performing data transfers or remote reads is to be fast and efficient as possible while, at the same time, keeping the usage of system resources as low as possible. Ideally, the software that manages these data transfers should be able to organize them so that one can have them run up to the hardware limits. Significant portions of LHC analysis use the same datasets,...

232. Utilizing Distributed Heterogeneous Computing with PanDA in ATLAS

Barreiro Megino, Fernando (Unive)

5/11/23, 12:00 PM

Track 4 - Distributed Computing

Oral

In recent years, advanced and complex analysis workflows have gained increasing importance in the ATLAS experiment at CERN, one of the large scientific experiments at the Large Hadron Collider (LHC). Support for such workflows has allowed users to exploit remote computing resources and service providers distributed worldwide, overcoming limitations on local resources and services. The spectrum...

162. The integration of heterogeneous resources in the CMS Submission Infrastructure for the LHC Run 3 and beyond

Dr Pérez-Calero Yzquierdo, Antonio (CIEMAT - PIC)

5/11/23, 12:15 PM

Track 4 - Distributed Computing

Oral

The computing resources supporting the LHC experiments research programmes are still dominated by x86 processors deployed at WLCG sites. This will however evolve in the coming years, as a growing number of HPC and Cloud facilities will be employed by the collaborations in order to process the vast amounts of data to be collected in the LHC Run 3 and into the HL-LHC phase. Compute power in...

84. Managing remote cloud resources for multiple HEP VO’s with cloudscheduler

Ebert, Marcus (University of Victoria)

5/11/23, 12:30 PM

Track 4 - Distributed Computing

Oral

Cloudscheduler is a system to manage resources of local and remote compute clouds and makes those resources available to HTCondor pools. It examines the resource needs of idle jobs, then starts virtual machines (VMs) sized to suit those resource needs on allowed clouds with available resources. Using yaml files, cloudscheduler then provisions the VMs during the boot process with all necessary...

30. Integrating the PanDA Workload Management System with the Vera C. Rubin Observatory

Dr Karavakis, Edward (Brookhaven National Laboratory)

5/11/23, 2:00 PM

Track 4 - Distributed Computing

Oral

The Vera C. Rubin Observatory will produce an unprecedented astronomical data set for studies of the deep and dynamic universe. Its Legacy Survey of Space and Time (LSST) will image the entire southern sky every three days and produce tens of petabytes of raw image data and associated calibration data. More than 20 terabytes of data must be processed and stored every night for ten...

521. The Rubin Observatory’s Legacy Survey of Space and Time DP0.2 processing campaign at CC-IN2P3

Hernandez, Fabio (IN2P3 / CNRS computing center)

5/11/23, 2:15 PM

Track 4 - Distributed Computing

Oral

The Vera C. Rubin Observatory, currently in construction in Chile, will start performing the Large Survey of Space and Time (LSST) late 2024 for 10 years. Its 8.4-meter telescope will survey the southern sky in less than 4 nights in six optical bands, and repeatedly generate about 2000 exposures per night, corresponding to a data volume of about 20 TB every night. Three data facilities are...

268. Lightweight Distributed Computing System Oriented to LHHAASO Data Processing

CHENG, Yaodong (IHEP, CAS)

5/11/23, 2:30 PM

Track 4 - Distributed Computing

Oral

The Large High Altitude Air Shower Observatory (LHAASO) is a large-scale astrophysics experiment led by China. The offline data processing was highly dependent on the Institute of High Energy Physics(IHEP) local cluster and the local file system.
As the LHAASO experimental cooperation groups’ resources are located geographically and most of them have the characteristics of limited scale, low...

367. The Cherenkov Telescope Array Observatory workflow management system

Faure, Alice (LUPM)

5/11/23, 2:45 PM

Track 4 - Distributed Computing

Oral

The Cherenkov Telescope Array Observatory (CTAO) is the next generation ground-based observatory for gamma-ray astronomy at very high energies. It will consist of tens of Cherenkov telescopes, spread between two array sites: one in the Northern hemisphere in La Palma (Spain), and one in the Southern hemisphere in Paranal (Chile). Currently under construction, CTAO will start scientific...

498. The ALICE Grid Workflow for LHC Run 3

Storetvedt, Maksim (CERN)

5/11/23, 3:00 PM

Track 4 - Distributed Computing

Oral

In preparation for LHC Run 3 and 4, the ALICE Collaboration has moved to a new Grid middleware, JAliEn, and workflow management system. The migration was dictated by the substantially higher requirements on the Grid infrastructure in terms of payload complexity, increased number of jobs and managed data volume, all of which required a complete rewrite of the middleware using modern software...

28. Dynamic scheduling using CPU oversubscription in the ALICE Grid

Bertran Ferrer, Marta (CERN)

5/11/23, 3:15 PM

Track 4 - Distributed Computing

Oral

The ALICE Grid is designed to perform a realtime comprehensive monitoring of both jobs and execution nodes in order to maintain a continuous and consistent status of the Grid infrastructure. An extensive database of historical data is available and is periodically analyzed to tune the workflow and data management to optimal performance levels. This data, when evaluated in real time, has the...

Building timetable...

26TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY & NUCLEAR PHYSICS (CHEP2023)

Conference Secretariat

Session

Track 4 - Distributed Computing

Marriott Ballroom II-III

Norfolk Waterside Marriott

Conveners

Track 4 - Distributed Computing: Analysis Workflows, Modeling and Optimisation

Track 4 - Distributed Computing: Computing Strategies and Evolution

Track 4 - Distributed Computing: Infrastructure and Services

Track 4 - Distributed Computing: Monitoring, Testing and Analytics

Track 4 - Distributed Computing: Security and Tokens

Track 4 - Distributed Computing: Distributed Storage and Computing Resources

Track 4 - Distributed Computing: Workload Management

Presentation materials

Choose timezone

26TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY & NUCLEAR PHYSICS (CHEP2023)

Conference Secretariat

Conveners

Track 4 - Distributed Computing: Analysis Workflows, Modeling and Optimisation

Track 4 - Distributed Computing: Computing Strategies and Evolution

Track 4 - Distributed Computing: Infrastructure and Services

Track 4 - Distributed Computing: Monitoring, Testing and Analytics

Track 4 - Distributed Computing: Security and Tokens

Track 4 - Distributed Computing: Distributed Storage and Computing Resources

Track 4 - Distributed Computing: Workload Management

Presentation materials