Conveners
Track 7 - Facilities and Virtualization: Dynamic Provisioning and Anything-As-A-Service
- Tomoe Kishimoto (KEK)
- Derek Weitzel (University of Nebraska-Lincoln)
Track 7 - Facilities and Virtualization: Analysis Facilities
- Derek Weitzel (University of Nebraska-Lincoln)
- Tomoe Kishimoto (KEK)
Track 7 - Facilities and Virtualization: Computing Centre Infrastructure
- Arne Wiebalck (CERN)
- Derek Weitzel (University of Nebraska-Lincoln)
Track 7 - Facilities and Virtualization: Computing Centre Infrastructure and Cloud
- Derek Weitzel (University of Nebraska-Lincoln)
- Arne Wiebalck (CERN)
Track 7 - Facilities and Virtualization: Networking
- Tomoe Kishimoto (KEK)
- Derek Weitzel (University of Nebraska-Lincoln)
Track 7 - Facilities and Virtualization: HPC and Deployment
- Arne Wiebalck (CERN)
- Tomoe Kishimoto (KEK)
Track 7 - Facilities and Virtualization: Deployment, Management and Monitoring
- Tomoe Kishimoto (KEK)
- Arne Wiebalck (CERN)
Managing a secure software environment is essential to a trustworthy cyberinfrastructure. Software supply chain attacks may be a top concern for IT departments, but they are also an aspect of scientific computing. The threat to scientific reputation caused by problematic software can be just as dangerous as an environment contaminated with malware. The issue of managing environments affects...
New particle/nuclear physics experiments require a massive amount of computing power that is only achieved by using high performance clusters directly connected to the data acquisition systems and integrated into the online systems of the experiments. However, integrating an HPC cluster into the online system of an experiment means: Managing and synchronizing thousands of processes that handle...
PUNCH4NFDI, funded by the Germany Research Foundation initially for five years, is a diverse consortium of particle, astro-, astroparticle, hadron and nuclear physics embedded in the National Research Data Infrastructure initiative.
In order to provide seamless and federated access to the huge variaty of compute and storage systems provided by the participating communities covering their...
Nowadays Machine Learning (ML) techniques are successfully used in many areas of High-Energy Physics (HEP) and will play a significant role also in the upcoming High-Luminosity LHC upgrade foreseen at CERN, when a huge amount of data will be produced by LHC and collected by the experiments, facing challenges at the exascale. To favor the usage of ML in HEP analyses, it would be useful to have...
The OSG-operated Open Science Pool is an HTCondor-based virtual cluster that aggregates resources from compute clusters provided by several organizations. A user can submit batch jobs to the OSG-maintained scheduler, and they will eventually run on a combination of supported compute clusters without any further user action. Most of the resources are not owned by, or even dedicated to OSG, so...
The Vera C. Rubin observatory is preparing for the execution of the most ambitious astronomical survey ever attempted, the Legacy Survey of Space and Time (LSST). Currently, in its final phase of construction in the Andes mountains in Chile and due to start operations in late 2024 for 10 years, its 8.4-meter telescope will nightly scan the southern sky and collect images of the entire visible...
The increasingly larger data volumes that the LHC experiments will accumulate in the coming years, especially in the High-Luminosity LHC era, call for a paradigm shift in the way experimental datasets are accessed and analyzed. The current model, based on data reduction on the Grid infrastructure, followed by interactive data analysis of manageable size samples on the physicists’ individual...
Prior to the start of the LHC Run 3, the US ATLAS Software and Computing operations program established three shared Tier 3 Analysis Facilities (AFs). The newest AF was established at the University of Chicago in the past year, joining the existing AFs at Brookhaven National Lab and SLAC National Accelerator Lab. In this paper, we will describe both the common and unique aspects of these three...
Effective analysis computing requires rapid turnaround times in order to enable frequent iteration, adjustment, and exploration, leading to discovery. An informal goal of reducing 10TB of experimental data in about ten minutes using campus-scale computing infrastructure is an achievable goal, just considering raw hardware capability. However, compared to production computing, which seeks to...
The INFN Cloud project was launched at the beginning of 2020, aiming to build a distributed Cloud infrastructure and provide advanced services for the INFN scientific communities. A Platform as a Service (PaaS) was created inside INFN Cloud that allows the experiments to develop and access resources as a Software as a Service (SaaS), and CYGNO is the beta-tester of this system. The aim of the...
The recent evolutions of the analysis frameworks and physics data formats of the LHC experiments provide the opportunity of using central analysis facilities with a strong focus on interactivity and short turnaround times, to complement the more common distributed analysis on the Grid. In order to plan for such facilities, it is essential to know in detail the performance of the combination of...
Computational science, data management and analysis have been key factors in the success of Brookhaven National Laboratory's scientific programs at the Relativistic Heavy Ion Collider (RHIC), the National Synchrotron Light Source (NSLS-II), the Center for Functional Nanomaterials (CFN), and in biological, atmospheric, and energy systems science, Lattice Quantum Chromodynamics (LQCD) and...
Moving towards Net-Zero requires robust information to enable good decision making at all levels: covering hardware procurement, workload management and operations, as well as higher level aspects encompassing grant funding processes and policy framework development.
The IRISCAST project is a proof-of-concept study funded as part of the UKRI Net-Zero Scoping Project. We have performed an...
LUX-ZEPLIN (LZ) is a direct detection dark matter experiment currently operating at the Sanford Underground Research Facility (SURF) in Lead, South Dakota. The core component is a liquid xenon time projection chamber with an active mass of 7 tonnes.
To meet the performance, availability, and security requirements for the LZ DAQ, Online, Slow Control and data transfer systems located at SURF,...
Recent years have seen an increasing interest in the environmental impact, especially the carbon footprint, generated by the often large scale computing facilities used by the communities represented at CHEP. As this is a fairly new requirement, this information is not always readily available, especially at universities and similar institutions which do not necessarily see large scale...
The INFN Tier1 data center is currently located in the premises of the Physics Department of the University of Bologna, where CNAF is also located. During 2023 it will be moved to the “Tecnopolo”, the new facility for research, innovation, and technological development in the same city area; the same location is also hosting Leonardo, the pre-exascale supercomputing machine managed by CINECA,...
Queen Mary University of London (QMUL) as part of the refurbishment of one of its's data centres has installed water to water heat pumps to use the heat produced by the computing servers to provide heat for the university via a district heating system. This will enable us to reduce the use of high carbon intensity natural gas heating boilers, replacing them with electricity which has a lower...
HEPscore is a CPU benchmark, based on HEP applications, that the HEPiX Working Group is proposing as a replacement of the HEPSpec06 benchmark (HS06), which is currently used by the WLCG for procurement, computing resource requests and pledges, accounting and performance studies. At the CHEP 2019 conference, we presented the reasons for building a benchmark for the HEP community that is based...
Creating new materials, discovering new drugs, and simulating systems are essential processes for research and innovation, and require substantial computational power. While many applications can be split into many smaller independent tasks, some cannot and may take hours or weeks to run to completion. To better manage those longer-running jobs, it would be desirable to stop them at any...
CERN IT has consolidated all life-cycle management of its physical server fleet on the Ironic bare-metal API. From the initial registration upon the first boot, over the inventory checking, the burn-in and the benchmarking for acceptance, the provisioning to the end users and the repairs during its service, up to the retirement at the end of the servers’ life, all stages can be managed within...
The University of Victoria (UVic) operates an Infrastructure-as-a-Service science cloud for Canadian researchers, and a WLCG T2 grid site for the ATLAS experiment at CERN. At first, these were two distinctly separate systems, but over time we have taken steps to migrate the T2 grid services to the cloud. This process has been significantly facilitated by basing our approach on Kubernetes, a...
The ATLAS experiment at CERN is one of the largest scientific machines built to date and will have ever growing computing needs as the Large Hadron Collider collects an increasingly larger volume of data over the next 20 years. ATLAS is conducting R&D projects on Amazon and Google clouds as complementary resources for distributed computing, focusing on some of the key features of commercial...
An all-inclusive analysis of costs for on-premises and public cloud-based solutions to handle the bulk of HEP computing requirements shows that dedicated on-premises deployments are still the most cost-effective. Since the advent of public cloud services, the HEP community has engaged in multiple proofs of concept to study the technical viability of using cloud resources; however, the...
We will present the rapid progress, vision and outlook across multiple state of the art development lines within the Global Network Advancement Group and its Data Intensive Sciences and SENSE/AutoGOLE working groups, which are designed to meet the present and future needs and address the challenges of the Large Hadron Collider and other science programs with global reach. Since it was founded...
The NOTED (Network Optimised Transfer of Experimental Data) project has successfully demonstrated the ability to dynamically reconfigure network links to increase the effective bandwidth available for FTS-driven transfers between endpoints, such as WLCG sites by inspecting on-going data transfers and so identifying those that are bandwidth-limited for a long period of time. Recently, the...
Data caches of various forms have been widely deployed in the context of commercial and research and education networks, but their common positioning at the Edge limits their utility from a network operator perspective. When deployed outside the network core, providers lack visibility to make decisions or apply traffic engineering based on data access patterns and caching node location.
As...
The Caltech team, in collaboration with network, computer science, and HEP partners at the DOE laboratories and universities, is building intelligent network services ("The Software-defined network for End-to-end Networked Science at Exascale (SENSE) research project") to accelerate scientific discovery.
The overarching goal of SENSE is to enable National Labs and universities to request and...
A comprehensive analysis of the HEP (High Energy Physics) experiment traffic across LHCONE (Large Hadron Collider Open Network Environment) and other networks, is essential for immediate network optimisation (for example by the NOTED project) and highly desirable for long-term network planning. Such an analysis requires two steps: tagging of network packets to indicate the type and owner of...
Apptainer (formerly known as Singularity) since its beginning implemented many of its container features with the assistance of a setuid-root program. It still supports that mode, but as of version 1.1.0 it no longer uses setuid by default. This is feasible because it now can mount squash filesystems, mount ext2/3/4 filesystems, and use overlayfs using unprivileged user namespaces and FUSE. It...
High energy physics experiments are pushing forward the precision measurements and searching for new physics beyond standard model. It is urgent to simulate and generate mass data to meet requirements from physics. It is one of the most popular areas to make good use of existing power of supercomputers for high energy physics computing. Taking the BESIII experiment as an illustration, we...
The CMS experiment is working to integrate an increasing number of High Performance Computing (HPC) resources into its distributed computing infrastructure. The case of the Barcelona Supercomputing Center (BSC) is particularly challenging as severe network restrictions prevent the use of CMS standard computing solutions. The CIEMAT CMS group has performed significant work in order to overcome...
With advances in the CMS CMSSW framework to support GPU, culminating with the deployment of GPU in the Run-3 HLT, CMS is also starting to look at integrating GPU resources into CMS Offline Computing as well. At US HPC Facilities a number of very large HPC with GPU have either just become available or will become available soon, offering opportunities for CMS if we would be able to make use of...
The NSF-funded Scalable CyberInfrastructure for Artificial Intelligence and Likelihood Free Inference (SCAILFIN) project has developed and deployed artificial intelligence (AI) and likelihood-free inference (LFI) techniques and software using scalable cyberinfrastructure (CI) built on top of existing CI elements. Specifically, the project has extended the CERN-based REANA framework, a...
LHCb (Large Hadron Collider beauty) is one of the four large particle physics experiments aimed at studying differences between particles and anti-particles and very rare decays in the charm and beauty sector of the standard model at the LHC. The Experiment Control System (ECS) is in charge of the configuration, control, and monitoring of the various subdetectors as well as all areas of the...
The ATLAS Trigger and Data Acquisition (TDAQ) High Level Trigger (HLT) computing farm contains 120,000 cores. These resources are critical for online selection and collection of collision data in the ATLAS experiment during LHC operation. Since 2013, during longer period of LHC inactivity these resources are being used for offline event simulation via the "Simulation at Point One" project...
The CMSWEB cluster is pivotal to the activities of the Compact Muon Solenoid (CMS) experiment, as it hosts critical services required for the operational needs of the CMS experiment. The security of these services and the corresponding data is crucial to CMS. Any malicious attack can compromise the availability of our services. Therefore, it is important to construct a robust security...
Helmholtz Federated IT Services (HIFIS) provides shared IT services in across all fields and centres in the Helmholtz Association. HIFIS is a joint platform in which most of the research centres in the Helmholtz Association collaborate and offer cloud and fundamental backbone services free of charge to scientists in Helmholtz and their partners. Furthermore, HIFIS provides a federated...
The centralised Elasticsearch service has already been running at CERN for over 6 years, providing the search and analytics engine for numerous CERN users. The service has been based on the open-source version of Elasticsearch, surrounded by a set of external open-source plugins offering security, multi-tenancy, extra visualization types and more. The evaluation of OpenDistro for...
The LCAPE project develops artificial intelligence to improve operations in the FNAL control room by reducing the time to identify the cause of an outage, improving the reproducibility of labeling it, predicting their duration and forecasting their occurrence.
We present our solution for incorporating information from ~2.5k monitored devices to distinguish between dozens of different causes...
Batch@CERN team manages over 250k CPU cores for Batch processing of LHC data with our HTCondor cluster comprising ~5k nodes. We will present a lifecycle management solution of our systems with our in-house developed state-manager daemon BrainSlug and how it handles draining, rebooting, interventions and other actions on the worker nodes, with Rundeck as our human-interaction endpoint and using...
The transition of WLCG storage services to dual-stack IPv4/IPv6 is nearing completion after more than 5 years, thus enabling the use of IPv6-only CPU resources as agreed by the WLCG Management Board and presented by us at earlier CHEP conferences. Much of the data is transferred by the LHC experiments over IPv6. All Tier-1 storage and over 90% of Tier-2 storage is now IPv6-enabled, yet we...