Indico is back online after maintenance on Tuesday, April 30, 2024.
Please visit Jefferson Lab Event Policies and Guidance before planning your next event: https://www.jlab.org/conference_planning.

May 8 – 12, 2023
Norfolk Waterside Marriott
US/Eastern timezone

Anomaly Detection in Data Center IT and physical Infrastructure

Not scheduled
1h
Hampton Roads Ballroom and Foyer Area (Norfolk Waterside Marriott)

Hampton Roads Ballroom and Foyer Area

Norfolk Waterside Marriott

235 East Main Street Norfolk, VA 23510
Poster Poster Poster Session

Speaker

Dr Giommi, Luca (INFN CNAF)

Description

Anomaly detection in data center IT and physical infrastructures is challenging due to the amount of heterogeneous data to be analyzed. Indeed, they include, among others, CPU and memory consumption, network traffic, cooling and electrical states. Defining a solution that early identifies unexpected anomalies is particularly important to prevent data losses, breakdown of the system, and any other event considered to be critical for the activity of the data center. Furthermore, this solution might support system managers and engineers to properly design the redundancy of IT apparatus and monitor the whole data center. In the context of the data center of the Italian Institute for Nuclear Physics, which serves more than 40 international scientific collaborations in multiple scientific domains, including high-energy physics experiments running at the Large Hadron Collider in Geneva, we have performed a set of studies based on monitored cooling, electrical and IT hardware and software metrics.

In previous work, we have focused on the detection of anomaly patterns by considering service log files and data coming from IT monitoring measurements, and leveraging natural language processing (NLP) solutions juxtaposed with multivariate time series anomaly detection techniques [1]. This study has revealed thousands of anomalies that have been verified by a comparison with the same log messages derived from the different services considered for the analysis. It has also computed anomaly scores on monitoring data to identify the timeframe where we could overlap services and monitored data anomalies to perform predictive maintenance analysis.

This contribution brings that work further, exploring statistical approaches and machine learning solutions in the anomaly detection field for time series numerical metrics related to IT sensors. The paper describes a model defined by considering critical scenarios and a wide range and type of monitoring data, such as network load, cooling, and electrical states. Our study will also take advantage of the threshold-based alarming system, set for each monitored metric of IT and physical infrastructures, to label the recorded events and use them within semi-supervised machine learning techniques. The relationship between the anomaly scores and the threshold-risk values will be assessed to be used for predictive maintenance management. The theoretical studies, based on real monitored data, and the related achievements are adopted and used to improve the existing monitoring platform to recognise and prevent anomaly detection within the national data center of INFN.

References
[1] Viola, L; Ronchieri, E; Cavallaro, C. Combining log files and monitoring data to detect anomaly patterns in a data center. Computers, 11(8):117, 2022. doi: https://doi.org/10.3390/computers11080117

Consider for long presentation Yes

Primary authors

Ronchieri, Elisabetta (INFN CNAF) Dr Giommi, Luca (INFN CNAF) Dr Scarponi, Luigi (INFN CNAF) Dr Doina, Duma Cristina (INFN CNAF) Dr Salomoni, Davide (INFN CNAF) Dr Costantini, Alessandro (INFN CNAF)

Presentation materials

Peer reviewing

Paper