Speaker
Description
Anomaly detection in data center IT and physical infrastructures is challenging due to the amount of heterogeneous data to be analyzed. Indeed, they include, among others, CPU and memory consumption, network traffic, cooling and electrical states. Defining a solution that early identifies unexpected anomalies is particularly important to prevent data losses, breakdown of the system, and any other event considered to be critical for the activity of the data center. Furthermore, this solution might support system managers and engineers to properly design the redundancy of IT apparatus and monitor the whole data center. In the context of the data center of the Italian Institute for Nuclear Physics, which serves more than 40 international scientific collaborations in multiple scientific domains, including high-energy physics experiments running at the Large Hadron Collider in Geneva, we have performed a set of studies based on monitored cooling, electrical and IT hardware and software metrics.
In previous work, we have focused on the detection of anomaly patterns by considering service log files and data coming from IT monitoring measurements, and leveraging natural language processing (NLP) solutions juxtaposed with multivariate time series anomaly detection techniques [1]. This study has revealed thousands of anomalies that have been verified by a comparison with the same log messages derived from the different services considered for the analysis. It has also computed anomaly scores on monitoring data to identify the timeframe where we could overlap services and monitored data anomalies to perform predictive maintenance analysis.
This contribution brings that work further, exploring statistical approaches and machine learning solutions in the anomaly detection field for time series numerical metrics related to IT sensors. The paper describes a model defined by considering critical scenarios and a wide range and type of monitoring data, such as network load, cooling, and electrical states. Our study will also take advantage of the threshold-based alarming system, set for each monitored metric of IT and physical infrastructures, to label the recorded events and use them within semi-supervised machine learning techniques. The relationship between the anomaly scores and the threshold-risk values will be assessed to be used for predictive maintenance management. The theoretical studies, based on real monitored data, and the related achievements are adopted and used to improve the existing monitoring platform to recognise and prevent anomaly detection within the national data center of INFN.
References
[1] Viola, L; Ronchieri, E; Cavallaro, C. Combining log files and monitoring data to detect anomaly patterns in a data center. Computers, 11(8):117, 2022. doi: https://doi.org/10.3390/computers11080117
Consider for long presentation | Yes |
---|