In the HEP community, the prediction of Data Popularity is a topic that has been approached for many years. Nonetheless, while facing increasing data storage challenges, especially in the HL-LHC era, we are still in need for better predictive models to answer the questions of whether particular data should be kept, replicated, or deleted.
The usage of caches proved to be a convenient technique that partially automates storage management and seems to eliminate some of these questions. While on one hand, we can benefit even from simple caching algorithms like LRU, on the other hand, we show that incorporation of the knowledge about the future access patterns can greatly improve the cache performance.
In this paper, we study the data popularity on the file level, where the special relation between files belonging to the same dataset could be used in addition to the standard attributes. We start by analyzing separate features and try to find the relation with the target variable: the reuse distance of the files. After, we turn to Machine Learning algorithms, such as Random Forest, which is well suited to work with Big Data: it can be parallelized, is more lightweight and easier to interpret than Deep Neural Networks. Finally, we compare the results with standard cache retention algorithms and with the theoretical optimum.
|Consider for long presentation||Yes|