Speaker
Description
Imperial College London hosts a large Tier-2 WLCG grid site based around a HTCondor batch system; additionally it provides cloud computing facilities using Openstack to non-WLCG activities. These cloud resources are open to opportunistic usage, provided the impact on the primary cloud users remains low.
In common with most Tier 2 sites we see constant job pressure from the WLCG VOs, while the usage of our cloud is typically much more intermittent. To allocate grid jobs to the available opportunistic cloud resources we implemented a lightweight backfill system based on the Openstack python API. We execute these grid jobs in a dedicated openstack project, the quota of which is adjusted according to the amount of unused resources in other projects. A second process takes care of starting new instances and erasing completed or stalled ones. We build a dedicated image for the virtual workernode instance once a day, using the Jenkins Continuous Integration tool. This ensures the image is always up-to-date and security updates are applied promptly. In this image we include a client for the HTcondor batch system and a script to shutdown the worker node if no work is underway to avoid having empty worker nodes taking up resources. For monitoring, we collate the key parameters of the backfill system using Ganglia.
Here we present our implementation of this lightweight backfill system and we report on its first year in production.
Consider for long presentation | No |
---|