Indico is back online after maintenance on Tuesday, April 30, 2024.
Please visit Jefferson Lab Event Policies and Guidance before planning your next event: https://www.jlab.org/conference_planning.

May 8 – 12, 2023
Norfolk Waterside Marriott
US/Eastern timezone

Managing the OSG Fabric of Services the GitOps Way

May 9, 2023, 11:15 AM
15m
Marriott Ballroom II-III (Norfolk Waterside Marriott)

Marriott Ballroom II-III

Norfolk Waterside Marriott

235 East Main Street Norfolk, VA 23510
Oral Track 4 - Distributed Computing Track 4 - Distributed Computing

Speaker

Bockelman, Brian (Morgridge Institute for Research)

Description

There is no lack of approaches for managing the deployment of distributed services – in the last 15 years of running distributed infrastructure, the OSG Consortium has seen many of them. One persistent problem has been each physical site has its style of configuration management and service operations, leading to a partitioning of the staff knowledge and inflexibility in migrating services between sites.

Recently, the team has been migrating the OSG Fabric of Services to be deployed via Kubernetes which provides a common service orchestration fabric across all sites. However, this leaves open a question - how does the team interact with Kubernetes? To coordinate this new style of deployment among geographically distributed clusters and team members, the team has adopted "GitOps", an operational model that uses Git version control repositories to drive service updates. Git-driven operations provides all the benefits of version control such as recording the who, what, when, and why of any given change. But, more powerfully, automated agents synchronize the current state of the Git repository with the current state of the Kubernetes clusters, streamlining the ability to redeploy services from scratch or transfer services between clusters. In this paper, we will describe the setup that enables GitOps deployments of central OSG services and the lessons learned along the way, including rebuilding a suite of services after a critical failure and our experiences with providing high-availability services across multiple Kubernetes clusters.

Consider for long presentation No

Primary authors

Bockelman, Brian (Morgridge Institute for Research) Mr Lin, Brian Thiltges, John (University of Nebraska-Lincoln) Hu, Fengping (University of Chicago)

Presentation materials