prscrew.com

Managing Self-Hosted CI Runners at Scale with EC2 Spot Instances

Written on

Chapter 1: Introduction to Self-Hosted Runners

As engineering teams expand, the demand for continuous integration (CI) naturally increases to accommodate more systems and developers. However, once a certain threshold is reached, relying on managed hosted runners provided by CI services can become prohibitively expensive.

A viable alternative is to deploy self-hosted runners on AWS EC2 Spot instances, which offer significant cost savings while enabling scalable solutions. Nevertheless, managing a large fleet of CI runners can become cumbersome without proper automation.

At VTS Engineering, we encountered similar hurdles. To tackle these challenges, we implemented a specialized Terraform module that oversees the entire lifecycle of our runners, alleviating much of the operational burden. In this piece, we will delve into this Terraform module and address the obstacles associated with deploying self-hosted CI runners at scale using GitHub Actions.

Section 1.1: Our Unique Use Case

At VTS Engineering, our scenario is distinct, as we operate multiple workflows that necessitate around 50 runners per execution. These workflows are integral to our monolithic application deployment, meaning they are executed frequently throughout the day.

We required a scalable and resilient solution to accommodate approximately 300 CI runners during peak hours, catering to over 150 active repositories and more than 200 engineers.

Subsection 1.1.1: Overview of the Terraform Module

We utilize the terraform-aws-github-runner Terraform module for scalable GitHub Action runners on AWS. Below is a high-level architecture diagram illustrating how the module manages the scaling of runners based on demand or scheduled intervals to optimize Spot instance usage.

High-level architecture of the Terraform module for GitHub Actions

Section 1.2: Our Solution

By leveraging the terraform-aws-github-runner module, we successfully deployed a fleet of 300 Spot runners to meet our CI requirements. This setup automatically scales down after work hours according to a cron schedule.

To mitigate the risk of a single point of failure within our runner pool, we also retained the capability to create additional runners on demand if the existing pool lacks idle resources or experiences outages.

Despite the daily need for numerous Spot runners, this self-hosted solution proves to be far more economical compared to previous managed CI solutions or GitHub-hosted runners.

Chapter 2: Pricing Comparison and Challenges

This video titled "How To Set Up Self-hosted GitHub Runners on AWS EC2 Instance Auto Scaling Group" provides a comprehensive guide on configuring self-hosted runners using AWS EC2.

The video covers the necessary steps to streamline the setup process, ensuring that teams can effectively manage their CI infrastructure.

The second video, "Provisioning 100 GitLab Spot Runners on AWS in Less Than 10 Mins via Less Than 10 Clicks For $5/hr," demonstrates how to quickly and efficiently provision GitLab runners on AWS.

This video outlines the steps to save costs while managing CI runners effectively, making it a valuable resource for teams looking to optimize their CI workflows.

Challenges We Faced

Transitioning from long-lived runners to ephemeral ones posed several difficulties, including maintaining a consistent pool size of around 300 runners, limitations on spinning up multiple runners simultaneously due to throttling, and issues with Spot instance availability, leading to increased service costs.

Challenge #1 — Ephemeral Runners:

In CI, it is crucial for runners to operate in an ephemeral state, ensuring they are clean before executing new jobs. The module we adopted facilitates the creation of ephemeral runners, but we encountered the limitation of requiring one EC2 instance per job, which isn't ideal.

Using ephemeral runners with GitHub Actions necessitates the use of the workflow_job event. Since these runners are stateless, each must authenticate with Docker Hub to avoid rate limits, and we must utilize specific pool configurations.

Challenge #2 — Scaling:

While we established ephemeral runners, we needed to ensure a consistent pool of approximately 300 runners. A Lambda function manages this pool, checking every few minutes to maintain the expected size and adjusting as needed.

The configuration facilitates scaling down surplus runners throughout the day, based on API calls to GitHub.

Challenge #3 — Throttling:

To meet our CI workload demands, we faced rate limits from AWS SSM Parameter Store, which only supports a limited number of requests per second. Our solution involved forking the open-source module to address these throttling issues.

Challenge #4 — Spot Instance Availability:

Using Spot instances at scale requires careful testing for availability. We frequently encountered interruption notices, which led to premature terminations of our runners. To improve stability, we diversified our instance types and families, reducing interruptions significantly.

Challenge #5 — Cost:

Although Spot instances are cost-effective, the frequent recycling of these instances increased our overall expenses for AWS services related to high EC2 usage. We implemented several measures to mitigate costs, including garbage collection for SSM parameters and utilizing VPC endpoints.

Learnings and Conclusion

In our quest to fulfill our scalability needs, we explored various approaches before settling on a solution involving a pool of self-hosted CI runners.

Although it's technically feasible to run multiple GitHub Actions runners on a single host, it is more suited for smaller workloads and can lead to maintenance challenges. Additionally, vertically scaling EC2 instances did not prove to be a viable option due to cost and utilization metrics.

We appreciate your time in reading this article, and we hope you find our solution beneficial for scaling your CI processes!

Acknowledgments

Special thanks to Pavel Susloparov and Shruti Venkatesh for their invaluable feedback and suggestions for this article. Dev works as a Senior SRE on the platform infrastructure team at VTS, with a passion for building developer tools and reliable systems while driving continuous improvements.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Winning at Life: Embracing Uncertainty and the Art of Resilience

Explore Nassim Taleb's transformative ideas on embracing uncertainty, achieving resilience, and navigating life's complexities.

Navigating the Challenges of Learning to Code: My Journey

A personal account of overcoming obstacles in coding education, focusing on direction, discipline, and mindset.

Wembley Stadium Launches Groundbreaking 100% Recyclable Pitch

Wembley Stadium introduces the first fully recyclable pitch, setting a new standard for sustainability in sports.

Maximizing Productivity: 9 Essential Hacks for Digital Nomads

Discover 9 powerful productivity hacks to thrive as a digital nomad while traveling the world.

# AI Enhances Home Robots: Halving Processing Times with PIGINet

Discover how PIGINet revolutionizes home robots, cutting processing times significantly through advanced machine learning techniques.

Empower Yourself: Embrace Accountability for a Better Life

Discover how taking ownership of your actions can lead to personal growth and empowerment through accountability.

The Dangers of AI: A Looming Threat to Humanity's Existence

A recent study highlights the potential dangers of AI, suggesting that an existential crisis may not just be possible, but probable.

Exploring Memory and Nootropics: A Journey into Cognitive Enhancement

Delve into the fascinating world of memory, Jill Price's unique condition, and the pursuit of cognitive enhancement through nootropics.