Optimizing CI/CD workflows for trunk-based development

Almost all of us have some experience with continuous integration, continuous delivery (CI/CD) systems. Whether you use GitHub Actions, Jenkins, or are one of the five people in the world still using Concourse, CI/CD systems provide the backbone of how most organizations build and ship software. However, one question we hear a lot from Graphite users is: how do we optimize our CI/CD pipelines to balance speed, cost, and efficiency?

In this post we’ll cover common challenges in CI/CD workflows, as well as share insights and best practices that our team here at Graphite has gathered over their collective decades of experience working at organizations of all shapes and sizes.

Challenges in CI/CD

The two biggest potential problems in CI/CD flows are time and financial cost, which, if not managed properly, can quickly spiral out of control, grinding your eng team to a halt while running up a big compute bill in the process.

Time cost: Productive eng teams need fast CI processes. A slow CI pipeline can quickly become a bottleneck, delaying review feedback, slowing down merge queues, and hampering the overall pace of development. This is particularly painful for teams using trunk-based development, where short-lived branches and frequent merges are standard practice, meaning that CI disruptions can easily snowball as engineers add more and more PRs to the queue.

Financial Cost: The typical response to slow CI processes is increased parallelization, but this often comes with increased computational and hosting costs. An initially simple setup can, over time, morph into a complex array of parallel jobs. Because each of these jobs run on every revision of every PR, the associated costs can quickly balloon into tens or even hundreds of thousands of dollars in compute spend a month.

With these challenges in mind, fast-moving eng teams need to strike a delicate balance in their CI setups. Your CI process should ensure comprehensive test coverage, while at the same time balancing speed and resource requirements to keep costs in check. The 2nd half of this post delves into strategies and best practices for creating a time and cost-efficient CI/CD pipeline.

Strategies for optimizing CI/CD

Here are a few strategies that we use at Graphite to optimize our CI/CD pipelines:

Parallelization

It’s important to keep your CI runtime under control. Aim for a total CI time of under 10 minutes. Fast CI runtimes accelerate the time-to-merge and keep development momentum high.

Techniques for parallelizing unit tests:

Utilize testing libraries that support parallel execution. Match the parallelization level to the number of cores available on your test runner.
For example in GitHub Actions you can define matrices of different job configurations:

yaml

jobs:
  parallelization_matrix:
    strategy:
      matrix:
        version: [1, 3, 5]
        os: [fedora-latest, ubuntu-latest]

Each possible combination of variables will trigger its own job run, so for this example a total of 6 jobs will run. Jobs will all run in parallel, depending on runner availability.

Strategies for dividing tests into sub-groups:

Split unit tests by module, sub-project, or hash of test file names. This allows for running multiple jobs or workflows in parallel.

Example of a test splitting configuration:

yaml

jobs:
  test-module-A:
    runs-on: ubuntu-latest
    steps:
      - name: Run module A tests
        run: npm run test:module-A
  test-module-B:
    runs-on: ubuntu-latest
    steps:
      - name: Run module B tests
        run: npm run test:module-B

Avoiding redundant work

Repetition in CI/CD flows is expensive. Avoid having each parallel job reinstall dependencies or rebuild the project. This can add significant time and cost to each CI run.

Best practices for caching and artifact management:

Use caching strategies to store and reuse build artifacts across jobs.
Recommended tools:
- GitHub Cache Action: GitHub Actions' first-party caching action
- Turborepo: efficient artifact management in monorepos

Example of efficient caching:

Cache your dependencies and build outputs to be reused in subsequent jobs.
Example GitHub Actions cache configuration:

yaml

steps:
  - uses: actions/cache@v2
    with:
      path: |
        node_modules
        build-output
      key: ${{ runner.os }}-build-${{ hashFiles('**/lockfiles') }}

Selective testing

Avoid re-testing everything on every code change. This is inefficient, especially in large projects and monorepos.

Approaches for building and testing based on selective changes:

Implement CI logic to detect and test only the changed components.
Use tools like Vercel’s Turbo to detect changes and trigger the necessary builds and tests.
Example of selective testing using GitHub Actions’ paths filter on push triggers:

yaml

on:
  push:
    paths:
      - 'module-A/**'
      - 'module-B/**'

Fail-fast principle

Halting CI workflows on failure:

Configure your CI to stop all subsequent jobs upon the first failure, saving time and resources.
Example of a fail-fast setup in GitHub Actions using the if conditional and this custom action that uses GitHub’s API to cancel the current job:

yaml

name: cancel_workflow
if: failure() || steps.lint_all.outcome == 'failure'
uses: andymckay/cancel-action@0.2

Configuring CI to cancel redundant executions:

Avoid running multiple CI executions for the same PR; you should cancel any in-progress runs when a new push is made, as all previous runs are now running on a deprecated version of the branch.
Example GitHub Actions configuration for canceling in-progress jobs using GitHub’s cancel-in-progress keyword.

yaml

concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

Set timeouts to prevent hanging jobs:

Implement a maximum timeout for CI jobs to avoid long-running or stalled jobs.
Example timeout configuration:

yaml

jobs:
  build:
    runs-on: ubuntu-latest
    timeout-minutes: 20

Through these strategies and practices, you can significantly optimize your CI/CD pipelines, ensuring that they are not only robust and reliable but also cost-effective and time-efficient.

Additional considerations

When optimizing your CI/CD pipeline, you should make sure to consider your team’s development workflows & processes. For example, your team’s repo setup will greatly affect how you approach CI:

Monorepo vs. polyrepo:

Monorepo advantages: In a monorepo, all of your projects and libraries reside in a single repository, simplifying dependency management and ensuring consistency across the codebase. This setup can significantly benefit CI/CD as changes across multiple projects can be tested together, ensuring that integration points are always in sync.
Polyrepo considerations: Polyrepos, where each project or library has its own repository, naturally shard code, which introduces complexity in managing atomic changes across repositories. While it may make testing siloed feature sets easier, it ultimately complicates integrating these features.

Graphite's Recommendation: Use a monorepo, especially for larger teams. It simplifies CI/CD pipeline management and fosters a more integrated development environment.

Key takeaways

Prioritize parallelization:
- Assess opportunities: Regularly review your test suites and other CI jobs to identify opportunities for parallel execution.
- Optimize resource allocation: Match the number of parallel jobs to your available computational resources to maximize efficiency without overloading the system.
- Use parallel-friendly tools: Employ tools and frameworks that inherently support parallel execution of tasks.
Avoid redundant work:
- Implement efficient caching: Cache dependencies and build artifacts to reuse them across jobs, reducing the time spent in setup.
- Optimize docker layers: If using Docker, structure your Dockerfiles to take advantage of layer caching.
- Use smart build systems: Employ build systems that can intelligently skip unchanged parts of the codebase.
Embrace selective testing:
- Implement change detection: Use tools like Vercel’s Turbo to isolate testing on only the parts of the codebase that have changed.
- Configure path-based triggers: Set up your CI to trigger different workflows based on the paths of changed files.
- Shallow clone: Only checkout the code you need for testing, cutting out time spent pulling down extraneous data.
Implement the fail-fast principle:
- Early exit on failure: Configure your CI pipeline to stop subsequent steps immediately after a failure is detected.
- Prioritize fast feedback loops: Run the quickest tests (like linters and static analysis) early in the CI process.
Optimize dependency management:
- Use dependency locking: Utilize lock files to ensure consistent dependency installation across all CI runs.
- Optimize dependency retrieval: Use a package manager that supports efficient retrieval and caching of dependencies.
- Prune unnecessary dependencies: Regularly audit your dependencies to remove or update those that are no longer needed or are outdated.

Optimizing your CI/CD pipeline is a juggling act of balancing efficiency, cost, and performance. By implementing strategies such as parallelization, avoiding redundant work, selective testing, and fast-failures, you can achieve significant improvements in your CI/CD processes. Make sure to consider the structure of your codebase (monorepo vs. polyrepo) and the limitations of your CI/CD tools as you think about these optimizations.

These approaches have worked well for us at Graphite, but obviously every company is different - we highly encourage you to experiment with these strategies, adapt them to your specific context, and continuously refine your CI/CD practices.

Let us know how these strategies work for you!

References and further reading

For more insights and detailed guides on optimizing CI/CD pipelines, check out the following resources:

Optimizing CI/CD workflows for trunk-based development

Challenges in CI/CD

Strategies for optimizing CI/CD

Parallelization

Techniques for parallelizing unit tests:

Strategies for dividing tests into sub-groups:

Avoiding redundant work

Best practices for caching and artifact management:

Example of efficient caching:

Selective testing

Approaches for building and testing based on selective changes:

Fail-fast principle

Halting CI workflows on failure:

Configuring CI to cancel redundant executions:

Set timeouts to prevent hanging jobs:

Additional considerations

Monorepo vs. polyrepo:

Key takeaways

References and further reading

Related posts

Built for the world's fastest engineering teams, now available for everyone