ChaiTime

ChaiTime

Share this post

ChaiTime
ChaiTime
The Eng Review Episode #2: Can we stop running these tests?
Copy link
Facebook
Email
Notes
More

The Eng Review Episode #2: Can we stop running these tests?

Chaitali Narla's avatar
Chaitali Narla
May 01, 2025
∙ Paid
9

Share this post

ChaiTime
ChaiTime
The Eng Review Episode #2: Can we stop running these tests?
Copy link
Facebook
Email
Notes
More
Share

About this series

There are a lot of excellent system design newsletters out there. Most of them cover understanding an existing system in depth or designing a new system from scratch. Both are extremely valuable in their own right but this series is different.

One of your roles as a senior leader is to make decisions. Sometimes you have to make design decisions weighing trade offs when there is no clear right answer. Sometimes you have to make decisions navigating the technical and human elements in the equation. You also have to own the consequences of your decisions as an engineering leader.

In case you are thinking these are manager/director/VP problems, think again! Staff+ ICs need to make these decisions too!

In this series I cover real engineering projects. I go into the details of the tradeoffs we encountered and decisions made. Some are my stories and some are ones I’ve collected from colleagues.

Each episode has 2 sections:

  • The Choice: In this section I lay out the background and context for you along with the decision question. You can put yourself in the decision maker’s shoes and try to think of what questions you’d ask or what decision you’d make and how you’d explain it to the team. This section is for all ChaiTime readers.

  • The Outcome: In this section I talk about what actually happened. Some are success stories, some failures but many in-between. I share my own lessons learned - technical and leadership related - so you can learn from my experiences! This section is for my paid subscribers only since we will also be discussing it further in the Substack chat.

The Choice

A decade ago I was a senior software engineer on Google Chrome. We had an automated suite of system tests that we were running on every revision of every changelist (CL = Google-speak for a PR). These tests installed and started a full instance of Chrome on a specified OS (Windows, Mac, Linux initially; later Android, iOS) then did various operations to ensure that critical functionality in the browser worked per user expectations. This suite covered all the critical user journeys and any breakage was treated as a P0 issue.

The problem was we had many engineers submitting a lot of changelists multiple times a day leading to many runs of the test suite. That was a lot of computing capacity that our CI/CD system did not have leading to backed up test runs, blocked submits and delayed releases.

Computing power and CI/CD systems a decade ago were more limited than today but the speed and scale of changes in codebases has also gone up with AI-powered coding so I’m sure this problem exists today also, just on a different scale.

A few of us senior engineers got together to analyze this system and came up with the following options with associated pros and cons.

Option 1 - Speed up the test suite

Recommendation:

  • First focus on the common build, setup and teardown steps of the test suite and optimize them as much as possible.

  • Do a deeper analysis of each test and optimize for speed.

Pros:

  • We’ll be able to extract more juice out of existing compute capacity if this succeeds.

Cons:

  • The common steps speedup may be marginal and possibly not worth the engineering time investment for the overall goal.

  • The test analysis and optimization is similarly unknown and possibly not worth the engineering time investment for the overall goal.

  • This might need to be a recurring project as new tests are added or existing tests are updated in the test suite.

Option 2 - Move the tests downstream

Recommendation:

  • Instead of running these on every revision of every changelist, run them only on our nightly builds.

Pros:

  • We’ll need less compute capacity due to the reduced runs.

  • We will create headroom for new future tests that may get added to the suite.

Cons:

  • If the test run fails, we will need more compute capacity to run a bisection algorithm on all the changelists in the build until we find the culprit.

  • The failure detection changes from instantaneous to much later which could slow down the development velocity canceling out or even reversing the effects of compute savings on overall release velocity.

Option 3 - Stop running the tests

Recommendation:

  • There are robust small to large unit tests and a reasonable amount of integration tests that check for important component-level integrations in the system. We also have an excellent setup of browser release channels for Chrome between Stable, Beta and Dev to catch problems early enough from user interactions. Perhaps we don’t need system tests.

Pros:

  • Permanent and instant speedup in our release velocity since we eliminate an entire release stage.

  • Large savings in compute capacity.

Cons:

  • Risk of a major bug affecting users and causing an outage.

You are the engineering director in charge of the Chrome release infra and you are tasked with making this decision. Which will you choose and why? What other choices would you consider?

The Outcome

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 ChaiTime
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More