The Eng Review Episode #2: Can we stop running these tests?

May 01, 2025

About this series

There are a lot of excellent system design newsletters out there. Most of them cover understanding an existing system in depth or designing a new system from scratch. Both are extremely valuable in their own right but this series is different.

One of your roles as a senior leader is to make decisions. Sometimes you have to make design decisions weighing trade offs when there is no clear right answer. Sometimes you have to make decisions navigating the technical and human elements in the equation. You also have to own the consequences of your decisions as an engineering leader.

In case you are thinking these are manager/director/VP problems, think again! Staff+ ICs need to make these decisions too!

In this series I cover real engineering projects. I go into the details of the tradeoffs we encountered and decisions made. Some are my stories and some are ones I’ve collected from colleagues.

Each episode has 2 sections:

The Choice: In this section I lay out the background and context for you along with the decision question. You can put yourself in the decision maker’s shoes and try to think of what questions you’d ask or what decision you’d make and how you’d explain it to the team. This section is for all ChaiTime readers.
The Outcome: In this section I talk about what actually happened. Some are success stories, some failures but many in-between. I share my own lessons learned - technical and leadership related - so you can learn from my experiences! This section is for my paid subscribers only since we will also be discussing it further in the Substack chat.

The Choice

A decade ago I was a senior software engineer on Google Chrome. We had an automated suite of system tests that we were running on every revision of every changelist (CL = Google-speak for a PR). These tests installed and started a full instance of Chrome on a specified OS (Windows, Mac, Linux initially; later Android, iOS) then did various operations to ensure that critical functionality in the browser worked per user expectations. This suite covered all the critical user journeys and any breakage was treated as a P0 issue.

The problem was we had many engineers submitting a lot of changelists multiple times a day leading to many runs of the test suite. That was a lot of computing capacity that our CI/CD system did not have leading to backed up test runs, blocked submits and delayed releases.

Computing power and CI/CD systems a decade ago were more limited than today but the speed and scale of changes in codebases has also gone up with AI-powered coding so I’m sure this problem exists today also, just on a different scale.

A few of us senior engineers got together to analyze this system and came up with the following options with associated pros and cons.

Option 1 - Speed up the test suite

Recommendation:

First focus on the common build, setup and teardown steps of the test suite and optimize them as much as possible.
Do a deeper analysis of each test and optimize for speed.

Pros:

We’ll be able to extract more juice out of existing compute capacity if this succeeds.

Cons:

The common steps speedup may be marginal and possibly not worth the engineering time investment for the overall goal.
The test analysis and optimization is similarly unknown and possibly not worth the engineering time investment for the overall goal.
This might need to be a recurring project as new tests are added or existing tests are updated in the test suite.

Option 2 - Move the tests downstream

Recommendation:

Instead of running these on every revision of every changelist, run them only on our nightly builds.

Pros:

We’ll need less compute capacity due to the reduced runs.
We will create headroom for new future tests that may get added to the suite.

Cons:

If the test run fails, we will need more compute capacity to run a bisection algorithm on all the changelists in the build until we find the culprit.
The failure detection changes from instantaneous to much later which could slow down the development velocity canceling out or even reversing the effects of compute savings on overall release velocity.

Option 3 - Stop running the tests

Recommendation:

There are robust small to large unit tests and a reasonable amount of integration tests that check for important component-level integrations in the system. We also have an excellent setup of browser release channels for Chrome between Stable, Beta and Dev to catch problems early enough from user interactions. Perhaps we don’t need system tests.

Pros:

Permanent and instant speedup in our release velocity since we eliminate an entire release stage.
Large savings in compute capacity.

Cons:

Risk of a major bug affecting users and causing an outage.

You are the engineering director in charge of the Chrome release infra and you are tasked with making this decision. Which will you choose and why? What other choices would you consider?

The Outcome

What happened

We first ended up picking Option 3!

A bold choice but a few factors went into our decision:

We analyzed the failure patterns of these tests and found that all failures from these tests were also caught by earlier unit or integration tests with better debuggability.
We looked at issues reported from Chrome’s release channels by real users and found that some of them were NOT caught by our system tests indicating that they weren’t as all-encompassing as I thought they were.
We also checked the code paths these tests exercised and made sure they were being exercised by our unit and integration tests. We added some more unit and integration tests to beef this up.

All these were signals that showed us we could manage the risk of a critical bug escaping to the user with just unit/integration tests and proactive actions on anything reported from our release channels.

A few months after turning down our system tests, Chrome had a big outage. Our downloader had a bug on one of our major platforms. This is the thing that allows you to download Chrome from another browser (like Safari) and install it on your system. In this case, the download was fine but the installer got stuck midway. This was definitely a SEV1 because if our users couldn’t install Chrome then we’d essentially lost them!

To add fuel to the fire, we found out during the post-mortem that we had a system test that performed the download and install sequence and it was shut down when Option 3 was executed 🙁 This test and a few others were resurrected in the system test suite as part of the post-mortem action items as they were deemed too critical to be stopped.

Fast forward a few years later. I was a staff tech lead manager in a different organization now. I met a mentee who was a senior engineer in Chrome and the first thing she mentioned in our meeting was “we are evaluating if we really need the very large system tests that keep slowing down our release pipeline…” 🫢

My learnings

When making a decision, making some decision is always better than no decision. In this case, we at least had a period of faster velocity before the downloader outage happened.
Strive for two-way door decisions. Ours was an easy two-way door since we disabled but didn’t delete code.
Understand the concept of polarities. If I had modeled the Speed-Quality tradeoff as a polarity, a moving system with different poles, I would have been more prepared to watch for signs of the organization’s risk appetite shifting.

The Eng Review Episode #2: Can we stop running these tests?

About this series

The Choice

Option 1 - Speed up the test suite

Option 2 - Move the tests downstream

Option 3 - Stop running the tests

The Outcome

What happened

My learnings

Discussion about this post