Perforce
In 2011, Dan Bloch, the tech lead on the Perforce admin team at Google, published “Still All On One Server: Perforce at Scale.” The paper described how Google’s source control system, servicing “over twelve thousand users” a day, still ran off of a single Perforce server, tucked away under a stairwell in Building 43 on its main campus.
As of 2011, the single server had been in operation for the past eleven years of Google history. It had served Google the two-year-old startup, and had now scaled to support Google the public company. In fact, around that time, a lucky Google engineer had just snagged PR #20,000,000. Still chugging along, the server was now executing “11-12 million commands” a day.
In the paper, Bloch described some of the successful efforts that had gone into scaling the server. Behind the scenes, though, this picture was less rosy.
Google, no longer the startup it had once been, had resources now, resources it had spent on some of the finest hardware money could buy. Even still, the server was stretched thin. From time to time, there would be TCP connection failures as Google maxed out the CPU of the box. At all times, a hot standby was kept running and a team of eight admins kept watch, performing the routine heroics needed to keep Google’s source control server alive.
For years, Google had internally known this was a risk. Engineers sought alternatives. But at Google’s scale — ”the busiest single Perforce server on the planet, and one of the largest repositories in any source control system” — there were no obvious, clear alternative.
We’ve previously written about how Linus created git in 2005 when he couldn’t find any solutions that could scale performantly to the immense size of the Linux kernel repository. A few years after the “Still All On One Server: Perforce at Scale” paper, Google shared some stats about the change volume in its monolith. As of 2014, “approximately 15 million lines of code were changed in approximately 250,000 files,” every week. To put this into perspective this volume is equivalent to re-writing the 2014 Linux kernel from scratch… weekly, nine years after Linus had already had to grapple with the complexities of its size.
Piper
Google engineers had been considering alternatives to this single server since 2008.
Breaking up the monolith was briefly entertained, but rejected — a significant, industry-changing decision in hindsight, as it would set the standard for handling code complexity at scale for decades to come.
In the years since, Google has invented and pioneered much of the large monorepo tooling and significantly influenced pro-monorepo culture. (For example, see it’s 2016 paper, “Why Google Stores Billions of Lines of Code in a Single Repository.”)
At the time, however, the decision to continue committing to the monorepo was a non-obvious one. Up until this point, Google’s monorepo had developed relatively naturally; this was the first decision that would force the organization to explicitly commit to this architecture. This came in stark contrast to the prevailing wisdom of the Git community a the time: that folks should have “more and smaller repositories”, in part due to the sheer cost of cloning a massive monolith. (Google would later invent tooling through SourceFS and then Clients in the Cloud, CitC for short, to address this.) Had Google chosen to adopt conventional wisdom — and not to explicitly affirm its commitment to a monorepo — things would likely look quite different today.
Google also briefly considered a migration from Perforce to SVN, believing that SVN might have been able to stretch to the scale they needed. That didn’t work out, however, when engineers couldn’t find a clear migration path.
In the end, like Linus in 2005, the only path forward seemed to be to invent something new.
The new system was called Piper (based on some engineers’ love of planes and short for “Piper is Piper expanded recursively”) and is still in use today.
The Piper logo, from “Why Google Stores Billions of Lines of Code in a Single Repository”
Migration
Once engineers had decided on the shape of an alternative (Piper would be distributed and “implemented on top of standard Google infrastructure, originally Bigtable”) the next step was to actually create and implement the new solution. Once Piper, the proposed Perforce alternative was deployed, they would then have to cut over all of the traffic and migrate Google’s entire monorepo.
The effort took over four years.
At first blush, this seems shockingly long, but over the course of its eleven years, Perforce had embedded itself deeply into Google’s software ecosystem, touching almost every engineering surface. When the migration started there were already 300 tools that relied on the Perforce API. More surprisingly — and critically — production dependencies on Perforce kept cropping up. In an ideal world, a version control system should be strictly internal facing and be able to fall over without impacting live traffic; however the Piper team kept discovering that this wasn’t the case. All in all, the engineers conducting the migration needed to be extremely careful so as not to disrupt Google’s end user experience.
Complicating matters, in 2010, Oracle sued Google over its usage of licensed Java API interfaces in the Android operating system. The case wouldn’t be fully decided until 2021, by which point it had been escalated all the way to the Supreme Court. In the meantime, however, this stoked fears in Google engineers about the migration off of the Perforce API — specifically how they could migrate seamlessly without completely copying its interface. In the end, Google engineers used a famous industry workaround — clean room design — where technical writers wrote out a specification that was then implemented from scratch by separate, clean engineers without any knowledge of the API.
As the years passed, the tone of the project shifted. At the beginning of the project, Piper had been a fresh, cool idea that engineers had been excited by and might be the key to solving their Perforce problems. As time wound on though, their work took on new urgency.
Development at Google had continued while the migration was underway and in the years that passed, the load on Perforce had only continued to rise, increasing the stakes. There had been substantial new development atop the Perforce APIs including now prominent systems like Blaze (Google’s build system, later open-sourced as Bazel) and TAP (Google’s internal testing platform).
What made this an especially bold bet was that in addition to the significant investment — a team of engineers allocated years of roadmap time — the outcome of all of this work was all-or-nothing; there were no interim benefits to be had. Because it was crucial to maintain a single source of truth for source code, Google wouldn’t be able to reap any benefits from Piper until it could cut over entirely. Either Piper would succeed and be able to entirely take over the role of source control server or it would fail, sending Google back to the drawing board, with years lost in development hell, and an even more overloaded Perforce server.
So when the Piper team needed to implement PAXOS, they were permitted to steal an expert from the Google Spanner team, even before Spanner had implemented PAXOS itself.
By the late stages of the migration, commits were successfully being double-written to both Perforce and Piper. The team tested out partial rollouts to limited regions. All 25,000 Google engineers had been corralled into migrating their workspaces to Piper, a painstaking effort for the Piper team which involved personally going to the desks of holdouts — some extremely senior, tenured engineers — and convincing them to migrate to unblock the overall project.
On a Saturday the Piper team, by then ten engineers, assembled in a conference room on campus. In classic engineering fashion, someone ordered pizza. Jeff Dean, now Alphabet's chief scientist, personally stopped by the room that day to check in on the progress, helping boost morale, and further emphasizing the critical nature of the project.
Though the team had practiced, had scripts on hand, had folks looking over other folks shoulders, and had detailed runbooks for the incidents that they might encounter, it was hard to ignore the elephant in the room: there was a possibility that those 10 engineers were about to take Google down.
For a few minutes while the migration was underway, the source control system became read-only as state was being frozen and migrated from Perforce to Piper. The room held their breath.
Then… the migration was complete.
No loss of state; Google’s production instances unaffected.
Just like that, the years-long all-or-nothing bet had paid off.
Afterwards
The cutover to Piper had the immediate effect of reducing Google’s operational risk by removing its dependency on the single, overloaded Perforce server. But, in time, the migration also unblocked a number of new systems through the new volumes of traffic the source control server now supported (notably including Tricorder, Google’s static analysis tool).
After the migration in 2012, the number of automated commits took off
Today when we think about internal tooling at Google, its easy to think about state-of-the-art systems created by a giant faceless corporation with significant time and resources at its disposal. But the migration to Piper shows a different side of things: 2012, eight years after IPO: a time of scrappiness and daring engineering.