Git was built in 5 days

“Any program is only as good as it is useful.” - Linus Torvalds, creator of Linux and Git, in a 2007 interview.

If you’re a developer who has been putting off understanding how Git works under the hood, then this guide is for you.

For those of us who get by with only knowing a few commands, Git can be a little mystifying:

A deeper understanding of Git however, will help you feel more confident in your workflow and allow you to get things done faster. In this guide, we’ll show you the fascinating history of Git, Git internals, and a guide of commands for common Git workflows that you can show-off later.

Where did Git come from?

Git was built in roughly 5 days and released in April 2005 by Linus Torvalds, the creator of the Linux kernel. What motivated him to build what became the world’s most used version control system? Two words: licensing dispute.

Torvalds was fed up when the company behind BitKeeper, the source code management system for the Linux kernel and many other open-source projects, revoked the Linux developers’ free license. Bitkeeper’s parent company did this because Andrew Tridgell, an open-source developer, attempted to create an open-source version of the BitKeeper client without accepting its proprietary license. Torvalds called Tridgell’s work a “bad project”.

Needing a new source code management system to keep Linux alive, Torvalds built Git in less than a week to attempt to match the utility of BitKeeper. His inspiration becomes clear when you compare some examples of Git commands that we all use today with their Bitkeeper equivalents:

Initialize a repository

BitKeeper: bk setup .

Git: git init

Clone a repository

BitKeeper: bk clone <repository>

Git: git clone <repository>

Commit changes

BitKeeper: bk citool (or bk commit)

Git: git commit -m "<message>"

Torvalds didn’t just copy paste however; he also introduced efficiencies that BitKeeper didn’t have, such as running on a distributed model rather than a centralized one. Git doesn’t need a remote server to make changes. It stores a local copy of the source code, so making changes is fast and efficient for developer teams working on the same project.

"Git, to some degree, was designed on the principle that everything you ever do on a daily basis should take less than a second," said Torvalds in a 2005 InfoWorld interview.

Torvalds wasn’t alone in wanting a more efficient, BitKeeper alternative, and Git wasn’t the only distributed VCS of its era. Darcs was released 2 years before Git, Bazaar was released 13 days before Git, Mercurial was released 12 days after Git and Fossil was released a year after Git. Despite all of these competitors, Git has fast become the most widely used VCS according to StackOverflow surveys, with estimated adoption growing from 69% in 2017 to to 94% in 2021.

By introducing SHA-1 hashing, trees, and other improvements, Torvalds built Git to enable functionality like branch merging, and commit tagging. Compared to BitKeeper, these features made the software development workflow much more efficient and easier to manage at scale.

Understanding Git internals

In order to get a better understanding of Git under the hood, let’s look at some practical examples. To follow along with this article you’ll need:

Git installed
Your terminal opened into an empty directory
A command-line text editor (e.g. nano, vim), or an IDE (e.g. VSCode)
Optional: The tree command installed on either Mac, Windows, or Linux; this will visualize what trees look like under the hood in Git.

First, run git init in an empty directory, and we can get started.

Porcelain and plumbing in Git

As we explore Git internals, one of the first things to recognize is that Git is made up of two different command sets: porcelain and plumbing. Porcelain commands are used to interface with Git in the terminal, like the familiargit init, git pull, git add, git commit, and git push.Plumbing commands are lower-level and reveal Git internals. Many of these plumbing commands aren’t meant to be used manually on the command line, but rather to be used as building blocks for new tools and custom scripts.

Lets walk through how Git is structured and explore that structure with some plumbing commands to help us get a better understanding of what’s happening under the hood.

Storage

To understand how Git prevents collisions between different pieces of code we’re first going to explore Git’s storage schema. There are only four types of Git objects: blobs, trees, commits, and tags.

Blobs (binary large object)

Git stores and represents file content as a blob, short for “binary large object”. A blob contains a file’s contents but doesn’t contain any metadata.

To identify unique files, without having metadata like a filename, the SHA-1 hash, a cryptographic hash function, identifies the blob’s contents uniquely by generating a 40-digit hexadecimal number. We can see this in action by looking at how Git stores the string hello world.

First, let’s create a text file in our empty directory called foobar.txt that contains hello world:

Terminal

% echo “hello world” > foobar.txt
% cat foobar.txt
hello world

Now, let’s calculate the SHA-1 hash for hello world using the plumbing command git hash-object:

Terminal

% echo 'hello world' | git hash-object -w --stdin

3b18e512dba79e4c8300dd08aeb37f8e728b8dad

This command takes the string hello world, creates it as an object in Git, writes the object to Git’s database, then returns the SHA-1 hash.

Even without the filename, you can use that very same hash to find the object in Git’s database. If you run git cat-file with the -p flag to “pretty print”, we’ll be able to see the string representation of the binary contents of our object.

Terminal

% git cat-file -p "3b18e512dba79e4c8300dd08aeb37f8e728b8dad"

hello world

These “blobs” are how file contents are stored in git. However, remembering the SHA for every piece of code in your database isn’t practical, which brings us to the other data types that compose the Git schema.

Trees

A tree is an object that represents one level of directory information, recording blob identifiers, pathnames, and bits of metadata for all the files within that directory. Trees build the complete hierarchy of files and subdirectories in Git. They use file names and their identifiers to reference other Git objects: blobs, and even other trees - a.k.a subtrees.

We can see all of this in the .git directory. If you have the tree command installed on either Mac, Windows or Linux, then run the following command tree .git:

Terminal

% tree .git
.git
├── HEAD
├── config
├── description
├── hooks
│   ├── applypatch-msg.sample
│   ├── commit-msg.sample
│   ├── fsmonitor-watchman.sample
│   ├── post-update.sample
│   ├── pre-applypatch.sample
│   ├── pre-commit.sample
│   ├── pre-merge-commit.sample
│   ├── pre-push.sample
│   ├── pre-rebase.sample
│   ├── pre-receive.sample
│   ├── prepare-commit-msg.sample
│   ├── push-to-checkout.sample
│   └── update.sample
├── info
│   └── exclude
├── objects
│   ├── 3b
│   │   └── 18e512dba79e4c8300dd08aeb37f8e728b8dad
│   ├── info
│   └── pack
└── refs
    ├── heads
    └── tags

You can see in the objects subdirectory where our hello world was stored: the 3b is the first two characters of our hash. The remaining 38 characters, 18e512…, references the name of the file within that subdirectory. The contents of the file is the object itself, stored in binary format.

That’s great for file storage, but where does the version control component come in? Let’s make some more Git objects: commits and tags.

Commits

A commit is an object that stores metadata for each change in the repository, such as commit date, log message, author (the person who originally wrote the code), and committer (the person who last changed the code).

Now that we have the file hashed and stored, let’s add a commit for our previously created hello world in foobar.txt:

Terminal

% git add foobar.txt
% git commit -m “Adding hello world to foobar.txt”
[master (root-commit) e7625cf] Adding hello world to foobar.txt
 1 file changed, 1 insertion(+)
 create mode 100644 foobar.txt

We now created our first commit object, which just like our blob has an identifying hash: e7625cf.

Next we’ll dive into how Git processes your code changes and the different stages where your objects actually live, as they make their way into the Git database. Before we move on, let’s open a new empty directory, run git init, and make a new foobar.txt with echo “hello world” > foobar.txt.

The three spaces where Git processes and stores your code

Working area

Occasionally referred to as the working tree, the working area is the actual directory on your filesystem that reflects the changes you’re making — including files not handled by Git (i.e. untracked files). You can make changes without worrying about losing your work if it's already stored in the repository. It serves as the space where developers make their immediate changes, and it is separate from the other parts of the project managed by Git.

Staging area

Also referred to as the cache or index, the staging area is where files are prepared to be a part of the next commit. It's how Git understands what's going to change between the current commit and the next one. A “clean” staging area — not to be mistaken as empty — is essentially a copy of the latest commit and contains a list of files and the SHA-1 hash of those files from their last commit. When you add (git add <file>), remove (git rm <file>), or rename files (git mv <file>), Git recognizes the differing SHA-1 hash between the changed files and the ones from the repository.

Repository

The repository in Git contains all of your commits, representing snapshots of what the working and staging areas looked like at the time of each commit. These files are stored safely in the .git directory, allowing you to continue making changes without fear of losing previous versions — you can always check out a fresh copy. It's the core part of Git that holds the entire history of the project, providing traceability and flexibility.

Summary

Hopefully this has given you a better understanding of the Git internals, so you can be more intentional and confident when wrestling with Git commands.

History of Git:

Here’s a quick wrap-up:

Linus Torvalds built Git in 2005, motivated by a licensing dispute with BitKeeper.
Git introduced the distributed model of version control and also added functionality like branch merging, and commit tagging.
Git is the most widely used VCS: adoption grew from 69% in 2017 to 94% in 2021.

Git internals:

Git stores everything in four object types: blobs, trees, commits, and tags.
Git has three processing spaces: the working area, the staging area, and the repository.
Branches: dynamic references to a series of commits, automatically updating to point to the most recent commit on the branch (a.k.a the HEAD of the branch) whenever a new commit is made.

Understanding Git: The history and internals

Where did Git come from?

Understanding Git internals

Porcelain and plumbing in Git

Storage

Blobs (binary large object)

Trees

Commits

The three spaces where Git processes and stores your code

Working area

Staging area

Repository

Tags

Tags vs branches

Summary

History of Git:

Git internals:

Related posts

Built for the world's fastest engineering teams, now available for everyone