Read Anthropic’s case study about Graphite Reviewer

“Any program is only as good as it is useful.” - Linus Torvalds, creator of Linux and Git, in a 2007 interview.

If you’re a developer who has been putting off understanding how Git works under the hood, then this guide is for you.

For those of us who get by with only knowing a few commands, Git can be a little mystifying:

A deeper understanding of Git however, will help you feel more confident in your workflow and allow you to get things done faster. In this guide, we’ll show you the fascinating history of Git, Git internals, and a guide of commands for common Git workflows that you can show-off later.

Git was built in roughly 5 days and released in April 2005 by Linus Torvalds, the creator of the Linux kernel. What motivated him to build what became the world’s most used version control system? Two words: licensing dispute.

Torvalds was fed up when the company behind BitKeeper, the source code management system for the Linux kernel and many other open-source projects, revoked the Linux developers’ free license. Bitkeeper’s parent company did this because Andrew Tridgell, an open-source developer, attempted to create an open-source version of the BitKeeper client without accepting its proprietary license. Torvalds called Tridgell’s work a “bad project”.

Needing a new source code management system to keep Linux alive, Torvalds built Git in less than a week to attempt to match the utility of BitKeeper. His inspiration becomes clear when you compare some examples of Git commands that we all use today with their Bitkeeper equivalents:

Initialize a repository

BitKeeper: bk setup .

Git: git init

Clone a repository

BitKeeper: bk clone <repository>

Git: git clone <repository>

Commit changes

BitKeeper: bk citool (or bk commit)

Git: git commit -m "<message>"

Torvalds didn’t just copy paste however; he also introduced efficiencies that BitKeeper didn’t have, such as running on a distributed model rather than a centralized one. Git doesn’t need a remote server to make changes. It stores a local copy of the source code, so making changes is fast and efficient for developer teams working on the same project.

"Git, to some degree, was designed on the principle that everything you ever do on a daily basis should take less than a second," said Torvalds in a 2005 InfoWorld interview.

Torvalds wasn’t alone in wanting a more efficient, BitKeeper alternative, and Git wasn’t the only distributed VCS of its era. Darcs was released 2 years before Git, Bazaar was released 13 days before Git, Mercurial was released 12 days after Git and Fossil was released a year after Git. Despite all of these competitors, Git has fast become the most widely used VCS according to StackOverflow surveys, with estimated adoption growing from 69% in 2017 to to 94% in 2021.

By introducing SHA-1 hashing, trees, and other improvements, Torvalds built Git to enable functionality like branch merging, and commit tagging. Compared to BitKeeper, these features made the software development workflow much more efficient and easier to manage at scale.

In order to get a better understanding of Git under the hood, let’s look at some practical examples. To follow along with this article you’ll need:

  • Git installed

  • Your terminal opened into an empty directory

  • A command-line text editor (e.g. nano, vim), or an IDE (e.g. VSCode)

  • Optional: The tree command installed on either Mac, Windows, or Linux; this will visualize what trees look like under the hood in Git.

First, run git init in an empty directory, and we can get started.

As we explore Git internals, one of the first things to recognize is that Git is made up of two different command sets: porcelain and plumbing. Porcelain commands are used to interface with Git in the terminal, like the familiargit init, git pull, git add, git commit, and git push.Plumbing commands are lower-level and reveal Git internals. Many of these plumbing commands aren’t meant to be used manually on the command line, but rather to be used as building blocks for new tools and custom scripts.

Lets walk through how Git is structured and explore that structure with some plumbing commands to help us get a better understanding of what’s happening under the hood.

To understand how Git prevents collisions between different pieces of code we’re first going to explore Git’s storage schema. There are only four types of Git objects: blobs, trees, commits, and tags.

Git stores and represents file content as a blob, short for “binary large object”. A blob contains a file’s contents but doesn’t contain any metadata.

To identify unique files, without having metadata like a filename, the SHA-1 hash, a cryptographic hash function, identifies the blob’s contents uniquely by generating a 40-digit hexadecimal number. We can see this in action by looking at how Git stores the string hello world.

First, let’s create a text file in our empty directory called foobar.txt that contains hello world:

Terminal
% echo “hello world” > foobar.txt
% cat foobar.txt
hello world

Now, let’s calculate the SHA-1 hash for hello world using the plumbing command git hash-object:

Terminal
% echo 'hello world' | git hash-object -w --stdin
3b18e512dba79e4c8300dd08aeb37f8e728b8dad

This command takes the string hello world, creates it as an object in Git, writes the object to Git’s database, then returns the SHA-1 hash.

Even without the filename, you can use that very same hash to find the object in Git’s database. If you run git cat-file with the -p flag to “pretty print”, we’ll be able to see the string representation of the binary contents of our object.

Terminal
% git cat-file -p "3b18e512dba79e4c8300dd08aeb37f8e728b8dad"
hello world

These “blobs” are how file contents are stored in git. However, remembering the SHA for every piece of code in your database isn’t practical, which brings us to the other data types that compose the Git schema.

A tree is an object that represents one level of directory information, recording blob identifiers, pathnames, and bits of metadata for all the files within that directory. Trees build the complete hierarchy of files and subdirectories in Git. They use file names and their identifiers to reference other Git objects: blobs, and even other trees - a.k.a subtrees.

We can see all of this in the .git directory. If you have the tree command installed on either Mac, Windows or Linux, then run the following command tree .git:

Terminal
% tree .git
.git
├── HEAD
├── config
├── description
├── hooks
│   ├── applypatch-msg.sample
│   ├── commit-msg.sample
│   ├── fsmonitor-watchman.sample
│   ├── post-update.sample
│   ├── pre-applypatch.sample
│   ├── pre-commit.sample
│   ├── pre-merge-commit.sample
│   ├── pre-push.sample
│   ├── pre-rebase.sample
│   ├── pre-receive.sample
│   ├── prepare-commit-msg.sample
│   ├── push-to-checkout.sample
│   └── update.sample
├── info
│   └── exclude
├── objects
│   ├── 3b
│   │   └── 18e512dba79e4c8300dd08aeb37f8e728b8dad
│   ├── info
│   └── pack
└── refs
├── heads
└── tags

You can see in the objects subdirectory where our hello world was stored: the 3b is the first two characters of our hash. The remaining 38 characters, 18e512…, references the name of the file within that subdirectory. The contents of the file is the object itself, stored in binary format.

That’s great for file storage, but where does the version control component come in? Let’s make some more Git objects: commits and tags.

A commit is an object that stores metadata for each change in the repository, such as commit date, log message, author (the person who originally wrote the code), and committer (the person who last changed the code).

Now that we have the file hashed and stored, let’s add a commit for our previously created hello world in foobar.txt:

Terminal
% git add foobar.txt
% git commit -m “Adding hello world to foobar.txt”
[master (root-commit) e7625cf] Adding hello world to foobar.txt
1 file changed, 1 insertion(+)
create mode 100644 foobar.txt

We now created our first commit object, which just like our blob has an identifying hash: e7625cf.

Next we’ll dive into how Git processes your code changes and the different stages where your objects actually live, as they make their way into the Git database. Before we move on, let’s open a new empty directory, run git init, and make a new foobar.txt with echo “hello world” > foobar.txt.

Occasionally referred to as the working tree, the working area is the actual directory on your filesystem that reflects the changes you’re making — including files not handled by Git (i.e. untracked files). You can make changes without worrying about losing your work if it's already stored in the repository. It serves as the space where developers make their immediate changes, and it is separate from the other parts of the project managed by Git.

Also referred to as the cache or index, the staging area is where files are prepared to be a part of the next commit. It's how Git understands what's going to change between the current commit and the next one. A “clean” staging area — not to be mistaken as empty — is essentially a copy of the latest commit and contains a list of files and the SHA-1 hash of those files from their last commit. When you add (git add <file>), remove (git rm <file>), or rename files (git mv <file>), Git recognizes the differing SHA-1 hash between the changed files and the ones from the repository.

The repository in Git contains all of your commits, representing snapshots of what the working and staging areas looked like at the time of each commit. These files are stored safely in the .git directory, allowing you to continue making changes without fear of losing previous versions — you can always check out a fresh copy. It's the core part of Git that holds the entire history of the project, providing traceability and flexibility.

Now that we have commit, blobs, and trees, there is one more object that we didn’t discuss yet: the tag!

Tags are a lot like commit objects, but instead of pointing to a tree, tags point to a commit. It’s similar to a branch reference, but its static — it always points to the same commit but gives it a more human-readable name. For example, it’s easier to read v1.0.0 instead of the commit hash 9ca102… Annotated tags, the most commonly used tags, include the tagger’s name, email, date, and message — capturing a significant point (e.g. release version) in a repo’s history after a commit.

Let’s make an annotated tag by turning your foobar.txt into a release version:

Terminal
$ git tag -a v1.0.0 -m “Release version 1.0.0 to the world”

Verify the tag was created by running:

Terminal
$ git show v1.0.0
Tagger: Jane Doe <info@graphite.com>
Date: Mon Aug 7 15:57:30 2023 -0400
Release version 1.0.0 to the world

Great, we can see our commit, without having to remember that pesky hash.

Branches, unlike tags, are dynamic references to a series of commits. Whenever a new commit is made, the branch points and moves with it. The branch name will always reference the most recent commit on the branch — a.k.a the HEAD of the branch. If you git checkout to a different branch, HEAD moves the pointer to another branch.

HEAD can also become a “detached HEAD” if you point to a specific commit rather than a branch. You may recognize the spooky “You are in ‘detached Head stage” message. Let’s see what this looks like by first grabbing the hash commit from a previous commit in your foobar.txt:

Terminal
% git log --oneline
0a8e526 This is my second change
fcc7845 This is my first change

Now we’ll checkout the latest commit hash (instead of a branch)

Terminal
% git checkout 0a8e526
Note: switching to '0a8e526'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch

This is dangerous because if you don’t do anything, Git will eventually delete any commits made in the detached HEAD state through garbage collection. Those commits are known as dangling commits.

To save your dangling commit and escape the detached HEAD state, you can create a new branch that points to this commit (git checkout -b <new-branch-name>):

Terminal
% git checkout -b feature_1
Switched to a new branch 'feature_1'
% git log --oneline
0a8e526 (HEAD -> feature_1) This is my second change
fcc7845 This is my first change

Now your head is reattached, and you’re ready to start pushing more changes.

Hopefully this has given you a better understanding of the Git internals, so you can be more intentional and confident when wrestling with Git commands.

Here’s a quick wrap-up:

  • Linus Torvalds built Git in 2005, motivated by a licensing dispute with BitKeeper.

  • Git introduced the distributed model of version control and also added functionality like branch merging, and commit tagging.

  • Git is the most widely used VCS: adoption grew from 69% in 2017 to 94% in 2021.

  • Git stores everything in four object types: blobs, trees, commits, and tags.

  • Git has three processing spaces: the working area, the staging area, and the repository.

  • Branches: dynamic references to a series of commits, automatically updating to point to the most recent commit on the branch (a.k.a the HEAD of the branch) whenever a new commit is made.

Built for the world's fastest engineering teams, now available for everyone