“Any program is only as good as it is useful.” - Linus Torvalds, creator of Linux and Git, in a 2007 interview.
If you’re a developer who has been putting off understanding how Git works under the hood, then this guide is for you.
For those of us who get by with only knowing a few commands, Git can be a little mystifying:
A deeper understanding of Git however, will help you feel more confident in your workflow and allow you to get things done faster. In this guide, we’ll show you the fascinating history of Git, Git internals, and a guide of commands for common Git workflows that you can show-off later.
Git was built in roughly 5 days and released in April 2005 by Linus Torvalds, the creator of the Linux kernel. What motivated him to build what became the world’s most used version control system? Two words: licensing dispute.
Torvalds was fed up when the company behind BitKeeper, the source code management system for the Linux kernel and many other open-source projects, revoked the Linux developers’ free license. Bitkeeper’s parent company did this because Andrew Tridgell, an open-source developer, attempted to create an open-source version of the BitKeeper client without accepting its proprietary license. Torvalds called Tridgell’s work a “bad project”.
Needing a new source code management system to keep Linux alive, Torvalds built Git in less than a week to attempt to match the utility of BitKeeper. His inspiration becomes clear when you compare some examples of Git commands that we all use today with their Bitkeeper equivalents:
Initialize a repository
bk setup .
Clone a repository
bk clone <repository>
git clone <repository>
bk citool (or bk commit)
git commit -m "<message>"
Torvalds didn’t just copy paste however; he also introduced efficiencies that BitKeeper didn’t have, such as running on a distributed model rather than a centralized one. Git doesn’t need a remote server to make changes. It stores a local copy of the source code, so making changes is fast and efficient for developer teams working on the same project.
"Git, to some degree, was designed on the principle that everything you ever do on a daily basis should take less than a second," said Torvalds in a 2005 InfoWorld interview.
Torvalds wasn’t alone in wanting a more efficient, BitKeeper alternative, and Git wasn’t the only distributed VCS of its era. Darcs was released 2 years before Git, Bazaar was released 13 days before Git, Mercurial was released 12 days after Git and Fossil was released a year after Git. Despite all of these competitors, Git has fast become the most widely used VCS according to StackOverflow surveys, with estimated adoption growing from 69% in 2017 to to 94% in 2021.
By introducing SHA-1 hashing, trees, and other improvements, Torvalds built Git to enable functionality like branch merging, and commit tagging. Compared to BitKeeper, these features made the software development workflow much more efficient and easier to manage at scale.
In order to get a better understanding of Git under the hood, let’s look at some practical examples. To follow along with this article you’ll need:
Your terminal opened into an empty directory
A command-line text editor (e.g. nano, vim), or an IDE (e.g. VSCode)
git init in an empty directory, and we can get started.
As we explore Git internals, one of the first things to recognize is that Git is made up of two different command sets: porcelain and plumbing. Porcelain commands are used to interface with Git in the terminal, like the familiar
git commit, and
git push.Plumbing commands are lower-level and reveal Git internals. Many of these plumbing commands aren’t meant to be used manually on the command line, but rather to be used as building blocks for new tools and custom scripts.
Lets walk through how Git is structured and explore that structure with some plumbing commands to help us get a better understanding of what’s happening under the hood.
To understand how Git prevents collisions between different pieces of code we’re first going to explore Git’s storage schema. There are only four types of Git objects: blobs, trees, commits, and tags.
Git stores and represents file content as a blob, short for “binary large object”. A blob contains a file’s contents but doesn’t contain any metadata.
To identify unique files, without having metadata like a filename, the SHA-1 hash, a cryptographic hash function, identifies the blob’s contents uniquely by generating a 40-digit hexadecimal number. We can see this in action by looking at how Git stores the string
First, let’s create a text file in our empty directory called
foobar.txt that contains
% echo “hello world” > foobar.txt% cat foobar.txthello world
Now, let’s calculate the SHA-1 hash for
hello world using the plumbing command
% echo 'hello world' | git hash-object -w --stdin3b18e512dba79e4c8300dd08aeb37f8e728b8dad
This command takes the string
hello world, creates it as an object in Git, writes the object to Git’s database, then returns the SHA-1 hash.
Even without the filename, you can use that very same hash to find the object in Git’s database. If you run
git cat-file with the
-p flag to “pretty print”, we’ll be able to see the string representation of the binary contents of our object.
% git cat-file -p "3b18e512dba79e4c8300dd08aeb37f8e728b8dad"hello world
These “blobs” are how file contents are stored in git. However, remembering the SHA for every piece of code in your database isn’t practical, which brings us to the other data types that compose the Git schema.
A tree is an object that represents one level of directory information, recording blob identifiers, pathnames, and bits of metadata for all the files within that directory. Trees build the complete hierarchy of files and subdirectories in Git. They use file names and their identifiers to reference other Git objects: blobs, and even other trees - a.k.a subtrees.
% tree .git.git├── HEAD├── config├── description├── hooks│ ├── applypatch-msg.sample│ ├── commit-msg.sample│ ├── fsmonitor-watchman.sample│ ├── post-update.sample│ ├── pre-applypatch.sample│ ├── pre-commit.sample│ ├── pre-merge-commit.sample│ ├── pre-push.sample│ ├── pre-rebase.sample│ ├── pre-receive.sample│ ├── prepare-commit-msg.sample│ ├── push-to-checkout.sample│ └── update.sample├── info│ └── exclude├── objects│ ├── 3b│ │ └── 18e512dba79e4c8300dd08aeb37f8e728b8dad│ ├── info│ └── pack└── refs├── heads└── tags
You can see in the
objects subdirectory where our
hello world was stored: the
3b is the first two characters of our hash. The remaining 38 characters,
18e512…, references the name of the file within that subdirectory. The contents of the file is the object itself, stored in binary format.
That’s great for file storage, but where does the version control component come in? Let’s make some more Git objects: commits and tags.
A commit is an object that stores metadata for each change in the repository, such as commit date, log message, author (the person who originally wrote the code), and committer (the person who last changed the code).
Now that we have the file hashed and stored, let’s add a commit for our previously created
hello world in
% git add foobar.txt% git commit -m “Adding hello world to foobar.txt”[master (root-commit) e7625cf] Adding hello world to foobar.txt1 file changed, 1 insertion(+)create mode 100644 foobar.txt
We now created our first commit object, which just like our blob has an identifying hash:
Next we’ll dive into how Git processes your code changes and the different stages where your objects actually live, as they make their way into the Git database. Before we move on, let’s open a new empty directory, run
git init, and make a new
echo “hello world” > foobar.txt.
Occasionally referred to as the working tree, the working area is the actual directory on your filesystem that reflects the changes you’re making — including files not handled by Git (i.e. untracked files). You can make changes without worrying about losing your work if it's already stored in the repository. It serves as the space where developers make their immediate changes, and it is separate from the other parts of the project managed by Git.
Also referred to as the cache or index, the staging area is where files are prepared to be a part of the next commit. It's how Git understands what's going to change between the current commit and the next one. A “clean” staging area — not to be mistaken as empty — is essentially a copy of the latest commit and contains a list of files and the SHA-1 hash of those files from their last commit. When you add (
git add <file>), remove (
git rm <file>), or rename files (
git mv <file>), Git recognizes the differing SHA-1 hash between the changed files and the ones from the repository.
The repository in Git contains all of your commits, representing snapshots of what the working and staging areas looked like at the time of each commit. These files are stored safely in the .git directory, allowing you to continue making changes without fear of losing previous versions — you can always check out a fresh copy. It's the core part of Git that holds the entire history of the project, providing traceability and flexibility.
Now that we have commit, blobs, and trees, there is one more object that we didn’t discuss yet: the tag!
Tags are a lot like commit objects, but instead of pointing to a tree, tags point to a commit. It’s similar to a branch reference, but its static — it always points to the same commit but gives it a more human-readable name. For example, it’s easier to read
v1.0.0 instead of the commit hash
9ca102… Annotated tags, the most commonly used tags, include the tagger’s name, email, date, and message — capturing a significant point (e.g. release version) in a repo’s history after a commit.
Let’s make an annotated tag by turning your
foobar.txt into a release version:
$ git tag -a v1.0.0 -m “Release version 1.0.0 to the world”
Verify the tag was created by running:
$ git show v1.0.0Tagger: Jane Doe <firstname.lastname@example.org>Date: Mon Aug 7 15:57:30 2023 -0400Release version 1.0.0 to the world
Great, we can see our commit, without having to remember that pesky hash.
Branches, unlike tags, are dynamic references to a series of commits. Whenever a new commit is made, the branch points and moves with it. The branch name will always reference the most recent commit on the branch — a.k.a the HEAD of the branch. If you
git checkout to a different branch, HEAD moves the pointer to another branch.
HEAD can also become a “detached HEAD” if you point to a specific commit rather than a branch. You may recognize the spooky “You are in ‘detached Head stage” message. Let’s see what this looks like by first grabbing the hash commit from a previous commit in your
% git log --oneline0a8e526 This is my second changefcc7845 This is my first change
Now we’ll checkout the latest commit hash (instead of a branch)
% git checkout 0a8e526Note: switching to '0a8e526'.You are in 'detached HEAD' state. You can look around, make experimentalchanges and commit them, and you can discard any commits you make in thisstate without impacting any branches by switching back to a branch
This is dangerous because if you don’t do anything, Git will eventually delete any commits made in the detached HEAD state through garbage collection. Those commits are known as dangling commits.
To save your dangling commit and escape the detached HEAD state, you can create a new branch that points to this commit (
git checkout -b <new-branch-name>):
% git checkout -b feature_1Switched to a new branch 'feature_1'% git log --oneline0a8e526 (HEAD -> feature_1) This is my second changefcc7845 This is my first change
Now your head is reattached, and you’re ready to start pushing more changes.
Hopefully this has given you a better understanding of the Git internals, so you can be more intentional and confident when wrestling with Git commands.
Here’s a quick wrap-up:
Linus Torvalds built Git in 2005, motivated by a licensing dispute with BitKeeper.
Git introduced the distributed model of version control and also added functionality like branch merging, and commit tagging.
Git is the most widely used VCS: adoption grew from 69% in 2017 to 94% in 2021.
Git stores everything in four object types: blobs, trees, commits, and tags.
Git has three processing spaces: the working area, the staging area, and the repository.
Branches: dynamic references to a series of commits, automatically updating to point to the most recent commit on the branch (a.k.a the HEAD of the branch) whenever a new commit is made.