Read Anthropic’s case study about Graphite Reviewer

How garbage collection works in Git

Greg Foster
Greg Foster
Graphite software engineer
Try Graphite


Note

This guide explains this concept in vanilla Git. For Graphite documentation, see our CLI docs.


Garbage collection in Git is the process by which Git cleans up unnecessary files and optimizes the local repository's data store. When you make changes to a repository—such as committing new changes, rebasing branches, or deleting references—Git creates and updates various objects (blobs, trees, commits, etc.). Over time, some of these objects become obsolete or redundant. Git GC helps in removing these objects to free up space and keep the repository efficient.

Git runs garbage collection automatically under certain conditions to ensure that the repository does not consume excessive disk space or degrade in performance. Automatic GC is triggered by commands like git commit, git merge, and git rebase. These commands often produce a number of loose objects that can be efficiently repacked or removed. The git gc --auto command is implicitly called by these commands to assess whether a cleanup is necessary based on the number of loose objects and the size of packed objects.

Repacking, part of the Git garbage collection process, involves combining multiple pack files into a single pack file and compressing the contents to save space and enhance performance. A pack file in Git is a binary file containing compressed object data, allowing Git to reduce the filesystem footprint and improve the speed of operations like cloning, fetching, and pushing.

During the repacking process, Git performs the following steps:

  1. Identifying loose objects: These are individual files in the .git/objects directory. Initially, each new object (like a commit, tree, or blob) is stored as a separate file, or a loose object.

  2. Combining objects into packs: Git combines these loose objects into pack files. If there are existing pack files, Git can repack them into fewer, more compressed pack files.

  3. Delta compression: During repacking, Git applies delta compression, where objects similar to each other are stored as deltas. This means that only the differences between the objects are stored, which significantly reduces the amount of space needed.

  4. Removing redundant data: As part of repacking, Git also removes any redundant data that is no longer necessary, such as objects that are no longer reachable from any branches or tags.

Although Git runs garbage collection automatically, you can manually invoke the process using the git gc command. This can be particularly useful in a few scenarios:

  • Repository cleanup before backups: Before backing up a repository, running git gc ensures that the backup is compact and devoid of unnecessary data.
  • After a large number of refs are deleted: If you've recently deleted a large number of branches or tags, running git gc will help clean up the related objects and references.
  • Recovery from corrupted repository data: Sometimes, corruption in repository data can be mitigated by repacking and cleaning up the objects.
Terminal
git gc

This command runs the default garbage collection function, cleaning up loose objects, compressing them into packs, and removing unnecessary files from .git/objects.

  • Aggressive cleanup:

    Terminal
    git gc --aggressive

    This option is more thorough than the default run. It recompresses all reachable objects, which can potentially lead to better compression but at the cost of more CPU usage.

  • Pruning old objects:

    Terminal
    git gc --prune=now

    This command forces Git to prune all objects older than the specified date (in this case, immediately).

  • Viewing detailed log: Files like .git/gc.log can provide logs if Git is configured to produce verbose output during GC operations. You can configure garbage collection to produce logs by running git config --global gc.verbose true.

While git gc orchestrates a variety of housekeeping tasks, including calling git prune, the git prune command specifically deals with the removal of inaccessible object files. The distinction is that git prune is a part of what git gc does, focusing solely on object deletion, whereas git gc involves a broader range of repository optimization actions.

gti gc is mostly safe, as it is designed to only remove objects that are not accessible by any refs and not present in the staging area. However, running git gc --aggressive or pruning objects aggressively can make recovery of any accidentally deleted data much harder because it more thoroughly removes redundant objects and compresses the repository. This leaves fewer opportunities to recover any data that was not explicitly linked to current branches or tags, as loose objects and older commit data are cleaned up more rigorously.

For further reading on garbage collection in Git, refer to the official Git documentation.

Built for the world's fastest engineering teams, now available for everyone