How garbage collection works in Git

Garbage collection in Git is the process by which Git cleans up unnecessary files and optimizes the local repository's data store. When you make changes to a repository—such as committing new changes, rebasing branches, or deleting references—Git creates and updates various objects (blobs, trees, commits, etc.). Over time, some of these objects become obsolete or redundant. Git GC helps in removing these objects to free up space and keep the repository efficient.

Automatic garbage collection

Git runs garbage collection automatically under certain conditions to ensure that the repository does not consume excessive disk space or degrade in performance. Automatic GC is triggered by commands like git commit, git merge, and git rebase. These commands often produce a number of loose objects that can be efficiently repacked or removed. The git gc --auto command is implicitly called by these commands to assess whether a cleanup is necessary based on the number of loose objects and the size of packed objects.

What does repacking mean in Git?

Repacking, part of the Git garbage collection process, involves combining multiple pack files into a single pack file and compressing the contents to save space and enhance performance. A pack file in Git is a binary file containing compressed object data, allowing Git to reduce the filesystem footprint and improve the speed of operations like cloning, fetching, and pushing.

During the repacking process, Git performs the following steps:

Identifying loose objects: These are individual files in the .git/objects directory. Initially, each new object (like a commit, tree, or blob) is stored as a separate file, or a loose object.
Combining objects into packs: Git combines these loose objects into pack files. If there are existing pack files, Git can repack them into fewer, more compressed pack files.
Delta compression: During repacking, Git applies delta compression, where objects similar to each other are stored as deltas. This means that only the differences between the objects are stored, which significantly reduces the amount of space needed.
Removing redundant data: As part of repacking, Git also removes any redundant data that is no longer necessary, such as objects that are no longer reachable from any branches or tags.

Using `git gc` manually

Although Git runs garbage collection automatically, you can manually invoke the process using the git gc command. This can be particularly useful in a few scenarios:

Repository cleanup before backups: Before backing up a repository, running git gc ensures that the backup is compact and devoid of unnecessary data.
After a large number of refs are deleted: If you've recently deleted a large number of branches or tags, running git gc will help clean up the related objects and references.
Recovery from corrupted repository data: Sometimes, corruption in repository data can be mitigated by repacking and cleaning up the objects.

Basic Usage

Terminal

git gc

This command runs the default garbage collection function, cleaning up loose objects, compressing them into packs, and removing unnecessary files from .git/objects.

Advanced options

Aggressive cleanup:
Terminal
```
git gc --aggressive
```
This option is more thorough than the default run. It recompresses all reachable objects, which can potentially lead to better compression but at the cost of more CPU usage.
Pruning old objects:
Terminal
```
git gc --prune=now
```
This command forces Git to prune all objects older than the specified date (in this case, immediately).
Viewing detailed log: Files like .git/gc.log can provide logs if Git is configured to produce verbose output during GC operations. You can configure garbage collection to produce logs by running git config --global gc.verbose true.

git gc vs. git prune

While git gc orchestrates a variety of housekeeping tasks, including calling git prune, the git prune command specifically deals with the removal of inaccessible object files. The distinction is that git prune is a part of what git gc does, focusing solely on object deletion, whereas git gc involves a broader range of repository optimization actions.

Is `git gc` safe?

gti gc is mostly safe, as it is designed to only remove objects that are not accessible by any refs and not present in the staging area. However, running git gc --aggressive or pruning objects aggressively can make recovery of any accidentally deleted data much harder because it more thoroughly removes redundant objects and compresses the repository. This leaves fewer opportunities to recover any data that was not explicitly linked to current branches or tags, as loose objects and older commit data are cleaned up more rigorously.

For further reading on garbage collection in Git, refer to the official Git documentation.

How garbage collection works in Git

Automatic garbage collection

What does repacking mean in Git?

Using `git gc` manually

Basic Usage

Advanced options

git gc vs. git prune

Is `git gc` safe?

Smartlog

How to use the Git command git reset

Sapling

Built for the world's fastest engineering teams, now available for everyone

How garbage collection works in Git

Automatic garbage collection

What does repacking mean in Git?

Using git gc manually

Basic Usage

Advanced options

git gc vs. git prune

Is git gc safe?

Smartlog

How to use the Git command git reset

Sapling

Built for the world's fastest engineering teams, now available for everyone

Using `git gc` manually

Is `git gc` safe?