What is Git garbage collection?
Garbage collection in Git is the process by which Git cleans up unnecessary files and optimizes the local repository's data store. When you make changes to a repository—such as committing new changes, rebasing branches, or deleting references—Git creates and updates various objects (blobs, trees, commits, etc.). Over time, some of these objects become obsolete or redundant. Git GC helps in removing these objects to free up space and keep the repository efficient.
Automatic garbage collection
Git runs garbage collection automatically under certain conditions to ensure that the repository does not consume excessive disk space or degrade in performance. Automatic GC is triggered by commands like git commit
, git merge
, and git rebase
. These commands often produce a number of loose objects that can be efficiently repacked or removed. The git gc --auto
command is implicitly called by these commands to assess whether a cleanup is necessary based on the number of loose objects and the size of packed objects.
What does repacking mean in Git?
Repacking, part of the Git garbage collection process, involves combining multiple pack files into a single pack file and compressing the contents to save space and enhance performance. A pack file in Git is a binary file containing compressed object data, allowing Git to reduce the filesystem footprint and improve the speed of operations like cloning, fetching, and pushing.
During the repacking process, Git performs the following steps:
Identifying loose objects: These are individual files in the .git/objects directory. Initially, each new object (like a commit, tree, or blob) is stored as a separate file, or a loose object.
Combining objects into packs: Git combines these loose objects into pack files. If there are existing pack files, Git can repack them into fewer, more compressed pack files.
Delta compression: During repacking, Git applies delta compression, where objects similar to each other are stored as deltas. This means that only the differences between the objects are stored, which significantly reduces the amount of space needed.
Removing redundant data: As part of repacking, Git also removes any redundant data that is no longer necessary, such as objects that are no longer reachable from any branches or tags.
Using git gc
manually
Although Git runs garbage collection automatically, you can manually invoke the process using the git gc
command. This can be particularly useful in a few scenarios:
- Repository cleanup before backups: Before backing up a repository, running
git gc
ensures that the backup is compact and devoid of unnecessary data. - After a large number of refs are deleted: If you've recently deleted a large number of branches or tags, running
git gc
will help clean up the related objects and references. - Recovery from corrupted repository data: Sometimes, corruption in repository data can be mitigated by repacking and cleaning up the objects.
Basic Usage
git gc
This command runs the default garbage collection function, cleaning up loose objects, compressing them into packs, and removing unnecessary files from .git/objects
.
Advanced options
Aggressive cleanup:
Terminalgit gc --aggressiveThis option is more thorough than the default run. It recompresses all reachable objects, which can potentially lead to better compression but at the cost of more CPU usage.
Pruning old objects:
Terminalgit gc --prune=nowThis command forces Git to prune all objects older than the specified date (in this case, immediately).
Viewing detailed log: Files like
.git/gc.log
can provide logs if Git is configured to produce verbose output during GC operations. You can configure garbage collection to produce logs by runninggit config --global gc.verbose true
.
git gc vs. git prune
While git gc
orchestrates a variety of housekeeping tasks, including calling git prune
, the git prune
command specifically deals with the removal of inaccessible object files. The distinction is that git prune
is a part of what git gc
does, focusing solely on object deletion, whereas git gc
involves a broader range of repository optimization actions.
Is git gc
safe?
gti gc
is mostly safe, as it is designed to only remove objects that are not accessible by any refs and not present in the staging area. However, running git gc --aggressive
or pruning objects aggressively can make recovery of any accidentally deleted data much harder because it more thoroughly removes redundant objects and compresses the repository. This leaves fewer opportunities to recover any data that was not explicitly linked to current branches or tags, as loose objects and older commit data are cleaned up more rigorously.
For further reading on garbage collection in Git, refer to the official Git documentation.