Read Anthropic’s case study about Graphite Reviewer

How to delete sensitive data from Git

Greg Foster
Greg Foster
Graphite software engineer
Try Graphite


Note

This guide explains this concept in vanilla Git. For Graphite documentation, see our CLI docs.


If you accidentally commit and push a credential or any other piece of sensitive data to a Git repo (private or otherwise), consider the credential compromised. The first step before deleting the commit containing the sensitive data should be to immediately rotate it out of use.

As per the GitHub documentation:

Once the credential has been rotated out of use, follow these steps to cleanse the credential from your your Git repository.

First, decide whether to use git filter-repo or BFG Repo-Cleaner. Both tools rewrite your repository's history, which changes the SHA hashes for altered commits and any dependent commits. This could affect open pull requests, so it's wise to merge or close these before proceeding.

The BFG Reop-Cleaner is an open-source tool written in Java, maintained by the community that provides a simpler, more user-friendly option to rewriting your repository’s history and thus cleaning out credentials or other sensitive data that may have been committed.

The git filter-repo command provides more flexibility however, and offers a finer-grained approach. Use this method if you are a more advanced user, and require a more delicate technique.

  1. Download and install BFG Repo-Cleaner Follow the instructions listed on the official BFG Repo-Cleaner website, to download and install the tool. Note that this tool requires Java to be installed on your machine.

  2. To remove a specific file containing sensitive data without affecting your latest commit, execute:

    bfg --delete-files YOUR-FILE-WITH-SENSITIVE-DATA

  3. Instead o replace sensitive text across your repository's history, use:

    bfg --replace-text passwords.txt

    This command will replace all text from the specified file across your entire repository’s history with *REMOVED*.

  4. After removal, force push your changes to GitHub with git push --force.

  1. Install the git filter-repo tool The filter-repo tool is not included in Git by default, and must be installed before use.

If using Homebrew, the command is brew install git-filter-repo.

You can also install the command manually from the official git filter-repo repository.

  1. Navigate to your repository directory:

    cd YOUR-REPOSITORY

  2. Execute the following command, replacing PATH-TO-YOUR-FILE-WITH-SENSITIVE-DATA with the entire path to the file that you want to delete:

    git filter-repo --invert-paths --path PATH-TO-YOUR-FILE-WITH-SENSITIVE-DATA

  3. Add the file to .gitignore to prevent future commits:

    echo "YOUR-FILE-WITH-SENSITIVE-DATA" >> .gitignore git add .gitignore git commit -m "Add YOUR-FILE-WITH-SENSITIVE-DATA to .gitignore"

    This will configure Git to automatically ignore this file in any future commits.

  4. Force-push the changes to GitHub to overwrite the history:

    git push origin --force --all

    To remove the sensitive file from all your tagged releases you also need to run:

    git push origin --force --tags

Even after these steps, some data might remain cached or referenced in pull requests. If the leaked data was not a credential that could be rotated, such as personal user data you may need to take additional steps to ensure the data has been removed properly:

  1. Contact GitHub support to request the removal of cached views and references to the sensitive data in pull requests.

  2. Inform all collaborators of the repository to rebase their branches instead of merging to avoid reintroducing the removed data.

After ensuring that the sensitive data is completely removed, clean your local repository and take preventive measures:

To force all objects in your local Git repository to be dereferenced and garbage collected, effectively cleaning up and minimizing the size of the repository after sensitive data has been removed or any substantial rewriting of the repository's history, you can follow these steps:

  1. Dereference original references: First, remove references to the original branches and tags that git filter-repo (or a similar tool) has rewritten. These references are usually stored in refs/original/. This step ensures that the rewritten history is the only one recognized by Git, facilitating the garbage collection process. Execute the following command:

    git for-each-ref --format="delete %(refname)" refs/original/ | git update-ref --stdin

    This command lists all references under refs/original/ and deletes them by feeding the list to git update-ref --stdin, which processes these deletion commands from standard input.

  2. Expire reflog entries: Next, expire all entries in the reflog. The reflog records the history of the tips of branches and other references within the local repository, and expiring these entries helps in removing any pointers to the old (now unwanted) objects. Run:

    git reflog expire --expire=now --all

    This command tells Git to immediately expire all reflog entries, effectively removing any references to objects that are no longer in the current history.

  3. Garbage collect: Finally, perform a manual garbage collection to clean up and optimize the repository. This step removes objects that are no longer reachable from any references, compacts the repository, and optimizes its performance. Use the following command:

    git gc --prune=now

    Here, --prune=now forces Git to immediately prune (delete) objects that are no longer needed, instead of waiting for the default period (typically two weeks).

These steps will clean up your repository by dereferencing the rewritten history's original objects and performing a thorough garbage collection. It's a crucial process after using tools like git filter-repo or BFG Repo-Cleaner to ensure your repository does not retain any unnecessary objects from the old history, potentially including the sensitive data you sought to remove. This cleanup also helps in reducing the repository's size and improving its performance.

In the future it’s important to stop these leaks from happening in the first place.

Employ best practices to avoid accidental commits of sensitive data, such as using visual tools for staging changes, avoiding catch-all git add commands, and enabling push protection in your repository settings.

By carefully following these steps, you can effectively remove sensitive data from your Git repository and take measures to prevent similar incidents in the future.

For more information on removing sensitive data from your Git repository see the official GitHub documentation.

Built for the world's fastest engineering teams, now available for everyone