Some people don’t seem to understand how git gc fits in with the git object database or even how the git object database works. Here’s a description of the git object database and why running git gc helps performance.

The git object database gives a lot of power to git. Because of the way it’s laid out, a lot of git’s cool features are possible.

There are three types of objects in the database: blobs, trees, and commits. Each object is identified by a hash that is derived from its contents. Given the same contents, exactly the same hash will be created.

Blobs are file contents. When you hear people say, “git only keeps track of file contents”, this is what they mean. The smallest unit in git is a file. This means that a single line change to a file will create a new blob object. This is what people mean when they say, “git stores the entire contents of the repo every commit”. Now you ask, if blobs store the contents of files and blob names are hashes, how does git know the filename of a blob?

Trees relate blobs in a file hierarchy. A tree is a list of filenames and associated blobs. If a file doesn’t change between commits, no new blobs are created: the exact same blob is used across tree objects. Diffing trees is really efficient because the contents of the files is already calculated; it comes down to comparing hash strings.

Commit objects relate tree objects with other commit objects. Without commit objects, git would just be a way to keep track of directory snapshots. Commits let you know who did something, when they did it, and what things looked like before and after. Commits contain commit and tree hashes which contain blob hashes, so any changes to blobs, trees, or parent commits will change the commit hash. This means if your branch and my branch have the same commit hash, we can trust that they share the same history*.

To understand git-gc, we need to understand one more piece, which is how branches work. Branches are just pointers to commits that move whenever a new commit is created.

Because commits are named based on their contents, changing the contents of a commit changes its name. Any time you git commit --amend or git rebase a new commit object(s) is created. “But what happens to the old one?”, you ask.

Old commit objects (all old objects for that matter) stick around in the database. The reason you don’t see them is because there are no pointers (branches or tags) to them. When I talk about git reflog, we’ll find out how to get them back.

git gc‘s purpose is twofold: deleting loose objects and packing objects to use disk space more efficiently. Git packs are just compressed combined object files and an index into those packed files.

Running git gc by itself will pack the object database but it won’t delete loose objects. git gc --prune packs objects and also deletes loose objects. Be aware that deleting loose objects means they’re totally gone. If you’ve just done something complicated that you might want to roll back, you might want to hold off on running git gc --prune until you’re sure you won’t want to go back to what you had.

*: This is why git doesn’t have very many mechanisms for signing commits. All I need is a trusted way to tell you my git commit hash. A signed email that says, “the HEAD of my repo is 82386bd2bd3baca6d424dcdfc5cf46ce14e3a644”, is enough to trust a git clone from someone.

References:

Advertisements