What are Git Snapshots
The more I use Git, the more I like it, but some underlying knowledge isn’t systematic or comprehensive, such as what snapshots are and how Git stores these snapshots. Regarding these mechanisms, I’m documenting them here.
Snapshot Recording
During each commit, Git scans all files in the repository. If a file changes, it generates a new Blob binary file containing the complete content of the file at the time of commit. If a file doesn’t change, it records a link pointing to the previously stored file
Each commit itself has an index storage, using this index to locate both changed and unchanged files
The diagram below helps understand the repository state for each snapshot (version)
Snapshot Storage
Knowing the snapshot recording strategy, where are snapshots stored? ======> In the .git hidden folder
There are many items in the file, here we only care about the locations storing historical snapshots, namely index and objects. For other parts, it’s recommended to consult Pro Git for understanding
To understand how Git specifically stores data, let’s initialize an empty project
mkdir git-demo & cd git-demo & git init & echo "just a demo"> README.md
Start executing git operations
git add
When executing git add .
, the index file stores the index of files to be committed. To view the index, you need to use the low-level command git ls-files -s
$ git ls-files -s
100644 a730a28e53d8defdda8fe953829afdfc906e463a 0 README.md
Note: Because it’s a binary file, you can’t view it directly as text, only as shown above. You can see the index file records the file name README.md
and the blob file name a730a28e53d8defdda8fe953829afdfc906e463a
stored in the Git file system, which is a 40-character SHA-1 value
The specific blob files are stored in .git/objects
. Note that the first two characters a7 are the folder name, and the remaining 38 characters are the file name.
At this point, you can use git cat-file -p a730a2
to view the complete content of the committed file.
$ git cat-file -p a730a2
just a demo
git commit
When executing git commit -m 'init readme'
, after successfully committing to the local repository
$ git commit -m 'init readme'
[master (root-commit) 79821c6] init readme
1 file changed, 1 insertion(+)
create mode 100644 README.md
Looking again at the .git/objects
directory, you’ll find two additional folders
$ ll .git/objects/
total 0
drwxr-xr-x 3 qhe staff 96B Dec 20 22:22 4e
drwxr-xr-x 3 qhe staff 96B Dec 20 22:22 79
drwxr-xr-x 3 qhe staff 96B Dec 20 16:24 a7
drwxr-xr-x 2 qhe staff 64B Dec 20 16:24 info
drwxr-xr-x 2 qhe staff 64B Dec 20 16:24 pack
Among these, 79
records the content of this commit, while 4e records a tree object that stores file names and other information related to the commit
$ git cat-file -t 4edb6d
tree
At this point, we roughly understand how daily Git operations record and store these individual node snapshots.
Memory Usage
As mentioned above, changed files are stored in their entirety as binary files. In the long run, this would consume significant memory. Git has optimizations for this
Git balances time and space utilization for optimized storage. It saves the complete file for the current latest version, while for older or infrequently used versions, it only stores diffs. This achieves a certain balance between storage space and read/load speed.
Summary
Listing the Git diagram for our daily basic operations
- During git add, files are stored in the staging area index/objects files
- During git commit, files are stored in the local repository, i.e., objects files
Of course, during git push
, these objects are sent to the upstream server, but remember that Git is distributed, so the upstream and our local content are actually the same.
Final Thoughts
- Git feels simple yet very powerful, which should be a characteristic of excellent software design.
- Understanding these underlying principles of Git helps use Git more efficiently, and also provides some reference significance for problems encountered in daily development, such as the storage strategy mentioned above.