What is Git made of?
Git can be confusing. Git can be scary. Git CLI may be the least intuitive tool you have to use on a daily basis.
But also Git is a wonderfully simple and cleverly designed version control system that definitely deserves its popularity.
To prove this point I invite you to implement your own tiny Git that would be able to create a local repository, commit a single file to it, view commit logs and checkout a certain revision of that file.
It won’t be more than a couple hundred lines of code, we’ll try to keep things as simple as we can. Code examples would be in Go, but any other language is suitable for this tutorial, too.
git init
What turns an empty directory into an empty Git repository? You probably have noticed that Git stores all its internal data in a hidden directory .git
. In fact, there are only a few special files/folder there that have to be created to let Git CLI treat it as a perfectly valid, empty repository:
$ mkdir -p .git/objects/info .git/objects/pack .git/refs/heads .git/refs/tags
$ echo "ref: refs/heads/main" > .git/HEAD
$ tree .git
.git
├── HEAD
├── objects
│ ├── info
│ └── pack
└── refs
├── heads
└── tags
$ git symbolic-ref --short HEAD
main
$ git log
fatal: your current branch 'main' does not have any commits yet
By using a couple of shell commands we’ve tricked Git into recognising our empty repository with a single main
branch and no commits. But what is stored in these directories we’ve created?
objects
Almost everything in Git is stored as an object: every source file that you commit becomes a blob object, every commit itself is an object, tags are objects, too.
For example, we have committed a file.txt
with the contents hello\n
(6 bytes). This would create 3 objects: a blob (actual file contents), a tree (a list of file names and permissions), and a commit (a reference to the committed tree with some information about the committer, timestamp etc).
For every object Git stores its object type (“blob”, “tree” or “commit”) and a length in bytes. So our hello\n
content would actually become blob 6\0hello\n
object data. Additionally, Git uses compression to save disk space, so our object data will be compressed using zlib algorithm before being written to disk as a special file inside .git/objects
.
hashes
Before we dive into the details of writing objects let’s talk about Git hashes.
Every object is uniquely identified inside a Git repo by the SHA hash of its contents. Originally Git was using SHA-1 hashing algorithm, but recent versions of switched to SHA-256 to reduce hash collisions. However, SHA-1 is still widely used in many modern Git setups, and we’ll be using it here as well.
Let’s get back to our file.txt
with hello\n
content. After being compressed the contents of that blob object could look like this (using a simple python one-liner for zlib compression):
$ python3 -c 'import sys,zlib; sys.stdout.buffer.write(zlib.compress(b"blob 6\0hello\n",6))' | hexdump -C
00000000 78 9c 4b ca c9 4f 52 30 63 c8 48 cd c9 c9 e7 02 |x.K..OR0c.H.....|
00000010 00 1d c5 04 14 |.....|
00000015
In practice various zlib implementations may use different compression levels and settings, so resulting encoded content may look different. However the SHA-1 hash is calculated from the uncompressed raw data of an object and would always be the same:
$ printf "blob 6\0hello\n" | sha1sum
ce013625030ba8dba906f756967f9e9ca394464a -
Now let’s compare that with Git CLI results in some dummy repo:
$ mkdir hello
$ cd hello
$ git init
$ echo "hello" > file.txt
$ git ci -m 'initial commit' file.txt
$ git cat-file blob ce013625
hello
$ hexdump -C .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a
00000000 78 01 4b ca c9 4f 52 30 63 c8 48 cd c9 c9 e7 02 |x.K..OR0c.H.....|
00000010 00 1d c5 04 14 |.....|
00000015
Git uses a little optimisation when it stores objects: the first two digits of a hash become the subdirectory name and the rest becomes the file name where the compressed object data is stored. Let’s reproduce this behaviour:
$ mkdir -p .git/objects/3a # first two digits: "3a"
$ printf "\x78\x9c\x4b\xca\xc9\x4f\x52\x30\x63\xc8\x48\xcd\xc9\xc9\xe7\x02\x00\x1d\xc5\x04\x14" \
> .git/objects/3a/3cca74450ee8a0245e7c564ac9e68f8233b1e8 # rest of the hash
# Now, can Git CLI read our blob?
$ git cat-file blob 3a3cca
hello
Writing objects
First of all let’s introduce a Git
“class” that would be a main entry point to work with out repo. We will also need a Hash
type that would handle hash encoding/decoding:
type Git struct {
Dir string // where `.git` is located
Branch string // current branch, i.e. "main"
...
}
type Hash []byte // hashes in Git are presented in hexadecimal form
func NewHash(b []byte) (Hash, error) {
dec, err := hex.DecodeString(strings.TrimSpace(string(b)))
if err != nil {
return nil, err
}
return Hash(dec), nil
}
func (h Hash) String() string { return hex.EncodeToString(h) }
Since all of the content in Git should be compressed – we can start implementing two utility functions that perform compression and decompression:
func zip(content []byte) ([]byte, error) {
b := &bytes.Buffer{}
zw := zlib.NewWriter(b)
if _, err := zw.Write(content); err != nil {
return nil, err
}
if err := zw.Close(); err != nil {
return nil, err
}
return b.Bytes(), nil
}
func unzip(content []byte) ([]byte, error) {
zw, err := zlib.NewReader(bytes.NewBuffer(content))
if err != nil {
return nil, err
}
defer zw.Close()
return io.ReadAll(zw)
}
Now we can write a helper method that writes an object into the Git repo:
// this would be shorter than using fmt.Sprintf all the time, and we would need it a lot
func (g *Git) fmt(format string, args ...any) []byte { return []byte(fmt.Sprintf(format, args...)) }
// this writes an object of the given type with the given raw content into .git
// g := &Git{Dir: ".git", Branch: "main"}
// hash, err := g.write("blob", []byte("hello\n"))
func (g *Git) write(objType string, b []byte) (Hash, error) {
b = append(g.fmt("%s %d\x00", objType, len(b)), b...)
bz, err := zip(b)
if err != nil {
return nil, err
}
sum := sha1.Sum(b)
hash := hex.EncodeToString(sum[:])
dir := filepath.Join(g.Dir, "objects", hash[:2])
obj := filepath.Join(dir, hash[2:])
if err := os.MkdirAll(dir, 0755); err != nil {
return nil, err
}
return sum[:], os.WriteFile(obj, bz, 0644)
}
If we call g.write("blob", []byte("hello\n"))
- it would create a blob object with a checksum we’ve calculated before, that we could later read by using git cat-file blob <hash>
.
Initial commit
Time to go further and make our first commit into the new repository. We know that commits refer to a tree object, and tree objects refer to the blob objects they contain. To create a blob object seems to be fairly obvious:
func (g *Git) AddBlob(data []byte) (Hash, error) {
return g.write("blob", data)
}
Tree data consists of the file permissions (like 100644
for normal files), file names and hashes of their content blob objects. This makes writing a tree object with a single file in it quite easy, too:
func (g *Git) AddTree(filename string, filedata []byte) (Hash, error) {
hash, err := g.AddBlob(filedata)
if err != nil {
return nil, err
}
content := append(g.fmt("100644 %s\x00", filename), hash...)
return g.write("tree", content)
}
Finally the last piece of a puzzle – the commit object. Commits usually have a tree reference (tree <sha-1 hash>
), some author and committer information (author John <john@example.com> <time> +0000
) and a commit message:
func (g *Git) AddCommit(filename string, data []byte, parentHash Hash, msg string) (Hash, error) {
hash, err := g.AddTree(filename, data)
if err != nil {
return nil, err
}
parent := ""
if parentHash != nil {
parent = fmt.Sprintf("parent %s\n", parentHash.String())
}
t := time.Now().Unix()
content := g.fmt("tree %s\n%sauthor %s <%s> %d +0000\ncommitter %s <%s> %d +0000\n\n%s\n",
hash, parent, g.User, g.Email, t, g.User, g.Email, t, msg)
b, err := g.write("commit", content)
if err != nil {
return nil, err
}
return b, g.SetHead(b)
}
func (g *Git) SetHead(h Hash) error { return os.WriteFile(filepath.Join(g.Dir, "refs", "heads", g.Branch), []byte(h.String()), 0644) }
At the end of the AddCommit
method we set the current branch head to the resulting commit hash.
Now if we try making a commit using this code - Git CLI would be able to show it in git log
. But without a proper parentHash
the next AddCommit
call would overwrite the previous commit and there will always be only one commit in the history. Let’s fix it.
History
Commits are chained. To create the second commit we should read the contents of .git/refs/heads/main
and use that hash as a parentHash
of the new commit:
func (g *Git) Head() (Hash, error) {
b, err := os.ReadFile(filepath.Join(g.Dir, "refs", "heads", g.Branch))
if err != nil {
return nil, err
}
return NewHash(b)
}
Now this enables us to read the objects back and iterate from the most latest commit (tip of the branch) to the initial commit (with no parent). Of course, in a “real” Git repo commits may have more than one parent (i.e. after a merge) but we only consider here a very simple, single-branch single-file repository.
To read an object by the given hash we shall implement a reverse procedure to our write()
method:
func (g *Git) read(objType string, hash Hash) ([]byte, error) {
h := hash.String()
dir := filepath.Join(g.Dir, "objects", h[:2])
obj := filepath.Join(dir, h[2:])
b, err := os.ReadFile(obj)
if err != nil {
return nil, err
}
b, err = unzip(b)
if err != nil {
return nil, err
}
if !bytes.HasPrefix(b, []byte(objType+" ")) {
return nil, fmt.Errorf("not a %s object", objType)
}
n := bytes.IndexByte(b, 0)
if n < 0 {
return nil, fmt.Errorf("invalid %s", objType)
}
return b[n+1:], nil
}
We read a file from .git/objects/<first two chars>/<rest of the hash>
, check the object type, skip object length and return the remaining object content. What’s left is to parse different content types to handle blobs, trees and commits.
Reading blobs is trivial, as there’s nothing to parse:
func (g *Git) Blob(hash []byte) ([]byte, error) { return g.read("blob", hash) }
Reading trees is slightly more complex, if we want to consider multiple files per tree:
type Tree struct {
Blobs []Blob
Hash Hash
}
type Blob struct {
Name string
Hash Hash
}
func (g *Git) Tree(hash []byte) (tree *Tree, err error) {
b, err := g.read("tree", hash)
if err != nil {
return nil, err
}
tree = &Tree{Hash: hash}
for {
parts := bytes.SplitN(b, []byte{0}, 2)
fields := bytes.SplitN(parts[0], []byte{' '}, 2)
tree.Blobs = append(tree.Blobs, Blob{Name: string(fields[1]), Hash: parts[1][0:20]})
b = parts[1][20:]
if len(parts[1]) == 20 {
break
}
}
return tree, nil
}
Here we parse tree content in a loop and create blob records. We don’t read blobs themselves, but only store their filenames and hashes.
What’s left is to read the commit object by its hash and we would be able to implement git log
!
Commit parser is very similar to the tree parser, it iterates over the content line by line, and depending on the line prefix fulfils the information about the commit:
type Commit struct {
Msg string
Parent Hash
Tree Hash
Hash Hash
}
func (g *Git) Commit(hash []byte) (ci Commit, err error) {
ci = Commit{Hash: hash}
b, err := g.read("commit", hash)
if err != nil {
return ci, err
}
lines := bytes.Split(b, []byte{'\n'})
for i, line := range lines {
if len(line) == 0 {
ci.Msg = string(bytes.Join(append(lines[i+1:]), []byte{'\n'}))
return ci, nil
}
parts := bytes.SplitN(line, []byte{' '}, 2)
switch string(parts[0]) {
case "tree":
ci.Tree, err = hex.DecodeString(string(parts[1]))
if err != nil {
return ci, err
}
case "parent":
ci.Parent, err = hex.DecodeString(string(parts[1]))
if err != nil {
return ci, err
}
}
}
return ci, nil
}
Despite the simplicity this should work even with complex Git repos, but it will only follow a single branch and will ignore merges. To support multiple parent commits one would have to append parent hashes into the slice.
Another exercise to try is to implement tags. Tags are similar to heads
(branches) – they are text files inside .git/refs/tags
and contain hashes of the tag objects.
The storage mechanism we’ve just implemented is known as “loose objects”. But there is an alternative, more efficient (and more complex) storage known as “packfiles”. Pack file is an archive of objects, similar to a tarball, where some objects can be stored as deltas (differences) from another object in the pack. Parsing packfiles is well documented and is not that difficult, but it’s time to wrap up our Git story.
A full example of nanogit.go
that implements git init
, git commit
, git checkout
, and git log
is available as a gist. In under ~300LOC it covers most of the fundamental Git concepts: references, objects, hashes and storage system. With a bit of effort it can be improved to support tags, multiple files, multiple parent commits and eventually end up being a small Git library for the programming language of your interest.
I hope you’ve enjoyed this article. You can follow – and contribute to – on Github, Mastodon, Twitter or subscribe via rss.
Dec 04, 2022
See also: Post-apocalyptic programming and more.