Two-step push: accelerating Git LFS migration for big repositories

December 16, 2020 (update: December 21, 2020) git gitlab

Not interested in explanations? Fast travel to the tl;dr 🚀

I recently had to migrate several Git repositories from one GitLab instance to another. The new instance had stricter requirements, notably a max size of 1 GB, and one repo in particular was not making the cut, clocking in at 1.5 GB bare total size (du -sh .) for 700K Git objects (git rev-list --objects --all | wc -l) on a fresh git clone --mirror.

There are two main ways to reduce repository size:

Filter Git history and prune undesirable objects (files committed by mistake, old objects that will never be needed again, etc.).
Migrate big blobs to Git LFS (will not count in Git objects size, since they are on external storage).

At the time of writing, the best tools for the job are:

git-filter-repo for analysis and filtering (as recommended by the Git documentation and the GitLab documentation).
git lfs migrate for LFS migration (as recommended by the Git LFS documentation).

Note: if you look around, you’ll probably find mentions of BFG Repo-Cleaner. I recommend not to use it, cf. below for more details.

Both methods were used to reduce the size of this specific repo. I’ll quickly go over both steps in order to focus on an issue where the first git push after an LFS migration on a big repository may take several hours or days to complete, seemingly stuck while uploading LFS objects.

History filtering

git-filter-repo --analyze # identify filtering candidates
git-filter-repo --strip-blobs-bigger-than 5M --path some-existing-dir/ --path some-deleted-dir/ --invert-paths

After analysis, I determined all blobs bigger than 5 MB could be safely filtered, as well as a number of directories (both existing and deleted ones). The repo was now 900 MB. Not bad! But still a bit too close to 1 GB for my taste.

Git LFS migration

git lfs info --top=25 # identify LFS candidates
git lfs migrate import --everything --include="*.ext1,*.ext2"

After analysis, I selected a number of binary file extensions to be converted to LFS. Around 2600 objects were converted (find lfs/objects/ -type f | wc -l) for a total of 800 MB. Great! That meant the regular Git objects were now accounting for only around 100 MB of repo size, with the rest of the bulk as LFS objects.

Note: if this is your first LFS migration, you may want to read below about Git filters case sensitivity and cleaning up after a LFS migration.

The issue

The next step consisted in pushing the filtered and LFS migrated history to a remote. At first it looked fine:

$ git push
Uploading LFS objects: 100% (2312/2312), 765 MB | 4.6 MB/s

…but then it stayed stuck there, at “100%”.

Note: this behavior is independent of the tool used: I experienced it with both BFG (deprecated) and git lfs migrate. The issue lies with the LFS blobs themselves, and not the tools used to obtain them.

Analysis

ps aux was showing that git-lfs was working full throttle, and watching closely, I could indeed see the number of objects ramping up from time to time (i.e. would go to 2312/2315 and then 2315/2315 after a few seconds). At this point, my hypothesis was that the operation was not stuck, but actually working its way through LFS blobs, albeit in a seemingly very slow manner.

Googling around did not help much: people with issues at the Uploading step seem to be mostly suffering from network bottlenecks, either due to bandwidth constraints or huge file sizes, but this was not my case.

I then enabled debug output and observed interesting tidbits:

$ GIT_TRACE=1 git push
Uploading LFS objects: 100% (2312/2312), 765 MB | 4.6 MB/s
16:41:13.318312 trace git-lfs: tq: running as batched queue, batch size of 100
16:41:13.318312 trace git-lfs: run_command: git rev-list --stdin --objects --not --remotes=<redacted> --
... [repeating]

A quick code inspection reveals this log happens while processing the transfer queue, and Googling this specific log entry yields more substantial results (#3707, #3915), which seem to confirm the hypothesis:

[…] we’re traversing the entire history to see what objects need to be pushed. If you have a large repository with a large history, that can be expensive. It’s unfortunately unavoidable, since we need to know which objects have to be pushed, and the only way to know that is to read all of the pointer files from all of the unpushed history.

When we’re pushing a repository that’s newly rewritten, we’re going to traverse every blob that exists that’s less than 1024 bytes so that we can determine if an LFS object is in that object.

This explains why git push takes a very long time processing LFS objects on this repo: it has to go through all 700K Git objects, 100 by 100, to determine whether or not they are LFS objects.

Note: the exact same issue happens when trying to push all LFS objects with git lfs push origin --all, which is coherent as it’s basically what git push is doing thanks to the pre-push hook installed via git lfs install. EDIT: this may change in a future Git LFS version, see below.

Improvising a two-step push

I tried to leave git push do its thing, but after 24 hours I decided it was unsuitable for us, especially since it was clear we were nowhere near the end: the object count was around 2500, whereas it started around 2300 and I knew there was a total of 2600 LFS objects to upload (find lfs/objects/ -type f | wc -l).

This last bit was key to a solution: since I could list all LFS objects, I wondered if I could split the push operation in two steps, one for regular Git objects and the other for LFS objects. This way we would avoid the need to scan through all blobs.

My initial idea looked like this:

GIT_LFS_SKIP_PUSH=1 git push # push regular Git objects
git lfs push origin --object-id `find lfs/objects/ -type f -printf "%f "` # push LFS objects

It leans on:

GIT_LFS_SKIP_PUSH to prevent sending LFS objects by selectively disabling the LFS pre-push hook (documentation). Note: variable added in Git LFS 2.13.0, on earlier versions we can use git push --no-verify to disable all pre-push hooks.
--object-id to specify which objects should be sent (documentation).

However, I immediately hit a wall as the git push operation was actively refused by GitLab pre-receive hooks due to referenced LFS objects being missing:

$ git push
remote: GitLab: LFS objects are missing. Ensure LFS is properly set up or try a manual "git lfs push --all".
To <redacted>.git
 ! [remote rejected] master -> master (pre-receive hook declined)
 ... [repeat for all branches]
error: failed to push some refs to '<redacted>.git'

This behavior is actually expected, as per the documentation. Fortunately, as the hook suggests, we can simply do it the other way around and push LFS objects first:

git lfs push origin --object-id `find lfs/objects/ -type f -printf "%f "` # push LFS objects
GIT_LFS_SKIP_PUSH=1 git push # push regular Git objects

This goes as fast as network speed allows. No history traversal involved 🎉

We can then try a regular git push again, and receive Everything up-to-date as confirmation that everything worked. And indeed, subsequent clones and regular Git operations with the migrated repo are working perfectly fine.

Follow-up

~~As for why this is not the default behavior, I have no idea.~~ I opened an issue on GitHub to enquire.

EDIT: This is not the default behaviour as it would be inefficient in most cases outside of a full migration, e.g. when we only need to rewrite a few objets in recent history due to a push by mistake:

The reason that your solution is efficient in your case is because you know that the remote repository has no LFS objects and you’ll need to upload them all. However, when we use the pre-push hook to do an upload, we have no way of knowing what refs may already exist on the remote side. All we know is that none of the refs being pushed exist yet. This case, generally, is indistinguishable from pushing a small number of new branches with only a few commits.

In the case of an LFS migration into an existing repository, you just look like you’re updating many refs at once, and that could be just a single commit on each one. We don’t know until we start to look at history.

In the common case, walking history is much, much faster than pushing all objects, especially on slow connections, because we’ll traverse a relatively small number of commits.

An interesting note: in a future version, git lfs push origin --all may evolve to avoid history traversal and behave like the improvised workaround.

tl;dr

The first git push after an LFS migration on a big repository may take several hours or days to complete, seemingly stuck while uploading LFS objects.

It is possible to manually split the push operation in two steps, one for regular Git objects and the other for LFS objects:

git lfs push origin --object-id `find lfs/objects/ -type f -printf "%f "` # push LFS objects
GIT_LFS_SKIP_PUSH=1 git push # push regular Git objects

Note: GIT_LFS_SKIP_PUSH variable added in Git LFS 2.13.0, on earlier versions we can use git push --no-verify to disable all pre-push hooks.

Depending on your Git hosting provider and its hook checks, you may need to reverse push order (above: GitLab-compatible).

Annexes

BFG Repo-Cleaner

You may find BFG Repo-Cleaner mentioned in various resources, both for history filtering and LFS migration. At the time of writing, it is in particular recommended by the GitLab documentation for LFS migration. However I advise against using it as it seems to be unsuitable for modern use:

Soft-deprecated for history filtering when the historic git command filter-branch was deprecated in favor of git-filter-repo, the latter being a reimplementation of BFG (9df53c5d) and a strict upgrade in terms of features (one notable being refs/replace support).
Soft-deprecated for LFS migration due to git lfs having its own migration process with git lfs migrate.
Does not properly convert deep-nested files to LFS (#349).
Handles LFS .gitattributes in an unpractical manner (#116).

Anecdotally, I can attest to have experienced both of these issues while following GitLab documentation and testing LFS migration with BFG on my repo.

Git filters case sensitivity

You probably already know this if you’ve fiddled with .gitattributes before, but Git filters are case sensitive. When tracking file extensions with LFS, it is a good idea to use glob patterns to do so in a case insensitive manner:

git lfs track "*.png"          # will only track files in ".png"
git lfs track "*.[pP][nN][gG]" # track files in ".png" and ".PNG", as well as any mix

Ditto for git lfs migrate:

git lfs migrate import --everything --include="*.[pP][nN][gG]"

Cleaning up after a LFS migration

git-filter-repo automatically cleans up after himself, however git lfs migrate does not. It is a good idea to force a cleanup before proceeding with post-migration push operations:

git reflog expire --expire=now --all
git gc --prune=now --aggressive

Previous Post Next Post