git-annex/doc/tips/splitting_a_repository.mdwn
Joey Hess 0df94132d9
add a warning and a related todo
arising from a conversation at FOSSY
2023-07-13 19:58:12 -04:00

126 lines
4.1 KiB
Markdown

[[!meta title="Splitting a git-annex repository"]]
I have a [git annex](https://git-annex.branchable.com/) repo for all my media
that has grown to 57866 files and git operations are getting slow, especially
on external spinning hard drives, so I decided to split it into separate
repositories.
Here is how to split out a repository that contains a subset of the files
in the larger repository. The larger repository is left as-is, but similar
methods can be used to remove the files from it. Or, it can be deleted
once it gets split up into several smaller repositories.
(This is the reverse of [[migrating two seperate disconnected directories
to git annex]].)
Suppose the old big repo is at `~/oldrepo`, and you want to split out
photos from it, and those are all located inside `~/oldrepo/photos`.
First, let's create a new empty repo.
mkdir ~/photos
cd photos
git init
Now to populate the new repo with the files we want from the old repo. We
can use `git filter-branch` to create a git branch that contains only the
history of the files in `photos`. That command has a *lot* of options and
ways to use it, but here is one simple way:
cd ~/oldrepo
# filter a branch to with only the files wanted by the new repository
git branch split-master master
git filter-branch --prune-empty --subdirectory-filter photos split-master
# replace the new repo's master branch with the filtered branch
git push ~/photos split-master
git branch -D split-master
cd ~/photos
git reset --hard split-master
git branch -d split-master
Next, the git-annex branch needs to be filtered to include only
the files in `photos`, and that filtered branch sent to the new repository.
That can be done with the [[git-annex-filter-branch]](1) command.
cd ~/oldrepo
annexrev=$(git annex filter-branch photos --include-all-key-information --include-all-repo-config --include-global-config)
git push ~/photos $annexrev:refs/heads/git-annex
Next, initialize git-annex on the new repository. This uses
the same annex.uuid as was in the old repository. That's ok, because
the repository that's been split off will never have the old repository
as a remote.
cd ~/photos
git annex reinit $(git config --file ../tofilter/.git/config annex.uuid)
Finally the annexed file contents need to be copied to the new repository:
cd ~/photos
# Hardlink all the annexed data from the old repo
cp -rl ~/oldrepo/.git/annex/objects .git/annex/
# Remove unneeded hard links
git annex unused --quiet
git annex drop --unused --force
# Fix up annex links to content and make sure it's all ok.
git annex fsck
Warning: This method of copying the annexed file contents and dropping
the unused ones causes the git-annex branch to log information.
# alternative older method
Here is another way to do it. Suppose the old big repo is at `~/oldrepo`:
```
# Create a new repo for photos only
mkdir ~/photos
cd photos
git init
git annex init laptop
# Hardlink all the annexed data from the old repo
cp -rl ~/oldrepo/.git/annex/objects .git/annex/
# Regenerate the git annex metadata
git annex fsck --fast
# Also split the repo on the usb key
cd /media/usbkey
git clone ~/photos
cd photos
git annex init usbkey
cp -rl ../oldrepo/.git/annex/objects .git/annex/
git annex fsck --fast
# Connect the annexes as remotes of each other
git remote add laptop ~/photos
cd ~/photos
git remote add usbkey /media/usbkey
```
At this point, I went through all repos doing standard cleanup:
```
# Remove unneeded hard links
git annex unused
git annex dropunused --force 1-12345
# Sync
git annex sync
```
To make sure nothing is missing, I used `git annex find --not --in=here`
to see if, for example, the usbkey that should have everything could be missing
some thing.
Update: Antoine Beaupré pointed me to
[this tip about Repositories with large number of files](http://git-annex.branchable.com/tips/Repositories_with_large_number_of_files/)
which I will try next time one of my repositories grows enough to hit a performance issue.
> This document was originally written by [Enrico Zini](http://www.enricozini.org/blog/2017/debian/splitting-a-git-annex-repository/) and added to this wiki by [[anarcat]].