101 lines
5.5 KiB
Markdown
101 lines
5.5 KiB
Markdown
I was very disappointed to see GitLab [drop support](https://gitlab.com/gitlab-org/gitlab-ee/issues/1648) for git-annex in their 9.0 release in 2017. This makes it impossible to use GitLab to store our blobs. But there is a way out of there.
|
|
|
|
GitLab, GitHub, Gogs and many other hosting providers actually support the Git LFS API for large file storage. If git-annex would support that API through (say) a special (or builtin) remote, it would be possible to transparently access those contents.
|
|
|
|
I'm not talking about supporting the *client*-side LFS datastructures. Everything would stay the same from git-annex's point of view. The idea would be to allow pushing/pulling files from Git LFS repositories, quite simply.
|
|
|
|
Could that work? Would it be possible to just make this into a separate remote without having to hack at git-annex's core?
|
|
|
|
Thanks for your great work! :) -- [[anarcat]]
|
|
|
|
> git lfs has some fairly complicated endpoint guessing and discovery;
|
|
> to find the lfs http endpoint for a ssh remote it sshs to the server,
|
|
> runs git-lfs-authenticate there and parses the resulting json. The
|
|
> authentication generates a http basic auth header, which is valid for a
|
|
> few hours or so.
|
|
>
|
|
> One consequence is that the endpoint can change over time to some other
|
|
> server. It may also be possible for the authentication to return more
|
|
> than one endpoint, not sure. Anyway, I guess that git-annex would need to
|
|
> treat a given lfs remote as a single copy, irrespective of what
|
|
> endpoints discovery finds. So a lfs special remote will get a uuid
|
|
> assigned like any other special remote.
|
|
>
|
|
> When a git-lfs repo is forked, the fork shares the lfs endpoint of its
|
|
> parent. (And github's lfs bandwidth and storage quotas do too, so it
|
|
> seems it may be possible to fork someone's repo, push big objects to it
|
|
> and eat up *their* quota!) If a special remote is initialized for the
|
|
> parent and another for the clone, git-annex would see two different
|
|
> uuids, and so think there were two copies of objects in them, while
|
|
> there's really just effectively 1 copy.
|
|
>
|
|
> In the git-lfs protocol, the upload action has an optional "ref"
|
|
> parameter, which is a git ref that the object is associated with.
|
|
> In some cases, a user may only be able to upload objects if the right ref
|
|
> is provided
|
|
> <https://github.com/git-lfs/git-lfs/blob/master/docs/api/batch.md#ref-property>.
|
|
> This could be problimatic because from git-annex's perspective, there's
|
|
> no particular git ref associated with an annex object. I suppose it could
|
|
> always send the current ref. It will need to handle the case where the
|
|
> lfs endpoint rejects a request due to the wrong ref, and communicate this
|
|
> as an error to git-annex, especially in the `checkPresent`
|
|
> implementation.
|
|
>
|
|
> To implement `checkPresent`, git-annex will need send a "download"
|
|
> request. The response contains a url to use to download the blob;
|
|
> git-annex could either HEAD it to verify it's present, or assume that the
|
|
> lfs endpoint has verified enough that it's present in order to send that
|
|
> response. Since lfs has no way to delete objects, all that `checkPresent`
|
|
> will detect is when the server has lost an object for some reason.
|
|
>
|
|
> git-lfs has "transfer adapters", but the only important one currently is
|
|
> the "basic" adaptor, which uses the standard lfs API. The "custom"
|
|
> adapter is equivilant to a git-annex external special remote.
|
|
>
|
|
> The lfs API is intended to batch together several uploads or downloads
|
|
> into a single response to the endpoint. But git-annex doesn't have a good
|
|
> capacity for batching; for example when git-annex is downloading all
|
|
> the files in a directory, it goes through them sequentially and expects
|
|
> each download to complete before stating the next.
|
|
> (This limitation also makes Amazon Glacier's
|
|
> batch downloads suboptimal so perhaps git-annex should be improved in
|
|
> some way to support batch requests.) The simple implementation would be
|
|
> to make API requests with a single object in each. Besides being somewhat
|
|
> slower, that risks running into whatever API rate limit the endpoint
|
|
> might have.
|
|
>
|
|
> > I probed the github lfs endpoint for rate limiting by forcing git-lfs
|
|
> > to re-download a small object repeatedly (deleting the object and running
|
|
> > `git lfs pull`). I tried this with both a http remote with no
|
|
> > authentication (about 1 request per second) and a ssh remote
|
|
> > (one request per 4 seconds). Both successfully got through 1000
|
|
> > requests w/o hitting a rate limit.
|
|
> >
|
|
> > But, github's rate limiting probably changes dynamically, and google
|
|
> > finds git lfs hitting rate limit when they're having problems in the
|
|
> > past, so this is only a rough idea of the current picture.
|
|
>
|
|
> That seems to be all the complications involved in implementing this,
|
|
> aside from git-annex needing to know the sha256 and size of an object.
|
|
> --[[Joey]]
|
|
|
|
> Started some initial work in the `git-lfs` branch. --[[Joey]]
|
|
|
|
[[done]]! --[[Joey]]
|
|
|
|
---
|
|
|
|
## related ideas
|
|
|
|
A couple ideas for possible things that could also be done with git-lfs
|
|
integration. Just to keep in mind while implementing the above.
|
|
|
|
* git-annex could support git-lfs pointer files. This would let a lfs
|
|
repo be cloned and git-annex used to manage the files in it with the
|
|
finer control it allows compared to git-lfs.
|
|
|
|
* A lfs API endpoint could serve out of a git-annex repository.
|
|
One neat thing this would allow is, when git-annex knows an url
|
|
where an object is located (from the web special remote or another
|
|
special remote that registers an url), it could direct the lfs
|
|
client to that url when it requests to download the object.
|