git-annex/doc/todo/LFS_API_support.mdwn

I was very disappointed to see GitLab [drop support](https://gitlab.com/gitlab-org/gitlab-ee/issues/1648) for git-annex in their 9.0 release in 2017. This makes it impossible to use GitLab to store our blobs. But there is a way out of there.

GitLab, GitHub, Gogs and many other hosting providers actually support the Git LFS API for large file storage. If git-annex would support that API through (say) a special (or builtin) remote, it would be possible to transparently access those contents.

I'm not talking about supporting the *client*-side LFS datastructures. Everything would stay the same from git-annex's point of view. The idea would be to allow pushing/pulling files from Git LFS repositories, quite simply.

Could that work? Would it be possible to just make this into a separate remote without having to hack at git-annex's core?

Thanks for your great work! :) -- [[anarcat]]

> git lfs has some fairly complicated endpoint guessing and discovery;
> to find the lfs http endpoint for a ssh remote it sshs to the server,
> runs git-lfs-authenticate there and parses the resulting json. The
> authentication generates a http basic auth header, which is valid for a
> few hours or so.
> 
> One consequence is that the endpoint can change over time to some other
> server. It may also be possible for the authentication to return more
> than one endpoint, not sure. Anyway, I guess that git-annex would need to
> treat a given lfs remote as a single copy, irrespective of what
> endpoints discovery finds. So a lfs special remote will get a uuid
> assigned like any other special remote.
> 
> When a git-lfs repo is forked, the fork shares the lfs endpoint of its
> parent. (And github's lfs bandwidth and storage quotas do too, so it
> seems it may be possible to fork someone's repo, push big objects to it
> and eat up *their* quota!) If a special remote is initialized for the
> parent and another for the clone, git-annex would see two different
> uuids, and so think there were two copies of objects in them, while
> there's really just effectively 1 copy.
> 
> In the git-lfs protocol, the upload action has an optional "ref"
> parameter, which is a git ref that the object is associated with.
> In some cases, a user may only be able to upload objects if the right ref
> is provided
> <https://github.com/git-lfs/git-lfs/blob/master/docs/api/batch.md#ref-property>.
> This could be problimatic because from git-annex's perspective, there's
> no particular git ref associated with an annex object. I suppose it could
> always send the current ref. It will need to handle the case where the
> lfs endpoint rejects a request due to the wrong ref, and communicate this
> as an error to git-annex, especially in the `checkPresent`
> implementation.
> 
> To implement `checkPresent`, git-annex will need send a "download"
> request. The response contains a url to use to download the blob;
> git-annex could either HEAD it to verify it's present, or assume that the
> lfs endpoint has verified enough that it's present in order to send that
> response. Since lfs has no way to delete objects, all that `checkPresent`
> will detect is when the server has lost an object for some reason.
> 
> git-lfs has "transfer adapters", but the only important one currently is 
> the "basic" adaptor, which uses the standard lfs API. The "custom"
> adapter is equivilant to a git-annex external special remote.
> 
> The lfs API is intended to batch together several uploads or downloads
> into a single response to the endpoint. But git-annex doesn't have a good
> capacity for batching; for example when git-annex is downloading all 
> the files in a directory, it goes through them sequentially and expects
> each download to complete before stating the next.
> (This limitation also makes Amazon Glacier's
> batch downloads suboptimal so perhaps git-annex should be improved in
> some way to support batch requests.) The simple implementation would be
> to make API requests with a single object in each. Besides being somewhat
> slower, that risks running into whatever API rate limit the endpoint
> might have.
> 
> > I probed the github lfs endpoint for rate limiting by forcing git-lfs
> > to re-download a small object repeatedly (deleting the object and running
> > `git lfs pull`). I tried this with both a http remote with no
> > authentication (about 1 request per second) and a ssh remote
> > (one request per 4 seconds). Both successfully got through 1000
> > requests w/o hitting a rate limit.
> > 
> > But, github's rate limiting probably changes dynamically, and google
> > finds git lfs hitting rate limit when they're having problems in the
> > past, so this is only a rough idea of the current picture.
> 
> That seems to be all the complications involved in implementing this,
> aside from git-annex needing to know the sha256 and size of an object.
> --[[Joey]]

> Started some initial work in the `git-lfs` branch. --[[Joey]]

[[done]]! --[[Joey]]

---

## related ideas

A couple ideas for possible things that could also be done with git-lfs
integration. Just to keep in mind while implementing the above.

* git-annex could support git-lfs pointer files. This would let a lfs 
  repo be cloned and git-annex used to manage the files in it with the
  finer control it allows compared to git-lfs.

* A lfs API endpoint could serve out of a git-annex repository.
  One neat thing this would allow is, when git-annex knows an url
  where an object is located (from the web special remote or another
  special remote that registers an url), it could direct the lfs
  client to that url when it requests to download the object.
feature request: LFS API support! 2018-11-30 21:30:22 +00:00			`I was very disappointed to see GitLab [drop support](https://gitlab.com/gitlab-org/gitlab-ee/issues/1648) for git-annex in their 9.0 release in 2017. This makes it impossible to use GitLab to store our blobs. But there is a way out of there.`

			`GitLab, GitHub, Gogs and many other hosting providers actually support the Git LFS API for large file storage. If git-annex would support that API through (say) a special (or builtin) remote, it would be possible to transparently access those contents.`

			`I'm not talking about supporting the client-side LFS datastructures. Everything would stay the same from git-annex's point of view. The idea would be to allow pushing/pulling files from Git LFS repositories, quite simply.`

			`Could that work? Would it be possible to just make this into a separate remote without having to hack at git-annex's core?`

			`Thanks for your great work! :) -- [[anarcat]]`
lfs deep dive 2019-07-29 16:28:07 +00:00
			`> git lfs has some fairly complicated endpoint guessing and discovery;`
			`> to find the lfs http endpoint for a ssh remote it sshs to the server,`
			`> runs git-lfs-authenticate there and parses the resulting json. The`
			`> authentication generates a http basic auth header, which is valid for a`
			`> few hours or so.`
			`>`
			`> One consequence is that the endpoint can change over time to some other`
			`> server. It may also be possible for the authentication to return more`
			`> than one endpoint, not sure. Anyway, I guess that git-annex would need to`
			`> treat a given lfs remote as a single copy, irrespective of what`
			`> endpoints discovery finds. So a lfs special remote will get a uuid`
			`> assigned like any other special remote.`
			`>`
			`> When a git-lfs repo is forked, the fork shares the lfs endpoint of its`
			`> parent. (And github's lfs bandwidth and storage quotas do too, so it`
			`> seems it may be possible to fork someone's repo, push big objects to it`
			`> and eat up their quota!) If a special remote is initialized for the`
			`> parent and another for the clone, git-annex would see two different`
			`> uuids, and so think there were two copies of objects in them, while`
			`> there's really just effectively 1 copy.`
			`>`
			`> In the git-lfs protocol, the upload action has an optional "ref"`
			`> parameter, which is a git ref that the object is associated with.`
			`> In some cases, a user may only be able to upload objects if the right ref`
			`> is provided`
			`> <https://github.com/git-lfs/git-lfs/blob/master/docs/api/batch.md#ref-property>.`
			`> This could be problimatic because from git-annex's perspective, there's`
			`> no particular git ref associated with an annex object. I suppose it could`
			`> always send the current ref. It will need to handle the case where the`
			`> lfs endpoint rejects a request due to the wrong ref, and communicate this`
			> as an error to git-annex, especially in the `checkPresent`
			`> implementation.`
			`>`
			> To implement `checkPresent`, git-annex will need send a "download"
			`> request. The response contains a url to use to download the blob;`
			`> git-annex could either HEAD it to verify it's present, or assume that the`
			`> lfs endpoint has verified enough that it's present in order to send that`
			> response. Since lfs has no way to delete objects, all that `checkPresent`
			`> will detect is when the server has lost an object for some reason.`
			`>`
			`> git-lfs has "transfer adapters", but the only important one currently is`
			`> the "basic" adaptor, which uses the standard lfs API. The "custom"`
			`> adapter is equivilant to a git-annex external special remote.`
			`>`
			`> The lfs API is intended to batch together several uploads or downloads`
			`> into a single response to the endpoint. But git-annex doesn't have a good`
			`> capacity for batching; for example when git-annex is downloading all`
			`> the files in a directory, it goes through them sequentially and expects`
			`> each download to complete before stating the next.`
			`> (This limitation also makes Amazon Glacier's`
			`> batch downloads suboptimal so perhaps git-annex should be improved in`
			`> some way to support batch requests.) The simple implementation would be`
			`> to make API requests with a single object in each. Besides being somewhat`
			`> slower, that risks running into whatever API rate limit the endpoint`
			`> might have.`
			`>`
			`> > I probed the github lfs endpoint for rate limiting by forcing git-lfs`
			`> > to re-download a small object repeatedly (deleting the object and running`
			> > `git lfs pull`). I tried this with both a http remote with no
			`> > authentication (about 1 request per second) and a ssh remote`
			`> > (one request per 4 seconds). Both successfully got through 1000`
			`> > requests w/o hitting a rate limit.`
			`> >`
			`> > But, github's rate limiting probably changes dynamically, and google`
			`> > finds git lfs hitting rate limit when they're having problems in the`
			`> > past, so this is only a rough idea of the current picture.`
			`>`
			`> That seems to be all the complications involved in implementing this,`
			`> aside from git-annex needing to know the sha256 and size of an object.`
			`> --[[Joey]]`
git-lfs branch 2019-07-29 19:48:52 +00:00
			> Started some initial work in the `git-lfs` branch. --[[Joey]]
wacky ideas 2019-07-30 16:13:00 +00:00
close 2019-08-05 17:44:24 +00:00			`[[done]]! --[[Joey]]`

wacky ideas 2019-07-30 16:13:00 +00:00			`---`

			`## related ideas`

			`A couple ideas for possible things that could also be done with git-lfs`
			`integration. Just to keep in mind while implementing the above.`

			`* git-annex could support git-lfs pointer files. This would let a lfs`
			`repo be cloned and git-annex used to manage the files in it with the`
			`finer control it allows compared to git-lfs.`

			`* A lfs API endpoint could serve out of a git-annex repository.`
			`One neat thing this would allow is, when git-annex knows an url`
			`where an object is located (from the web special remote or another`
			`special remote that registers an url), it could direct the lfs`
			`client to that url when it requests to download the object.`