When the specified number of copies is > 1, and some repositories are
too full, it can be better to move content from them to other less full
repositories, in order to make space for new content.
annex.fullybalancedthreshhold is documented, but not implemented yet
This is not tested very well yet, and is known to sometimes take several
runs to stabalize.
Walking a tightrope between security and convenience here, because
git-annex-shell needs to only proxy for things when there has been
an explicit, local action to configure them.
In this case, the user has to have run `git-annex extendcluster`,
which now sets annex-cluster-gateway on the remote.
Note that any repositories that the gateway is recorded to
proxy for will be proxied onward. This is not limited to cluster nodes,
because checking the node log would not add any security; someone could
add any uuid to it. The gateway of course then does its own
checking to determine if it will allow proxying for the remote.
This makes git-annex sync and similar not treat proxied remotes as git
syncable remotes.
Also, display in git-annex info remote when the remote is proxied.
One benefit of this is that a typo in annex-cluster-node config won't
init a new cluster.
Also it gets the cluster description set and is consistent with
initremote.
An incremental push that gets converted to a full push due to this
config results in the inManifest having just one bundle in it, and the
outManifest listing every other bundle. So it actually takes up more
space on the special remote. But, it speeds up clone and fetch to not
have to download a long series of bundles for incremental pushes.
This avoids some apparently otherwise unsolveable problems involving
races that resulted in the manifest listing bundles that were deleted.
Removed the annex-max-git-bundles config because it can't actually
result in deleting old bundles. It would still be possible to have a
config that controls how often to do a full push, which would avoid
needing to download too many bundles on clone, as well as needing to
checkpresent too many bundles in verifyManifest. But it would need a
different name and description.
I hope to support importtree=yes eventually, but it does not currently
work.
Added remote.<name>.allow-encrypted-gitrepo that needs to be set to
allow using it with encrypted git repos.
Note that even encryption=pubkey uses a cipher stored in the git repo
to encrypt the keys stored in the remote. While it would be possible to
not encrypt the GITBUNDLE and GITMANIFEST keys, and then allow using
encryption=pubkey, it doesn't currently work, and that would be a
complication that I doubt is worth it.
And document remote.<name>.git-remote-annex-max-bundles which will
configure it.
datalad-annex uses a similar url format, but with some enhancements.
See https://github.com/datalad/datalad-next/blob/main/datalad_next/gitremotes/datalad_annex.py
I added the UUID to the URL, because it is needed in order to pick out which
manifest file to use. The design allows for a single key/value store to have
several special remotes all stored in it, and so the manifest includes
the UUID in its name.
While datalad-annex allows datalad-annex::<url>?, and allows referencing
peices of the url in the parameters, needing the UUID prevents
git-remote-annex from supporting that syntax. And anyway, it is a
complication and I want to keep things simple for now.
Sponsored-by: unqueued on Patreon
Added rclone special remote, which can be used without needing to install
the git-annex-remote-rclone program. This needs a new version of rclone,
which supports "rclone gitannex".
This is implemented as a variant of an external special remote, that
runs "rclone gitannex" instead of the usual git-annex-remote- command.
Parameterized Remote.External to support that.
Sponsored-by: Luke T. Shumaker on Patreon
What this can currently be used for is only to change an url from being
used by a special remote to being used by the web remote.
This could have been a --move-from option to registerurl. But, that would
have complicated its option and --batch processing, and also would have
complicated unregisterurl, which is implemented on top of
Command.Registerurl. So, a separate command was actually less complicated
to implement.
The generic description of the command is because I want to make this
command a catch-all for other url updating kind of things, if there are
ever any more. Also because it was hard to come up with a good name for the
specific action. I considered `git-annex moveurl`, but that seems to
indicate data is perhaps actually being moved, and seems to sit at the same
level as addurl and rmurl, and this command is at the plumbing
level of registerurl and unregisterurl.
Sponsored-by: Dartmouth College's DANDI project
The getSocket comment that mentioned using ":port"
in the hostname seems to have been incorrect or be out of date.
After all, the bug report came when the user first tried doing that,
and it didn't work.
Sponsored-by: the NIH-funded NICEMAN (ReproNim TR&D3) project
Refactored to allow offline experimentation, and ended up changing the
allowedvariation (aka fudge factor) to 3. 10 seems too high, and 1.5 too low.
Scale earlier, so even if the first chunk takes less than the configured
time period, allowance is made that later chunks might transfer slower.
Decided to use the same allowedvariation to decide when to start
scaling.
Smoothed the scaling out.
Some examples:
ghci> upscale (BwRate 10 (Duration 60)) 25
BwRate 13 (Duration {durationSeconds = 75})
-- A small scaling upwards after 1/3rd the time. Not noticable.
ghci> upscale (BwRate 10 (Duration 60)) 60
BwRate 30 (Duration {durationSeconds = 180})
-- At the configured time, 3x scaling.
ghci> upscale (BwRate 10 (Duration 60)) 120
BwRate 60 (Duration {durationSeconds = 360})
-- A typical upscaling, here a 1 minute duration became 6 minutes
-- due to the first chunk taking 2 minutes to transfer.
ghci> upscale (BwRate 10 (Duration 60)) 600
BwRate 300 (Duration {durationSeconds = 1800})
-- Here the first chunk took 10 minutes to transfer, so it will
-- take 30 minutes to detect a stall.
Sponsored-by: Dartmouth College's DANDI project
Improve annex.stalldetection to handle remotes that update progress less
frequently than the configured time period.
In particular, this makes remotes that don't report progress but are
chunked work when transferring a single chunk takes longer than the
specified time period.
Any remotes that just have very low update granulatity would also be
handled by this.
The change to Remote.Helper.Chunked avoids an extra progress update when
resuming an interrupted upload. In that case, the code saw first Nothing
and then Just the already transferred number of bytes, which defeated this
new heuristic. This change will mean that, when resuming an interrupted
upload to a chunked remote that does not do its own progress reporting, the
progress display does not start out displaying the amount sent so far,
until after the first chunk is sent. This behavior change does not seem
like a major problem.
About the scalefudgefactor, it seems reasonable to expect subsequent chunks
to take no more than 1.5 times as long as the first chunk to transfer.
Could set it to 1, but then any chunk taking a little longer would be
treated as a stall. 2 also seems a likely value. Even 10 might be fine?
Sponsored-by: Dartmouth College's DANDI project
Test a specified Stateless OpenPGP command with eg:
git-annex test --test-git-config annex.shared-sop-command=sqop
Also documented that config and another one, but so far only the test suite
uses the configs, have not yet implemented using it for actual symmetric
encryption.
Sponsored-by: Joshua Antonishen on Patreon
pull, sync: When operating on content, automatically hard link objects
that have been migrated.
Added annex.syncmigrations config that can be set to false to prevent
pull and sync from migrating object content.
I think that true is a good default for this config, because it avoids
users having to re-download migrated content or learning about migration.
But, some users will surely not like it, whether because it does take some
time (especially for the first git-annex branch scan when there is a long
history), or because they want to deal with it manually, or because their
filesystem doesn't support hard links and they don't want it to copy
objects.
Sponsored-by: k0ld on Patreon
Make git-annex get/copy/move --from foo override configuration of
remote.foo.annex-ignore, as documented.
This already worked for remotes supporting hasKeyCheap. For others though,
git-annex copy --from foo would silently not do anything, while
git-annex copy --to foo would use the annex-ignored remote.
Also improved the annex-ignore docs, to reflect that `git-annex get`
without --from will skip using annex-ignored remotes, for example.
Sponsored-by: Dartmouth College's DANDI project
git-annex get with no parameters and annex.skipunknown = false
in a directory with no files tracked by git results in the same
failure as with a "." parameter.
It may be that git ls-files --error-unmatch changed behavior? Or this
was just wrong.
The tricky thing about this turned out to be handling renames and reverts.
For that, it has to make two passes over the git log, and to avoid
buffering a possibly huge amount of logs in memory (ie the whole git log of
an entire repository!), runs git log twice.
(It might be possible to speed this up by asking git log to show a diff,
and so avoid needing to use catKey.)
Sponsored-By: Brock Spratlen on Patreon
Avoid using curl when annex.security.allowed-ip-addresses is set but
neither annex.web-options nor annex.security.allowed-url-schemes is set to
a value that needs curl.
Bug introduced in 840bd50390
Sponsored-By: Brock Spratlen on Patreon
This is groundwork for making special remotes like borg be skipped by
sync when on an offline drive.
Added AVAILABILITY UNAVAILABLE reponse and the UNAVAILABLERESPONSE extension
to the external special remote protocol. The extension is needed because
old git-annex, if it sees that response, will display a warning
message. (It does continue as if the remote is globally available, which
is acceptable, and the warning is only displayed at initremote due to
remote.name.annex-availability caching, but still it seemed best to make
this a protocol extension.)
The remote.name.annex-availability git config is no longer used any
more, and is documented as such. It was only used by external special
remotes to cache the availability, to avoid needing to start the
external process every time. Now that availability is queried as an
Annex action, the external is only started by sync (and the assistant),
when they actually check availability.
Sponsored-by: Nicholas Golder-Manning on Patreon
This ended up having an interface like sync, rather than like get/copy/drop.
That let it be implemented in terms of sync, which took a lot less code.
Also, it lets it handle many of the edge cases that sync does, such as
getting files that are not visible in a --hide-missing branch, and sending
files to exporttree remotes.
As well as being easier to implement, `git-annex satisfy myremote` makes
sense as it satisfies the preferred content settings of the remote.
`git-annex satisfy somefile` does not form a sentence that makes sense. So
while -C can be a little bit annoying, it still makes sense to have this
syntax.
Note that, while I initially thought this would also satisfy numcopies, it
does not. Arguably it ought to. But, sync does not send files in order to
satisfy numcopies, it only sends files to satisfy preferred content. And
it's important that this transfer the same files as sync does, because
it will probably be used in a workflow where the user sometimes syncs and
sometimes satisfies, and does not expect satisfy to do things that sync
would not do.
(Also opened a new bug that also affects sync et all, not only this command.)
Sponsored-by: Nicholas Golder-Manning on Patreon
This was already possible, but it was rather hard to come up with the
complex shell command needed.
Note that the diff output starts with "diff a/... b/...".
I left off the "--git" because it's not a git format diff.
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
assist: New command, which is the same as git-annex sync but with
new files added and content transferred by default.
(Also this fixes another reversion in git-annex sync,
--commit --no-commit, and --message were not enabled, oops.)
See added comment for why git-annex assist does commit staged
changes elsewhere in the work tree, but only adds files under
the cwd.
Note that it does not support --no-commit, --no-push, --no-pull
like sync does. My thinking is, why should it? If you want that
level of control, use git commit, git annex push, git annex pull.
Sync only got those options because pull and push were not split
out.
Sponsored-by: k0ld on Patreon
Split out two new commands, git-annex pull and git-annex push. Those plus a
git commit are equivilant to git-annex sync.
In a sense, git-annex sync conflates 3 things, and it would have been
better to have push and pull from the beginning and not sync. Although
note that git-annex sync --content is faster than a pull followed by a
push, because it only has to walk the tree once, look at preferred
content once, etc. So there is some value in git-annex sync in speed, as
well as user convenience.
And it would be hard to split out pull and push from sync, as far as the
implementaton goes. The implementation inside sync was easy, just adjust
SyncOptions so it does the right thing.
Note that the new commands default to syncing content, unless
annex.synccontent is explicitly set to false. I'd like sync to also do
that, but that's a hard transition to make. As a start to that
transition, I added a note to git-annex-sync.mdwn that it may start to
do so in a future version of git-annex. But a real transition would
necessarily involve displaying warnings when sync is used without
--content, and time.
Sponsored-by: Kevin Mueller on Patreon