Handled limitCopies, as well as everything using fromNumCopies and
fromMinCopies.
This should be everything, probably.
Note that, git-annex info displays a count of repositories, which still
includes cluster. I think that's ok. It would be possible to filter out
clusters there, but to the user they're pretty much just another
repository. The numcopies displayed by eg `git-annex info .` does not
include clusters.
This is to avoid inserting a cluster uuid into the location log when
only dead nodes in the cluster contain the content of a key.
One reason why this is necessary is Remote.keyLocations, which excludes
dead repositories from the list. But there are probably many more.
Implementing this was challenging, because Logs.Location importing
Logs.Cluster which imports Logs.Trust which imports Remote.List resulted
in an import cycle through several other modules.
Resorted to making Logs.Location not import Logs.Cluster, and instead
it assumes that Annex.clusters gets populated when necessary before it's
called.
That's done in Annex.Startup, which is run by the git-annex command
(but not other commands) at early startup in initialized repos. Or,
is run after initialization.
Note that is Remote.Git, it is unable to import Annex.Startup, because
Remote.Git importing Logs.Cluster leads the the same import cycle.
So ensureInitialized is not passed annexStartup in there.
Other commands, like git-annex-shell currently don't run annexStartup
either.
So there are cases where Logs.Location will not see clusters. So it won't add
any cluster UUIDs when loading the log. That's ok, the only reason to do
that is to make display of where objects are located include clusters,
and to make commands like git-annex get --from treat keys as being located
in a cluster. git-annex-shell certainly does not do anything like that,
and I'm pretty sure Remote.Git (and callers to Remote.Git.onLocalRepo)
don't either.
One benefit of this is that a typo in annex-cluster-node config won't
init a new cluster.
Also it gets the cluster description set and is consistent with
initremote.
Not used yet. (Or tested.)
I did consider making the log start with the uuid of the node, followed
by the cluster uuid (or uuids). That would perhaps mean a smaller write
to the git-annex branch when adding a node, but overall the log file
would be larger, and it will be read and cached near to startup on most
git-annex runs.
These remotes have no url configured, so git pull and push will fail.
git-annex sync --content etc can still sync with them otherwise.
Also, avoid git syncing twice with the same url. This is for cases where
a proxied remote has been manually configured and so does have a url.
Or perhaps proxied remotes will get configured like that automatically
later.
This does mean a redundant write to the git-annex branch. But,
it means that two clients can be using the same proxy, and after
one sends a file to a proxied remote, the other only has to pull from
the proxy to learn about that. It does not need to pull from every
remote behind the proxy (which it couldn't do anyway as git repo
access is not currently proxied).
Anyway, the overhead of this in git-annex branch writes is no worse
than eg, sending a file to a repository where git-annex assistant
is running, which then sends the file on to a remote, and updates
the git-annex branch then. Indeed, when the assistant also drops
the local copy, that results in more writes to the git-annex branch.
CONNECT is not supported by git-annex-shell p2pstdio, but for proxying
to tor-annex remotes, it will be supported, and will make a git pull/push
to a proxied remote work the same with that as it does over ssh,
eg it accesses the proxy's git repo not the proxied remote's git repo.
The p2p protocol docs say that NOTIFYCHANGES is not always supported,
and it looked annoying to implement it for this, and it also seems
pretty useless, so make it be a protocol error. git-annex remotedaemon
will already be getting change notifications from the proxy's git repo,
so there's no need to get additional redundant change notifications for
proxied remotes that would be for changes to the same git repo.
Prevent listProxied from listing anything when the proxy remote's
url is a local directory. Proxying does not work in that situation,
because the proxied remotes have the same url, and so git-annex-shell
is not run when accessing them, instead the proxy remote is accessed
directly.
I don't think there is any good way to support this. Even if the instantiated
git repos for the proxied remotes somehow used an url that caused it to use
git-annex-shell to access them, planned features like `git-annex copy --to
proxy` accepting a key and sending it on to nodes behind the proxy would not
work, since git-annex-shell is not used to access the proxy.
So it would need to use something to access the proxy that causes
git-annex-shell to be run and speaks P2P protocol over it. And we have that.
It's a ssh connection to localhost. Of course, it would be possible to
take ssh out of that mix, and swap in something that does not have
encryption overhead and authentication complications, but otherwise
behaves the same as ssh. And if the user wants to do that, GIT_SSH
does exist.
This just happened to work correctly. Rather surprisingly. It turns out
that openP2PSshConnection actually also supports local git remotes,
by just running git-annex-shell with the path to the remote.
Renamed "P2PSsh" to "P2PShell" to make this clear.
The almost identical code duplication between relayDATA and relayDATA'
is very annoying. I tried quite a few things to parameterize them, but
the type checker is having fits when I try it.
Memory use is small and constant; receiveBytes returns a lazy bytestring
and it does stream.
Comparing speed of a get of a 500 mb file over proxy from origin-origin,
vs from the same remote over a direct ssh:
joey@darkstar:~/tmp/bench/client>/usr/bin/time git-annex get bigfile --from origin-origin
get bigfile (from origin-origin...)
ok
(recording state in git...)
1.89user 0.67system 0:10.79elapsed 23%CPU (0avgtext+0avgdata 68716maxresident)k
0inputs+984320outputs (0major+10779minor)pagefaults 0swaps
joey@darkstar:~/tmp/bench/client>/usr/bin/time git-annex get bigfile --from direct-ssh
get bigfile (from direct-ssh...)
ok
1.79user 0.63system 0:10.49elapsed 23%CPU (0avgtext+0avgdata 65776maxresident)k
0inputs+1024312outputs (0major+9773minor)pagefaults 0swaps
So the proxy doesn't add much overhead even when run on the same machine as
the client and remote.
Still, piping receiveBytes into sendBytes like this does suggest that the proxy
could be made to use less CPU resouces by using `sendfile()`.
getRepoUUID looks at that, and was seeing the annex.uuid of the proxy.
Which caused it to unncessarily set the git config. Probably also would
have led to other problems.
For NotifyChanges and also for the fallthrough case where
git-annex-shell passes a command off to git-shell, proxying is currently
ignored. So every remote that is accessed via a proxy will be treated as
the same git repository.
Every other command listed in cmdsMap will need to check if
Annex.proxyremote is set, and if so handle the proxying appropriately.
Probably only P2PStdio will need to support proxying. For now,
everything else refuses to work when proxying.
The part of that I don't like is that there's the possibility a command
later gets added to the list that doesn't check proxying.
When proxying is not enabled, it's important that git-annex-shell not
leak information that it would not have exposed before. Such as the
names or uuids of remotes.
I decided that, in the case where a repository used to have proxying
enabled, but no longer supports any proxies, it's ok to give the user a
clear error message indicating that proxying is not configured, rather
than a confusing uuid mismatch message.
Similarly, if a repository has proxying enabled, but not for the
requested repository, give a clear error message.
A tricky thing here is how to handle the case where there is more than
one remote, with proxying enabled, with the specified uuid. One way to
handle that would be to plumb the proxyRemoteName all the way through
from the remote git-annex to git-annex-shell, eg as a field, and use
only a remote with the same name. That would be very intrusive though.
Instead, I decided to let the proxy pick which remote it uses to access
a given Remote. And so it picks the least expensive one.
The client after all doesn't necessarily know any details about the
proxy's configuration. This does mean though, that if the least
expensive remote is not accessible, but another remote would have
worked, an access via the proxy will fail.
They could be missing due to an interrupted git-annex at just the wrong
time during a prior graft, after which the tree objects got garbage
collected.
Or they could be missing because of manual messing with the git-annex
branch, eg resetting it to back before the graft commit.
Sponsored-by: Dartmouth College's OpenNeuro project
Fix a bug where interrupting git-annex while it is updating the git-annex
branch could lead to git fsck complaining about missing tree objects.
Interrupting git-annex while regraftexports is running in a transition
that is forgetting git-annex branch history would leave the
repository with a git-annex branch that did not contain the tree shas
listed in export.log. That lets those trees be garbage collected.
A subsequent run of the same transition then regrafts the trees listed
in export.log into the git-annex branch. But those trees have been lost.
Note that both sides of `if neednewlocalbranch` are atomic now. I had
thought only the True side needed to be, but I do think there may be
cases where the False side needs to be as well.
Sponsored-by: Dartmouth College's OpenNeuro project
Using the usual url download machinery even allows these urls to need
http basic auth, which is prompted for with git-credential. Which opens
the possibility for urls that contain a secret to be used, eg the cipher
for encryption=shared. Although the user is currently on their own
constructing such an url, I do think it would work.
Limited to httpalso for now, for security reasons. Since both httpalso
(and retrieving this very url) is limited by the usual
annex.security.allowed-ip-addresses configs, it's not possible for an
attacker to make one of these urls that sets up a httpalso url that
opens the garage door. Which is one class of attacks to keep in mind
with this thing.
It seems that there could be either a git-config that allows other types
of special remotes to be set up this way, or special remotes could
indicate when they are safe. I do worry that the git-config would
encourage users to set it without thinking through the security
implications. One remote config might be safe to access this way, but
another config, for one with the same type, might not be. This will need
further thought, and real-world examples to decide what to do.
And use it to set annex-config-uuid in git config. This makes
using the origin special remote work after cloning.
Without the added Logs.Remote.configSet, instantiating the remote will
look at the annex-config-uuid's config in the remote log, which will be
empty, and so it will fail to find a special remote.
The added deletion of files in the alternatejournaldir is just to make
100% sure they don't get committed to the git-annex branch. Now that
they contain things that definitely should not be committed.
An incremental push that gets converted to a full push due to this
config results in the inManifest having just one bundle in it, and the
outManifest listing every other bundle. So it actually takes up more
space on the special remote. But, it speeds up clone and fetch to not
have to download a long series of bundles for incremental pushes.
When building an adjusted unlocked branch, make pointer files executable
when the annex object file is executable.
This slows down git-annex adjust --unlock/--unlock-present by needing to
stat all annex object files in the tree. Probably not a significant
slowdown compared to other work they do, but I have not benchmarked.
I chose to leave git-annex adjust --unlock marked as stable, even though
get or drop of an object file can change whether it would make the pointer
file executable. Partly because making it unstable would slow down
re-adjustment, and partly for symmetry with the handling of an unlocked
pointer file that is executable when the content is dropped, which does not
remove its execute bit.
cleanupInitialization gets run when an exception is thrown, so needs to
avoid throwing exceptions itself, as that would hide the error message
that the user needs to see.
When exporttree=yes is also set. Probably it would also be possible to
support ones with only importtree=yes, by enabling exporttree=yes for
the remote only when using git-remote-annex, but let's keep this
simple... I'm not sure what gets recorded in .git/annex/ state
differently in the two cases that might cause a problem when doing that.
Note that the full annex:: urls generated and displayed for such a
remote omit the importree=yes. Which is ok, cloning from such an url
uses an exporttree=remote, but the git-annex branch doesn't get written
by this program, so once the real config is available from the git-annex
branch, it will still function as an importree=yes remote.
This git bug also broke git-lfs, and I am confident it will be reverted
in the next release.
For now, cloning from an annex:: url wastes some bandwidth on the next
pull by not caching bundles locally.
If git doesn't fix this in the next version, I'd be tempted to rethink
whether bundle objects need to be cached locally. It would be possible to
instead remember which bundles have been seen and their heads, and
respond to the list command with the heads, and avoid unbundling them
agian in fetch. This might even be a useful performance improvement in
the latter case. It would be quite a complication to a currently simple
implementation though.
This fixes pushing a new ref that is the same as something already
pushed. In findotherprereq, it compares two shas, which didn't work when
one is actually not a sha but a ref.
This is one of those cases where Sha being an alias for Ref makes it
hard to catch mistakes. One of these days those need to be
differentiated at the type level, but not today..
Check explicitly for an annex:: url, not just any url. While no built-in
special remotes set an url, except ones that can be synced with, it
seems possible that some external special remote sets an url for its own
use, but did not expect it to be used by git-annex sync et al.
The assistant also syncs with them.
Locally record the manifest before uploading it or any bundles,
and read it on the next push. Any bundles from the push that are
not included in the currently being pushed manifest will get added
to the outManifest, and so eventually get deleted.
This deals with an interrupted push that is not resumed and instead
something else is pushed. And it deals with a push race that overwrites
the manifest.
Of course, this can't help if one of those situations is followed by
the local repo being deleted. But that's equivilant to doing a git-annex
copy of a new annexed file to a special remote and then deleting the
special repo w/o pushing. In either case the special remote ends up with
a object in it that git-annex doesn't know about.
This avoids some apparently otherwise unsolveable problems involving
races that resulted in the manifest listing bundles that were deleted.
Removed the annex-max-git-bundles config because it can't actually
result in deleting old bundles. It would still be possible to have a
config that controls how often to do a full push, which would avoid
needing to download too many bundles on clone, as well as needing to
checkpresent too many bundles in verifyManifest. But it would need a
different name and description.
Added a backup manifest key, which is used if the main manifest key is
not present. When uploading a new Manifest, it makes sure that it never
drops one key except when the other key is present.
It's entirely possible for the two manifest keys to get out of sync, due
to races. The main one wins when it's present, it is possible for the
main one being dropped to expose the backup one, which has a different
push recorded.
On push, first try to drop all outManifest keys listed in the current
manifest file, which resumes from an interrupted push that didn't
get a chance to delete those keys.
The new manifest gets its outManifest populated with the keys that were
in the old manifest, plus any of the keys that were unable to be
dropped.
Note that it would be possible for uploadManifest to skip dropping old
keys at all. The old keys would get dropped on the next push. But it
seems better to delete stuff immediately rather than waiting. And the
extra work is limited to push and typically is small.
A remote where dropKey always fails will result in an outManifest that
grows longer and longer. It would be possible to check if the remote
has appendonly = True and avoid populating the outManifest. Of course,
an appendonly remote will grow with every git push anyway. And currently
only Remote.GitLFS sets that, which can't be used as a git-remote-annex
remote anyway.
Implemented alternateJournal, which git-remote-annex
uses to avoid any writes to the git-annex branch while setting up
a special remote from an annex:: url.
That prevents the remote.log from being overwritten with the special
remote configuration from the url, which might not be 100% the same as
the existing special remote configuration.
And it prevents an overwrite deleting of other stuff that was
already in the remote.log.
Also, when the branch was created by git-remote-annex, only delete it
at the end if nothing else has been written to it by another command.
This fixes the race condition described in
797f27ab05, where git-remote-annex
set up the branch and git-annex init and other commands were
run at the same time and their writes to the branch were lost.