At this point the RepoSize database is getting populated, and it
all seems to be working correctly. Incremental updates still need to be
done to make it performant.
Including locking on creation, handling of permissions errors, and
setting repo sizes.
I'm confident that locking is not needed while using this database.
Since writes happen in a single transaction. When there are two writers
that are recording sizes based on different git-annex branch commits,
one will overwrite what the other one recorded. Which is fine, it's only
necessary that the database stays consistent with the content of a
git-annex branch commit.
Plan is to run this when populating Annex.reposizes on demand.
So Annex.reposizes will be up-to-date with the journal, including
crucially journal entries for private repositories. But also
anything that has been written to the journal by another process,
especially if the process was ran with annex.alwayscommit=false.
From there, Annex.reposizes can be kept up to date with changes made
by the running process.
This is very innefficient, it will need to be optimised not to
calculate the sizes of repos every time.
Also, fixed a bug in balancedPicker that caused it to pick a too high
index when some repos were excluded due to being full.
This deals with the possible security problem that someone could make an
unusually low UUID and generate keys that are all constructed to hash to
a number that, mod the number of repositories in the group, == 0.
So balanced preferred content would always put those keys in the
repository with the low UUID as long as the group contains the
number of repositories that the attacker anticipated.
Presumably the attacker than holds the data for ransom? Dunno.
Anyway, the partial solution is to use HMAC (sha256) with all the UUIDs
combined together as the "secret", and the key as the "message". Now any
change in the set of UUIDs in a group will invalidate the attacker's
constructed keys from hashing to anything in particular.
Given that there are plenty of other things someone can do if they can
write to the repository -- including modifying preferred content so only
their repository wants files, and numcopies so other repositories drom
them -- this seems like safeguard enough.
Note that, in balancedPicker, combineduuids is memoized.
This all works fine. But it doesn't check repository sizes yet, and
without repository size checking, once a repository gets full, there
will be no other repository that will want its files.
Use of sha2 seems unncessary, probably alder2 or md5 or crc would have
been enough. Possibly just summing up the bytes of the key mod the number
of repositories would have sufficed. But sha2 is there, and probably
hardware accellerated. I doubt very much there is any security benefit
to using it though. If someone wants to construct a key that will be
balanced onto a given repository, sha2 is certianly not going to stop
them.
This removes versionedExport, which was only used by the S3 special
remote. Instead, versionedexport=yes is a common way for remotes to
indicate that they are versioned.
This is not perfect because it does not handle versioned special
remotes, which should not be untrustworthy, but now are when proxied.
The implementation turned out to be easy, because the exporttree field
is a default field, so is available in RemoteConfig even for git
remotes.
This handles cases where a single key is used by multiple files in the
exported tree. When using `git-annex push`, the key's content gets
stored in the annexobjects location, and then when the branch is pushed,
it gets renamed from the annexobjects location to the first exported
file. For subsequent exported files, a copy of the content needs to be
made. This causes it to download the key from the remote in order to
upload another copy to it.
This is not needed when using `git push` followed by `git-annex copy --to`
the proxied remote, because the received key is stored at all export
locations then.
Also, fixed handling of the synced branch push, it was exporting master
when synced/master was pushed.
Note that currently, the first push to the remote does not see that it
is able to get a key from it in order to upload it back. It displays
"(not available)". The second push is able to. Since git-annex push
pushes first the synced branch and then the branch, this does end up
with a full export being made, but it is not quite right.
This is similar to git-annex copy --from --to, in that it downloads a
local copy, locks it for removal, uploads it, and drops it. Removal of
the temporary local copy is done without verifying numcopies for the
same reason as that command.
I do wonder, looking at this, if there's a race where the local copy
gets used as a copy to allow some other drop in the narrow window after
it is downloaded and before it gets locked for removal. That would need
some other repository to have an out of date location log that says the
repository contains a copy of the key, in order for it to try to use it
as a copy. If there is such a race, git-annex copy/move would also be
vulnerable to it. It would be better to lock it for removal before
starting to download it! That is possible in v10 repositories, which do
use a separate content lock file.
Note that, when the exported tree contains several files that use the
same key, it will be downloaded repeatedly, once per time needed to
upload it. It would be possible to avoid that extra work, but it would
complicate this since the local copy would need to be preserved, locked
for removal, until the end. Also, that would mean that interrupting the
export would leave possibly a lot of temporarily downloaded keys in the
local repository, while currently it can only leave one.
Since git-annex sync sends the sync branch first, and only displays the
output of the push to the sync branch, this makes git-annex
post-retrieve's output when updating the exported tree be visible when
syncing.
This also makes syncing with a non-bare repository still update the
exported tree, even when the checked out branch is not able to be
updated. The sync branch gets sent regardless.
This handles the workflow where the branch is first pushed to the proxy,
and then files in the exported tree are later are copied to the proxied remote.
Turns out that the way the export log is structured, nothing needs
to be done to finalize the export once the last key is sent to it. Which
is great because that would have been a lot of complication. On
receiving the push, Command.Export runs and calls recordExportBeginning,
does as much as it can to update the export with the files currently
on it, and then calls recordExportUnderway. At that point, the
export.log records the export as "complete", but it's not really. And
that's fine. The same happens when using `git-annex export` when some
files are not available to send. Other repositories that have
access to the special remote can already retrieve files from it. As
the missing files get copied to the exported remote, all that needs
to be done is record each in the export db.
At this point, proxying to exporttree=yes annexobjects=yes special remotes
is fully working. Except for in the case where multiple files in the
tree use the same key, and the files are sent to the proxied remote
before pushing the tree.
It seems that even special remotes without annexobjects=yes will work if
used with the workflow where the git-annex branch is pushed before
copying files. But not with the `git-annex push` workflow.
It works when using git-annex sync/push/assist, or when manually sending
all content to the proxied remote before pushing to the proxy remote.
But when the push comes before the content is sent, sending content does
not update the exported tree.
(When possible, of course it may not be there, or it may get renamed from
there for another exported file first. Or the remote may not support
renames.)
This will avoids redundant uploads.
An example case where this is important: Proxying to a exporttree remote,
a file is uploaded to it but is not yet in an exported tree. When the
exported tree is pushed, the remote needs to be updated by exporting to
it. In this case, the proxy doesn't have a copy of the file, so it would
need to download it from annexobjects before uploading it to the final
location. With this optimisation, it can just rename it.
However: If a key is used twice in an exported tree, it seems a proxy
will need to download and reupload anyway. Unless a copy operation is
added to exporttree remotes..
This avoids needing to re-upload the file again to get it to the
annexobjects location, which git-annex sync was doing when it was
preferred content.
If the file is not preferred content, sync will drop it from the
annexobjects location.
If the file has been deleted from the tree, it will remain in the
annexobjects location until an unused/dropunused pass is done.
Decided not to use the annexobjects location for exportTempName.
There doesn't seem to be any actual benefit to doing that, because an
export that renames to exportTempName always renames it back from that
to another location.
Also the annexobjects directory won't actually help with the paired
rename issue.
The file in the annexobjects location may have been renamed from a
previously exported file that got deleted in a subsequent export.
Or it may be renamed to annexobjects temporarily before being renamed to
another name (to handle eg pairwise renames).
But, an exported file is not guaranteed to contain the content of the
key that the local repository last exported there. Another tree could
have been exported from elsewhere in the meantime.
So, files in annexobjects do not necessarily have the content of their
key. And so have to be strongly verified when retrieving. The same as
is done when retrieving exported files.
While usually uploading to a special remote does not verify the content,
the content in a repository is assumed to be valid, and there is no trust
boundary. But with a proxied special remote, there may be users who are
allowed to store objects, but are not really trusted.
Another way to look at this is it's the equivilant of git-annex-shell
checking the hash of received data, which it does (see StoreContent
implementation).