This way, the temp files that might be left due to failure will be
cleaned up next time.
Also, nub the list of incomplete exports to avoid repeatedly adding the
same tree to it when running export repeatedly when it's failing.
This commit was supported by the NSF-funded DataLad project.
This basically works, but there's a bug when renaming a file that leaves
a .git-annex-temp-content-key file in the webdav store, that never gets
cleaned up.
Also, exporting files with spaces to box.com seems to fail; perhaps it
does not support it?
This commit was supported by the NSF-funded DataLad project.
When an export was interrupted, the sqlite database won't have been
committed necessarily. Also, the interrupted export might have been
run in an entirely different repository. There's not a significant speed
benefit in checking getExportLocation in this case anyway, so avoid it.
Also, remove the old filename from the export database.
Recovery from interrupted exports is now tested working.
This commit was supported by the NSF-funded DataLad project.
Only rename when actually ncessary.
The diff gets buffered in memory. Probably git has to buffer a diff in
memory when generating it as well, so this memory usage should not be a
problem, even when the diff is very large. I hope.
This commit was supported by the NSF-funded DataLad project.
Don't allow "exporttree=yes" to be set when the special remote
does not support exports. That would be confusing since the user would
set up a special remote for exports, but `git annex export` to it would
later fail.
This commit was supported by the NSF-funded DataLad project.
This is seriously super hairy. It has to handle interrupted exports,
which may be resumed with the same or a different tree. It also has to
recover from export conflicts, which could cause the wrong content
to be renamed to a file.
I think this works, or is close to working. See the update to the design
for how it works.
This is definitely not optimal, in that it does more renames than are
necessary. It would probably be worth finding the keys that are really
renamed and only renaming those. But let's get the "simple" approach to
work first..
This commit was supported by the NSF-funded DataLad project.
Not yet used, but essential for resuming cleanly.
Note that, in normmal operation, only one commit is made to export.log
during an export; the incomplete version only gets to the journal and
is then overwritten.
This commit was supported by the NSF-funded DataLad project.
Removed uncorrect UniqueKey key in db schema; a key can appear multiple
times with different files.
The database has to be flushed after each removal. But when adding files
to the export, lots of changes are able to be queued up w/o flushing.
So it's still fairly efficient.
If large removals of files from exports are too slow, an alternative
would be to make two passes over the diff, one pass queueing deletions
from the database, then a flush and the a second pass updating the
location log. But that would use more memory, and need to look up
exportKey twice per removed file, so I've avoided such optimisation yet.
This commit was supported by the NSF-funded DataLad project.
Went with a separate db per export remote, rather than a single export
database. Mostly because there will probably not be a lot of separate
export remotes, and it might be convenient to be able to delete a given
remote's export database.
This commit was supported by the NSF-funded DataLad project.
* Only export to remotes that were initialized to support it.
* Prevent storing key/value on export remotes.
* Prevent enabling exporttree=yes and encryption in the same remote.
SetupStage Enable was changed to take the old RemoteConfig.
This allowed only setting exporttree when initially setting up a
remote, and not configuring it later after stuff might already be stored
in the remote.
Went with =yes rather than =true for consistency with other parts of
git-annex. Changed docs accordingly.
This commit was supported by the NSF-funded DataLad project.
This will allow disabling exports for remotes that are not configured to
allow them. Also, exportSupported will be useful for the external
special remote to probe.
This commit was supported by the NSF-funded DataLad project
So it will be available later and elsewhere, even after GC.
I first though to use git update-index to do this, but feeding it a line
with a tree object seems to always cause it to generate a git subtree
merge. So, fell back to using the Git.Tree interface to maniupulate the
trees, and not involving the git-annex branch index file at all.
This commit was sponsored by Andreas Karlsson.
Make a pass over the whole exported tree, and upload anything that has
not yet reached the export. Update location log when exporting.
Note that the synthesized keys for non-annexed files are stored in the
location log too.
Some cases involving files in the tree with the same content are not
handled correctly yet.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
Added remote configuration settings annex-ignore-command and
annex-sync-command, which are dynamic equivilants of the annex-ignore
and annex-sync configurations.
For this I needed a new DynamicConfig infrastructure. Its implementation
should be as fast as before when there is no dynamic config, and it caches
so shell commands are only run once.
Note that annex-ignore-command exits nonzero when the remote should be ignored.
While that may seem backwards, it allows using the same command for it as
for annex-sync-command when you want to disable both.
This commit was sponsored by Trenton Cronholm on Patreon.
Can be used to override the default timestamps used in log files in the
git-annex branch. This is a dangerous environment variable; use with
caution.
Note that this only affects writing to the logs on the git-annex branch.
It is not used for metadata in git commits (other env vars can be set for
that).
There are many other places where timestamps are still used, that don't
get committed to git, but do touch disk. Including regular timestamps
of files, and timestamps embedded in some files in .git/annex/, including
the last fsck timestamp and timestamps in transfer log files.
A good way to find such things in git-annex is to get for getPOSIXTime and
getCurrentTime, although some of the results are of course false positives
that never hit disk (unless git-annex gets swapped out..)
So this commit does NOT necessarily make git-annex comply with some HIPPA
privacy regulations; it's up to the user to determine if they can use it in
a way compliant with such regulations.
Benchmarking: It takes 0.00114 milliseconds to call getEnv
"GIT_ANNEX_VECTOR_CLOCK" when that env var is not set. So, 100 thousand log
files can be written with an added overhead of only 0.114 seconds. That
should be by far swamped by the actual overhead of writing the log files
and making the commit containing them.
This commit was supported by the NSF-funded DataLad project.
* Added annex.resolvemerge configuration, which can be set to false to
disable the usual automatic merge conflict resolution done by git-annex
sync and the assistant.
* sync: Added --no-resolvemerge option.
Note that disabling merge conflict resolution is probably not a good idea
in a direct mode repo or adjusted branch. Since updates to both are done
outside the usual work tree, if it fails the tree is not left in a
conflicted state, and it would be hard to manually resolve the conflict.
Still, made annex.resolvemerge be supported in those cases for consistency.
This commit was sponsored by Riku Voipio.
* move --to=here moves from all reachable remotes to the local repository.
The output of move --from remote is changed slightly, when the remote and
local both have the content. It used to say:
move foo ok
Now:
move foo (from theremote...) ok
That was done so that, when move --to=here is used and the content is
locally present and also in several remotes, it's clear which remotes the
content gets dropped from.
Note that move --to=here will report an error if a non-reachable remote
contains the file, even if the local repository also contains the file. I
think that's reasonable; the user may be intending to move all other copies
of the file from remotes.
OTOH, if a copy of the file is believed to be present in some repository
that is not a configured remote, move --to=here does not report an error.
So a little bit inconsistent, but erroring in this case feels wrong.
copy --to=here came along for free, but it's basically the same behavior as
git-annex get, and probably with not as good messages in edge cases
(especially on failure), so I've not documented it.
This commit was sponsored by Anthony DeRobertis on Patreon.
Reworked remote name parsing to allow things like that. Command.Move
uses it for --to=here, although there's not yet an implementation of
that option.
This commit was sponsored by Ignacio on Patreon.
Removed dependency on MissingH, instead depending on the split
library.
After laying groundwork for this since 2015, it
was mostly straightforward. Added Utility.Tuple and
Utility.Split. Eyeballed System.Path.WildMatch while implementing
the same thing.
Since MissingH's progress meter display was being used, I re-implemented
my own. Bonus: Now progress is displayed for transfers of files of
unknown size.
This commit was sponsored by Shane-o on Patreon.
It takes a single key-value backend, rather than the unncessary and confusing list.
The old option still works if set.
Simplified some old old code too.
This commit was sponsored by Thomas Hochstein on Patreon.
Unlike git add -u, git annex add -u does not update the index for files
removed from the working tree. But then, "git add ." stages removals,
and "git annex add ." does not, so that's an existing divergence.
Seems that --update --batch would need to run git ls-files once per line of
batch input, which would surely be too slow, so just throw an error for
that.
This commit was supported by the NSF-funded DataLad project.
They were silently ignored, a reversion introduced in 6.20160527.
I don't like this regular git remote special case in enableremote, but I
can't see a way to get rid of it. So, check if the existing remote is
a Remote.Git
This commit was sponsored by Trenton Cronholm on Patreon.
The former can be useful to make remotes that don't get fully synced with
local changes, which comes up in a lot of situations.
The latter was mostly added for symmetry, but could be useful (though less
likely to be).
Implementing `remote.<name>.annex-pull` was a bit tricky, as there's no one
place where git-annex pulls/fetches from remotes. I audited all
instances of "fetch" and "pull". A few cases were left not checking this
config:
* Git.Repair can try to pull missing refs from a remote, and if the local
repo is corrupted, that seems a reasonable thing to do even though
the config would normally prevent it.
* Assistant.WebApp.Gpg and Remote.Gcrypt and Remote.Git do fetches
as part of the setup process of a remote. The config would probably not
be set then, and having the setup fail seems worse than honoring it if it
is already set.
I have not prevented all the code that does a "merge" from merging branches
from remotes with remote.<name>.annex-pull=false. That could perhaps
be done, but it would need a way to map from branch name to remote name,
and the way refspecs work makes that hard to get really correct. So if the
user fetches manually, the git-annex branch will get merged, for example.
Anther way of looking at/justifying this is that the setting is called
"annex-pull", not "annex-merge".
This commit was supported by the NSF-funded DataLad project.
They are handled close the same as they are by git. However, unlike git,
git-annex sometimes needs to pass the -n parameter when using these.
So, this has the potential for breaking some setup, and perhaps there ought
to be a ANNEX_USE_GIT_SSH=1 needed to use these. But I'd rather avoid that
if possible, so let's see if anyone complains.
Almost all places where "ssh" was run have been changed to support the env
vars. Anything still calling sshOptions does not support them. In
particular, rsync special remotes don't. Seems that annex-rsync-transport
already gives sufficient control there.
(Fixed in passing: Remote.Helper.Ssh.toRepo used to extract
remoteAnnexSshOptions and pass them to sshOptions, which was redundant
since sshOptions also extracts those.)
This commit was sponsored by Jeff Goeke-Smith on Patreon.
* init: When annex.securehashesonly has been set with git-annex config,
copy that value to the annex.securehashesonly git config.
* config --set: As well as setting value in git-annex branch,
set local gitconfig. This is needed especially for
annex.securehashesonly, which is read only from local gitconfig and not
the git-annex branch.
doc/todo/sha1_collision_embedding_in_git-annex_keys.mdwn has the
rationalle for doing it this way. There's no perfect solution; this
seems to be the least-bad one.
This commit was supported by the NSF-funded DataLad project.
Cryptographically secure hashes can be forced to be used in a repository,
by setting annex.securehashesonly. This does not prevent the git repository
from containing files with insecure hashes, but it does prevent the content
of such files from being pulled into .git/annex/objects from another
repository.
We want to make sure that at no point does git-annex accept content into
.git/annex/objects that is hashed with an insecure key. Here's how it
was done:
* .git/annex/objects/xx/yy/KEY/ is kept frozen, so nothing can be
written to it normally
* So every place that writes content must call, thawContent or modifyContent.
We can audit for these, and be sure we've considered all cases.
* The main functions are moveAnnex, and linkToAnnex; these were made to
check annex.securehashesonly, and are the main security boundary
for annex.securehashesonly.
* Most other calls to modifyContent deal with other files in the KEY
directory (inode cache etc). The other ones that mess with the content
are:
- Annex.Direct.toDirectGen, in which content already in the
annex directory is moved to the direct mode file, so not relevant.
- fix and lock, which don't add new content
- Command.ReKey.linkKey, which manually unlocks it to make a
copy.
* All other calls to thawContent appear safe.
Made moveAnnex return a Bool, so checked all callsites and made them
deal with a failure in appropriate ways.
linkToAnnex simply returns LinkAnnexFailed; all callsites already deal
with it failing in appropriate ways.
This commit was sponsored by Riku Voipio.
9c4650358c changed the Read instance for
Key.
I've checked all uses of that instance (by removing it and seeing what
breaks), and they're all limited to the webapp, except one.
That is GitAnnexDistribution's Read instance.
So, 9c4650358c would have broken upgrades
of git-annex from downloads.kitenet.net. Once the .info files there got
updated for a new release, old releases would have failed to parse them
and never upgraded.
To fix this, I found a way to make the .info files that contain
GitAnnexDistribution values be readable by the old version of git-annex.
This commit was sponsored by Ewen McNeill.
Where before the "name" of a key and a backend was a string, this makes
it a concrete data type.
This is groundwork for allowing some varieties of keys to be disabled
in file2key, so git-annex won't use them at all.
Benchmarks ran in my big repo:
old git-annex info:
real 0m3.338s
user 0m3.124s
sys 0m0.244s
new git-annex info:
real 0m3.216s
user 0m3.024s
sys 0m0.220s
new git-annex find:
real 0m7.138s
user 0m6.924s
sys 0m0.252s
old git-annex find:
real 0m7.433s
user 0m7.240s
sys 0m0.232s
Surprising result; I'd have expected it to be slower since it now parses
all the key varieties. But, the parser is very simple and perhaps
sharing KeyVarieties uses less memory or something like that.
This commit was supported by the NSF-funded DataLad project.
This was noticed because it broke the datalad test suite, which pushed
to the remote and then fetched to check if it had received the expected
branches. Auto-init caused the git-annex branch on the remote to
diverge, breaking that test.
https://github.com/datalad/datalad/issues/1319#issuecomment-281649518
The auto-init still happens, it's staged in the journal, and will be
commited by some later git-annex command when it runs. Which is fine,
it's the same as that later command doing the auto-init.
This commit was supported by the NSF-funded DataLad project
sync: When syncing with a local repository located on a crippled
filesystem, run the post-receive hook there, since it wouldn't get run
otherwise. This makes pushing to repos on FAT-formatted removable drives
update them when receive.denyCurrentBranch=updateInstead.
Made Remote.Git export onLocal, which was cleaned up to not have so many
caveats about its use.
This commit was sponsored by Jeff Goeke-Smith on Patreon.
* Added post-recieve hook, which makes updateInstead work with direct
mode and adjusted branches.
* init: Set up the post-receive hook.
This commit was sponsored by Fernando Jimenez on Patreon.
Recent changes had a side effect of displaying errors in the fairly
common case when this push fails. Since the synced/git-annex push
is always forced, those errors are noise, so hide again.
This means 3 separate pushes are done now, where before it only made 2.
A bit more expensive, but ssh connection caching eliminates most of
the costs.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
config group groupwanted numcopies schedule wanted required: Avoid
displaying extraneous messages about repository auto-init, git-annex branch
merging, etc, when being used to get information.
By displaying error messages from the remote then it fails to update
its checked out branch.
Error messages in the default receive.denyCurrentBranch are still
suppressed, which matches user expectations.
This commit was sponsored by Nick Daly on Patreon.
... to avoid it consuming stdin that it shouldn't.
This fixes git-annex-checkpresentkey --batch remote, which didn't output
results for all keys passed into it.
Other git-annex commands that communicate with a remote over ssh may also
have been consuming stdin that they shouldn't have, which could have
impacted using them in eg, shell scripts. For example, a shell script
reading files from stdin and passing them to git annex drop would be
impacted by this bug, whenever git annex drop ran git-annex-shell
checkpresent, it would consume part/all of the stdin that the shell script
was supposed to consume.
Fixed by adding a ConsumeStdin parameter to Annex.Ssh.sshOptions, which
is used throughout git-annex to run ssh (in order for ssh connection
caching to work). Every call site was checked to see if it used
CreatePipe for stdin, and if not was marked NoConsumeStdin.
At first I wanted to make it go ahead and merge into the newborn branch,
so made it use Git.Branch.currentUnsafe to get the current branch. But that
failed:
fatal: ambiguous argument 'refs/heads/master..refs/heads/synced/master':
unknown revision or path not in the working tree.
A whole nother code path to handle merging into newborn branches seemed
excessive, so went with displaying a warning and propigating failure
status.
This commit was sponsored by Brock Spratlen on Patreon.
This is the same as running git annex reinject --known, followed by
git-annex import. The advantage to having it in one command is that it
only has to hash each file once; the two commands have to
hash the imported files a second time.
This commit was sponsored by Shane-o on Patreon.
import: --deduplicate and --skip-duplicates were implemented inneficiently;
they unncessarily hashed each file twice. They have been improved to only
hash once.
The new approach is to lock down (minimally) and hash files, and then
reuse that information when importing them.
This was rather tricky, especially in detecting changes to files while
they are being imported.
The output of import changed slightly. While before it silently skipped
over files with eg --skip-duplicates, now it shows each file as it starts
to act on it. Since every file is hashed first thing, it would otherwise
not be clear what file import is chewing on. (Actually, it wasn't clear
before when any of the duplicates switches were used.)
This commit was sponsored by Alexander Thompson on Patreon.
Before, only content known to be present somewhere was considered a
duplicate. Now, any content that has been annexed before will be considered
a duplicate, even if all annexed copies of the data have been lost.
Note that --clean-duplicates and --deduplicate still check numcopies,
so won't delete duplicate files unless there's an annexed copy.
This makes import use the same method as reinject --known.
The man page already said that duplicate meant "its content is either
present in the local repository already, or git-annex knows of another
repository that contains it, or it was present in the annex before but has
been removed now". So, this is really only bringing the implementation into
line with the man page.
This commit was sponsored by Jochen Bartl on Patreon.
Most remotes have an idempotent setup that can be reused for
enableremote, but in a few cases, it needs to tell which, and whether
a UUID was provided to setup was used.
This is groundwork for making initremote be able to provide a UUID.
It should not change any behavior.
Note that it would be nice to make the UUID always be provided to setup,
and make setup not need to generate and return a UUID. What prevented
this simplification is Remote.Git.gitSetup, which needs to reuse the
UUID of the git remote when setting it up, and so has to return that
UUID.
This commit was sponsored by Thom May on Patreon.
Wormhole pairing will start to provide an appid to wormhole on 2021-12-31.
An appid can't be provided now because Debian stable is going to ship a
older version of git-annex that does not provide an appid. Assumption is
that by 2021-12-31, this version of git-annex will be shipped in a Debian
stable release. If that turns out to not be the case, this change will need
to be cherry-picked into the git-annex in Debian stable, or its wormhole
pairing will break.
This commit was sponsored by Thomas Hochstein on Patreon.
.. which can be set to true to make git annex sync default to --content.
This may become the default at some point in the future.
As well as being configuable by git config, it can be configured by
git-annex config to control the default behavior in all clones of a
repository.
Had to add a separate --no-content switch to we can tell if it's been
explicitly set, and should override annex.synccontent. If --content was the
default, this complication would not be necessary.
This commit was sponsored by Jake Vosloo on Patreon.
... to control the default behavior in all clones of a repository.
This includes a new Configurable data type, so the GitConfig type indicates
which values can be configured this way.
The implementation should be quite efficient; the config log is only read
once, and only when a Configurable value has not already been set by
git-config.
Indeed, it would be nice in the future to extend this, so that git-config
is itself only read on demand. Some commands may not need to look at the
git configuration at all.
This commit was sponsored by Trenton Cronholm on Patreon.
This makes it a little bit slower since it has to check file size,
but worth it to fix a potential memory use problem.
This commit was sponsored by Fernando Jimenez on Patreon.
Turns out that Data.List.Utils.split is slow and makes a lot of
allocations. Here's a much simpler single character splitter that behaves
the same (even in wacky corner cases) while running in half the time and
75% the allocations.
As well as being an optimisation, this helps move toward eliminating use of
missingh.
(Data.List.Split.splitOn is nearly as slow as Data.List.Utils.split and
allocates even more.)
I have not benchmarked the effect on git-annex, but would not be surprised
to see some parsing of eg, large streams from git commands run twice as
fast, and possibly in less memory.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
Any config names can be set using this; git-annex commands will only look
at specific ones that make sense and are worth the overhead of querying the
branch.
This might also be useful for storing whatever other config-type stuff the
user might want to shove into the git-annex branch.
This commit was sponsored by Jochen Bartl on Patreon.
Docs say vicfg can configure everything from git-annex branch,
so it ought to configure numcopies.
Note that commenting out existing numcopies does not unset it.
This commit was sponsored by Thom May on Patreon.
Since the user does not know whether it will run su or sudo, indicate
whether the password prompt will be for root or the user's password,
when possible.
I assume that programs like gksu that can prompt for either depending on
system setup will make clear in their prompt what they're asking for.
This way we know that after enable-tor, the tor hidden service is fully
published and working, and so there should be no problems with it at
pairing time.
It has to start up its own temporary listener on the hidden service. It
would be nice to have it start the remotedaemon running, so that extra
step is not needed afterwards. But, there may already be a remotedaemon
running, in communication with the assistant and we don't want to start
another one. I thought about trying to HUP any running remotedaemon, but
Windows does not make it easy to do that. In any case, having the user
start the remotedaemon themselves lets them know it needs to be running
to serve the hidden service.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
weasel explained that apparmor limits on what files tor can read do not
apply to sockets (because they're not files). And apparently the
problems I was seeing with hidden services not being accessible had to
do with onion address propigation and not the location of the socket
file.
remotedaemon looks up the HiddenServicePort in torrc, so if it was
previously configured with the socket in /etc, that will still work.
This commit was sponsored by Denis Dzyubenko on Patreon.