The code is only needed because for a long time, git-annex didn't
install hooks in repos on crippled filesystems. Now it does, and they
work at least on FAT (where all files are executable) and Windows.
It would be possible to remove this code in v8 simply by re-installing
the hooks.
This means that Command.Move and Command.Get don't need to
manually set the stage, and is a lot cleaner conceptually.
Also, this makes Command.Sync.syncFile use the worker pool better.
In the scenario where it first downloads content and then uploads it to
some other remotes, it will start in TransferStage, then enter VerifyStage
and then go back to TransferStage for each transfer to the remotes.
Before, it entered CleanupStage after the download, and stayed in it for
the upload, so too many transfer jobs could run at the same time.
Note that, in Remote.Git, it uses runTransfer and also verifyKeyContent
inside onLocal. That has a Annex state for the remote, with no worker pool.
So the resulting calls to enteringStage won't block in there.
While Remote.Git.copyToRemote does do checksum verification, I
realized that should not use a verification slot in the WorkerPool
to do it. Because, it's reading back from eg, a removable disk to checksum.
That will contend with other writes to that disk. It's best to treat
that checksum verification as just part of the transer. So, removed the todo
item about that, as there's nothing needing to be done.
Rather than limiting it to PerformStage and CleanupStage, this opens it
up so any number of stages can be added as needed by commands.
Each concurrent command has a set of stages that it uses, and only
transitions between those can block waiting for a free slot in the
worker pool. Calling enteringStage for some other stage does not block,
and has very little overhead.
Note that while before the Annex state was duplicated on the first call
to commandAction, this now happens earlier, in startConcurrency.
That means that seek stage actions should that use startConcurrency
and then modify Annex state won't modify the state of worker threads
they then start. I audited all of them, and only Command.Seek
did so; prepMerge changes the working directory and so has to come
before startConcurrency.
Also, the remote list is built before duplicating the state, which means
that it gets built earlier now than it used to. This would only have an
effect of making commands that end up not needing to perform any actions
unncessary build the remote list (only when they're run with concurrency
enable), but that's a minor overhead compared to commands seeking
through the work tree and determining they don't need to do anything.
get, move, copy, sync: When -J or annex.jobs has enabled concurrency,
checksum verification uses a separate job pool than is used for
downloads, to keep bandwidth saturated.
Not yet done for upload checksum verification, but that only affects
remotes on local disks.
Fixes the last wart in the StartMessage transition. A few commands
include other CommandStart actions that generate output, and
do not themselves need to display a start/end message.
The hoped for optimisation of CommandStart with -J did not materialize.
In fact, not runnign CommandStart in parallel is slower than -J3.
So, CommandStart are still run in parallel.
(The actual bad performance I've been seeing with -J in my big repo
has to do with building the remoteList.)
But, this is still progress toward making -J faster, because it gets rid
of the onlyActionOn roadblock in the way of making CommandCleanup jobs
run separate from CommandPerform jobs.
Added OnlyActionOn constructor for ActionItem which fixes the
onlyActionOn breakage in the last commit.
Made CustomOutput include an ActionItem, so even things using it can
specify OnlyActionOn.
In Command.Move and Command.Sync, there were CommandStarts that used
includeCommandAction, so output messages, which is no longer allowed.
Fixed by using startingCustomOutput, but that's still not quite right,
since it prevents message display for the includeCommandAction run
inside it too.
The goal is to be able to run CommandStart in the main thread when -J is
used, rather than unncessarily passing it off to a worker thread, which
incurs overhead that is signficant when the CommandStart is going to
quickly decide to stop.
To do that, the message it displays needs to be displayed in the worker
thread, after the CommandStart has run.
Also, the change will mean that CommandStart will no longer necessarily
run with the same Annex state as CommandPerform. While its docs already
said it should avoid modifying Annex state, I audited all the
CommandStart code as part of the conversion. (Note that CommandSeek
already sometimes runs with a different Annex state, and that has not been
a source of any problems, so I am not too worried that this change will
lead to breakage going forward.)
The only modification of Annex state I found was it calling
allowMessages in some Commands that default to noMessages. Dealt with
that by adding a startCustomOutput and a startingUsualMessages.
This lets a command start with noMessages and then select the output it
wants for each CommandStart.
One bit of breakage: onlyActionOn has been removed from commands that used it.
The plan is that, since a StartMessage contains an ActionItem,
when a Key can be extracted from that, the parallel job runner can
run onlyActionOn' automatically. Then commands won't need to worry about
this detail. Future work.
Otherwise, this was a fairly straightforward process of making each
CommandStart compile again. Hopefully other behavior changes were mostly
avoided.
In a few cases, a command had a CommandStart that called a CommandPerform
that then called showStart multiple times. I have collapsed those
down to a single start action. The main command to perhaps suffer from it
is Command.Direct, which used to show a start for each file, and no
longer does.
Another minor behavior change is that some commands used showStart
before, but had an associated file and a Key available, so were changed
to ShowStart with an ActionItemAssociatedFile. That will not change the
normal output or behavior, but --json output will now include the key.
This should not break it for anyone using a real json parser.
The filtering is fairly efficient as far as building the trees goes,
since it reuses adjustTree. But it still needs to traverse the whole
tree, and look up the keys used by every file.
The tree that gets recorded to export.log is the filtered tree.
This way resumes of interrupted sync to an export uses it without
needing to recalculate it. And, a change to the preferred content
settings of the remote will result in a different tree, so the export
will be updated accordingly.
The original tree is still used in the remote tracking branch.
That branch represents the special remote as a git remote, and if it
were a normal git remote, the tree in its head would not be affected by
preferred content.
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
This works, and tested syncing both gets changes from a special remote
and sends changes to it, keeping it fully in sync nicely!
But have not tried it with a subdir configured.
Users may want sync to only export, or only import and this is broadly
analagous to push and pull, so it makes sense to use the same
configuration for it.
The branch is only updated once the export is 100% complete. This way,
if an export is started but interrupted and so the remote does not yet
contain some of the files, an import will make a commit on the old
branch, and so won't delete the missing files.
Purifying exportActions will allow introspecting and modifying it,
which is needed to add progress bar display to it.
Only S3 and WebDAV ran an Annex action while constructing ExportActions.
There was a small performance gain from them doing that, since a
resource was able to be prepared and reused for multiple actions by
Command.Export.
As seen in commit 809cfbbd8a and
5d394023eb S3 and WebDAV actually create a
new handle for each access in normal, non-export use. It doesn't seem
worth making export use of them marginally more efficient than normal
use. It would be better to do that work upfront when constructing the
remote. Or perhaps use a MVar to cache a handle.
This commit was sponsored by Nick Piper on Patreon.
Like the earlier fixed one in Command.Export, it occurred when the same
tree was exported by multiple clones. Previous fix was incomplete since
several other places looked at the list of exported trees to detect when
there was an export conflict. Added a single unified function to avoid
missing any places it needed to be fixed.
This commit was sponsored by mo on Patreon.
This fixes a bug with the numcopies counting when using sync --content.
It did not always pass the local repo uuid to handleDropsFrom, and so the
numcopies counting was off by one, and unwanted local content would only be
dropped when there were numcopies+1 remote copies.
Also, support dropping local content that has reached an
exporttree remote that is not untrusted (currently only S3 remotes
with versioning).
When an export conflict prevents accessing a special remote, be clearer
about what the problem is and how to resolve it.
This commit was sponsored by Trenton Cronholm on Patreon.
This completes initial support for --hide-missing, although the
assistant still needs to be updated and it perhaps needs to be sped up,
and maybe there needs to be a way for git-annex get to operate on
missing files. Opened some more todos for those things.
This commit was sponsored by Henrik Riomar.
This relies on git ls-files --with-tree, which I'm using in a way that
its man page does not document. Hm. I emailed the git list to try to get
the docs improved, but at least the git test suite does test the same
kind of use case I'm using here.
Performance impact when not in an adjusted branch is limited to some
additional MVar accesses, and a single git call to determine the name of
the current branch. So very minimal.
When in an adjusted branch, the performance impact is
in Annex.WorkTree.lookupFile, which starts doing an equal amount of work
for files that didn't exist as it already did for files that were
unlocked.
This commit was sponsored by Jochen Bartl on Patreon.
Both Command.Sync and Annex.Ingest had their own versions of this.
The one in Annex.Ingest used Git.Branch.currentUnsafe, but does not seem
to need it. That is only checking to see if it's in an adjusted unlocked
branch, and when in an adjusted branch, the branch does in fact exist,
so the added check that Git.Branch.current does is fine.
This commit was sponsored by Denis Dzyubenko on Patreon.
This is groundwork for nested seek loops, eg seeking over all files and
then performing commandActions on a list of remotes, which can be done
concurrently.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
Only display the warning when the current branch has a tree that is not
the same as the tree in the export.
Note that it doesn't check to see if the current tree is
in incompleteExportedTreeish; it might be worth checking that and reminding
the user about an incomplete export, but when export tracking is not
configured, they are probably not in the right clone of the repository to
resolve the incomplete export.
This commit was sponsored by Ethan Aubin.
Added annex.commitmessage config that can specify a commit message for the
git-annex branch instead of the usual "update".
This commit was supported by the NSF-funded DataLad project.
Added remote.name.annex-speculate-present config that can be used to
make cache remotes.
Implemented it in Remote.keyPossibilities, which is used by the
get/move/copy/mirror commands, and nothing else. This way, things like
whereis will not show content that's speculatively present.
The assistant and sync --content were not using Remote.keyPossibilities,
and were changed to use it.
The efficiency hit should be small; Remote.keyPossibilities is only
used before transferring a file, which is the expensive operation.
And, it's only doing one lookup of the remoteList and a very cheap
filter over it.
Note that, git-annex still updates the location log when copying content
to a remote with annex-speculate-present set. In this case, the location
tracking will indicate that content is present in the remote. This may
not be wanted for caches, or may not be a real problem for them. TBD.
This commit was supported by the NSF-funded DataLad project.
Unfortunately one more use remains..
This should be just as fast as the other method. The remote's Git.Repo
has already had its config read, so Annex.new's call to Git.Config.read
is a noop.
Thid commit was sponsored by andrea rota.
This is groundwork for letting a repo be instantiated the first time
it's actually used, instead of at startup.
The only behavior change is that some old special cases for xmpp remotes
were removed. Where before git-annex silently did nothing with those
no-longer supported remotes, it may now fail in some way.
The additional IO action should have no performance impact as long as
it's simply return.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon
move: Added --safe option, which makes move honor numcopies settings.
Also --unsafe enables the default behavior, anticipating that the
default may one day change.
This commit was sponsored by Ethan Aubin.
sync: Fix bug that prevented pulling changes into direct mode repositories
that were committed to remotes using git commit rather than git-annex sync.
This commit was supported by the NSF-funded DataLad project.
Fourth or fifth try at this and finally found a way to make it work.
Absurd amount of busy-work forced on me by change in cabal's behavior.
Split up Utility modules that need posix stuff out of ones used by
Setup. Various other hacks around inability for Setup to use anything
that ifdefs a use of unix.
Probably lost a full day of my life to this.
This is how build systems make their users hate them. Just saying.
Clean up some uses of showStart with "" for the file,
or in some cases, a non-filename description string. That would
generate bad json, although none of the commands doing that
supported --json.
Using "" for the file resulted in output like "foo rest";
now the extra space is eliminated.
This commit was sponsored by Fernando Jimenez on Patreon.
This avoids all the complication about redundant work discussed in
the previous try at fixing this. At the expense of needing each command
that could have the problem to be patched to simply wrap the action in
onlyActionOn once the key is known. But there do not seem to be many
such commands.
onlyActionOn' should not be used with a CommandStart (or CommandPerform),
although the types do allow it. onlyActionOn handles running the whole
CommandStart chain. I couldn't immediately see a way to avoid mistken
use of onlyActionOn'.
This commit was supported by the NSF-funded DataLad project.
git annex add, git annex lock etc make multiple seek passes,
and each seek pass checked that files existed. That was unncessary
redundant work.
Fixed by adding a new WorkTreeItem type, make seek actions use it,
and check that the files exist when constructing it.
This commit was supported by the NSF-funded DataLad project.
Also deletes any tagged pushes that the assistant might have done,
since those would also prevent resetting a branch back.
This commit was sponsored by andrea rota.
This is similar to the pusher thread, but a separate thread because git
pushes can be done in parallel with exports, and updating a big export
should not prevent other git pushes going out in the meantime.
The exportThread only runs at most every 30 seconds, since updating an
export is more expensive than pushing. This may need to be tuned.
Added a separate channel for export commits; the committer records a
commit in that channel.
Also, reconnectRemotes records a dummy commit, to make the exporter
thread wake up and make sure all exports are up-to-date. So,
connecting a drive with a directory special remote export will
immediately update it, and getting online will automatically
update S3 and WebDAV exports.
The transfer queue is not involved in exports. Instead, failed
exports are retried much like failed pushes.
This commit was sponsored by Ewen McNeill.
Added remote configuration settings annex-ignore-command and
annex-sync-command, which are dynamic equivilants of the annex-ignore
and annex-sync configurations.
For this I needed a new DynamicConfig infrastructure. Its implementation
should be as fast as before when there is no dynamic config, and it caches
so shell commands are only run once.
Note that annex-ignore-command exits nonzero when the remote should be ignored.
While that may seem backwards, it allows using the same command for it as
for annex-sync-command when you want to disable both.
This commit was sponsored by Trenton Cronholm on Patreon.
* Added annex.resolvemerge configuration, which can be set to false to
disable the usual automatic merge conflict resolution done by git-annex
sync and the assistant.
* sync: Added --no-resolvemerge option.
Note that disabling merge conflict resolution is probably not a good idea
in a direct mode repo or adjusted branch. Since updates to both are done
outside the usual work tree, if it fails the tree is not left in a
conflicted state, and it would be hard to manually resolve the conflict.
Still, made annex.resolvemerge be supported in those cases for consistency.
This commit was sponsored by Riku Voipio.
The former can be useful to make remotes that don't get fully synced with
local changes, which comes up in a lot of situations.
The latter was mostly added for symmetry, but could be useful (though less
likely to be).
Implementing `remote.<name>.annex-pull` was a bit tricky, as there's no one
place where git-annex pulls/fetches from remotes. I audited all
instances of "fetch" and "pull". A few cases were left not checking this
config:
* Git.Repair can try to pull missing refs from a remote, and if the local
repo is corrupted, that seems a reasonable thing to do even though
the config would normally prevent it.
* Assistant.WebApp.Gpg and Remote.Gcrypt and Remote.Git do fetches
as part of the setup process of a remote. The config would probably not
be set then, and having the setup fail seems worse than honoring it if it
is already set.
I have not prevented all the code that does a "merge" from merging branches
from remotes with remote.<name>.annex-pull=false. That could perhaps
be done, but it would need a way to map from branch name to remote name,
and the way refspecs work makes that hard to get really correct. So if the
user fetches manually, the git-annex branch will get merged, for example.
Anther way of looking at/justifying this is that the setting is called
"annex-pull", not "annex-merge".
This commit was supported by the NSF-funded DataLad project.