The hoped for optimisation of CommandStart with -J did not materialize.
In fact, not runnign CommandStart in parallel is slower than -J3.
So, CommandStart are still run in parallel.
(The actual bad performance I've been seeing with -J in my big repo
has to do with building the remoteList.)
But, this is still progress toward making -J faster, because it gets rid
of the onlyActionOn roadblock in the way of making CommandCleanup jobs
run separate from CommandPerform jobs.
Added OnlyActionOn constructor for ActionItem which fixes the
onlyActionOn breakage in the last commit.
Made CustomOutput include an ActionItem, so even things using it can
specify OnlyActionOn.
In Command.Move and Command.Sync, there were CommandStarts that used
includeCommandAction, so output messages, which is no longer allowed.
Fixed by using startingCustomOutput, but that's still not quite right,
since it prevents message display for the includeCommandAction run
inside it too.
The goal is to be able to run CommandStart in the main thread when -J is
used, rather than unncessarily passing it off to a worker thread, which
incurs overhead that is signficant when the CommandStart is going to
quickly decide to stop.
To do that, the message it displays needs to be displayed in the worker
thread, after the CommandStart has run.
Also, the change will mean that CommandStart will no longer necessarily
run with the same Annex state as CommandPerform. While its docs already
said it should avoid modifying Annex state, I audited all the
CommandStart code as part of the conversion. (Note that CommandSeek
already sometimes runs with a different Annex state, and that has not been
a source of any problems, so I am not too worried that this change will
lead to breakage going forward.)
The only modification of Annex state I found was it calling
allowMessages in some Commands that default to noMessages. Dealt with
that by adding a startCustomOutput and a startingUsualMessages.
This lets a command start with noMessages and then select the output it
wants for each CommandStart.
One bit of breakage: onlyActionOn has been removed from commands that used it.
The plan is that, since a StartMessage contains an ActionItem,
when a Key can be extracted from that, the parallel job runner can
run onlyActionOn' automatically. Then commands won't need to worry about
this detail. Future work.
Otherwise, this was a fairly straightforward process of making each
CommandStart compile again. Hopefully other behavior changes were mostly
avoided.
In a few cases, a command had a CommandStart that called a CommandPerform
that then called showStart multiple times. I have collapsed those
down to a single start action. The main command to perhaps suffer from it
is Command.Direct, which used to show a start for each file, and no
longer does.
Another minor behavior change is that some commands used showStart
before, but had an associated file and a Key available, so were changed
to ShowStart with an ActionItemAssociatedFile. That will not change the
normal output or behavior, but --json output will now include the key.
This should not break it for anyone using a real json parser.
When running multiple concurrent actions, the cleanup phase is run in a
separate queue than the main action queue. This can make some commands
faster, because less time is spent on bookkeeping in between each file
transfer.
But as far as I can see, nothing will be sped up much by this yet, because
all the existing cleanup actions are very light-weight. This is just groundwork
for deferring checksum verification to cleanup time.
This change does mean that if the user expects -J2 will mean that they see no
more than 2 jobs running at a time, they may be surprised to see 4 in some
cases (if the cleanup actions are slow enough to notice).
It might also make sense to enable background cleanup without the -J,
for at least one cleanup action. Indeed, that's the behavior that -J1
has now. At some point in the future, it make make sense to make the
behavior with no -J the same as -J1. The only reason it's not currently
is that git-annex can build w/o concurrent-output, and also any bugs
in concurrent-output (such as perhaps misbehaving on non-VT100 compatible
terminals) are avoided by default by only using it when -J is used.
Add back support for ftp urls, which was disabled as part of the fix for
security hole CVE-2018-10857 (except for configurations which enabled curl
and bypassed public IP address restrictions). Now it will work if allowed
by annex.security.allowed-ip-addresses.
Renamed annex.security.allowed-http-addresses to
annex.security.allowed-ip-addresses because it is not really specific to
the http protocol, also limiting eg, git-annex's use of ftp and via
youtube-dl, several other protocols.
The old name for the config will still work.
If both old and new name are set, the new name will win.
* init: When the repository already has a description, don't change it.
* describe: When run with no description parameter it used to set
the description to "", now it will error out.
Importing from a special remote honors its preferred content too; unwanted
files are not imported. But, some preferred content expressions can't be
checked before files are imported, and trying to import with such an
expression will fail.
Tested this with scenarios including changing the preferred content
expression and making sure merging the import didn't delete files that were
no longer wanted.
There was one minor inefficiency mentioned in the todo that I punted on.
Make the import have the previous import as a parent, so eg `git log --stat`
displays a useful diff.
Also a minor optimisation, only calculate the depth of the imported history
once.
Prevents merging the import from deleting the non-preferred files from
the branch it's merged into.
adjustTree previously appended the new list of items to the old, which
could result in it generating a tree with multiple files with the same
name. That is not good and confuses some parts of git. Gave it a
function to resolve such conflicts.
That allowed dealing with the problem of what happens when the import
contains some files (or subtrees) with the same name as files that were
filtered out of the export. The files from the import win.
The filtering is fairly efficient as far as building the trees goes,
since it reuses adjustTree. But it still needs to traverse the whole
tree, and look up the keys used by every file.
The tree that gets recorded to export.log is the filtered tree.
This way resumes of interrupted sync to an export uses it without
needing to recalculate it. And, a change to the preferred content
settings of the remote will result in a different tree, so the export
will be updated accordingly.
The original tree is still used in the remote tracking branch.
That branch represents the special remote as a git remote, and if it
were a normal git remote, the tree in its head would not be affected by
preferred content.
Only when the preferred content expression includes them will a parse
failure due to them needing keys result in the preferred content
expression not parsing in keyless mode.
This will let import try to match preferred content expressions before
downloading the content and generating its key.
If an expression needs a key, it preferredContentParser with
preferredContentKeylessTokens will fail to parse it.
standard and groupwanted are not in preferredContentKeylessTokens
because they may refer to an expression that refers to a key.
That needs further work to support them.
Added the ability to run one job per CPU (core), by setting annex.jobs=cpus,
or using option --jobs=cpus or -Jcpus.
Built with future expansion in mind, including not defaulting matching on
Concurrency so more constructors can later be added, and using "cpu"
instead of "0".
Fixes bug that caused git-annex to fail to add a file when another
git-annex process cleaned up the temp directory it was using.
Solution is just to push withOtherTmp out to a higher level, so that
the whole ingest process can be completed inside it.
But in the assistant, that was not practical to do, since withOtherTmp runs
in the Annex monad and the assistant does not. Worked around by introducing
a separate temp directory that only the assistant uses for lockdown.
Since only one assistant can run at a time, it's easy to clean up that
directory of old cruft at startup.
This is only done for correctness sake; I don't see any way that it
would have caused a problem here. The jlog file escaped withOtherTmp
so another process could swoop in and delete it, but the file is only
used as a buffer for a list of filenames, and its handle gets rewound
and they're read back out, which will still work even if it's already
been deleted.
The only reason I didn't just pre-delete the file and keep the handle
open is I'm not sure that works on all OS's (eg Windows). If there was
a problem that this fixed it might involve an OS that doesn't support
deleting an open file or something like that.
bf7ecd6892 went too far and broke
importing, the old tree was used on the remote tracking branch and not
the newly imported tree.
Test suite noticed the problem luckily.
Fix reversion in last release that caused wrong tree to be written to
remote tracking branch after an export of a subtree.
The invariant "commitsha should have the treesha as its tree"
was not met due to a bug. Guarantee it's met by catting the commitsha
to find its actual tree. A little bit slower, but this is not run often.
This way no history is lost, neither what was exported to the remote,
or the history of changes that is imported from it. No complicated
correlation of two possibly very different histories is needed, just
record what we know and then git merge will do a good job.
Also, it notices when the remote tracking branch doesn't need to be updated,
and avoids doing anything, so noop remotes are super cheap.
The only catch here is that, since the commits generated for imports
from the remote don't have a stable date or author/committer, each
(non-noop) import generates different commits for the same imported
trees. So, when the imported remote tracking branch is merged into master
and then a change is imported again, there will be an extra series of
commits, which will get more and more expensive each time.
This seems to call for making stable commits for imports. Also that
seems a good idea to make importing in several repositories have the
same result.
* Added mimeencoding= term to annex.largefiles expressions.
This is probably mostly useful to match non-text files with eg
"mimeencoding=binary"
* git-annex matchexpression: Added --mimeencoding option.
Switch listContents to being a proper CommandStart, so if it throws an
exception, it will be treated like any other command action that fails.
downloadImport apparently does not ever throw an exception,
and itself uses commandAction, so it can't be a CommandStart.
Fix bug that caused importing from a special remote to repeatedly download
unchanged files when multiple files in the remote have the same content.
Unfortunately, there's really no good way to remove a uniqueness constraint
from a sqlite database. The best that can be done is to make a new table
and copy the data over. But that would require using persistent's
migrations or raw sql, and I don't want to do either.
Instead, a sledgehammer approach: Renamed .git/annex/cid to
.git/annex/cids. When the new database doesn't exist, it will be populated
from the git-annex branch.
Noting deletes the old database. Don't want to delete it out from under
some long-running git-annex process that might be using it. It could
eventually be deleted. But this is such a new feature, probably few repos
have the database in any case.
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
untested
This won't be super slow, but it does need to diff two likely large
trees, and since the git-annex branch rarely sits still, it will most
likely be run at the beginning of every import.
A possible speed improvement would be to only run this when the database
did not contain a ContentIdentifier. But that would only speed up
imports when there is no new version of a file on the special remote,
at most renames of existing files being imported.
A better speed improvement would be to record something in the git-annex
branch that indicates when an import has been run, and only do the diff
if the git-annex branch has record of a newer import than we've seen
before. Then, it would only run when there is in fact new
ContentIdentifier information available from a remote. Certianly doable,
but didn't want to complicate things yet.
For now, it's only allowed when exporttree=yes is also set.
That simplified the implementation, but could later be changed if
there's a remote that makes sense to be an import but not an export.
However, it may work just as well to make a remote be readonly to
prevent export to it while still allowing import.
This does not avoid all possible races, but it does avoid all likely
ones, and is demonstratably better than git's own handling of races
where files get modified at the same time as it's updating the working
tree.
The main thing this won't detect are not unlikely races where part
of a file gets changed while it's being copied and then the file is
restored to its original condition before the modification check.
No, it's more likely that the limitations of checking inode, size,
and mtime won't detect certian modifications, involving eg mmapped
files.
Made some api changes.
listImportableContents needs to provide the size
of the data, so the downloader can check disk free space.
retrieveExportWithContentIdentifier is passed the filepath to write to
Use temporary "CID" key during download of a ContentIdentifier from a
remote, so withTmp can be used and then move the content to the real key
once it's known.
Added graftTree but it's buggy.
Should use graftTree in Annex.Branch.graftTreeish; it will be faster
than the current implementation there.
Started Annex.Import, but untested and it doesn't yet handle tree
grafting.
Does not yet have a way to update with new information from the
git-annex branch, which will be needed when multiple repos are importing
from the same remote.
Avoid performing repository fixups for submodules and git-worktrees
when there's a .noannex file that will prevent git-annex from being
used in the repository.
This change is ok as long as the .noannex file is really going to prevent
git-annex from being used. But, init --force could override the file.
Which would result in the repo being initialized without the fixups
having run.
To avoid that situation decided to change init, to not let --force be used
to override a .noannex file. Instead the user can just delete the file.
If the worktree file already exists, and is annexed and uses the same
key, avoid failing, nothing needs to be done.
Had to add lookupFileNotHidden to handle the case where an adjust --hide-missing
is in use, and the worktree file was hidden due to the object content
being missing. lookupFile would return the key of the hidden file,
but it makes sense that after fromkey succeeds, the worktree must
contain the file it was supposed to set up.
Need to create the directory after the lock is held, not before.
The other racing process would need to shut down at just the wrong time,
running cleanupOtherTmp.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
I am very doubtful that commit 613e747d91
was right about this doing anything, and I've verified that without it,
ctrl-c sends sigINT to child processes, and git-annex get does not
continue to the next item.
It seems likely that the real problem back then was something catching
the async exception.
Hard to see how installing a default signal handler could cause any
change from default behavior either.
One reason to want to get rid of this cruft now is that tasty has a
sigINT handler of its own, and this would override it.
(Tasty is not currently setting that handler up the way git-annex uses
it, due to a problem in tasty, but that will hopefully change.)
* Switch to using .git/annex/othertmp for tmp files other than partial
downloads, and make stale files left in that directory when git-annex
is interrupted be cleaned up promptly by subsequent git-annex processes.
* The .git/annex/misctmp directory is no longer used and git-annex will
delete anything lingering in there after it's 1 week old.
Also, in Annex.Ingest, made the filename it uses in the tmp dir be
prefixed with "ingest-" to avoid potentially using a filename used by
some other code.
Adding that field broke the Read/Show serialization back-compat,
and also the Eq and Ord instances were not blinded to it, which broke
git annex fsck and probably more.
I think that the new approach used in formatKeyVariety will be nearly
as fast, but have not benchmarked it.
This reverts commit 4536c93bb2.
That broke Read/Show of a Key, and unfortunately Key is read in at least
one place; the GitAnnexDistribution data type.
It would be worth bringing this optimisation back, but it would need
either a custom Read/Show instance that preserves back-compat, or
wrapping Key in a data type that contains the serialization, or changing
how GitAnnexDistribution is serialized.
Also, the Eq instance would need to compare keys with and without a
cached seralization the same.
The builder produces a lazy ByteString, and L.toStrict has to copy it,
but needing to use the builder is no longer to common case; the
serialization will normally be cached already as a strict ByteString,
and this avoids keyFile' needing to use L.toStrict . serializeKey'
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.
It means that every place a Key has any of its fields changed, the cache
has to be dropped. I've grepped and found them all. But, it would be
better to avoid that gotcha somehow..
Now there's a ByteString used all the way from disk to Key.
The main complication in this conversion was the use of fromInternalGitPath
in several places to munge things on Windows. The things that used that
were changed to parse the ByteString using either path separator.
Also some code that had read from files to a String lazily was changed
to read a minimal strict ByteString.
A keyName could contain "/", though this is unlikely and certianly only
ever could happen with WORM keys.
The change to addunused to escape that is no problem at all.
The change to VariantFile to escape it means that different versions of
git-annex could resolve a merge conflict differently in this case, which
is unfortunate. There would be different .variant files used, so the two
resolutions would themselves merge together without additional
conflicts, but the user would have to clean up the extra .variant
files.