Except for the assistant, which I think may use them between threads?
Most of the uses of SomeException were already catching only async exceptions.
But I did find a few places that were accidentially catching them.
This handles all createProcessSuccess callers, and aside from process
pools, the complete conversion of all process running to async exception
safety should be complete now.
Also, was able to remove from Utility.Process the old API that I now
know was not a good idea. And proof it was bad: The code size went *down*,
despite there being a fair bit of boilerplate for some future API to
reduce.
This handles all sites where checkSuccessProcess/ignoreFailureProcess
is used, except for one: Git.Command.pipeReadLazy
That one will be significantly more work to convert to bracketing.
(Also skipped Command.Assistant.autoStart, but it does not need to
shut down the processes it started on exception because they are
git-annex assistant daemons..)
forceSuccessProcess is done, except for createProcessSuccess.
All call sites of createProcessSuccess will need to be converted
to bracketing.
(process pools still todo also)
Not yet 100% done, so far I've grepped for waitForProcess and converted
everything that uses that to start the process with withCreateProcess.
Except for some things like P2P.IO and Assistant.TransferrerPool,
and Utility.CoProcess, that manage a pool of processes. See #2
in https://git-annex.branchable.com/todo/more_extensive_retries_to_mask_transient_failures/#comment-209f8a8c38e63fb3a704e1282cb269c7
for how those will need to be dealt with.
checkSuccessProcess, ignoreFailureProcess, and forceSuccessProcess calls waitForProcess, so
callers of them will also need to be dealt with, and have not been yet.
Added annex.skipunknown git config, that can be set to false to change the
behavior of commands like `git annex get foo*`, to not skip over files/dirs
that are not checked into git and are explicitly listed in the command
line.
Significant complexity was needed to handle git-annex add, which uses some
git ls-files calls, but needs to not use --error-unmatch because of course
the files are not known to git.
annex.skipunknown is planned to change to default to false in a
git-annex release in early 2022. There's a todo for that.
Try to enable special remotes configured with autoenable=yes when git-annex
auto-initialization happens in a new clone of an existing repo. Previously,
git-annex init had to be explicitly run to enable them. That was a bit of a
wart of a special case for users to need to keep in mind.
Special remotes cannot display anything when autoenabled this way, to avoid
interfering with the output of git-annex query commands.
Any error messages will be hidden, and if it fails, nothing is displayed.
The user will realize the remote isn't enable when they try to use it,
and can run git-annex init manually then to try the autoenable again and
see what failed.
That seems like a reasonable approach, and it's less complicated than
communicating something across a pipe in order to display it as a side
message. Other reason not to do that is that, if the first command the
user runs is one like git-annex find that has machine readable output,
any message about autoenable failing would need to not be displayed anyway.
So better to not display a failure message ever, for consistency.
(Had to split out Remote.List.Util to avoid an import cycle.)
retrieveExport is part of ongoing transition to make remote methods
throw exceptions, rather than silently hide them.
getKey very rarely fails, and when it does it's always for the same reason
(user configured annex.backend to url for some reason). So, this will
avoid dealing with Nothing everywhere it's used.
This commit was sponsored by Ilya Shlyakhter on Patreon.
When storing content on remote fails, always display a reason why.
Since the Storer used by special remotes already did, this mostly affects
git remotes, but not entirely. For example, if git-lfs failed to connect to
the endpoint, it used to silently return False.
* addurl --preserve-filename: New option, uses server-provided filename
without any sanitization, but with some security checking.
Not yet implemented for remotes other than the web.
* addurl, importfeed: Avoid adding filenames with leading '.', instead
it will be replaced with '_'.
This might be considered a security fix, but a CVE seems unwattanted.
It was possible for addurl to create a dotfile, which could change
behavior of some program. It was also possible for a web server to say
the file name was ".git" or "foo/.git". That would not overrwrite the
.git directory, but would cause addurl to fail; of course git won't
add "foo/.git".
sanitizeFilePath is too opinionated to remain in Utility, so moved it.
The changes to mkSafeFilePath are because it used sanitizeFilePath.
In particular:
isDrive will never succeed, because "c:" gets munged to "c_"
".." gets sanitized now
".git" gets sanitized now
It will never be null, because sanitizeFilePath keeps the length
the same, and splitDirectories never returns a null path.
Also, on the off chance a web server suggests a filename of "",
ignore that, rather than trying to save to such a filename, which would
fail in some way.
addurl: When run with --fast on an url that
annex.security.allowed-ip-addresses prevents accessing, display a more
useful message.
(Also importfeed --fast potentially.)
Limited to min of -JN or number of CPU cores, because it will often be
CPU bound, once it's read the gitignore file for a directory.
In some situations it's more disk bound, but in any case it's unlikely
to be the main bottleneck that -J is used to avoid. Eg, when dropping,
this is used for numcopies checks, but the main bottleneck will be
accessing the remotes to verify presence. So the user might decide to
-J32 that, but having 32 check-attr processes would just waste however
many filehandles they open, and probably worsen their performance due to
CPU contention.
Note that, I first tried just letting up to the -JN be started. However,
even when it's no bottleneck at all, that still results in all of them
being started. Why? Well, all the worker threads start up nearly
simulantaneously, so there's a thundering herd..
Avoid running a large number of git cat-file child processes when run with
a large -J value.
This implementation takes care to avoid adding any overhead to git-annex
when run without -J. When run with -J, there is a small bit of added
overhead, to manipulate the resource pool. That optimisation added a
fair bit of complexity.
This does mean that RemoteDaemon.Transport.Tor's call runs it, otherwise
no change, but this is groundwork for doing more such expensive actions
in dupState.
Fixes a failure mode where git-annex sync would try to run git-annex and
complain that it failed to find it in ~/.config/git-annex/program or PATH,
when there was a git-annex in /usr/bin/, but the original one was run
from elsewhere (eg, ~/bin) and happened not to be present any longer.
Now, it will fall back to using git-annex from PATH in such a case.
Which might fail due to some version incompatability, but still better
than a misleading error message.
Also made readProgramFile only read the file, not look for git-annex in
PATH as a fallback. That fallback may have confused Assistant.Upgrade,
which really wants the value from the file.
The journal read optimisation in aeca7c220 later got fixed in eedd73b84
to stage and commit any files that were left in the journal by a
previous git-annex run. That's necessary for the optimisation to work
correctly. But it also meant that alwayscommit=false started committing
the previous git-annex processes journalled changes, which defeated the
purpose of the config setting entirely.
So, disable the optimisation when alwayscommit=false, leaving the
files in the journal and not committing them. See my comments on the bug
report for why this seemed the best approach.
Also fixes a problem when annex.merge-annex-branches=false and there
are changes in the journal. That config indirectly prevents committing
the journal. (Which seems a bit odd given its name, but it always has..)
So, when there were changes in the journal, perhaps left there due to
alwayscommit=false being set before, the optimisation would prevent
git-annex from reading the journal files, and it would operate with out
of date information.
This change does impact git-annex config
eg "git annex config --set annex.addunlocked on"
will store "on" and new git-annex will understand that value, while
old git-annex will error:
git-annex: bad annex.addunlocked configuration in git annex config:
Parse failure: near "on"
That seems acceptable.
Not special remote configs that are only documented as =true or =false
however. Having git-annex support other values for those would break
backwards compatability when used with old versions of git-annex. And
older versions ignore invalid special remote configs.. That would not
be a good combination.
Git has an obnoxious special case in git config, a line "foo" is the same
as "foo = true". That means there is no way to examine the output of
git config and tell if it was run with --null or not, since a "foo"
in the first line could be such a boolean, or could be followed by its
value on the next line if --null were used.
So, rather than trying to do such a detection, track the style of config
at all the points where it's generated.
aeca7c2207 was predicated on the
assumption that updateTo would stage any journal files, but in one case
it did not actually do so. The test suite happened to expose the bug.
The only price paid is one additional MVar read per write to the journal.
Presumably writing a journal file dominiates over a MVar read time by
several orders of magnitude.
--batch does not get the speedup because then it needs to notice when
another process has made a change. Also made the assistant and other damon
modes bypass the optimisation, which would not help them anyway.
Improve git-annex's ability to find the path to its program, especially
when it needs to run itself in another repo to upgrade it.
Some parts of the code used readProgramFile, probably because I forgot that
programPath exists.
I noticed this when a git-annex auto-upgrade failed because it was running
git-annex upgrade --autoonly, but the code to run git-annex used
readProgramFile, which happened to point to an older build of git-annex.
This was originally added so that unannex could prevent the hook from
running while files were in a state that the hook would interpret as
old-style unlocked and so would lock.
Now that's gone, so the only thing the hook was preventing was two
pre-commit processes running simulantaneously. But such concurrency
is normal in git-annex and should not be a problem.
Does mean that .git/hooks/pre-commit-annex might run more concurrently,
that seems the only risk of it causing any problems.
This means it will still be a .git file when git-annex init runs. That's
ok, the repo probably contains no annexed objects yet, and even if it does,
git-annex init does not care if symlinks in the worktree don't point to the
objects.
I made init, at the end, run the conversion code. Not really necessary
because the next git-annex command could do it just as well. But, this
avoids commands that don't normally write to the repo needing to write to
it, which might avoid some problem or other, and seems worth avoiding
generally.
Since it was used on both worktree and .git/annex files, split into
multiple functions.
In passing, this also improves permissions of created directories in
.git/annex, using createAnnexDirectory on those.
It will create foo/.git/annex/, but not foo/.git/ and not foo/.
This will avoid it creating an empty path to a repo when a drive is
yanked out and the mount point goes away, for example.
git-annex config: Only allow configs be set that are ones git-annex
actually supports reading from repo-global config, to avoid confused users
trying to set other configs with this.
It's important that it be clear that it overrides a config, such that
reloading the git config won't change it, and in particular, setConfig
won't change it.
Most of the calls to changeGitConfig were actually after setConfig,
which was redundant and unncessary. So removed those.
The only remaining one, besides --debug, is in the handling of
repository-global config values. That one's ok, because the
way mergeGitConfig is implemented, it does not override any value that
is set in git config. If a value with a repo-global setting was passed
to setConfig, it would set it in the git config, reload the git config,
re-apply mergeGitConfig, and use the newly set value, which is the right
thing.
Fix serious regression in gcrypt and encrypted git-lfs remotes.
Since version 7.20200202.7, git-annex incorrectly stored content
on those remotes without encrypting it.
Problem was, Remote.Git enumerates all git remotes, including git-lfs
and gcrypt. It then dispatches to those. So, Remote.List used the
RemoteConfigParser from Remote.Git, instead of from git-lfs or gcrypt,
and that parser does not know about encryption fields, so did not
include them in the ParsedRemoteConfig. (Also didn't include other
fields specific to those remotes, perhaps chunking etc also didn't
get through.)
To fix, had to move RemoteConfig parsing down into the generate methods
of each remote, rather than doing it in Remote.List.
And a consequence of that was that ParsedRemoteConfig had to change to
include the RemoteConfig that got parsed, so that testremote can
generate a new remote based on an existing remote.
(I would have rather fixed this just inside Remote.Git, but that was not
practical, at least not w/o re-doing work that Remote.List already did.
Big ugly mostly mechanical patch seemed preferable to making git-annex
slower.)
* init --version: When the version given is one that automatically
upgrades to a newer version, use the newer version instead.
* Auto upgrades from older repo versions, like v5, now jump right to v8.
remoteAnnexConfig will avoid bugs like
a3a674d15b
Use now more generic remoteConfig in a couple places that built
non-annex config settings manually before.
Fix support for repositories tuned with annex.tune.branchhash1=true,
including --all not working and git-annex log not displaying anything for
annexed files.
* When git-annex is built with a ssh that does not support ssh connection
caching, default annex.sshcaching to false, but let the user override it.
* Improve warning messages further when ssh connection caching cannot
be used, to clearly state why.
A warning message is unsatisfying. But erroring out is too hard a failure,
especially since it may well work fine if the user has enabled passwordless
ssh.
I did think about falling back to one ssh connection at a time in this
case, but it would have needed a rework of every ssh call, which
seems far overboard for such a niche problem. There's no single place where
git-annex runs ssh, so no one place that it could block a concurrent call
on a semaphore. And, even if it did fall back to one ssh connection at a
time, it seems to me that doing so without warning the user about the
problem just invites bug reports like "git-annex is ignoring my -J2 and
only doing one download at a time". So a warning is needed, and I suppose
is good enough.
If git-credential has it cached and does not prompt, this will
unfortunately result in a brief flicker, as the displayed console
regions are hidden while running it and then re-displayed. Better than a
corrupted display.
Actually, I tried it and don't see a visible flicker, so probably only
over a slow ssh will it be apparent.
using git credential to get the password
One thing this doesn't do is wrap the password prompting inside the prompt
action. So with -J, the output can be a bit garbled.
getRemoteConfigPassedThrough was never returning anything, Typeable
prevented the type checker from noticing a dumb mistake.
parseRemoteConfig was not adding Accepted values as PassedThrough
preferreddir can be used with any special remote, so its parser needs to
be included in the commonFieldParsers.
initremote with uuid= changed to delete that field, so it does not
need to be included in commonFieldParsers. Note that, existing remotes
initialized before this change will have the field in remote.log.
This will not cause problems parsing, because the value will be
Accepted.
Grepping for 'Accepted "' found these, and I'm pretty sure this is all of
them.
Needed so Remote.External can query the external program for its
configs. When the external program does not support the query,
the passthrough option will make all input fields be available.
Remote now contains a ParsedRemoteConfig. The parsing happens when the
Remote is constructed, rather than when individual configs are used.
This is more efficient, and it lets initremote/enableremote
reject configs that have unknown fields or unparsable values.
It also allows for improved type safety, as shown in
Remote.Helper.Encryptable where things that used to match on string
configs now match on data types.
This is a work in progress, it does not build yet.
The main risk in this conversion is forgetting to add a field to
RemoteConfigParser. That will prevent using that field with
initremote/enableremote, and will prevent remotes that already are set
up from seeing that configuration. So will need to check carefully that
every field that getRemoteConfigValue is called on has been added to
RemoteConfigParser.
(One such case I need to remember is that credPairRemoteField needs to be
included in the RemoteConfigParser.)
This is a first step toward that goal, using the ProposedAccepted type
in RemoteConfig lets initremote/enableremote reject bad parameters that
were passed in a remote's configuration, while avoiding enableremote
rejecting bad parameters that have already been stored in remote.log
This does not eliminate every place where a remote config is parsed and a
default value is used if the parse false. But, I did fix several
things that expected foo=yes/no and so confusingly accepted foo=true but
treated it like foo=no. There are still some fields that are parsed with
yesNo but not not checked when initializing a remote, and there are other
fields that are parsed in other ways and not checked when initializing a
remote.
This also lays groundwork for rejecting unknown/typoed config keys.
Git will eventually switch to sha2 and there will not be one single
shaSize anymore, but two (40 and 64).
Changed all parsers for git plumbing output to support both sizes of
shas.
One potential problem this does not deal with is, if somewhere in
git-annex it reads two shas from different sources, and compares them
to see if they're the same sha, it would fail if they're sha1 and sha256
of the same value. I don't know if that will really be a concern.
smudge: When annex.largefiles=anything, files that were already stored in
git, and have not been modified could sometimes be converted to being
stored in the annex. Changes in 7.20191024 made this more of a problem.
This case is now detected and prevented.
* annex.addunlocked can be set to an expression with the same format used by
annex.largefiles, in case you want to default to unlocking some files but
not others.
* annex.addunlocked can be configured by git-annex config.
Added a git-annex-matching-expression man page, broken out from
tips/largefiles.
A tricky consequence of this is that git-annex add --relaxed
honors annex.addunlocked, but an expression might want to know the size
or content of an url, which it's not going to download. I decided it was
better not to fail, and just dummy up some plausible data in that case.
Performance impact should be negligible. The global config is already
loaded for annex.largefiles. The expression only has to be parsed once,
and in the simple true/false case, it should not do any additional work
matching it.
annex.largefiles can be configured by git-annex config, to more easily set
a default that will also be used by clones, without needing to shoehorn the
expression into the gitattributes file. The git config and gitattributes
override that.
Whenever something is added to git-annex config, we have to consider what
happens if a user puts a purposfully bad value in there. Or, if a new
git-annex adds some new value that an old git-annex can't parse.
In this case, a global annex.largefiles that can't be parsed currently
makes an error be thrown. That might not be ideal, but the gitattribute
behaves the same, and is almost equally repo-global.
Performance notes:
git-annex add and addurl construct a matcher once
and uses it for every file, so the added time penalty for reading the global
config log is minor. If the gitattributes annex.largefiles were deprecated,
git-annex add would get around 2% faster (excluding hashing), because
looking that up for each file is not fast. So this new way of setting
it is progress toward speeding up add.
git-annex smudge does need to load the log every time. As well as checking
the git attribute. Not ideal. Setting annex.gitaddtoannex=false avoids
both overheads.
Remove dup definitions and just use the RawFilePath one. </> etc are
enough faster that it's probably faster than building a String directly,
although I have not benchmarked.
the encode' and decode' functions on Windows should not apply the
filesystem encoding, which does not work there. Instead, convert to and
from UTF-8.
Also, avoid exporting encodeW8 and decodeW8. Both use the filesystem
encoding, so won't work as expected on windows.
My ByteString rewrite oversimplified it, resulting in any _ in a journal
file turning into a / in the git-annex branch, which was often the wrong
filename, or sometimes (//) an invalid filename that git
refused to add.
git-annex find is now RawFilePath end to end, no string conversions.
So is git-annex get when it does not need to get anything.
So this is a major milestone on optimisation.
Benchmarks indicate around 30% speedup in both commands.
Probably many other performance improvements. All or nearly all places
where a file is statted use RawFilePath now.
Adds a dependency on filepath-bytestring, an as yet unreleased fork of
filepath that operates on RawFilePath.
Git.Repo also changed to use RawFilePath for the path to the repo.
This does eliminate some RawFilePath -> FilePath -> RawFilePath
conversions. And filepath-bytestring's </> is probably faster.
But I don't expect a major performance improvement from this.
This is mostly groundwork for making Annex.Location use RawFilePath,
which will allow for a conversion-free pipleline.
Since the sqlite branch uses blobs extensively, there are some
performance benefits, ByteStrings now get stored and retrieved w/o
conversion in some cases like in Database.Export.
Only done on those calls to getFileStatus that had a RawFilePath, not a
FilePath. The others would probably be just as fast if converted to use
it with toRawFilePath, but I'm not 100% sure.
Note that genInodeCache' uses fromRawFilePath, but that value only gets
used on Windows, so on unix the thunk will never be evaluated.
This was already optimised before, but profiling found that delEntry was
around 1.5% of the total runtime of git-annex whereis. It was being
called once per environment variable per file processed.
Fixed by better caching. Since withIndexFile is almost always run with
the same .git/annex/index file, it can cache the modified environment,
rather than re-modifying it each time called.
The parser and looking up config keys in the map should both be faster
due to using ByteString.
I had hoped this would speed up startup time, but any improvement to
that was too small to measure. Seems worth keeping though.
Note that the parser breaks up the ByteString, but a config map ends up
pointing to the config as read, which is retained in memory until every
value from it is no longer used. This can change memory usage
patterns marginally, but won't affect git-annex.
Finally builds (oh the agoncy of making it build), but still very
unmergable, only Command.Find is included and lots of stuff is badly
hacked to make it compile.
Benchmarking vs master, this git-annex find is significantly faster!
Specifically:
num files old new speedup
48500 4.77 3.73 28%
12500 1.36 1.02 66%
20 0.075 0.074 0% (so startup time is unchanged)
That's without really finishing the optimization. Things still to do:
* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
decodeBS conversions.
* Use versions of IO actions like getFileStatus that take a RawFilePath.
* Eliminate some Data.ByteString.Lazy.toStrict, which is a slow copy.
* Use ByteString for parsing git config to speed up startup.
It's likely several of those will speed up git-annex find further.
And other commands will certianly benefit even more.
Goal is to make git-annex faster by using ByteString for all the
worktree traversal. For now, this is focusing on Command.Find,
in order to benchmark how much it helps. (All other commands are
temporarily disabled)
Currently in a very bad unbuildable in-between state.
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.
Previously attempted in 4536c93bb2
and reverted in 96aba8eff7.
The problems mentioned in the latter commit are addressed now:
Read/Show of KeyData is backwards-compatible with Read/Show of Key from before
this change, so Types.Distribution will keep working.
The Eq instance is fixed.
Also, Key has smart constructors, avoiding needing to remember to update
the cached serialization.
Used git-annex benchmark:
find is 7% faster
whereis is 3% faster
get when all files are already present is 5% faster
Generally, the benchmarks are running 0.1 seconds faster per 2000 files,
on a ram disk in my laptop.
It's not necessary. And if the bare repo somehow has a pointer
file in it with the same name as a file in HEAD, that file would be
populated, which would be surprising since the file is not really under
git's control.
* git-lfs: The url provided to initremote/enableremote will now be
stored in the git-annex branch, allowing enableremote to be used without
an url. initremote --sameas can be used to add additional urls.
* git-lfs: When there's a git remote with an url that's known to be
used for git-lfs, automatically enable the special remote.
See the comment for a trace of the deadlock.
Added a new StartStage. New worker threads begin in the StartStage.
Once a thread is ready to do work, it moves away from the StartStage,
and no thread will ever transition back to it.
A thread that blocks waiting on another thread that is processing
the same key will block while in the StartStage. That other thread
will never switch back to the StartStage, and so the deadlock is avoided.
Convert Utility.Url to return Either String so the error message can be
displated in the annex monad and so captured.
(When curl is used, its errors are still not caught.)
Delete the old export dbs on upgrade.
Testing this an exporting to a directory with both exporttree=yes and
importtree=yes, it refused to let an interrupted export proceed after
upgrade, with "unsafe to overwrite file". An import resolved the
problem.
It will be populated automatically by the next command that needs data
from it, the same way it gets populated in a fresh clone. That may be a
little expensive, but it's a one time cost, and no slower than in a
fresh clone.
The old db is cleaned up when a new incremental fsck is started.
The incremental fsck won't pick up where the old one left off, but I
consider this a minor enough thing that it can just be documented and
won't be a problem.
Renamed the database to .git/annex/keysdb;
the old .git/annex/keys gets deleted during the upgrade.
It is possible that an old git-annex process is running during the
upgrade. If so, it will be able to continue using the old keys db until the
upgrade is complete, and then will presumably fail in some ugly way. Or
perhaps the upgrade will be unable to delete the open files on some
systems, and so fail with an ugly error message.
It's also possible for multiple processes to be running the upgrade
concurrently. That should be fine; they will both write the same
information into the keys db.
Other databases still need to be upgraded.
Fix bug that lost modifications to unlocked files when init is re-ran in an
already initialized repo.
In retrospect needing scanUnlockedFiles False in the direct mode upgrade
path was a good hint that it was unsafe when used with True.
However, this bug did not affect upgrade from v5. In such an upgrade, an
unlocked file that is modified is left as-is. The only place
scanUnlockedFiles True did overwrite modified unlocked files is during an
git-annex init of a repo that was already initialized by git-annex.
(I also tried a scenario where the repo had not been initialized by
git-annex yet, but was cloned from a v7 repo with an unlocked file, and the
pointer file replaced with some other content, and the data loss did not
occur in that situation.)
Since the fixed scanUnlockedFiles avoids overwriting non-pointer files,
it should be safe to run in any situation, so there's no need any longer
for the parameter.
This is a non-backwards compatable change, so not suitable for merging
w/o a annex.version bump and transition code. Not yet tested.
This improves performance of git-annex benchmark --databases
across the board by 10-25%, since eg Key roundtrips as a ByteString.
(serializeKey' produces a lazy ByteString, so there is still a
copy involved in converting it to a strict ByteString. It may be faster
to switch to using bytestring-strict-builder.)
FilePath and Key are both stored as blobs. This avoids mojibake in some
situations. It would be possible to use varchar instead, if persistent
could avoid converting that to Text, but it seems there is no good
way to do so. See doc/todo/sqlite_database_improvements.mdwn
Eliminated some ugly artifacts of using Read/Show serialization;
constructors and quoted strings are no longer stored in sqlite.
Renamed SRef to SSha to reflect that it is only ever a git sha,
not a ref name. Since it is limited to the characters in a sha,
it is not affected by mojibake, so still uses String.
When the submodule's parent repo has an adjusted unlocked branch,
it gets cloned by git, but git checks out master. git annex init then
fails because it wants to enter the adjusted branch, but:
adjusted branch adjusted/master(unlocked) already exists.
Aborting because that branch may have changes that have not yet reached master
Note that init actually then exits 0, leaving master checked out.
This could also happen, absent submodules, if the parent repo has
an adjusted unlocked branch, but it is not checked out. In the more common
case where that branch is checked out, the clone uses the same branch,
so no problem.
The choices to fix this:
* Init could delete the existing adjusted branch, and re-adjust.
But then running init inside an adjusted branch on a crippled filesystem
would lose any changes that have not been synced back to master.
* Init could sync any changes back to master, but that would be very surprising
behavior for it.
* Init could simply check out the existing adjusted branch. If the branch
is diverged from master, well, sync will sort that out later.
This mirrors the behavior of cloning a repo that has an adjusted branch
checked out that has not yet been synced back to master.
Picked this choice.
All that needs to be retained in remote.log is the sameas-uuid.
The rest of the config is eliminated. This doesn't save enough space to
bother with, but it prevents anything sensitive in the config of the
dead sameas remote from lingering around.
Note that minimizesameasdead does not update the VectorClock when
changing the log line. That's normally a no-no, but in this case,
it makes each DropDead result in the exact same file contents,
and vector clocks are not needed because the transition breaks
the history chain.
forget --drop-dead: Remove several classes of git-annex log files when they
become empty, further reducing the size of the git-annex branch.
Noticed while testing sameas uuid removal, but it could happen other times
too.
An empty log file is always treated by git-annex the same as no file
being present, and when the files are per-key, it can be a sizable space
saving to exclude them from the tree.
It would have been a lot less round-about to just make git annex dead
also add the uuids of sameas remotes to the trust.log as dead.
But, that would fail in the case where there's an unmerged other clone
that has a sameas remote that the current repo does not know about.
Then it would not get marked as dead.
Handling it at transition time avoids that scenario.
Note that the generation of trustmap' in dropDead should only
happen once, due to the partial application.
This solves the problem of sameas remotes trampling over per-remote
state. Used for:
* per-remote state, of course
* per-remote metadata, also of course
* per-remote content identifiers, because two remote implementations
could in theory generate the same content identifier for two different
peices of content
While chunk logs are per-remote data, they don't use this, because the
number and size of chunks stored is a common property across sameas
remotes.
External special remote had a complication, where it was theoretically
possible for a remote to send SETSTATE or GETSTATE during INITREMOTE or
EXPORTSUPPORTED. Since the uuid of the remote is typically generate in
Remote.setup, it would only be possible to pass a Maybe
RemoteStateHandle into it, and it would otherwise have to construct its
own. Rather than go that route, I decided to send an ERROR in this case.
It seems unlikely that any existing external special remote will be
affected. They would have to make up a git-annex key, and set state for
some reason during INITREMOTE. I can imagine such a hack, but it doesn't
seem worth complicating the code in such an ugly way to support it.
Unfortunately, both TestRemote and Annex.Import needed the Remote
to have a new field added that holds its RemoteStateHandle.
I found a way to avoid inheritance complicating anything outside of
Logs.Remote. It seems fine to require all inherited values to be
inherited and not set in the sameas remote's config. Since inherited
values will be used for stuff like encryption and perhaps chunking, which
control the actual content stored on the remote, it seems likely that
there will not be any reason to need them to vary between two remotes
that access the same underlying data store.
The newer version of containers is free; the minimum ghc version is
bundled with a newer version than that.
Initremote sets that, so after both initremote and enableremote,
the git config will be set.
Any remote that does not use Annex.SpecialRemote won't set
annex-config-uuid. But that's only Remote.Git, which doesn't use
RemoteConfig anyway.
initremote --sameas=remotename sets sameas-name and sameas-uuid
Using sameas-name rather than name prevents old git-annex initremote
from enabling a sameas remote by name, since it would not handle it
correctly.
When dropping an unlocked file, preserve its mtime, which avoids git status
unncessarily running the clean filter on the file.
If the index file has close to the same mtime as a work tree file, git will
not trust the index to be up-to-date, and re-runs the clean filter
unncessarily. Preserving the mtime when depopulating a pointer file avoids
git status doing a little (or maybe a lot) of unncessary work.
There are other places that the mtime could be preserved, including other
places where pointer files are written perhaps, but also
populatePointerFile. But, I don't know of cases where those lead to git
status doing unncessary work, so I just fixed the one I'm aware of for now.
Fix bug in handling of annex.largefiles that use largerthan/smallerthan.
When adding a modified file, it incorrectly used the file size of the old
version of the file, not the current size.
That was the only largefiles limit that didn't directly look at the file on
disk already. Added a new type to keep straight the two different ways such
a limit can be matched. I kind of wanted to extend MatchingFile or FileInfo
to indicate that the matcher is supposed to operate on files from disk or
annex, but it turned out to be too complex to implement it that way.
This also changes the LimitAnnexFiles case when lookupFileKey does not find
a key. It used to fall back to statting the file, now it always returns
False. I doubt the old code could really get to that point, but if it
somehow does, it's better for preferred content matching to be consistent.
debian oldoldstable has 2.1, and that's what i386ancient uses. It would be
better to require git 2.2, which is needed to use adjusted branches, but
can't do that w/o losing support for some old linux kernels or a
complicated git backport.
This brings back .git/annex/misctmp, but only for init. If an init
is interrupted while probing using that temp directory, the files it left
will get deleted 1 week later by a subsequent git-annex run.
Can be set to false to prevent any automatic repository upgrades.
Also, removed direct mode specific upgrade code in Annex.Init, and made
needsUpgrade always include the name/path of the repo, so if
there's a problem it's clear what repo has the problem.
And, made needsUpgrade catch any exceptions that might occur during the
upgrade, so it can display a more useful error message than just the
exception.
No longer used. The only possible user of it would be code in
Upgrade.V5, so I verified that the parts of Annex.Content it used were
not used to manipulate direct mode files.
Three reasons:
* Committing as part of an upgrade is very unusual and unexpected.
* The commit was failing with a weird error message when done during an
automatic upgrade.
* Let me remove more of that sweet^Whorrible direct mode code.
* Automatically convert direct mode repositories to v7 with adjusted
unlocked branches and set annex.thin.
* init: When run on a crippled filesystem with --version=5,
will error out, since version 7 is needed for adjusted unlocked branch.
* direct: This command always errors out as direct mode is no longer
supported.
* indirect: This command has become a deprecated noop.
* proxy: This command is deprecated because it was only needed in direct
mode. (But it continues to work.)
Also removed mentions of direct mode throughough the documentation.
I have not removed all the direct mode code yet.
When upgrading a direct mode repo to v7 with adjusted unlocked branches,
fix a bug that prevented annex.thin from taking effect for the files in
working tree.
The hard links used to be ok, but commit 8e22114735 accidentially
broke them. It repopulates the worktree file, which is already a hard link,
and when it's creating the new file, the link count is already 2, and so it
doesn't make a hard link then.
Rather than direct mode, which this is a small step on the path to
removing.
Init on a crippled filesystem already used v7 adjusted branches,
and like that, this doesn't pose any interoperability issues with old
versions of git-annex that clone the same repo, because files are only
unlocked on the adjusted branch.
Turns out that 7be690f326 broke the
test suite on the i386ancient builder. There, git show-ref --verify HEAD
fails with "'HEAD' - not a valid ref". Apparently git 2.1.4 didn't
support that.
headExists works there and does the same thing.
This code was already AGPL, except for the bit split out
to Utility/MD5.hs in commit 426053cb6c.
That commit accidentially updated the license of this file from AGPL
to GPL.
Thanks to Sean Whitton for spotting this.
In 40ecf58d4b I changed the license of code I
wrote from GPL to AGPL. But, two files containing code I wrote combined
with code by others were updated to say their license is AGPL, while in
fact part of it was (the code I wrote) but part remained under the original
license (the code written by others).
Remote/Ddar.hs is now changed entirely back to GPL 3.
Annex/DirHashes.hs stays AGPL, but I broke out Utility/MD5.hs with the code
not written by me, and corrected its license statement to GPL-2, which
is the actual version of the GPL included with the code in its original
distribution at http://www.cs.ox.ac.uk/people/ian.lynagh/md5/
I made some improvements to its API after splitting it out of git-annex,
so merge those back in.
This is groundwork for removing the embedded copy of it and depending on
it.
Also moved the managerResponseTimeout disabling to Annex.Url as it's
git-annex specific.
This commit was sponsored by Ethan Aubin on Patreon.
Support running v7 upgrade in a repo where there is no branch checked out,
but HEAD is set directly to some other ref.
This commit was sponsored by Jack Hill on Patreon.
Drop support for building with ghc older than 8.4.4, and with older
versions of serveral haskell libraries than will be included in Debian 10.
The only remaining version ifdefs in the entire code base are now a couple
for aws!
This commit should only be merged after the Debian 10 release.
And perhaps it will need to wait longer than that; it would make
backporting new versions of git-annex to Debian 9 (stretch) which
has been actively happening as recently as this year.
This commit was sponsored by Ilya Shlyakhter.
No behavior changes, but this shows everywhere that a progress meter
could be displayed when hashing a file to add to the annex.
Many of the places don't make sense to display a progress meter though,
eg when importing the copy of the file probably swamps the hashing of
the file.
init: Fix a reversion in the last release that prevented automatically
generating and setting a description for the repository.
Seemed best to factor out uuidDescMapRaw that does not
have the default mempty descrition behavior.
I don't much like that behavior, but I know things depend on it.
One thing in particular is `git annex info` which lists the uuids and
descriptions; if the current repo has been initialized in some way that
means it does not have a description, it would not show up w/o that.
(Not only repos created due to this bug might lack that. For example a repo
that was marked dead and had --drop-dead delete its git-annex branch info,
and then came back from the dead would similarly not be in the uuid.log.
Also there have been other versions of git-annex that didn't set a default
description; for years there was no default description.)
Avoid an unncessary STM transaction. This will happen when the worker
pool is not completely full of the new stage, which is the common case.
In the uncommon case, this adds only a tiny bit of overhead for the
extra traversal of the worker pool. And the thread is going to block
for some time anyway.
When all worker threads are running and enteringStage is called,
it waits for an idle slot. If all off the other threads then call it in
turn, a deadlock occurrs.
This is the same problem I didn't actually fix in
5a9842d7ed.
Fixed by doing two separate STM transactions, the first replaces its
active thread with an idle thread, and the second waits for another idle
thread. That guarantees there will eventually be an idle thread to find.
The changes to WorkerPool were necessary because it can't add an idle
thread containing the Annex state and go on to run an action using that
same state, so I had to remove the Annex state from IdleWorker.
This means that Command.Move and Command.Get don't need to
manually set the stage, and is a lot cleaner conceptually.
Also, this makes Command.Sync.syncFile use the worker pool better.
In the scenario where it first downloads content and then uploads it to
some other remotes, it will start in TransferStage, then enter VerifyStage
and then go back to TransferStage for each transfer to the remotes.
Before, it entered CleanupStage after the download, and stayed in it for
the upload, so too many transfer jobs could run at the same time.
Note that, in Remote.Git, it uses runTransfer and also verifyKeyContent
inside onLocal. That has a Annex state for the remote, with no worker pool.
So the resulting calls to enteringStage won't block in there.
While Remote.Git.copyToRemote does do checksum verification, I
realized that should not use a verification slot in the WorkerPool
to do it. Because, it's reading back from eg, a removable disk to checksum.
That will contend with other writes to that disk. It's best to treat
that checksum verification as just part of the transer. So, removed the todo
item about that, as there's nothing needing to be done.
Rather than limiting it to PerformStage and CleanupStage, this opens it
up so any number of stages can be added as needed by commands.
Each concurrent command has a set of stages that it uses, and only
transitions between those can block waiting for a free slot in the
worker pool. Calling enteringStage for some other stage does not block,
and has very little overhead.
Note that while before the Annex state was duplicated on the first call
to commandAction, this now happens earlier, in startConcurrency.
That means that seek stage actions should that use startConcurrency
and then modify Annex state won't modify the state of worker threads
they then start. I audited all of them, and only Command.Seek
did so; prepMerge changes the working directory and so has to come
before startConcurrency.
Also, the remote list is built before duplicating the state, which means
that it gets built earlier now than it used to. This would only have an
effect of making commands that end up not needing to perform any actions
unncessary build the remote list (only when they're run with concurrency
enable), but that's a minor overhead compared to commands seeking
through the work tree and determining they don't need to do anything.
The hoped for optimisation of CommandStart with -J did not materialize.
In fact, not runnign CommandStart in parallel is slower than -J3.
So, CommandStart are still run in parallel.
(The actual bad performance I've been seeing with -J in my big repo
has to do with building the remoteList.)
But, this is still progress toward making -J faster, because it gets rid
of the onlyActionOn roadblock in the way of making CommandCleanup jobs
run separate from CommandPerform jobs.
Added OnlyActionOn constructor for ActionItem which fixes the
onlyActionOn breakage in the last commit.
Made CustomOutput include an ActionItem, so even things using it can
specify OnlyActionOn.
In Command.Move and Command.Sync, there were CommandStarts that used
includeCommandAction, so output messages, which is no longer allowed.
Fixed by using startingCustomOutput, but that's still not quite right,
since it prevents message display for the includeCommandAction run
inside it too.
The goal is to be able to run CommandStart in the main thread when -J is
used, rather than unncessarily passing it off to a worker thread, which
incurs overhead that is signficant when the CommandStart is going to
quickly decide to stop.
To do that, the message it displays needs to be displayed in the worker
thread, after the CommandStart has run.
Also, the change will mean that CommandStart will no longer necessarily
run with the same Annex state as CommandPerform. While its docs already
said it should avoid modifying Annex state, I audited all the
CommandStart code as part of the conversion. (Note that CommandSeek
already sometimes runs with a different Annex state, and that has not been
a source of any problems, so I am not too worried that this change will
lead to breakage going forward.)
The only modification of Annex state I found was it calling
allowMessages in some Commands that default to noMessages. Dealt with
that by adding a startCustomOutput and a startingUsualMessages.
This lets a command start with noMessages and then select the output it
wants for each CommandStart.
One bit of breakage: onlyActionOn has been removed from commands that used it.
The plan is that, since a StartMessage contains an ActionItem,
when a Key can be extracted from that, the parallel job runner can
run onlyActionOn' automatically. Then commands won't need to worry about
this detail. Future work.
Otherwise, this was a fairly straightforward process of making each
CommandStart compile again. Hopefully other behavior changes were mostly
avoided.
In a few cases, a command had a CommandStart that called a CommandPerform
that then called showStart multiple times. I have collapsed those
down to a single start action. The main command to perhaps suffer from it
is Command.Direct, which used to show a start for each file, and no
longer does.
Another minor behavior change is that some commands used showStart
before, but had an associated file and a Key available, so were changed
to ShowStart with an ActionItemAssociatedFile. That will not change the
normal output or behavior, but --json output will now include the key.
This should not break it for anyone using a real json parser.
When running multiple concurrent actions, the cleanup phase is run in a
separate queue than the main action queue. This can make some commands
faster, because less time is spent on bookkeeping in between each file
transfer.
But as far as I can see, nothing will be sped up much by this yet, because
all the existing cleanup actions are very light-weight. This is just groundwork
for deferring checksum verification to cleanup time.
This change does mean that if the user expects -J2 will mean that they see no
more than 2 jobs running at a time, they may be surprised to see 4 in some
cases (if the cleanup actions are slow enough to notice).
It might also make sense to enable background cleanup without the -J,
for at least one cleanup action. Indeed, that's the behavior that -J1
has now. At some point in the future, it make make sense to make the
behavior with no -J the same as -J1. The only reason it's not currently
is that git-annex can build w/o concurrent-output, and also any bugs
in concurrent-output (such as perhaps misbehaving on non-VT100 compatible
terminals) are avoided by default by only using it when -J is used.
Add back support for ftp urls, which was disabled as part of the fix for
security hole CVE-2018-10857 (except for configurations which enabled curl
and bypassed public IP address restrictions). Now it will work if allowed
by annex.security.allowed-ip-addresses.
Renamed annex.security.allowed-http-addresses to
annex.security.allowed-ip-addresses because it is not really specific to
the http protocol, also limiting eg, git-annex's use of ftp and via
youtube-dl, several other protocols.
The old name for the config will still work.
If both old and new name are set, the new name will win.
* init: When the repository already has a description, don't change it.
* describe: When run with no description parameter it used to set
the description to "", now it will error out.
Importing from a special remote honors its preferred content too; unwanted
files are not imported. But, some preferred content expressions can't be
checked before files are imported, and trying to import with such an
expression will fail.
Tested this with scenarios including changing the preferred content
expression and making sure merging the import didn't delete files that were
no longer wanted.
There was one minor inefficiency mentioned in the todo that I punted on.
Make the import have the previous import as a parent, so eg `git log --stat`
displays a useful diff.
Also a minor optimisation, only calculate the depth of the imported history
once.
Prevents merging the import from deleting the non-preferred files from
the branch it's merged into.
adjustTree previously appended the new list of items to the old, which
could result in it generating a tree with multiple files with the same
name. That is not good and confuses some parts of git. Gave it a
function to resolve such conflicts.
That allowed dealing with the problem of what happens when the import
contains some files (or subtrees) with the same name as files that were
filtered out of the export. The files from the import win.
The filtering is fairly efficient as far as building the trees goes,
since it reuses adjustTree. But it still needs to traverse the whole
tree, and look up the keys used by every file.
The tree that gets recorded to export.log is the filtered tree.
This way resumes of interrupted sync to an export uses it without
needing to recalculate it. And, a change to the preferred content
settings of the remote will result in a different tree, so the export
will be updated accordingly.
The original tree is still used in the remote tracking branch.
That branch represents the special remote as a git remote, and if it
were a normal git remote, the tree in its head would not be affected by
preferred content.
Only when the preferred content expression includes them will a parse
failure due to them needing keys result in the preferred content
expression not parsing in keyless mode.
This will let import try to match preferred content expressions before
downloading the content and generating its key.
If an expression needs a key, it preferredContentParser with
preferredContentKeylessTokens will fail to parse it.
standard and groupwanted are not in preferredContentKeylessTokens
because they may refer to an expression that refers to a key.
That needs further work to support them.
Added the ability to run one job per CPU (core), by setting annex.jobs=cpus,
or using option --jobs=cpus or -Jcpus.
Built with future expansion in mind, including not defaulting matching on
Concurrency so more constructors can later be added, and using "cpu"
instead of "0".
Fixes bug that caused git-annex to fail to add a file when another
git-annex process cleaned up the temp directory it was using.
Solution is just to push withOtherTmp out to a higher level, so that
the whole ingest process can be completed inside it.
But in the assistant, that was not practical to do, since withOtherTmp runs
in the Annex monad and the assistant does not. Worked around by introducing
a separate temp directory that only the assistant uses for lockdown.
Since only one assistant can run at a time, it's easy to clean up that
directory of old cruft at startup.
This is only done for correctness sake; I don't see any way that it
would have caused a problem here. The jlog file escaped withOtherTmp
so another process could swoop in and delete it, but the file is only
used as a buffer for a list of filenames, and its handle gets rewound
and they're read back out, which will still work even if it's already
been deleted.
The only reason I didn't just pre-delete the file and keep the handle
open is I'm not sure that works on all OS's (eg Windows). If there was
a problem that this fixed it might involve an OS that doesn't support
deleting an open file or something like that.
bf7ecd6892 went too far and broke
importing, the old tree was used on the remote tracking branch and not
the newly imported tree.
Test suite noticed the problem luckily.
Fix reversion in last release that caused wrong tree to be written to
remote tracking branch after an export of a subtree.
The invariant "commitsha should have the treesha as its tree"
was not met due to a bug. Guarantee it's met by catting the commitsha
to find its actual tree. A little bit slower, but this is not run often.
This way no history is lost, neither what was exported to the remote,
or the history of changes that is imported from it. No complicated
correlation of two possibly very different histories is needed, just
record what we know and then git merge will do a good job.
Also, it notices when the remote tracking branch doesn't need to be updated,
and avoids doing anything, so noop remotes are super cheap.
The only catch here is that, since the commits generated for imports
from the remote don't have a stable date or author/committer, each
(non-noop) import generates different commits for the same imported
trees. So, when the imported remote tracking branch is merged into master
and then a change is imported again, there will be an extra series of
commits, which will get more and more expensive each time.
This seems to call for making stable commits for imports. Also that
seems a good idea to make importing in several repositories have the
same result.
* Added mimeencoding= term to annex.largefiles expressions.
This is probably mostly useful to match non-text files with eg
"mimeencoding=binary"
* git-annex matchexpression: Added --mimeencoding option.
Switch listContents to being a proper CommandStart, so if it throws an
exception, it will be treated like any other command action that fails.
downloadImport apparently does not ever throw an exception,
and itself uses commandAction, so it can't be a CommandStart.
Fix bug that caused importing from a special remote to repeatedly download
unchanged files when multiple files in the remote have the same content.
Unfortunately, there's really no good way to remove a uniqueness constraint
from a sqlite database. The best that can be done is to make a new table
and copy the data over. But that would require using persistent's
migrations or raw sql, and I don't want to do either.
Instead, a sledgehammer approach: Renamed .git/annex/cid to
.git/annex/cids. When the new database doesn't exist, it will be populated
from the git-annex branch.
Noting deletes the old database. Don't want to delete it out from under
some long-running git-annex process that might be using it. It could
eventually be deleted. But this is such a new feature, probably few repos
have the database in any case.
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
untested
This won't be super slow, but it does need to diff two likely large
trees, and since the git-annex branch rarely sits still, it will most
likely be run at the beginning of every import.
A possible speed improvement would be to only run this when the database
did not contain a ContentIdentifier. But that would only speed up
imports when there is no new version of a file on the special remote,
at most renames of existing files being imported.
A better speed improvement would be to record something in the git-annex
branch that indicates when an import has been run, and only do the diff
if the git-annex branch has record of a newer import than we've seen
before. Then, it would only run when there is in fact new
ContentIdentifier information available from a remote. Certianly doable,
but didn't want to complicate things yet.
For now, it's only allowed when exporttree=yes is also set.
That simplified the implementation, but could later be changed if
there's a remote that makes sense to be an import but not an export.
However, it may work just as well to make a remote be readonly to
prevent export to it while still allowing import.
This does not avoid all possible races, but it does avoid all likely
ones, and is demonstratably better than git's own handling of races
where files get modified at the same time as it's updating the working
tree.
The main thing this won't detect are not unlikely races where part
of a file gets changed while it's being copied and then the file is
restored to its original condition before the modification check.
No, it's more likely that the limitations of checking inode, size,
and mtime won't detect certian modifications, involving eg mmapped
files.
Made some api changes.
listImportableContents needs to provide the size
of the data, so the downloader can check disk free space.
retrieveExportWithContentIdentifier is passed the filepath to write to
Use temporary "CID" key during download of a ContentIdentifier from a
remote, so withTmp can be used and then move the content to the real key
once it's known.