This needs the content to be present in order to hash it. But it's not
possible for a module used by Backend.URL to call inAnnex because that
would entail a dependency loop. So instead, rely on the fact that
Command.Migrate calls inAnnex before performing a migration.
But, Command.ExamineKey calls fastMigrate and the key may or may not
exist, and it's not wanting to actually perform a migration in any case.
To handle that, had to add an additional value to fastMigrate to
indicate whether the content is inAnnex.
Factored generateEquivilantKey out of Remote.Web.
Note that migrateFromURLToVURL hardcodes use of the SHA256E backend.
It would have been difficult not to, given all the dependency loop
issues. But --backend and annex.backend are used to tell git-annex
migrate to use VURL in any case, so there's no config knob that
the user could expect to configure that.
Sponsored-by: Brock Spratlen on Patreon
Considerable difficulty to work around an import cycle. Had to move the
list of backends (except for VURL) to Backend.Variety to VURL could use
it.
Sponsored-by: Kevin Mueller on Patreon
When downloading a VURL from the web, make sure that the equivilant key
log is populated.
Unfortunately, this does not hash the content while it's being
downloaded from the web. There is not an interface in Backend currently
for incrementally hash generation, only for incremental verification of an
existing hash. So this might add a noticiable delay, and it has to show
a "(checksum...") message. This could stand to be improved.
But, that separate hashing step only has to happen on the first download
of new content from the web. Once the hash is known, the VURL key can have
its hash verified incrementally while downloading except when the
content in the web has changed. (Doesn't happen yet because
verifyKeyContentIncrementally is not implemented yet for VURL keys.)
Note that the equivilant key log file is formatted as a presence log.
This adds a tiny bit of overhead (eg "1 ") per line over just listing the
urls. The reason I chose to use that format is it seems possible that
there will need to be a way to remove an equivilant key at some point in
the future. I don't know why that would be necessary, but it seemed wise
to allow for the possibility.
Downloads of VURL keys from other special remotes that claim urls,
like bittorrent for example, does not popilate the equivilant key log.
So for now, no checksum verification will be done for those.
Sponsored-by: Nicholas Golder-Manning on Patreon
Converted warning and similar to use StringContainingQuotedPath. Most
warnings are static strings, some do refer to filepaths that need to be
quoted, and others don't need quoting.
Note that, since quote filters out control characters of even
UnquotedString, this makes all warnings safe, even when an attacker
sneaks in a control character in some other way.
When json is being output, no quoting is done, since json gets its own
quoting.
This does, as a side effect, make warning messages in json output not
be indented. The indentation is only needed to offset warning messages
underneath the display of the file they apply to, so that's ok.
Sponsored-by: Brett Eisenberg on Patreon
The temporary URL key used for the download, before the real key is
generated, was blocked by annex.securehashesonly.
Fixed by passing the Backend that will be used for the final key into
runTransfer. When a Backend is provided, have preCheckSecureHashes
check that, rather than the key being transferred.
Sponsored-by: unqueued on Patreon
All backends use genKeyName, which uses preSanitizeKeyName, which will
escape newlines.
I suspect this newline prevention was added back when git-annex was
reading the output of sha1sum..
Sponsored-by: Kevin Mueller on Patreon
This adds the overhead of a copy when serializing and deserializing keys.
I have not benchmarked much, but runtimes seem barely changed at all by that.
When a lot of keys are in memory, it improves memory use.
And, it prevents keys sometimes getting PINNED in memory and failing to GC,
which is a problem ByteString has sometimes. In particular, git-annex sync
from a borg special remote had that problem and this improved its memory
use by a large amount.
Sponsored-by: Shae Erisson on Patreon
The goal is that Database.Keys be able to use it; it can't use
Annex.Content.Presence due to an import loop.
Several other things also needed to be moved to Annex.Verify as a
conseqence.
Checksum as content is received from a remote git-annex repository, rather
than doing it in a second pass.
Not tested at all yet, but I imagine it will work!
Not implemented for any special remotes, and also not implemented for
copies from local remotes. It may be that, for local remotes, it will
suffice to use rsync, rely on its checksumming, and simply return Verified.
(It would still make a checksumming pass when cp is used for COW, I guess.)
This is groundwork for external backends, but also makes sense to keep
this information with the rest of a Backend's implementation.
Also, removed isVerifiable. I noticed that the same information is
encoded by whether a Backend implements verifyKeyContent or not.
retrieveExport is part of ongoing transition to make remote methods
throw exceptions, rather than silently hide them.
getKey very rarely fails, and when it does it's always for the same reason
(user configured annex.backend to url for some reason). So, this will
avoid dealing with Nothing everywhere it's used.
This commit was sponsored by Ilya Shlyakhter on Patreon.
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.
Previously attempted in 4536c93bb2
and reverted in 96aba8eff7.
The problems mentioned in the latter commit are addressed now:
Read/Show of KeyData is backwards-compatible with Read/Show of Key from before
this change, so Types.Distribution will keep working.
The Eq instance is fixed.
Also, Key has smart constructors, avoiding needing to remember to update
the cached serialization.
Used git-annex benchmark:
find is 7% faster
whereis is 3% faster
get when all files are already present is 5% faster
Generally, the benchmarks are running 0.1 seconds faster per 2000 files,
on a ram disk in my laptop.
No behavior changes, but this shows everywhere that a progress meter
could be displayed when hashing a file to add to the annex.
Many of the places don't make sense to display a progress meter though,
eg when importing the copy of the file probably swamps the hashing of
the file.
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
This reverts commit 4536c93bb2.
That broke Read/Show of a Key, and unfortunately Key is read in at least
one place; the GitAnnexDistribution data type.
It would be worth bringing this optimisation back, but it would need
either a custom Read/Show instance that preserves back-compat, or
wrapping Key in a data type that contains the serialization, or changing
how GitAnnexDistribution is serialized.
Also, the Eq instance would need to compare keys with and without a
cached seralization the same.
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.
It means that every place a Key has any of its fields changed, the cache
has to be dropped. I've grepped and found them all. But, it would be
better to avoid that gotcha somehow..
It takes a single key-value backend, rather than the unncessary and confusing list.
The old option still works if set.
Simplified some old old code too.
This commit was sponsored by Thomas Hochstein on Patreon.
Where before the "name" of a key and a backend was a string, this makes
it a concrete data type.
This is groundwork for allowing some varieties of keys to be disabled
in file2key, so git-annex won't use them at all.
Benchmarks ran in my big repo:
old git-annex info:
real 0m3.338s
user 0m3.124s
sys 0m0.244s
new git-annex info:
real 0m3.216s
user 0m3.024s
sys 0m0.220s
new git-annex find:
real 0m7.138s
user 0m6.924s
sys 0m0.252s
old git-annex find:
real 0m7.433s
user 0m7.240s
sys 0m0.232s
Surprising result; I'd have expected it to be slower since it now parses
all the key varieties. But, the parser is very simple and perhaps
sharing KeyVarieties uses less memory or something like that.
This commit was supported by the NSF-funded DataLad project.
Backend.lookupFile is changed to always fall back to catKey when
operating on a file that's not a symlink.
catKey is changed to understand pointer files, as well as annex symlinks.
Before, catKey needed a file mode witness, to be sure it was looking at a
symlink. That was complicated stuff. Now, it doesn't actually care if a
file in git is a symlink or not; in either case asking git for the content
of the file will get the pointer to the key.
This does mean that git-annex will treat a link
foo -> WORM--bar as a git-annex file, and also treats
a regular file containing annex/objects/WORM--bar as a git-annex file.
Calling catKey could make git-annex commands need to do more work than
before. This would especially be the case if a repo contained many regular
files, and only a few annexed files, as now git-annex will need to ask
git about the contents of the regular files.
The content of unstable keys can potentially be different in different
repos, so eg, resuming a chunked upload started by another repo would
corrupt data.
Needed for resuming from chunks.
Url keys are considered not stable. I considered treating url keys with a
known size as stable, but just don't feel that is enough information.
Only fsck and reinject and the test suite used the Backend, and they can
look it up as needed from the Key. This simplifies the code and also speeds
it up.
There is a small behavior change here. Before, all commands would warn when
acting on an annexed file with an unknown backend. Now, only fsck and
reinject show that warning.
SHA3 is still waiting for final standardization.
Although this is looking less likely given
https://www.cdt.org/blogs/joseph-lorenzo-hall/2409-nist-sha-3
In the meantime, cryptohash implements skein, and it's used by some of the
haskell ecosystem (for yesod sessions, IIRC), so this implementation is
likely to continue working. Also, I've talked with the cryprohash author
and he's a reasonable guy.
It makes sense to have an alternate high security hash, in case some
horrible attack is found against SHA2 tomorrow, or in case SHA3 comes out
and worst fears are realized.
I'd also like to support using skein for HMAC. But no hurry there and
a new version of cryptohash has much nicer HMAC code, so I will probably
wait until I can use that version.
This avoids commit churn by the assistant when eg,
replacing a file with a symlink.
But, just as importantly, it prevents the working tree being left with a
deleted file if git-annex, or perhaps the whole system, crashes at the
wrong time.
(It also probably avoids confusing displays in file managers.)
Refactored annex link code into nice clean new library.
Audited and dealt with calls to createSymbolicLink.
Remaining calls are all safe, because:
Annex/Link.hs: ( liftIO $ createSymbolicLink linktarget file
only when core.symlinks=true
Assistant/WebApp/Configurators/Local.hs: createSymbolicLink link link
test if symlinks can be made
Command/Fix.hs: liftIO $ createSymbolicLink link file
command only works in indirect mode
Command/FromKey.hs: liftIO $ createSymbolicLink link file
command only works in indirect mode
Command/Indirect.hs: liftIO $ createSymbolicLink l f
refuses to run if core.symlinks=false
Init.hs: createSymbolicLink f f2
test if symlinks can be made
Remote/Directory.hs: go [file] = catchBoolIO $ createSymbolicLink file f >> return True
fast key linking; catches failure to make symlink and falls back to copy
Remote/Git.hs: liftIO $ catchBoolIO $ createSymbolicLink loc file >> return True
ditto
Upgrade/V1.hs: liftIO $ createSymbolicLink link f
v1 repos could not be on a filesystem w/o symlinks
Audited and dealt with calls to readSymbolicLink.
Remaining calls are all safe, because:
Annex/Link.hs: ( liftIO $ catchMaybeIO $ readSymbolicLink file
only when core.symlinks=true
Assistant/Threads/Watcher.hs: ifM ((==) (Just link) <$> liftIO (catchMaybeIO $ readSymbolicLink file))
code that fixes real symlinks when inotify sees them
It's ok to not fix psdueo-symlinks.
Assistant/Threads/Watcher.hs: mlink <- liftIO (catchMaybeIO $ readSymbolicLink file)
ditto
Command/Fix.hs: stopUnless ((/=) (Just link) <$> liftIO (catchMaybeIO $ readSymbolicLink file)) $ do
command only works in indirect mode
Upgrade/V1.hs: getsymlink = takeFileName <$> readSymbolicLink file
v1 repos could not be on a filesystem w/o symlinks
Audited and dealt with calls to isSymbolicLink.
(Typically used with getSymbolicLinkStatus, but that is just used because
getFileStatus is not as robust; it also works on pseudolinks.)
Remaining calls are all safe, because:
Assistant/Threads/SanityChecker.hs: | isSymbolicLink s -> addsymlink file ms
only handles staging of symlinks that were somehow not staged
(might need to be updated to support pseudolinks, but this is
only a belt-and-suspenders check anyway, and I've never seen the code run)
Command/Add.hs: if isSymbolicLink s || not (isRegularFile s)
avoids adding symlinks to the annex, so not relevant
Command/Indirect.hs: | isSymbolicLink s -> void $ flip whenAnnexed f $
only allowed on systems that support symlinks
Command/Indirect.hs: whenM (liftIO $ not . isSymbolicLink <$> getSymbolicLinkStatus f) $ do
ditto
Seek.hs:notSymlink f = liftIO $ not . isSymbolicLink <$> getSymbolicLinkStatus f
used to find unlocked files, only relevant in indirect mode
Utility/FSEvents.hs: | Files.isSymbolicLink s = runhook addSymlinkHook $ Just s
Utility/FSEvents.hs: | Files.isSymbolicLink s ->
Utility/INotify.hs: | Files.isSymbolicLink s ->
Utility/INotify.hs: checkfiletype Files.isSymbolicLink addSymlinkHook f
Utility/Kqueue.hs: | Files.isSymbolicLink s = callhook addSymlinkHook (Just s) change
all above are lower-level, not relevant
Audited and dealt with calls to isSymLink.
Remaining calls are all safe, because:
Annex/Direct.hs: | isSymLink (getmode item) =
This is looking at git diff-tree objects, not files on disk
Command/Unused.hs: | isSymLink (LsTree.mode l) = do
This is looking at git ls-tree, not file on disk
Utility/FileMode.hs:isSymLink :: FileMode -> Bool
Utility/FileMode.hs:isSymLink = checkMode symbolicLinkMode
low-level
Done!!
The expensive scan uses lookupFile, but in direct mode, that doesn't work
for files that are present. So the scan was not finding things that are
present that need to be uploaded. (It did find things not present that
needed to be downloaded.)
Now lookupFile also works in direct mode. Note that it still prefers
symlinks on disk to info committed to git, in direct mode. This is
necessary to make things like Assistant.Threads.Watcher.onAddSymlink
work correctly, when given a new symlink not yet checked into git (or
replacing a file checked into git).
Now there's a Config type, that's extracted from the git config at startup.
Note that laziness means that individual config values are only looked up
and parsed on demand, and so we get implicit memoization for all of them.
So this is not only prettier and more type safe, it optimises several
places that didn't have explicit memoization before. As well as getting rid
of the ugly explicit memoization code.
Not yet done for annex.<remote>.* configuration settings.
Kqueue needs to remember which files failed to be added due to being open,
and retry them. This commit gets the data in place for such a retry thread.
Broke KeySource out into its own file, and added Eq and Ord instances
so it can be stored in a Set.