Commit graph

473 commits

Author SHA1 Message Date
Joey Hess
aaacf431d8
handle importtree=yes config
For now, it's only allowed when exporttree=yes is also set.
That simplified the implementation, but could later be changed if
there's a remote that makes sense to be an import but not an export.
However, it may work just as well to make a remote be readonly to
prevent export to it while still allowing import.
2019-03-04 16:07:35 -04:00
Joey Hess
88ccfaa78c
storeExportWithContentIdentifierM for directory special remote
Not sure if my reasoning about the races really holds.

It would certianly be possible to better guard against races by using
Linux-specific renameat2 with RENAME_EXCHANGE or RENAME_NOREPLACE.

Or by using link and relying on it not overwriting existing files -- but
that would need a filesystem that supports hard links and directory can
be used in filesystems that don't.
2019-03-04 14:46:25 -04:00
Joey Hess
e2e57f8556
initial export support for directory special remote
This does not guard against race condition yet, it's only for testing
purposes.
2019-02-27 13:42:34 -04:00
Joey Hess
45aacd888b
import downloader complete (untested)
Made some api changes.

listImportableContents needs to provide the size
of the data, so the downloader can check disk free space.

retrieveExportWithContentIdentifier is passed the filepath to write to

Use temporary "CID" key during download of a ContentIdentifier from a
remote, so withTmp can be used and then move the content to the real key
once it's known.
2019-02-27 13:15:02 -04:00
Joey Hess
f4b773e9a1
incomplete action to download files from import 2019-02-26 15:25:28 -04:00
Joey Hess
4747fa923d
export: Deprecated the --tracking option.
Instead, users can configure remote.<name>.annex-tracking-branch themselves.
2019-02-23 15:54:33 -04:00
Joey Hess
bab6c570b0
buildImportTrees is fully working
buildImportCommit not yet tested
2019-02-22 12:41:17 -04:00
Joey Hess
8fdea8f444
WIP
Added graftTree but it's buggy.

Should use graftTree in Annex.Branch.graftTreeish; it will be faster
than the current implementation there.

Started Annex.Import, but untested and it doesn't yet handle tree
grafting.
2019-02-21 17:32:59 -04:00
Joey Hess
fd304dce60
split out Types.Import and some changes to the types in it 2019-02-21 13:39:09 -04:00
Joey Hess
936aee6a60
quickcheck property for parsing of content identifier logs 2019-02-21 13:17:43 -04:00
Joey Hess
ccc0684d21
no remotes support import yet 2019-02-20 16:59:04 -04:00
Joey Hess
0442842622
add import tree interface to Remote 2019-02-20 15:35:22 -04:00
Joey Hess
d839c2110a
fix encoding of metadata containing newlines
This fixes a reversion in the ByteString conversion. The old code used
isSpace to decide when the metadata value needs to be base64 encoded,
and that incorrectly changed to only checking if it contained ' '.

Note that only '\n' and '\r' were added and not other sorts of
whitespace that isSpace matches, like '\t' and '\v'. Only the former
would cause problems.
2019-02-20 14:26:18 -04:00
Joey Hess
1b8026b2cb
constrain Arbitrary MetaField to ascii
Same reason other Arbitrary's have been. I saw a test failure on Windows
that was probably caused by non-ascii there.
2019-02-18 17:50:06 -04:00
Joey Hess
9cebfd7002
purify exportActions
Purifying exportActions will allow introspecting and modifying it,
which is needed to add progress bar display to it.

Only S3 and WebDAV ran an Annex action while constructing ExportActions.
There was a small performance gain from them doing that, since a
resource was able to be prepared and reused for multiple actions by
Command.Export.

As seen in commit 809cfbbd8a and
5d394023eb S3 and WebDAV actually create a
new handle for each access in normal, non-export use. It doesn't seem
worth making export use of them marginally more efficient than normal
use. It would be better to do that work upfront when constructing the
remote. Or perhaps use a MVar to cache a handle.

This commit was sponsored by Nick Piper on Patreon.
2019-01-30 15:11:40 -04:00
Joey Hess
f76c4a0973
avoid Arbitrary generating excessivly long lists
Turns what it was doing often generated too long lists, or spun with
suchThat rejecting too large numbers. Limit lists to 10.
2019-01-21 13:50:24 -04:00
Joey Hess
e8ff3c3e73
fix build with old ghc 2019-01-18 14:08:10 -04:00
Joey Hess
a5764c4a78
fix build with old ghc 2019-01-18 13:59:29 -04:00
Joey Hess
d5f2463702
misctmp cleanup
* Switch to using .git/annex/othertmp for tmp files other than partial
  downloads, and make stale files left in that directory when git-annex
  is interrupted be cleaned up promptly by subsequent git-annex processes.
* The .git/annex/misctmp directory is no longer used and git-annex will
  delete anything lingering in there after it's 1 week old.

Also, in Annex.Ingest, made the filename it uses in the tmp dir be
prefixed with "ingest-" to avoid potentially using a filename used by
some other code.
2019-01-17 16:02:22 -04:00
Joey Hess
c3afb3434d
remove recently added cache from KeyVariety
Adding that field broke the Read/Show serialization back-compat,
and also the Eq and Ord instances were not blinded to it, which broke
git annex fsck and probably more.

I think that the new approach used in formatKeyVariety will be nearly
as fast, but have not benchmarked it.
2019-01-16 16:33:08 -04:00
Joey Hess
96aba8eff7
Revert "cache the serialization of a Key"
This reverts commit 4536c93bb2.

That broke Read/Show of a Key, and unfortunately Key is read in at least
one place; the GitAnnexDistribution data type.

It would be worth bringing this optimisation back, but it would need
either a custom Read/Show instance that preserves back-compat, or
wrapping Key in a data type that contains the serialization, or changing
how GitAnnexDistribution is serialized.

Also, the Eq instance would need to compare keys with and without a
cached seralization the same.
2019-01-16 16:21:59 -04:00
Joey Hess
4536c93bb2
cache the serialization of a Key
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.

It means that every place a Key has any of its fields changed, the cache
has to be dropped. I've grepped and found them all. But, it would be
better to avoid that gotcha somehow..
2019-01-14 16:37:28 -04:00
Joey Hess
d3ab5e626b
rename key2file and file2key
What these generate is not really suitable to be used as a filename,
which is why keyFile and fileKey further escape it. These are just
serializing Keys.

Also removed a quickcheck test that was very unlikely to test anything
useful, since it relied on random chance creating something that looks
like a serialized key. The other test is sufficient for testing what
that was intended to test anyway.
2019-01-14 13:03:35 -04:00
Joey Hess
727767e1e2
make everything build again after ByteString Key changes 2019-01-11 16:39:46 -04:00
Joey Hess
151562b537
convert key2file and file2key to use builder and attoparsec
The new parser is significantly stricter than the old one:

The old file2key allowed the fields to come in any order,
but the new one requires the fixed order that git-annex has always used.
Hopefully this will not cause any breakage.

And the old file2key allowed eg SHA1-m1-m2-m3-m4-m5-m6--xxxx
while the new does not allow duplication of fields. This could potentially
improve security, because allowing lots of extra junk like that in a key
could potentially be used in a SHA1 collision attack, although the current
attacks need binary data and not this kind of structured numeric data.

Speed improved of course, and fairly substantially, in microbenchmarks:

benchmarking old/key2file
time                 2.264 μs   (2.257 μs .. 2.273 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.265 μs   (2.260 μs .. 2.275 μs)
std dev              21.17 ns   (13.06 ns .. 39.26 ns)

benchmarking new/key2file'
time                 1.744 μs   (1.741 μs .. 1.747 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.745 μs   (1.742 μs .. 1.751 μs)
std dev              13.55 ns   (9.099 ns .. 21.89 ns)

benchmarking old/file2key
time                 6.114 μs   (6.102 μs .. 6.129 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 6.118 μs   (6.106 μs .. 6.143 μs)
std dev              55.00 ns   (30.08 ns .. 100.2 ns)

benchmarking new/file2key'
time                 1.791 μs   (1.782 μs .. 1.801 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 1.792 μs   (1.785 μs .. 1.804 μs)
std dev              32.46 ns   (20.59 ns .. 50.82 ns)
variance introduced by outliers: 19% (moderately inflated)
2019-01-11 16:33:42 -04:00
Joey Hess
b552551b33
use ByteString in Key for speed
This is an easy win for parseKeyVariety:

benchmarking old/parseKeyVariety
time                 1.515 μs   (1.512 μs .. 1.517 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.515 μs   (1.513 μs .. 1.517 μs)
std dev              6.417 ns   (4.992 ns .. 8.113 ns)

benchmarking new/parseKeyVariety
time                 54.97 ns   (54.70 ns .. 55.40 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 55.42 ns   (55.05 ns .. 56.03 ns)
std dev              1.562 ns   (969.5 ps .. 2.442 ns)
variance introduced by outliers: 44% (moderately inflated)

For formatKeyVariety, using a Builder is marginally worse than building a
String... (This is with criterion evaluating fully to nf not whnf)

benchmarking old/formatKeyVariety
time                 434.3 ns   (428.0 ns .. 440.4 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 430.6 ns   (428.2 ns .. 433.9 ns)
std dev              9.166 ns   (6.932 ns .. 11.94 ns)
variance introduced by outliers: 27% (moderately inflated)

benchmarking Builder/formatKeyVariety
time                 526.5 ns   (524.7 ns .. 528.8 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 526.1 ns   (524.9 ns .. 528.5 ns)
std dev              5.687 ns   (3.762 ns .. 8.000 ns)

Manually building the ByteString was better, but still slightly slower than String,
due to innefficient need to S.pack . show the HashSize:

benchmarking formatKeyVariety
time                 459.5 ns   (455.8 ns .. 463.2 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 459.9 ns   (457.4 ns .. 466.6 ns)
std dev              11.65 ns   (6.860 ns .. 21.41 ns)
variance introduced by outliers: 35% (moderately inflated)

So I cheated and made parseKeyVariety cache the original ByteString,
for formatKeyVariety to use instead of re-building it. Final benchmark:

benchmarking new/formatKeyVariety
time                 50.64 ns   (50.57 ns .. 50.73 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 51.05 ns   (50.60 ns .. 52.71 ns)
std dev              2.790 ns   (259.6 ps .. 5.916 ns)
variance introduced by outliers: 75% (severely inflated)

benchmarking new/parseKeyVariety
time                 71.88 ns   (71.54 ns .. 72.24 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 71.97 ns   (71.69 ns .. 72.47 ns)
std dev              1.249 ns   (910.7 ps .. 1.791 ns)
variance introduced by outliers: 22% (moderately inflated)
2019-01-11 16:32:51 -04:00
Joey Hess
ed8d9a29fe
add missing case 2019-01-10 17:17:37 -04:00
Joey Hess
591e4b145f
convert old uuid-based log parsers to attoparsec
This preserves the workaround for the old bug that caused NoUUID items
to be stored in the log, prefixing log lines with " ". It's now handled
implicitly, by using takeWhile1 (/= ' ') to get the uuid.

There is a behavior change from the old parser, which split the value
into words and then recombined it. That meant that "foo  bar" and "foo\tbar"
came out as "foo bar". That behavior was not documented, and seems
surprising; it meant that after a git-annex describe here "foo  bar",
you wouldn't get that same string back out when git-annex displayed repo
descriptions.

Otoh, some other parsers relied on the old behavior, and the attoparsec
rewrites had to deal with the issue themselves...

For group.log, there are some edge cases around the user providing a
group name with a leading or trailing space. The old parser would ignore
such excess whitespace. The new parser does too, because the alternative
is to refuse to parse something like " group1  group2 " due to excess
whitespace, which would be even more confusing behavior.

The only git-annex branch log file that is not converted to attoparsec
and bytestring-builder now is transitions.log.
2019-01-10 16:34:20 -04:00
Joey Hess
6f66b53a30
newtype Group to ByteString
This may speed up queries for things in groups, due to Eq and Ord being faster.
2019-01-09 15:05:49 -04:00
Joey Hess
2fef43dd71
convert all per-uuid log files to use Builder
Mostly didn't push the ByteStrings down very deep, but all of these log
files are not written to frequently at all, so slight remaining
innefficiency doesn't matter.

In Logs.UUID, removed the fixBadUUID code that cleaned up after a bug in
git-annex versions 3.20111105-3.20111110. In the unlikely event that a repo was
last touched by that ancient git-annex version, the descriptions of remotes
would appear missing when used with this version of git-annex. That is such minor
breakage, and so unlikely to still be a problem for any repos, that it was not
worth forward-porting that code to ByteString.
2019-01-09 14:00:35 -04:00
Joey Hess
16c798b5ef
switch MetaValue to ByteString and MetaField to Text
MetaField was already limited to alphanumerics, so it makes sense to use
Text for it.

Note that technically a UUID can contain invalid UTF-8, and so
remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8
values with '?' or whatever. In practice, a UUID is usually also text,
I only kept open the possibility of it containing invalid UTF-8 to avoid
breaking parsing of strange UUIDs in git-annex branch files. So, I
decided to let this edge case slip by.

Have not updated the rest of the code base yet for this change, as the
change took 2.5 hours longer than I expected to get working properly.
2019-01-07 14:18:24 -04:00
Joey Hess
11d6e2e260
new improved benchmark command that can benchmark anything git-annex does 2019-01-04 13:46:36 -04:00
Joey Hess
ec1b9da72f
avoid abusing from/toRawFilePath for non-FilePaths 2019-01-01 22:44:04 -04:00
Joey Hess
894716512d
add a UUIDDesc type containing a ByteString
Groundwork for handling uuid.log using ByteString
2019-01-01 16:17:54 -04:00
Joey Hess
9cc6d5549b
convert UUID from String to ByteString
This should make == comparison of UUIDs somewhat faster, and perhaps a
few other operations around maps of UUIDs etc.

FromUUID/ToUUID are used to convert String, which is still used for all
IO of UUIDs. Eventually the hope is those instances can be removed,
and all git-annex branch log files etc use ByteString throughout, for a
real speed improvement.

Note the use of fromRawFilePath / toRawFilePath -- while a UUID usually
contains only alphanumerics and so could be treated as ascii, it's
conceivable that some git-annex repository has been initialized using
a UUID that is not only not a canonical UUID, but contains high unicode
or invalid unicode. Using the filesystem encoding avoids any problems
with such a thing. However, a NUL in a UUID seems extremely unlikely,
so I didn't use encodeBS / decodeBS to avoid their extra overhead in
handling NULs.

The Read/Show instance for UUID luckily serializes the same way for
ByteString as it did for String.
2019-01-01 14:45:33 -04:00
Joey Hess
029ae8d4db
support findred and --branch with file matching options
* findref: Support file matching options: --include, --exclude,
  --want-get, --want-drop, --largerthan, --smallerthan, --accessedwithin
* Commands supporting --branch now apply file matching options --include,
  --exclude, --want-get, --want-drop to filenames from the branch.
  Previously, combining --branch with those would fail to match anything.
* add, import, findref: Support --time-limit.

This commit was sponsored by Jake Vosloo on Patreon.
2018-12-09 13:38:35 -04:00
Joey Hess
ab7746a2ae
annex.cachecreds: New config to allow disabling of credentials caching for special remotes.
Note that it does not prevent storing p2p access tokens or multicast
encryption keys, since those are not cached; the previous commit
established the distinction.

How well this works depends on how often getRemoteCredPair is called and
how expensive it is. In some cases setting this will result in an annoying
number of gpg password prompts and/or slowdowns due to reading creds
from the git-annex branch and decrypting, which could be improved by calling
getRemoteCredPair less often.

This commit was sponsored by Ilya Shlyakhter on Patreon.
2018-12-04 14:16:56 -04:00
Joey Hess
850d19d038
add dropFromEnd 2018-11-23 11:24:05 -04:00
Joey Hess
234842a347
v7
Install new git hooks in this version.

This does beg the question of what to do if git later gets eg a
post-smudge hook, that could run git-annex smudge --update. I think the
thing to do in that case would be to make git-annex smudge --update
install the new hooks. That way, as the user uses git-annex, the hook
would be created pretty quickly and without needing any extra syscalls
except for when git-annex smudge --update is called.

I considered doing something like that for installation of the
post-checkout and post-merge hooks, which would have avoided the need
for v7. But the only place it was cheap to do it would be in git-annex smudge
which could cheaply notice that smudge.log didn't exist yet and so know
the hooks needed to be installed. But since smudge used to populate pointer
files, it would be quite surprising if a single git checkout/merge failed
to update the work tree, and so that idea didn't work out.

The other reason for v7 is psychological -- users don't need to worry
about whether they might be running an old version of git-annex that
doesn't support their v7 repository very well. And bug reports about
"v6" have gotten a bit of a bad association in my head since they often
hit one of the known limitations and didn't realize it was experimental.

newtyped RepoVersion Int to avoid needing 2 comparisons in
versionSupportsUnlockedPointers etc. Also it's just nicer.

This commit was sponsored by John Pellman on Patreon.
2018-10-25 18:24:23 -04:00
Joey Hess
8be5a7269a
refactor getCurrentBranch
Both Command.Sync and Annex.Ingest had their own versions of this.

The one in Annex.Ingest used Git.Branch.currentUnsafe, but does not seem
to need it. That is only checking to see if it's in an adjusted unlocked
branch, and when in an adjusted branch, the branch does in fact exist,
so the added check that Git.Branch.current does is fine.

This commit was sponsored by Denis Dzyubenko on Patreon.
2018-10-19 17:29:18 -04:00
Joey Hess
38d691a10f
removed the old Android app
Running git-annex linux builds in termux seems to work well enough that the
only reason to keep the Android app would be to support Android 4-5, which
the old Android app supported, and which I don't know if the termux method
works on (although I see no reason why it would not).
According to [1], Android 4-5 remains on around 29% of devices, down from
51% one year ago.

[1] https://www.statista.com/statistics/271774/share-of-android-platforms-on-mobile-devices-with-android-os/

This is a rather large commit, but mostly very straightfoward removal of
android ifdefs and patches and associated cruft.

Also, removed support for building with very old ghc < 8.0.1, and with
yesod < 1.4.3, and without concurrent-output, which were only being used
by the cross build.

Some documentation specific to the Android app (screenshots etc) needs
to be updated still.

This commit was sponsored by Brett Eisenberg on Patreon.
2018-10-13 01:41:11 -04:00
Joey Hess
6ba3dea566
annex.jobs
Added annex.jobs setting, which is like using the -J option.

Of course, -J overrides annex.jobs.

This commit was sponsored by Trenton Cronholm on Patreon.
2018-10-04 12:47:27 -04:00
Joey Hess
bc31b93c77
remote.name.annex-security-allow-unverified-downloads
Added remote.name.annex-security-allow-unverified-downloads, a per-remote
setting for annex.security.allow-unverified-downloads.

This commit was sponsored by Brock Spratlen on Patreon.
2018-09-25 15:34:47 -04:00
Joey Hess
4ecba916a1
annex.maxextensionlength
Added annex.maxextensionlength for use cases where extensions longer than 4
characters are needed.

This commit was sponsored by Henrik Riomar on Patreon.
2018-09-24 12:10:18 -04:00
Joey Hess
0a7c5a9982
dropdead per-remote metadata
Had to refactor pure code into separate modules so it is accessible
inside Annex.Branch.Transitions.

This commit was sponsored by Peter on Patreon.
2018-09-05 13:52:46 -04:00
Joey Hess
24b76cb8e0
fix prefixing 2018-08-31 13:39:50 -04:00
Joey Hess
b3d42283ad
use per-remote metadata storage for S3 version ID
Since the same key can be stored in a versioned S3 bucket multiple times
with different version IDs, this allows tracking them all. Not currently
needed, but if we ever want to drop from a versioned S3 bucket, we'll
need to know them all.

This commit was supported by the NSF-funded DataLad project.
2018-08-31 13:27:29 -04:00
Joey Hess
5c99f6247e
per-remote metadata storage
Actually very straightforward reuse of the metadata log file code.
Although I had to add a todo item as git-annex forget won't clean up
dead remote's metadata yet.

This would be worth adding to the external special remote interface
sometime. Have not opened a todo though, guess I'll wait until something
needs it.

This commit was supported by the NSF-funded DataLad project.
2018-08-31 12:23:22 -04:00
Joey Hess
8b39db20b5
export appendonly support
Make `git annex export` check appendonly when removing a file from an
export, and not update the location log, since the remote still contains
the content.

This commit was supported by the NSF-funded DataLad project.
2018-08-30 11:18:20 -04:00
Joey Hess
02630b39ee
add Remote.readonly
Does nothing yet.

Considered making bup readonly, but while the content can't be removed,
it is able to delete a branch, so didn't.

This commit was supported by the NSF-funded DataLad project.
2018-08-30 11:12:18 -04:00