Commit graph

491 commits

Author SHA1 Message Date
Joey Hess
258a7c5cd1
add Key to all ActionItem constructors 2019-06-06 12:53:24 -04:00
Joey Hess
4932972487
fix STM deadlock
659640e224 was buggy, it had a STM
deadlock because two actions both wanted to takeTMVar the WorkerPool
and so blocked one-another.

Fixed by completely reworking how the pool is maintained. Maintenace
threads now wait for the Async actions and update the WorkerPool. This
means twice as many threads as before, but green threads so will only
use a few extra bytes ram per thread.
2019-06-05 20:07:35 -04:00
Joey Hess
659640e224
separate queue for cleanup actions
When running multiple concurrent actions, the cleanup phase is run in a
separate queue than the main action queue. This can make some commands
faster, because less time is spent on bookkeeping in between each file
transfer.

But as far as I can see, nothing will be sped up much by this yet, because
all the existing cleanup actions are very light-weight. This is just groundwork
for deferring checksum verification to cleanup time.

This change does mean that if the user expects -J2 will mean that they see no
more than 2 jobs running at a time, they may be surprised to see 4 in some
cases (if the cleanup actions are slow enough to notice).

It might also make sense to enable background cleanup without the -J,
for at least one cleanup action. Indeed, that's the behavior that -J1
has now. At some point in the future, it make make sense to make the
behavior with no -J the same as -J1. The only reason it's not currently
is that git-annex can build w/o concurrent-output, and also any bugs
in concurrent-output (such as perhaps misbehaving on non-VT100 compatible
terminals) are avoided by default by only using it when -J is used.
2019-06-05 17:54:35 -04:00
Joey Hess
c04b2af3e1
improved WorkerPool abstraction
No behavior changes.
2019-06-05 14:26:48 -04:00
Joey Hess
0b5cc89687
wording 2019-06-05 12:20:11 -04:00
Joey Hess
6e51b9ae88
clarify 2019-06-04 21:49:53 -04:00
Joey Hess
500f72ec3d
comment typo 2019-06-04 14:40:07 -04:00
Joey Hess
1871295765
rename annex.security.allowed-http-addresses
Renamed annex.security.allowed-http-addresses to
annex.security.allowed-ip-addresses because it is not really specific to
the http protocol, also limiting eg, git-annex's use of ftp and via
youtube-dl, several other protocols.

The old name for the config will still work.

If both old and new name are set, the new name will win.
2019-05-30 12:43:40 -04:00
Joey Hess
e06feb7316
honor preferred content when importing
Importing from a special remote honors its preferred content too; unwanted
files are not imported. But, some preferred content expressions can't be
checked before files are imported, and trying to import with such an
expression will fail.

Tested this with scenarios including changing the preferred content
expression and making sure merging the import didn't delete files that were
no longer wanted.

There was one minor inefficiency mentioned in the todo that I punted on.
2019-05-21 14:38:06 -04:00
Joey Hess
82186ca58f
annex.jobs=cpus etc
Added the ability to run one job per CPU (core), by setting annex.jobs=cpus,
or using option --jobs=cpus or -Jcpus.

Built with future expansion in mind, including not defaulting matching on
Concurrency so more constructors can later be added, and using "cpu"
instead of "0".
2019-05-10 13:27:08 -04:00
Joey Hess
9dd764e6f7
Added mimeencoding= term to annex.largefiles expressions.
* Added mimeencoding= term to annex.largefiles expressions.
  This is probably mostly useful to match non-text files with eg
  "mimeencoding=binary"
* git-annex matchexpression: Added --mimeencoding option.
2019-04-30 12:17:22 -04:00
Joey Hess
40ecf58d4b
update licenses from GPL to AGPL
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.

Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.

(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
2019-03-13 15:48:14 -04:00
Joey Hess
8ae0db925b
fix name of annex-tracking-branch config 2019-03-11 13:56:59 -04:00
Joey Hess
2912429640
better indicate when special remotes do not support renameExport
Avoid a warning message when renameExport is not supported, and just
fallback to deleting with a subsequent re-upload. Especially needed for
importtree remotes, where renameExport needs to be disabled.

This changes the external special remote protocol, but in a
backwards-compatible way. A reply of UNSUPPORTED-REQUEST to an older
version of git-annex will cause it to make renameExport return False.
2019-03-11 12:53:24 -04:00
Joey Hess
3b412aaae0
simplify Applicative instance 2019-03-06 16:44:17 -04:00
Joey Hess
c6c5f6336b
avoid whitespace in Arbitrary UUID and empty UUID 2019-03-06 15:44:27 -04:00
Joey Hess
46d33e804a
added checkPresentExportWithContentIdentifier
Ugh, don't like needing to add this, but I can't see a way around it.
2019-03-05 16:03:03 -04:00
Joey Hess
8c54604e67
import+export from directory special remote fully working
Had to add two more API calls to override export APIs that are not safe
for use in combination with import.

It's unfortunate that removeExportDirectory is documented to be allowed
to remove non-empty directories. I'm not entirely sure why it's that
way, my best guess is it was intended to make it easy to implement with
just rm -rf.
2019-03-05 14:20:14 -04:00
Joey Hess
aaacf431d8
handle importtree=yes config
For now, it's only allowed when exporttree=yes is also set.
That simplified the implementation, but could later be changed if
there's a remote that makes sense to be an import but not an export.
However, it may work just as well to make a remote be readonly to
prevent export to it while still allowing import.
2019-03-04 16:07:35 -04:00
Joey Hess
88ccfaa78c
storeExportWithContentIdentifierM for directory special remote
Not sure if my reasoning about the races really holds.

It would certianly be possible to better guard against races by using
Linux-specific renameat2 with RENAME_EXCHANGE or RENAME_NOREPLACE.

Or by using link and relying on it not overwriting existing files -- but
that would need a filesystem that supports hard links and directory can
be used in filesystems that don't.
2019-03-04 14:46:25 -04:00
Joey Hess
e2e57f8556
initial export support for directory special remote
This does not guard against race condition yet, it's only for testing
purposes.
2019-02-27 13:42:34 -04:00
Joey Hess
45aacd888b
import downloader complete (untested)
Made some api changes.

listImportableContents needs to provide the size
of the data, so the downloader can check disk free space.

retrieveExportWithContentIdentifier is passed the filepath to write to

Use temporary "CID" key during download of a ContentIdentifier from a
remote, so withTmp can be used and then move the content to the real key
once it's known.
2019-02-27 13:15:02 -04:00
Joey Hess
f4b773e9a1
incomplete action to download files from import 2019-02-26 15:25:28 -04:00
Joey Hess
4747fa923d
export: Deprecated the --tracking option.
Instead, users can configure remote.<name>.annex-tracking-branch themselves.
2019-02-23 15:54:33 -04:00
Joey Hess
bab6c570b0
buildImportTrees is fully working
buildImportCommit not yet tested
2019-02-22 12:41:17 -04:00
Joey Hess
8fdea8f444
WIP
Added graftTree but it's buggy.

Should use graftTree in Annex.Branch.graftTreeish; it will be faster
than the current implementation there.

Started Annex.Import, but untested and it doesn't yet handle tree
grafting.
2019-02-21 17:32:59 -04:00
Joey Hess
fd304dce60
split out Types.Import and some changes to the types in it 2019-02-21 13:39:09 -04:00
Joey Hess
936aee6a60
quickcheck property for parsing of content identifier logs 2019-02-21 13:17:43 -04:00
Joey Hess
ccc0684d21
no remotes support import yet 2019-02-20 16:59:04 -04:00
Joey Hess
0442842622
add import tree interface to Remote 2019-02-20 15:35:22 -04:00
Joey Hess
d839c2110a
fix encoding of metadata containing newlines
This fixes a reversion in the ByteString conversion. The old code used
isSpace to decide when the metadata value needs to be base64 encoded,
and that incorrectly changed to only checking if it contained ' '.

Note that only '\n' and '\r' were added and not other sorts of
whitespace that isSpace matches, like '\t' and '\v'. Only the former
would cause problems.
2019-02-20 14:26:18 -04:00
Joey Hess
1b8026b2cb
constrain Arbitrary MetaField to ascii
Same reason other Arbitrary's have been. I saw a test failure on Windows
that was probably caused by non-ascii there.
2019-02-18 17:50:06 -04:00
Joey Hess
9cebfd7002
purify exportActions
Purifying exportActions will allow introspecting and modifying it,
which is needed to add progress bar display to it.

Only S3 and WebDAV ran an Annex action while constructing ExportActions.
There was a small performance gain from them doing that, since a
resource was able to be prepared and reused for multiple actions by
Command.Export.

As seen in commit 809cfbbd8a and
5d394023eb S3 and WebDAV actually create a
new handle for each access in normal, non-export use. It doesn't seem
worth making export use of them marginally more efficient than normal
use. It would be better to do that work upfront when constructing the
remote. Or perhaps use a MVar to cache a handle.

This commit was sponsored by Nick Piper on Patreon.
2019-01-30 15:11:40 -04:00
Joey Hess
f76c4a0973
avoid Arbitrary generating excessivly long lists
Turns what it was doing often generated too long lists, or spun with
suchThat rejecting too large numbers. Limit lists to 10.
2019-01-21 13:50:24 -04:00
Joey Hess
e8ff3c3e73
fix build with old ghc 2019-01-18 14:08:10 -04:00
Joey Hess
a5764c4a78
fix build with old ghc 2019-01-18 13:59:29 -04:00
Joey Hess
d5f2463702
misctmp cleanup
* Switch to using .git/annex/othertmp for tmp files other than partial
  downloads, and make stale files left in that directory when git-annex
  is interrupted be cleaned up promptly by subsequent git-annex processes.
* The .git/annex/misctmp directory is no longer used and git-annex will
  delete anything lingering in there after it's 1 week old.

Also, in Annex.Ingest, made the filename it uses in the tmp dir be
prefixed with "ingest-" to avoid potentially using a filename used by
some other code.
2019-01-17 16:02:22 -04:00
Joey Hess
c3afb3434d
remove recently added cache from KeyVariety
Adding that field broke the Read/Show serialization back-compat,
and also the Eq and Ord instances were not blinded to it, which broke
git annex fsck and probably more.

I think that the new approach used in formatKeyVariety will be nearly
as fast, but have not benchmarked it.
2019-01-16 16:33:08 -04:00
Joey Hess
96aba8eff7
Revert "cache the serialization of a Key"
This reverts commit 4536c93bb2.

That broke Read/Show of a Key, and unfortunately Key is read in at least
one place; the GitAnnexDistribution data type.

It would be worth bringing this optimisation back, but it would need
either a custom Read/Show instance that preserves back-compat, or
wrapping Key in a data type that contains the serialization, or changing
how GitAnnexDistribution is serialized.

Also, the Eq instance would need to compare keys with and without a
cached seralization the same.
2019-01-16 16:21:59 -04:00
Joey Hess
4536c93bb2
cache the serialization of a Key
This will speed up the common case where a Key is deserialized from
disk, but is then serialized to build eg, the path to the annex object.

It means that every place a Key has any of its fields changed, the cache
has to be dropped. I've grepped and found them all. But, it would be
better to avoid that gotcha somehow..
2019-01-14 16:37:28 -04:00
Joey Hess
d3ab5e626b
rename key2file and file2key
What these generate is not really suitable to be used as a filename,
which is why keyFile and fileKey further escape it. These are just
serializing Keys.

Also removed a quickcheck test that was very unlikely to test anything
useful, since it relied on random chance creating something that looks
like a serialized key. The other test is sufficient for testing what
that was intended to test anyway.
2019-01-14 13:03:35 -04:00
Joey Hess
727767e1e2
make everything build again after ByteString Key changes 2019-01-11 16:39:46 -04:00
Joey Hess
151562b537
convert key2file and file2key to use builder and attoparsec
The new parser is significantly stricter than the old one:

The old file2key allowed the fields to come in any order,
but the new one requires the fixed order that git-annex has always used.
Hopefully this will not cause any breakage.

And the old file2key allowed eg SHA1-m1-m2-m3-m4-m5-m6--xxxx
while the new does not allow duplication of fields. This could potentially
improve security, because allowing lots of extra junk like that in a key
could potentially be used in a SHA1 collision attack, although the current
attacks need binary data and not this kind of structured numeric data.

Speed improved of course, and fairly substantially, in microbenchmarks:

benchmarking old/key2file
time                 2.264 μs   (2.257 μs .. 2.273 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.265 μs   (2.260 μs .. 2.275 μs)
std dev              21.17 ns   (13.06 ns .. 39.26 ns)

benchmarking new/key2file'
time                 1.744 μs   (1.741 μs .. 1.747 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.745 μs   (1.742 μs .. 1.751 μs)
std dev              13.55 ns   (9.099 ns .. 21.89 ns)

benchmarking old/file2key
time                 6.114 μs   (6.102 μs .. 6.129 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 6.118 μs   (6.106 μs .. 6.143 μs)
std dev              55.00 ns   (30.08 ns .. 100.2 ns)

benchmarking new/file2key'
time                 1.791 μs   (1.782 μs .. 1.801 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 1.792 μs   (1.785 μs .. 1.804 μs)
std dev              32.46 ns   (20.59 ns .. 50.82 ns)
variance introduced by outliers: 19% (moderately inflated)
2019-01-11 16:33:42 -04:00
Joey Hess
b552551b33
use ByteString in Key for speed
This is an easy win for parseKeyVariety:

benchmarking old/parseKeyVariety
time                 1.515 μs   (1.512 μs .. 1.517 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.515 μs   (1.513 μs .. 1.517 μs)
std dev              6.417 ns   (4.992 ns .. 8.113 ns)

benchmarking new/parseKeyVariety
time                 54.97 ns   (54.70 ns .. 55.40 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 55.42 ns   (55.05 ns .. 56.03 ns)
std dev              1.562 ns   (969.5 ps .. 2.442 ns)
variance introduced by outliers: 44% (moderately inflated)

For formatKeyVariety, using a Builder is marginally worse than building a
String... (This is with criterion evaluating fully to nf not whnf)

benchmarking old/formatKeyVariety
time                 434.3 ns   (428.0 ns .. 440.4 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 430.6 ns   (428.2 ns .. 433.9 ns)
std dev              9.166 ns   (6.932 ns .. 11.94 ns)
variance introduced by outliers: 27% (moderately inflated)

benchmarking Builder/formatKeyVariety
time                 526.5 ns   (524.7 ns .. 528.8 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 526.1 ns   (524.9 ns .. 528.5 ns)
std dev              5.687 ns   (3.762 ns .. 8.000 ns)

Manually building the ByteString was better, but still slightly slower than String,
due to innefficient need to S.pack . show the HashSize:

benchmarking formatKeyVariety
time                 459.5 ns   (455.8 ns .. 463.2 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 459.9 ns   (457.4 ns .. 466.6 ns)
std dev              11.65 ns   (6.860 ns .. 21.41 ns)
variance introduced by outliers: 35% (moderately inflated)

So I cheated and made parseKeyVariety cache the original ByteString,
for formatKeyVariety to use instead of re-building it. Final benchmark:

benchmarking new/formatKeyVariety
time                 50.64 ns   (50.57 ns .. 50.73 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 51.05 ns   (50.60 ns .. 52.71 ns)
std dev              2.790 ns   (259.6 ps .. 5.916 ns)
variance introduced by outliers: 75% (severely inflated)

benchmarking new/parseKeyVariety
time                 71.88 ns   (71.54 ns .. 72.24 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 71.97 ns   (71.69 ns .. 72.47 ns)
std dev              1.249 ns   (910.7 ps .. 1.791 ns)
variance introduced by outliers: 22% (moderately inflated)
2019-01-11 16:32:51 -04:00
Joey Hess
ed8d9a29fe
add missing case 2019-01-10 17:17:37 -04:00
Joey Hess
591e4b145f
convert old uuid-based log parsers to attoparsec
This preserves the workaround for the old bug that caused NoUUID items
to be stored in the log, prefixing log lines with " ". It's now handled
implicitly, by using takeWhile1 (/= ' ') to get the uuid.

There is a behavior change from the old parser, which split the value
into words and then recombined it. That meant that "foo  bar" and "foo\tbar"
came out as "foo bar". That behavior was not documented, and seems
surprising; it meant that after a git-annex describe here "foo  bar",
you wouldn't get that same string back out when git-annex displayed repo
descriptions.

Otoh, some other parsers relied on the old behavior, and the attoparsec
rewrites had to deal with the issue themselves...

For group.log, there are some edge cases around the user providing a
group name with a leading or trailing space. The old parser would ignore
such excess whitespace. The new parser does too, because the alternative
is to refuse to parse something like " group1  group2 " due to excess
whitespace, which would be even more confusing behavior.

The only git-annex branch log file that is not converted to attoparsec
and bytestring-builder now is transitions.log.
2019-01-10 16:34:20 -04:00
Joey Hess
6f66b53a30
newtype Group to ByteString
This may speed up queries for things in groups, due to Eq and Ord being faster.
2019-01-09 15:05:49 -04:00
Joey Hess
2fef43dd71
convert all per-uuid log files to use Builder
Mostly didn't push the ByteStrings down very deep, but all of these log
files are not written to frequently at all, so slight remaining
innefficiency doesn't matter.

In Logs.UUID, removed the fixBadUUID code that cleaned up after a bug in
git-annex versions 3.20111105-3.20111110. In the unlikely event that a repo was
last touched by that ancient git-annex version, the descriptions of remotes
would appear missing when used with this version of git-annex. That is such minor
breakage, and so unlikely to still be a problem for any repos, that it was not
worth forward-porting that code to ByteString.
2019-01-09 14:00:35 -04:00
Joey Hess
16c798b5ef
switch MetaValue to ByteString and MetaField to Text
MetaField was already limited to alphanumerics, so it makes sense to use
Text for it.

Note that technically a UUID can contain invalid UTF-8, and so
remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8
values with '?' or whatever. In practice, a UUID is usually also text,
I only kept open the possibility of it containing invalid UTF-8 to avoid
breaking parsing of strange UUIDs in git-annex branch files. So, I
decided to let this edge case slip by.

Have not updated the rest of the code base yet for this change, as the
change took 2.5 hours longer than I expected to get working properly.
2019-01-07 14:18:24 -04:00
Joey Hess
11d6e2e260
new improved benchmark command that can benchmark anything git-annex does 2019-01-04 13:46:36 -04:00