merging sqlite and bs branches

Since the sqlite branch uses blobs extensively, there are some
performance benefits, ByteStrings now get stored and retrieved w/o
conversion in some cases like in Database.Export.
This commit is contained in:
Joey Hess 2019-12-06 15:17:54 -04:00
commit 2f9a80d803
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
266 changed files with 2860 additions and 1325 deletions

View file

@ -0,0 +1,14 @@
In neurophysiology we encounter HUGE files (HDF5 .nwb files).
Sizes reach hundreds of GBs per file (thus exceeding any possible file system memory cache size). While operating in the cloud or on a fast connection it is possible to fetch the files with speeds up to 100 MBps.
Upon successful download such files are then loaded back by git-annex for the checksum validation, and often at slower speeds (eg <60MBps on EC2 SSD drive).
So, ironically, it does not just double, but rather nearly triples overall time to obtain a file.
I think ideally,
- (at minimum) for built-in special remotes (such as web), it would be great if git-annex was check-summing incrementally as data comes in;
- made it possible to for external special remotes to provide desired checksum on obtained content. First git-annex should of cause inform them on type (backend) of the checksum it is interested in, and may be have some information reported by external remotes on what checksums they support.
If needed example, here is http://datasets.datalad.org/allen-brain-observatory/visual-coding-2p/.git with >50GB files such as ophys_movies/ophys_experiment_576261945.h5 .
[[!meta author=yoh]]
[[!tag projects/dandi]]

View file

@ -0,0 +1,11 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="use named pipes?"
date="2019-11-25T16:45:26Z"
content="""
For external remotes can pass to the `TRANSFER` request, as the `FILE` parameter, a named pipe, and use `tee` to create a separate stream for checksumming.
An external remote could also do its own checksum checking and then set remote.<name>.annex-verify=false.
Could also make a “wrapper” external remote that delegates all requests to a given external remote but does checksum-checking in parallel with downloading (by creating a named pipe and passing that to the wrapped remote).
"""]]

View file

@ -0,0 +1,5 @@
It would be useful to have a [[`git-annex-cat`|forum/Is_there_a___34__git_annex_cat-file__34___type_command__63__/]] command that outputs the contents of an annexed file without storing it in the annex. This [[can be faster|OPT: "bundle" get + check (of checksum) in a single operation]] than `git-annex-get` followed by `cat`, even if file is already present. It avoids some failure modes of `git-annex-get` (like running out of local space, or contending for locks). It supports a common use case of just needing a file for some operation, without needing to remember to drop it later. It could be used to implement a web server or FUSE filesystem that serves git-annex repo files on demand.
If file is not present, or `remote.here.cost` is higher than `remote.someremote.cost` where file is present, `someremote` would get a `TRANSFER` request where the `FILE` argument is a named pipe, and a `cat` of that named pipe would be started.
If file is not annexed, for uniformity `git-annex-cat file` would just call `cat file`.

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="yarikoptic"
avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
subject="reference original bug report"
date="2019-11-29T17:58:28Z"
content="""
original bug report was https://git-annex.branchable.com/bugs/git-lfs_remote_URL_is_not_recorded__63__/ for an attempt to share some NWB data on github's LFS
"""]]

View file

@ -0,0 +1,11 @@
After unlocking a file, `git status` runs the smudge filter. That is
unnecessary, and when many files were unlocked, it can take a long time
because [[git_smudge_clean_interface_suboptiomal]] means it runs git-annex
once per file.
It should be possible to avoid that, as was done with git drop in [[!commit
1113caa53efedbe7ab1d98b74010160f20473e8d]]. I tried making Command.Unlock
use restagePointerFile, but that did not help; git update-index does then
smudge it during the `git annex unlock`, which is no faster (but at least
doing it then would avoid the surprise of a slow `git status` or `git
commit -a`). Afterwards, `git status` then smudged it again, unsure why!

View file

@ -0,0 +1,46 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="moving unlocked file onto locked file isn't possible"
date="2019-11-24T16:36:24Z"
content="""
`git mv` won't move an unlocked file onto a locked file (trace below).
\"The right solution is to improve the smudge/clean filter interface\" -- of course, but realistically, do you think git devs can be persuaded to do [[this|todo/git_smudge_clean_interface_suboptiomal]] sometime soon? Even if yes, it still seems better to avoid adding a step to common git workflows, than to make the step fast.
[[!format sh \"\"\"
(master_env_v164_py36) 11:14 [t1] $ ls
bar foo
(master_env_v164_py36) 11:14 [t1] $ git init
Initialized empty Git repository in /tmp/t1/.git/
(master_env_v164_py36) 11:14 [t1] $ git annex init
init (scanning for unlocked files...)
ok
(recording state in git...)
(master_env_v164_py36) 11:14 [t1] $ git annex add foo
add foo ok
(recording state in git...)
(master_env_v164_py36) 11:14 [t1] $ git annex add bar
add bar ok
(recording state in git...)
(master_env_v164_py36) 11:14 [t1] $ ls -alt
total 0
drwxrwxr-x 8 ilya ilya 141 Nov 24 11:14 .git
drwxrwxr-x 3 ilya ilya 40 Nov 24 11:14 .
lrwxrwxrwx 1 ilya ilya 108 Nov 24 11:14 bar -> .git/annex/objects/jx/MV/MD5E-s4--c157a79031e1c40f85931829bc5fc552/MD5E-s4--c157a79031\
e1c40f85931829bc5fc552
lrwxrwxrwx 1 ilya ilya 108 Nov 24 11:14 foo -> .git/annex/objects/00/zZ/MD5E-s4--d3b07384d113edec49eaa6238ad5ff00/MD5E-s4--d3b07384d1\
13edec49eaa6238ad5ff00
drwxrwxrwt 12 root root 282 Nov 24 11:14 ..
(master_env_v164_py36) 11:14 [t1] $ git annex unlock bar
unlock bar ok
(recording state in git...)
(master_env_v164_py36) 11:16 [t1] $ git mv bar foo
fatal: destination exists, source=bar, destination=foo
(master_env_v164_py36) 11:17 [t1] $
\"\"\"]]
"""]]

View file

@ -0,0 +1,126 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="even git mv -f seems to work correctly"
date="2019-11-24T17:25:32Z"
content="""
Also, `git mv` seems to reuse the already-smudged object contents of the source file for the target file, so even with `git mv -f` only the checksum gets checked into git:
[[!format sh \"\"\"
+ cat ./test-git-mv
#!/bin/bash
set -eu -o pipefail -x
cat $0
TEST_DIR=/tmp/test_dir
mkdir -p $TEST_DIR
chmod -R u+w $TEST_DIR
rm -rf $TEST_DIR
mkdir -p $TEST_DIR
pushd $TEST_DIR
git init
git annex init
git --version
git annex version
rm .git/info/attributes
echo foo > foo
echo bar > bar
git annex add foo bar
git check-attr -a foo
git check-attr -a bar
echo 'bar filter=annex' > .gitattributes
git add .gitattributes
git check-attr -a foo
git check-attr -a bar
git annex unlock bar
git mv bar foo || true
git mv -f bar foo
git commit -m add
git log -p
+ TEST_DIR=/tmp/test_dir
+ mkdir -p /tmp/test_dir
+ chmod -R u+w /tmp/test_dir
+ rm -rf /tmp/test_dir
+ mkdir -p /tmp/test_dir
+ pushd /tmp/test_dir
/tmp/test_dir /tmp
+ git init
Initialized empty Git repository in /tmp/test_dir/.git/
+ git annex init
init (scanning for unlocked files...)
ok
(recording state in git...)
+ git --version
git version 2.20.1
+ git annex version
git-annex version: 7.20191024-g6dc2272
build flags: Assistant Webapp Pairing S3 WebDAV Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite
dependency versions: aws-0.21.1 bloomfilter-2.0.1.0 cryptonite-0.25 DAV-1.3.3 feed-1.0.1.0 ghc-8.6.5 http-client-0.5.14 persistent-sqlite-2.9.3 torrent-10000.1.1 uuid-1.3.13 yesod-1.6.0
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs hook external
operating system: linux x86_64
supported repository versions: 7
upgrade supported from repository versions: 0 1 2 3 4 5 6
local repository version: 7
+ rm .git/info/attributes
+ echo foo
+ echo bar
+ git annex add foo bar
add foo ok
add bar ok
(recording state in git...)
+ git check-attr -a foo
+ git check-attr -a bar
+ echo 'bar filter=annex'
+ git add .gitattributes
+ git check-attr -a foo
+ git check-attr -a bar
bar: filter: annex
+ git annex unlock bar
unlock bar ok
(recording state in git...)
+ git mv bar foo
fatal: destination exists, source=bar, destination=foo
+ true
+ git mv -f bar foo
+ git commit -m add
[master (root-commit) 8610c0d] add
2 files changed, 2 insertions(+)
create mode 100644 .gitattributes
create mode 100644 foo
+ git log -p
commit 8610c0d8f327140608e71dc229f167731552d284
Author: Ilya Shlyakhter <ilya_shl@alum.mit.edu>
Date: Sun Nov 24 12:24:28 2019 -0500
add
diff --git a/.gitattributes b/.gitattributes
new file mode 100644
index 0000000..649f07e
--- /dev/null
+++ b/.gitattributes
@@ -0,0 +1 @@
+bar filter=annex
diff --git a/foo b/foo
new file mode 100644
index 0000000..266ae50
--- /dev/null
+++ b/foo
@@ -0,0 +1 @@
+/annex/objects/MD5E-s4--c157a79031e1c40f85931829bc5fc552
\"\"\"]]
"""]]

View file

@ -0,0 +1,37 @@
git-annex uses FilePath (String) extensively. That's a slow data type.
Converting to ByteString, and RawFilePath, should speed it up
significantly, according to [[/profiling]].
I've made a test branch, `bs`, to see what kind of performance improvement
to expect.
Benchmarking `git-annex find`, speedups range from 28-66%. The files fly by
much more snappily. Other commands likely also speed up, but do more work
than find so the improvement is not as large.
The `bs` branch is in a mergeable state now, but still needs work:
* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
decodeBS conversions. Or at least most of them. There are likely
quite a few places where a value is converted back and forth several times.
As a first step, profile and look for the hot spots. Known hot spots:
* keyFile uses fromRawFilePath and that adds around 3% overhead in `git-annex find`.
Converting it to a RawFilePath needs a version of `</>` for RawFilePaths.
* getJournalFileStale uses fromRawFilePath, and adds 3-5% overhead in
`git-annex whereis`. Converting it to RawFilePath needs a version
of `</>` for RawFilePaths. It also needs a ByteString.readFile
for RawFilePath.
* System.FilePath is not available for RawFilePath, and many of the
conversions are to get a FilePath in order to use that library.
It should be entirely straightforward to make a version of System.FilePath
that can operate on RawFilePath, except possibly there could be some
complications due to Windows.
* Use versions of IO actions like getFileStatus that take a RawFilePath,
avoiding a conversion. Note that these are only available on unix, not
windows, so a compatability shim will be needed.
(I can't seem to find any library that provides one.)

View file

@ -0,0 +1,44 @@
[[!comment format=mdwn
username="joey"
subject="""profiling"""
date="2019-11-26T20:05:28Z"
content="""
Profiling the early version of the `bs` branch.
Tue Nov 26 16:05 2019 Time and Allocation Profiling Report (Final)
git-annex +RTS -p -RTS find
total time = 2.75 secs (2749 ticks @ 1000 us, 1 processor)
total alloc = 1,642,607,120 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
inAnnex'.\ Annex.Content Annex/Content.hs:(103,61)-(118,31) 31.2 46.8
keyFile' Annex.Locations Annex/Locations.hs:(567,1)-(577,30) 5.3 6.2
encodeW8 Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:(189,1)-(191,70) 3.3 4.2
>>=.\ Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:(146,9)-(147,44) 2.9 0.8
>>=.\.succ' Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:146:13-76 2.6 0.3
keyFile'.esc Annex.Locations Annex/Locations.hs:(573,9)-(577,30) 2.5 5.5
parseLinkTarget Annex.Link Annex/Link.hs:(254,1)-(262,25) 2.4 4.4
getAnnexLinkTarget'.probesymlink Annex.Link Annex/Link.hs:78:9-46 2.4 2.8
w82s Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:217:1-15 2.3 6.0
keyPath Annex.Locations Annex/Locations.hs:(606,1)-(608,23) 1.9 4.0
parseKeyVariety Types.Key Types/Key.hs:(323,1)-(371,42) 1.8 0.0
getState Annex Annex.hs:(251,1)-(254,27) 1.7 0.4
fileKey'.go Annex.Locations Annex/Locations.hs:588:9-55 1.4 0.8
fileKey' Annex.Locations Annex/Locations.hs:(586,1)-(596,41) 1.4 1.7
hashUpdates.\.\.\ Crypto.Hash Crypto/Hash.hs:85:48-99 1.3 0.0
parseLinkTargetOrPointer Annex.Link Annex/Link.hs:(239,1)-(243,25) 1.2 0.1
withPtr Basement.Block.Base Basement/Block/Base.hs:(395,1)-(404,31) 1.2 0.6
primitive Basement.Monad Basement/Monad.hs:72:5-18 1.0 0.1
decodeBS' Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:151:1-31 1.0 2.8
mkKeySerialization Types.Key Types/Key.hs:(115,1)-(117,22) 0.7 1.1
w82c Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:211:1-28 0.6 1.1
Comparing with [[/profiling]] results, the alloc is down significantly.
And the main IO actions are getting a larger share of the runtime.
There is still significantly conversion going on, encodeW8 and w82s and
decodeBS' and w82c. Likely another 5% or so speedup if that's eliminated.
"""]]

View file

@ -0,0 +1,19 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="representing paths"
date="2019-11-27T15:08:40Z"
content="""
Thanks for working on this Joey.
I don't know Haskell or git-annex architecture, so my thoughts might make no sense, but I'll post just in case.
\"There are likely quite a few places where a value is converted back and forth several times\" -- as a quick/temp fix, could memoization speed this up? Or memoizing the results of some system calls?
The many filenames flying around often share long prefixes. Could that be used to speed things up? E.g. if they could be represented as pointers into some compact storage, maybe cache performance would improve.
\"git annex find... files fly by much more snappily\" -- does this mean `git-annex-find` is testing each file individually, as opposed to constructing a SQL query to an indexed db? Maybe, simpler `git-annex-find` queries that are fully mappable to SQL queries could be special-cased?
Sorry for naive comments, I'll eventually read up on Haskell and make more sense...
"""]]

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="parallelization"
date="2019-11-27T17:23:14Z"
content="""
When operating on many files, maybe run N parallel commands where i'th command ignores paths for which `(hash(filename) module N) != i`. Or, if git index has size I, i'th command ignores paths that are not legixographically between `index[(I/N)*i]` and `index[(I/N)*(i+1)]` (for index state at command start). Extending [[git-annex-matching-options]] with `--block=i` would let this be done using `xargs`.
"""]]