merging sqlite and bs branches

Since the sqlite branch uses blobs extensively, there are some performance benefits, ByteStrings now get stored and retrieved w/o conversion in some cases like in Database.Export.
2019-12-06 15:17:54 -04:00 · 2019-12-06 15:17:54 -04:00 · 2f9a80d803
commit 2f9a80d803
parent 70a8716324 db13b16013
266 changed files with 2860 additions and 1325 deletions
--- a/doc/todo/OPT58_34bundle34_get_+_check_40of_checksum41_in_a_single_operation.mdwn
+++ b/doc/todo/OPT58_34bundle34_get_+_check_40of_checksum41_in_a_single_operation.mdwn
@ -0,0 +1,14 @@
+In neurophysiology we encounter HUGE files (HDF5 .nwb files).
+Sizes reach hundreds of GBs per file (thus exceeding any possible file system memory cache size).  While operating in the cloud or on a fast connection it is possible to fetch the files with speeds up to 100 MBps. 
+Upon successful download such files are then loaded back by git-annex for the checksum validation, and often at slower speeds (eg <60MBps on EC2 SSD drive).
+So, ironically, it does not just double, but rather nearly triples overall time to obtain a file.
+
+I think ideally, 
+
+- (at minimum) for built-in special remotes (such as web), it would be great if git-annex was check-summing incrementally as data comes in;
+- made it possible to for external special remotes to provide desired checksum on obtained content. First git-annex should of cause inform them on type (backend) of the checksum it is interested in, and may be have some information reported by external remotes on what checksums they support.
+
+If needed example, here is http://datasets.datalad.org/allen-brain-observatory/visual-coding-2p/.git with >50GB files such as ophys_movies/ophys_experiment_576261945.h5 .
+
+[[!meta author=yoh]]
+[[!tag projects/dandi]]
--- a/doc/todo/OPT58_34bundle34_get_+_check_40of_checksum41_in_a_single_operation/comment_1_29e601ea3ea4f22301c6cf6eed403ba4._comment
+++ b/doc/todo/OPT58_34bundle34_get_+_check_40of_checksum41_in_a_single_operation/comment_1_29e601ea3ea4f22301c6cf6eed403ba4._comment
@ -0,0 +1,11 @@
+[[!comment format=mdwn
+ username="Ilya_Shlyakhter"
+ avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
+ subject="use named pipes?"
+ date="2019-11-25T16:45:26Z"
+ content="""
+For external remotes can pass to the `TRANSFER` request, as the `FILE` parameter, a named pipe, and use `tee` to create a separate stream for checksumming.
+
+An external remote could also do its own checksum checking and then set remote.<name>.annex-verify=false.
+Could also make a “wrapper” external remote that delegates all requests to a given external remote but does checksum-checking in parallel with downloading (by creating a named pipe and passing that to the wrapped remote).
+"""]]
--- a/doc/todo/git-annex-cat.mdwn
+++ b/doc/todo/git-annex-cat.mdwn
@ -0,0 +1,5 @@
+It would be useful to have a [[`git-annex-cat`|forum/Is_there_a___34__git_annex_cat-file__34___type_command__63__/]] command that outputs the contents of an annexed file without storing it in the annex.  This [[can be faster|OPT: "bundle" get + check (of checksum) in a single operation]] than `git-annex-get` followed by `cat`, even if file is already present.  It avoids some failure modes of `git-annex-get` (like running out of local space, or contending for locks).  It supports a common use case of just needing a file for some operation, without needing to remember to drop it later.  It could be used to implement a web server or FUSE filesystem that serves git-annex repo files on demand.
+
+If file is not present, or `remote.here.cost` is higher than `remote.someremote.cost` where file is present, `someremote` would get a `TRANSFER` request where the `FILE` argument is a named pipe, and a `cat` of that named pipe would be started.
+
+If file is not annexed, for uniformity `git-annex-cat file` would just call `cat file`.
--- a/doc/todo/git-lfs_special_remote_simpler_setup/comment_1_4e9f8b60dd1b705d4755200dada8801c._comment
+++ b/doc/todo/git-lfs_special_remote_simpler_setup/comment_1_4e9f8b60dd1b705d4755200dada8801c._comment
@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="reference original bug report"
+ date="2019-11-29T17:58:28Z"
+ content="""
+original bug report was https://git-annex.branchable.com/bugs/git-lfs_remote_URL_is_not_recorded__63__/ for an attempt to share some NWB data on github's LFS
+"""]]
--- a/doc/todo/git_status_smudges_unncessarily_after_unlock.mdwn
+++ b/doc/todo/git_status_smudges_unncessarily_after_unlock.mdwn
@ -0,0 +1,11 @@
+After unlocking a file, `git status` runs the smudge filter. That is
+unnecessary, and when many files were unlocked, it can take a long time
+because [[git_smudge_clean_interface_suboptiomal]] means it runs git-annex
+once per file.
+
+It should be possible to avoid that, as was done with git drop in [[!commit
+1113caa53efedbe7ab1d98b74010160f20473e8d]]. I tried making Command.Unlock
+use restagePointerFile, but that did not help; git update-index does then
+smudge it during the `git annex unlock`, which is no faster (but at least
+doing it then would avoid the surprise of a slow `git status` or `git
+commit -a`). Afterwards, `git status` then smudged it again, unsure why!
--- a/doc/todo/only_pass_unlocked_files_through_the_clean47smudge_filter/comment_2_8ddf77de6313df0157de8d24c2dc7951._comment
+++ b/doc/todo/only_pass_unlocked_files_through_the_clean47smudge_filter/comment_2_8ddf77de6313df0157de8d24c2dc7951._comment
@ -0,0 +1,46 @@
+[[!comment format=mdwn
+ username="Ilya_Shlyakhter"
+ avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
+ subject="moving unlocked file onto locked file isn't possible"
+ date="2019-11-24T16:36:24Z"
+ content="""
+`git mv` won't move an unlocked file onto a locked file (trace below).
+
+\"The right solution is to improve the smudge/clean filter interface\" -- of course, but realistically, do you think git devs can be persuaded to do [[this|todo/git_smudge_clean_interface_suboptiomal]] sometime soon?  Even if yes, it still seems better to avoid adding a step to common git workflows, than to make the step fast.
+
+
+[[!format sh \"\"\"
+(master_env_v164_py36) 11:14  [t1] $ ls
+bar  foo
+(master_env_v164_py36) 11:14  [t1] $ git init
+Initialized empty Git repository in /tmp/t1/.git/
+(master_env_v164_py36) 11:14  [t1] $ git annex init
+init  (scanning for unlocked files...)
+ok
+(recording state in git...)
+(master_env_v164_py36) 11:14  [t1] $ git annex add foo
+add foo ok
+(recording state in git...)
+(master_env_v164_py36) 11:14  [t1] $ git annex add bar
+add bar ok
+(recording state in git...)
+(master_env_v164_py36) 11:14  [t1] $ ls -alt
+total 0
+drwxrwxr-x  8 ilya ilya 141 Nov 24 11:14 .git
+drwxrwxr-x  3 ilya ilya  40 Nov 24 11:14 .
+lrwxrwxrwx  1 ilya ilya 108 Nov 24 11:14 bar -> .git/annex/objects/jx/MV/MD5E-s4--c157a79031e1c40f85931829bc5fc552/MD5E-s4--c157a79031\
+e1c40f85931829bc5fc552
+lrwxrwxrwx  1 ilya ilya 108 Nov 24 11:14 foo -> .git/annex/objects/00/zZ/MD5E-s4--d3b07384d113edec49eaa6238ad5ff00/MD5E-s4--d3b07384d1\
+13edec49eaa6238ad5ff00
+drwxrwxrwt 12 root root 282 Nov 24 11:14 ..
+(master_env_v164_py36) 11:14  [t1] $ git annex unlock bar
+unlock bar ok
+(recording state in git...)
+(master_env_v164_py36) 11:16  [t1] $ git mv bar foo
+fatal: destination exists, source=bar, destination=foo
+(master_env_v164_py36) 11:17  [t1] $
+
+
+
+\"\"\"]]
+"""]]
--- a/doc/todo/only_pass_unlocked_files_through_the_clean47smudge_filter/comment_3_51ea1a6c7c6c46322975cf051c191887._comment
+++ b/doc/todo/only_pass_unlocked_files_through_the_clean47smudge_filter/comment_3_51ea1a6c7c6c46322975cf051c191887._comment
@ -0,0 +1,126 @@
+[[!comment format=mdwn
+ username="Ilya_Shlyakhter"
+ avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
+ subject="even git mv -f seems to work correctly"
+ date="2019-11-24T17:25:32Z"
+ content="""
+Also, `git mv` seems to reuse the already-smudged object contents of the source file for the target file, so even with `git mv -f` only the checksum gets checked into git:
+
+[[!format sh \"\"\"
+ cat ./test-git-mv
+#!/bin/bash
+
+set -eu -o pipefail -x
+
+cat $0
+
+TEST_DIR=/tmp/test_dir
+mkdir -p $TEST_DIR
+chmod -R u+w $TEST_DIR
+rm -rf $TEST_DIR
+mkdir -p $TEST_DIR
+pushd $TEST_DIR
+
+git init
+git annex init
+
+git --version
+git annex version
+
+rm .git/info/attributes
+echo foo > foo
+echo bar > bar
+git annex add foo bar
+git check-attr -a foo
+git check-attr -a bar
+echo 'bar filter=annex' > .gitattributes
+git add .gitattributes
+git check-attr -a foo
+git check-attr -a bar
+
+git annex unlock bar
+git mv bar foo  || true
+git mv -f bar foo
+git commit -m add
+git log -p
+
+
+ TEST_DIR=/tmp/test_dir
+ mkdir -p /tmp/test_dir
+ chmod -R u+w /tmp/test_dir
+ rm -rf /tmp/test_dir
+ mkdir -p /tmp/test_dir
+ pushd /tmp/test_dir
+/tmp/test_dir /tmp
+ git init
+Initialized empty Git repository in /tmp/test_dir/.git/
+ git annex init
+init  (scanning for unlocked files...)
+ok
+(recording state in git...)
+ git --version
+git version 2.20.1
+ git annex version
+git-annex version: 7.20191024-g6dc2272
+build flags: Assistant Webapp Pairing S3 WebDAV Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite
+dependency versions: aws-0.21.1 bloomfilter-2.0.1.0 cryptonite-0.25 DAV-1.3.3 feed-1.0.1.0 ghc-8.6.5 http-client-0.5.14 persistent-sqlite-2.9.3 torrent-10000.1.1 uuid-1.3.13 yesod-1.6.0
+key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL
+remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs hook external
+operating system: linux x86_64
+supported repository versions: 7
+upgrade supported from repository versions: 0 1 2 3 4 5 6
+local repository version: 7
+ rm .git/info/attributes
+ echo foo
+ echo bar
+ git annex add foo bar
+add foo ok
+add bar ok
+(recording state in git...)
+ git check-attr -a foo
+ git check-attr -a bar
+ echo 'bar filter=annex'
+ git add .gitattributes
+ git check-attr -a foo
+ git check-attr -a bar
+bar: filter: annex
+ git annex unlock bar
+unlock bar ok
+(recording state in git...)
+ git mv bar foo
+fatal: destination exists, source=bar, destination=foo
+ true
+ git mv -f bar foo
+ git commit -m add
+[master (root-commit) 8610c0d] add
+ 2 files changed, 2 insertions(+)
+ create mode 100644 .gitattributes
+ create mode 100644 foo
+ git log -p
+commit 8610c0d8f327140608e71dc229f167731552d284
+Author: Ilya Shlyakhter <ilya_shl@alum.mit.edu>
+Date:   Sun Nov 24 12:24:28 2019 -0500
+
+    add
+
+diff --git a/.gitattributes b/.gitattributes
+new file mode 100644
+index 0000000..649f07e
+--- /dev/null
+++ b/.gitattributes
+@@ -0,0 +1 @@
+bar filter=annex
+diff --git a/foo b/foo
+new file mode 100644
+index 0000000..266ae50
+--- /dev/null
+++ b/foo
+@@ -0,0 +1 @@
+/annex/objects/MD5E-s4--c157a79031e1c40f85931829bc5fc552
+
+\"\"\"]]
+
+
+
+
+"""]]
--- a/doc/todo/optimize_by_converting_String_to_ByteString.mdwn
+++ b/doc/todo/optimize_by_converting_String_to_ByteString.mdwn
@ -0,0 +1,37 @@
+git-annex uses FilePath (String) extensively. That's a slow data type.
+Converting to ByteString, and RawFilePath, should speed it up
+significantly, according to [[/profiling]].
+
+I've made a test branch, `bs`, to see what kind of performance improvement
+to expect. 
+
+Benchmarking `git-annex find`, speedups range from 28-66%. The files fly by
+much more snappily. Other commands likely also speed up, but do more work
+than find so the improvement is not as large.
+
+The `bs` branch is in a mergeable state now, but still needs work:
+
+* Eliminate all the fromRawFilePath, toRawFilePath, encodeBS,
+  decodeBS conversions. Or at least most of them. There are likely
+  quite a few places where a value is converted back and forth several times.
+
+  As a first step, profile and look for the hot spots. Known hot spots:
+  
+  * keyFile uses fromRawFilePath and that adds around 3% overhead in `git-annex find`. 
+    Converting it to a RawFilePath needs a version of `</>` for RawFilePaths.
+  * getJournalFileStale uses fromRawFilePath, and adds 3-5% overhead in
+    `git-annex whereis`. Converting it to RawFilePath needs a version 
+    of `</>` for RawFilePaths. It also needs a ByteString.readFile
+    for RawFilePath.
+
+* System.FilePath is not available for RawFilePath, and many of the
+  conversions are to get a FilePath in order to use that library.
+
+  It should be entirely straightforward to make a version of System.FilePath
+  that can operate on RawFilePath, except possibly there could be some
+  complications due to Windows.
+
+* Use versions of IO actions like getFileStatus that take a RawFilePath,
+  avoiding a conversion. Note that these are only available on unix, not
+  windows, so a compatability shim will be needed.
+  (I can't seem to find any library that provides one.)
--- a/doc/todo/optimize_by_converting_String_to_ByteString/comment_1_403601fa8ad6946eca8f598bdc31f2d7._comment
+++ b/doc/todo/optimize_by_converting_String_to_ByteString/comment_1_403601fa8ad6946eca8f598bdc31f2d7._comment
@ -0,0 +1,44 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""profiling"""
+ date="2019-11-26T20:05:28Z"
+ content="""
+Profiling the early version of the `bs` branch. 
+
+		Tue Nov 26 16:05 2019 Time and Allocation Profiling Report  (Final)
+	
+		   git-annex +RTS -p -RTS find
+	
+		total time  =        2.75 secs   (2749 ticks @ 1000 us, 1 processor)
+		total alloc = 1,642,607,120 bytes  (excludes profiling overheads)
+	
+	COST CENTRE                      MODULE                         SRC                                                 %time %alloc
+	
+	inAnnex'.\                       Annex.Content                  Annex/Content.hs:(103,61)-(118,31)                   31.2   46.8
+	keyFile'                         Annex.Locations                Annex/Locations.hs:(567,1)-(577,30)                   5.3    6.2
+	encodeW8                         Utility.FileSystemEncoding     Utility/FileSystemEncoding.hs:(189,1)-(191,70)        3.3    4.2
+	>>=.\                            Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:(146,9)-(147,44)    2.9    0.8
+	>>=.\.succ'                      Data.Attoparsec.Internal.Types Data/Attoparsec/Internal/Types.hs:146:13-76           2.6    0.3
+	keyFile'.esc                     Annex.Locations                Annex/Locations.hs:(573,9)-(577,30)                   2.5    5.5
+	parseLinkTarget                  Annex.Link                     Annex/Link.hs:(254,1)-(262,25)                        2.4    4.4
+	getAnnexLinkTarget'.probesymlink Annex.Link                     Annex/Link.hs:78:9-46                                 2.4    2.8
+	w82s                             Utility.FileSystemEncoding     Utility/FileSystemEncoding.hs:217:1-15                2.3    6.0
+	keyPath                          Annex.Locations                Annex/Locations.hs:(606,1)-(608,23)                   1.9    4.0
+	parseKeyVariety                  Types.Key                      Types/Key.hs:(323,1)-(371,42)                         1.8    0.0
+	getState                         Annex                          Annex.hs:(251,1)-(254,27)                             1.7    0.4
+	fileKey'.go                      Annex.Locations                Annex/Locations.hs:588:9-55                           1.4    0.8
+	fileKey'                         Annex.Locations                Annex/Locations.hs:(586,1)-(596,41)                   1.4    1.7
+	hashUpdates.\.\.\                Crypto.Hash                    Crypto/Hash.hs:85:48-99                               1.3    0.0
+	parseLinkTargetOrPointer         Annex.Link                     Annex/Link.hs:(239,1)-(243,25)                        1.2    0.1
+	withPtr                          Basement.Block.Base            Basement/Block/Base.hs:(395,1)-(404,31)               1.2    0.6
+	primitive                        Basement.Monad                 Basement/Monad.hs:72:5-18                             1.0    0.1
+	decodeBS'                        Utility.FileSystemEncoding     Utility/FileSystemEncoding.hs:151:1-31                1.0    2.8
+	mkKeySerialization               Types.Key                      Types/Key.hs:(115,1)-(117,22)                         0.7    1.1
+	w82c                             Utility.FileSystemEncoding     Utility/FileSystemEncoding.hs:211:1-28                0.6    1.1
+
+Comparing with [[/profiling]] results, the alloc is down significantly.
+And the main IO actions are getting a larger share of the runtime.
+
+There is still significantly conversion going on, encodeW8 and w82s and
+decodeBS' and w82c. Likely another 5% or so speedup if that's eliminated.
+"""]]
--- a/doc/todo/optimize_by_converting_String_to_ByteString/comment_2_9c51e1986aeb16b3138b6824be9f5a58._comment
+++ b/doc/todo/optimize_by_converting_String_to_ByteString/comment_2_9c51e1986aeb16b3138b6824be9f5a58._comment
@ -0,0 +1,19 @@
+[[!comment format=mdwn
+ username="Ilya_Shlyakhter"
+ avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
+ subject="representing paths"
+ date="2019-11-27T15:08:40Z"
+ content="""
+Thanks for working on this Joey.
+
+I don't know Haskell or git-annex architecture, so my thoughts might make no sense, but I'll post just in case.
+
+\"There are likely quite a few places where a value is converted back and forth several times\" -- as a quick/temp fix, could memoization speed this up? Or memoizing the results of some system calls?
+
+The many filenames flying around often share long prefixes.  Could that be used to speed things up?  E.g. if they could be represented as pointers into some compact storage, maybe cache performance would improve.
+
+\"git annex find... files fly by much more snappily\" -- does this mean `git-annex-find` is testing each file individually, as opposed to constructing a SQL query to an indexed db?  Maybe, simpler `git-annex-find` queries that are fully mappable to SQL queries could be special-cased?
+
+Sorry for naive comments, I'll eventually read up on Haskell and make more sense...
+
+"""]]
--- a/doc/todo/parallel_possibilities/comment_1_bad182de605b7b47d66dcfe583acd4f1._comment
+++ b/doc/todo/parallel_possibilities/comment_1_bad182de605b7b47d66dcfe583acd4f1._comment
@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="Ilya_Shlyakhter"
+ avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
+ subject="parallelization"
+ date="2019-11-27T17:23:14Z"
+ content="""
+When operating on many files, maybe run N parallel commands where i'th command ignores paths for which `(hash(filename) module N) != i`.   Or, if git index has size I, i'th command ignores paths that are not legixographically between `index[(I/N)*i]` and `index[(I/N)*(i+1)]` (for index state at command start).  Extending [[git-annex-matching-options]] with `--block=i` would let this be done using `xargs`.
+"""]]