From 424b1912d6a6e78db044518a7b5fa31a058785ab Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 1 Jul 2020 12:28:44 -0400 Subject: [PATCH] followup and add link --- doc/design/caching_database.mdwn | 4 +--- ..._690c0dcbfc112f6abd94d02c248ce68b._comment | 24 +++++++++++++++++++ 2 files changed, 25 insertions(+), 3 deletions(-) create mode 100644 doc/todo/speed_up_git_annex_sync_--content_--all/comment_4_690c0dcbfc112f6abd94d02c248ce68b._comment diff --git a/doc/design/caching_database.mdwn b/doc/design/caching_database.mdwn index f53753a18d..be0dd17fca 100644 --- a/doc/design/caching_database.mdwn +++ b/doc/design/caching_database.mdwn @@ -1,6 +1,7 @@ * [[metadata]] for views * [[todo/cache_key_info]] * [[bugs/indeterminite_preferred_content_state_for_duplicated_file]] +* [[todo/speed_up_git_annex_sync_--content_--all]] What do all these have in common? They could all be improved by using some kind of database to locally store the information in an @@ -11,9 +12,6 @@ generated and updated by looking at the git repository. * Metadata can be updated by looking at the git-annex branch, either its current state, or the diff between the old and new versions -* Direct mode mappings can be updated by looking at the current branch, - to see which files map to which key. Or the diff between the old - and new versions of the branch. * Incremental fsck information is not stored in git, but can be "regenerated" by running fsck again. (Perhaps doesn't quite fit, but let it slide..) diff --git a/doc/todo/speed_up_git_annex_sync_--content_--all/comment_4_690c0dcbfc112f6abd94d02c248ce68b._comment b/doc/todo/speed_up_git_annex_sync_--content_--all/comment_4_690c0dcbfc112f6abd94d02c248ce68b._comment new file mode 100644 index 0000000000..8ac734863f --- /dev/null +++ b/doc/todo/speed_up_git_annex_sync_--content_--all/comment_4_690c0dcbfc112f6abd94d02c248ce68b._comment @@ -0,0 +1,24 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 4""" + date="2020-07-01T16:13:23Z" + content=""" +It's 80s in my big repo. But of course it would also have to be read back +in and parsed, so seems it would take 160s or so. (It's going to be a dozen +or so gb of data anywhere the speed of git-annex sync --all is a problem.) + +Cross-referencing it with `git ls-tree -r git-annex` to get filenames +would mean git-annex would take more memory the more keys are stored in it. +Which is something I have been careful to avoid. + +An sqlite database could surely be faster, especially if it's designed so +it can be queried for things like "all keys in repo A that are not in repo +B". But a sqlite database shouldn't only benefit --all, so it also needs to +be able to do queries like "all keys that have files in HEAD, that are in +repo A and not in repo B". With that, `git annex get` etc could also get +faster. + +Anyway, it seems like --all is not really the problem for you; I guess +you would see similar runtime if you ran git-annex sync --content with the +larger of your two branches checked out than you do with --all. +"""]]