update exportdb tree in getImportableContents

This avoids bottlenecking on git check-ignore in a particular situation.
Also, there may have been a correctness issue with it not having updated it.
When the exportdb is already up-to-date, this is not expensive. And the
exportdb is updated elsewhere, so usually it is up-to-date.

Sponsored-by: Joshua Antonishen on Patreon
This commit is contained in:
Joey Hess 2023-06-08 18:36:24 -04:00
parent 5934e7d402
commit 532b227086
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
3 changed files with 64 additions and 1 deletions

View file

@ -1049,7 +1049,10 @@ getImportableContents r importtreeconfig ci matcher = do
Just c' -> Just <$> filterunwantedchunk dbhandle c'
)
opendbhandle = Export.openDb (Remote.uuid r)
opendbhandle = do
h <- Export.openDb (Remote.uuid r)
void $ Export.updateExportTreeFromLog h
return h
wanted dbhandle (loc, (_cid, sz))
| ingitdir = pure False

View file

@ -0,0 +1,8 @@
[[!comment format=mdwn
username="joey"
subject="""Re: comment 17"""
date="2023-06-08T20:56:25Z"
content="""
I've fixed that, now `git-annex sync` avoids updating the adjusted branch
when there have been no changes to available content.
"""]]

View file

@ -0,0 +1,52 @@
[[!comment format=mdwn
username="joey"
subject="""comment 19"""
date="2023-06-08T21:14:35Z"
content="""
I ran the second test case with 150000 files, and here's how long the syncs
took:
1. 0m0.170s
2. 33m37.810s
3. 0m36.644s
4. 2m58.773s
5. 13m38.126s
6. 0m3.933s
7. still running after 85 minutes
8. tbd
Sync 2 took longer than your results in comment #11, but consistent with my
laptop being slower. And I think 33 minutes to import 150k files is fine.
Still not seeing sync 5 take as long as it did for you.
Nowhere in the same ballpark. 13 minutes seems ok for a sync --content
that has to scan 150000 files.
The 7th sync is seeming too slow to me. For you it took equally long as the
5th sync, and both are --content syncs. So maybe I'm seeing the same
problem but only on the 7th for some reason?
For me it seemed to take a long time after outputting "list source ok". At
that point strace showed only a lot of futex(). And the cpu was pegged. And
it had the cidsdb open. Hmmm.. This is feeling a bit like the problem you
originally reported.
Interrupted the 7th sync and ran again...
The "list source" takes more than 15 minutes. It's bottlenecked on checking
git ignores. Bottleneck that I didn't notice with a smaller
number of files. Fixed that by making sure the export db was
populated, which it usually is, but not in the 7th sync's situation.
Now "list source" completes in less than 2 minutes.
And.. after that, it was back to the tight futex() loop.. And this time I
had intrumented the cidsdb, and it was importKeys
calling getContentIdentifierKeys.
Here's the kicker: It's only running getContentIdentifierKeys 15 times
per second. So that will take 166 minutes for all 150000 files.
Each call to getContentIdentifierKeys is taking 0.05 seconds.
So, this bug is back to the original problem of being bottlenecked on the
cidsdb. And it is smelling like a lack of indexes. Yay!
"""]]