git-annex/doc/bugs/importtree_spends_hours_reading_cidsdb.mdwn
jgoerzen e1fa970010
2023-05-30 12:23:28 +00:00

48 lines
2.4 KiB
Markdown

### Please describe the problem.
Hi!
I've got a directory special remote set up, with importtree. After scanning through it (import filename... ok), it stops reading from the filesystem and spends hours just reading from cidsdb. Verifying with strace, there is not reading from the filesystem here.
### What steps will reproduce the problem?
Here are my prep steps:
```
git init
git config annex.thin true
git annex init 'local hub'
git annex wanted . "include=* and exclude=testdata/*"
touch mtree
git annex add mtree
git annex sync
git annex adjust --unlock-present
git annex initremote source type=directory directory=/acrypt/git-annex/bind-ro/testdata importtree=yes encryption=none
git annex enableremote source directory=/acrypt/git-annex/bind-ro/testdata
git config remote.source.annex-readonly true
git config remote.source.annex-tracking-branch main:testdata # (says to put the files in the "$REPO" directory)
git config annex.securehashesonly true
git config annex.genmetadata true
git config annex.diskreserve 100M
git annex sync
```
It is that last sync that runs into this problem.
### What version of git-annex are you using? On what operating system?
10.20230407 on Debian bullseye
### Please provide any additional information below.
There are about 150,000 files in that tree. This problem occurs *after* git-annex is done scanning the tree (verified with strace and lsof). This problem does not occur with the same number of files in a local/standard git-annex repo (as opposed to the directory special remote).
.git/annex/cidsdb/db is only 51M so it is certainly entirely cached. git-annex is entirely CPU-bound at this point.
I can rerun the sync with an unchanged import directory. It still takes 107 minutes, the majority of which is spent reading cidsdb. Only the first minute or two are spent scanning the source area.
I have tested this on a source directory that's 2.2G and another that's 1.1T, both with about 150,000 files. After the first import, the subsequent syncs are similar in performance. In other words, this behavior appears to be related to the number of files, not the size of files.
### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
Yes, and I hope to use it for a project to archive family photos and videos to BD-R (that's what this is about here)