Commit graph

43967 commits

Author SHA1 Message Date
Joey Hess
5934e7d402
Merge branch 'master' of ssh://git-annex.branchable.com 2023-06-08 16:54:05 -04:00
Joey Hess
6821ba8dab
sync: use log to track adjusted branch needs updating
Speeds up sync in an adjusted branch by avoiding re-adjusting the branch
unncessarily, particularly when it is adjusted with --hide-missing or
--unlock-present.

When there are a lot of files, that was the majority of the time of a
--no-content sync.

Uses a log file, which is updated when content presence changes. This
adds a little bit of overhead to every file get/drop when on such an
adjusted branch. The overhead is minimal for get of any size of file,
but might be noticable for drop in some cases. It seems like a reasonable
trade-off. It would be possible to update the log file only at the end, but
then it would not happen if the command is interrupted.

When not in an adjusted branch, there should be no additional overhead.
(getCurrentBranch is an MVar read, and it avoids the MVar read of
getGitConfig.)

Note that this does not deal with situations such as:
git checkout master, git-annex get, git checkout adjusted branch,
git-annex sync. The sync won't know that the adjusted branch needs to be
updated. Dealing with that would add overhead to operation in non-adjusted
branches, which I don't like. Also, there are other situations like having
two adjusted branches that both need to be updated like this, and switching
between them and sync not updating.

This does mean a behavior change to sync, since it did previously deal
with those situations. But, the documentation did not say that it did.
The man pages only talk about sync updating the adjusted branch after
it transfers content.

I did consider making sync keep track of content it transferred (and
dropped) and only update the adjusted branch then, not to catch up to other
changes made previously. That would perform better. But it seemed rather
hard to implement, and also it would have problems with races with a
concurrent get/drop, which this implementation avoids.

And it seemed pretty likely someone had gotten used to get/drop followed by
sync updating the branch. It seems much less likely someone is switching
branches, doing get/drop, and then switching back and expecting sync to update
the branch.

Re-running git-annex adjust still does a full re-adjusting of the branch,
for anyone who needs that.

Sponsored-by: Leon Schuermann on Patreon
2023-06-08 14:35:41 -04:00
Joey Hess
637f19bebb
fix adjusted branch update breakage
Introduced recently in commit 64fc34b3da.

adjustBranch changes the sha that is recorded for the current branch
(eg the adjusted branch). So, have to get the original sha before
calling it.

Sponsored-by: Jack Hill on Patreon
2023-06-08 13:33:58 -04:00
yarikoptic
96a6946a14 stalling report 2023-06-08 15:46:29 +00:00
Joey Hess
7888702955
update 2023-06-07 11:32:53 -04:00
Joey Hess
3e3d225ca0
Merge branch 'master' of ssh://git-annex.branchable.com 2023-06-07 11:16:39 -04:00
Joey Hess
64fc34b3da
narrow window where HEAD is detached
Updating an adjusted branch can take a while when there are a lot of
files. HEAD was detached at the start, so if eg git-annex sync was
interrupted at the wrong point, there was a possibly wide window where
it would leave the repo with HEAD detached.

There's still a window, just much narrower. I don't know if it's
possible to close the window entirely. While git can clearly update
the currently checked out branch in eg git merge, it doesn't seem to
provide another way to do it.

Sponsored-by: Graham Spencer on Patreon
2023-06-07 11:10:54 -04:00
nobodyinperson
2fe032c4ee Added a comment 2023-06-07 04:49:04 +00:00
Joey Hess
5bc37c2de2
comment 2023-06-06 15:17:09 -04:00
Joey Hess
d63af3f52e
comment 2023-06-06 14:45:48 -04:00
Joey Hess
3c15e0f7a0
cache negative lookups of global numcopies and mincopies
Speeds up eg git-annex sync --content by up to 50%. When it does not need
to transfer or drop anything, it now noops a lot more quickly.

I didn't see anything else in sync --content noop loop that could really
be sped up. It has to cat git objects to keys, stat object files, etc.

Sponsored-by: unqueued on Patreon
2023-06-06 14:43:25 -04:00
Joey Hess
4437e187e6
update 2023-06-06 13:04:47 -04:00
Joey Hess
3efcb58b6a
comment 2023-06-06 13:02:15 -04:00
Joey Hess
4c88f68061
Merge branch 'master' of ssh://git-annex.branchable.com 2023-06-06 12:48:47 -04:00
nobodyinperson
aa61ac4273 Added a comment 2023-06-06 12:54:36 +00:00
nobodyinperson
cf7249d00c 2023-06-06 12:49:11 +00:00
Mowgli
6c60e1d715 Added a comment 2023-06-05 20:35:19 +00:00
Mowgli
c0b2eb3914 Added a comment: comment igendwas 2023-06-05 20:33:42 +00:00
jgoerzen
432e7cd9f3 Added a comment 2023-06-05 19:32:29 +00:00
Joey Hess
cfad0def18
wrap 2023-06-05 15:15:20 -04:00
Joey Hess
1f0f774ab7
close this release blocker 2023-06-05 15:10:52 -04:00
Joey Hess
4c9326dab5
reject 2023-06-05 15:00:39 -04:00
Joey Hess
07db8e234a
comment and wontfix 2023-06-05 14:40:25 -04:00
Joey Hess
528882a6df
comment 2023-06-05 14:08:12 -04:00
Joey Hess
190a538c0b
Merge branch 'master' of ssh://git-annex.branchable.com 2023-06-05 11:46:19 -04:00
Joey Hess
c6c6e3f5d6
update 2023-06-05 11:45:18 -04:00
jgoerzen
2c2a84caac Added a comment 2023-06-02 21:44:54 +00:00
Joey Hess
fe1b2dfb4b
speed up very first tree import by 25%
Reading from the cidsdb is responsible for about 25% of the runtime of
an import. Since the cidmap is used to store the same information in
ram, the cidsdb is not written to during an import any longer. And so,
if it started off empty (and updateFromLog wasn't needed), those reads
can just be skipped.

This is kind of a cheesy optimisation, since after any import from any
special remote, the database will no longer be empty, so it's a single
use optimisation. But it's probably not uncommon to start by importing a
lot of files, and it can save a lot of time then.

Sponsored-by: Brock Spratlen on Patreon
2023-06-02 13:30:30 -04:00
Joey Hess
b43fb4923f
comment 2023-06-02 13:11:24 -04:00
Joey Hess
b8750bcb17
Merge branch 'master' of ssh://git-annex.branchable.com 2023-06-02 12:14:03 -04:00
Joey Hess
b40b368857
comment 2023-06-02 12:13:50 -04:00
jgoerzen
5dcbf7d41e Added a comment 2023-06-02 03:25:27 +00:00
Joey Hess
f6dd34ca81
sync content with import remotes
This didn't used to be needed because importKeys would import all
content and so doing another pass was redundant.

But since 40017089f2 it uses
importChanges, so only new files are imported. If a file that was
already imported before was dropped, that would prevent sync --content
from gettng its content again.

Sponsored-by: Jack Hill on Patreon
2023-06-01 18:52:19 -04:00
Joey Hess
92e4ed3cc0
retitle 2023-06-01 18:44:11 -04:00
Joey Hess
7178db5e06
Merge branch 'master' of ssh://git-annex.branchable.com 2023-06-01 18:43:29 -04:00
Joey Hess
2e92cef13f
comment 2023-06-01 18:43:17 -04:00
jgoerzen
53eeca40ae Added a comment 2023-06-01 21:26:23 +00:00
Joey Hess
f1fe13c79c
devblog 2023-06-01 15:07:03 -04:00
Joey Hess
594110a6af
comment 2023-06-01 14:21:55 -04:00
Joey Hess
40017089f2
use importChanges optimisation
Large speed up to importing trees from special remotes that contain a lot
of files, by only processing changed files.

Benchmarks:

Importing from a special remote that has 10000 files, that have all been
imported before, and 1 new file sped up from 26.06 to 2.59 seconds.

An import with no change and 10000 unchanged files sped up from 24.3 to
1.99 seconds.

Going up to 20000 files, an import with no changes sped up from
125.95 to 3.84 seconds.

Sponsored-by: k0ld on Patreon
2023-06-01 13:47:00 -04:00
Joey Hess
029b08f54b
Merge branch 'master' of ssh://git-annex.branchable.com 2023-05-31 16:34:03 -04:00
Joey Hess
c6acf574c7
implement importChanges optimisaton (not used yet)
For simplicity, I've not tried to make it handle History yet, so when
there is a history, a full import will still be done. Probably the right
way to handle history is to first diff from the current tree to the last
imported tree. Then, diff from the current tree to each of the
historical trees, and recurse through the history diffing from child tree
to parent tree.

I don't think that will need a record of the previously imported
historical trees, and so Logs.Import doesn't store them. Although I did
leave room for future expansion in that log just in case.

Next step will be to change importTree to importChanges and modify
recordImportTree et all to handle it, by using adjustTree.

Sponsored-by: Brett Eisenberg on Patreon
2023-05-31 16:01:34 -04:00
Joey Hess
7298123520
build git trees using ContentIdentifier to speed up import
This gets the trees built, but it does not use them. Next step will be
to remember the tree for next time an import is done, and diff between
old and new trees to find the files that have changed.

Added --missing to the mktree parameters. That only disables a check, so
it's ok to do everywhere mktree is used. It probably also speeds up
mktree to disable the check.

Note that git fsck does not complain about the resulting tree objects
that point to shas that are not in the repository. Even with --strict.

A quick benchmark, importing 10000 files, this slowed it down
from 2:04.06 to 2:04.28. So it will more than pay for itself.

Sponsored-by: Luke Shumaker on Patreon
2023-05-31 12:46:54 -04:00
Joey Hess
51319f8558
update 2023-05-30 17:19:23 -04:00
Joey Hess
f6aa097a39
avoid import writing to cidsdb initially
Speed up importing trees from special remotes somewhat by avoiding
redundant writes to sqlite database.

Before, import would write to both the git-annex branch and also to the
sqlite database. But then the next time it was run, needsUpdateFromLog
would see the branch had changed, so run updateFromLog, which would make
the same writes to the sqlite database a second time.

Now import writes only to the git-annex branch. The next time it's run,
needsUpdateFromLog sees that the branch has changed and so calls
updateFromLog, which updates the sqlite database.

Why defer the write to the sqlite database like this? It seems that it
could write to the database as it goes, and at the end call
recordAnnexBranchTree to indicate that the information in the git-annex
branch has all been written to the cidsdb. That would avoid the second
import doing extra work.

But, there could be other processes running at the same time, and one of
them may update the git-annex branch, eg merging a remote git-annex branch
into it. Any cids logs on that merged git-annex branch would not be
reflected in the cidsdb yet. If the import then called
recordAnnexBranchTree, the cidsdb would never get updated with that merged
information.

I don't think there's a good way to prevent, or to detect that situation.
So, it can't call recordAnnexBranchTree at the end. So it might as well
wait until the next run and do updateFromLog then. It could instead do
updateFromLog at the end, but it's going to check needsUpdateFromLog
at the beginning anyway.

Note that the database writes were queued, so there is already a cidmap
that is used to remember changes that the current process has made.
So, omitting database writes can't change the behavior of the current
process.

Also note that thirdpartypopulatedimport uses recordcidkeyindb, which
reflects what it already did. That code path does not use the cidmap,
but does not need to query it either. It might be possible to make that
code path also only update the git-annex branch and not the db, but I
haven't checked.

Sponsored-by: Noam Kremen on Patreon
2023-05-30 17:05:28 -04:00
jgoerzen
f47e7abd57 Added a comment 2023-05-30 20:58:21 +00:00
Joey Hess
c1e415887a
improve test descriptions 2023-05-30 16:11:29 -04:00
Joey Hess
5070087a63
repair: Fix handling of git ref names on Windows
Sponsored-by: Kevin Mueller on Patreon
2023-05-30 16:09:13 -04:00
Joey Hess
9ca81ed02a
update 2023-05-30 15:49:52 -04:00
Joey Hess
aaeae746f0
comment and a neat idea 2023-05-30 15:42:34 -04:00