It just so happens that everywhere that checks attrs other than
annex.largefiles parses the value further, and failed to parse
unspecifiedAttr in a way that behaved the same as if nothing was set. So
this is not a bug fix or behavior change. What it does so is prevent
future uses of checkAttr from needing to remember to handle checking for
this edge case.
Sponsored-by: Dartmouth College's DANDI project
This avoids bottlenecking on git check-ignore in a particular situation.
Also, there may have been a correctness issue with it not having updated it.
When the exportdb is already up-to-date, this is not expensive. And the
exportdb is updated elsewhere, so usually it is up-to-date.
Sponsored-by: Joshua Antonishen on Patreon
Speeds up sync in an adjusted branch by avoiding re-adjusting the branch
unncessarily, particularly when it is adjusted with --hide-missing or
--unlock-present.
When there are a lot of files, that was the majority of the time of a
--no-content sync.
Uses a log file, which is updated when content presence changes. This
adds a little bit of overhead to every file get/drop when on such an
adjusted branch. The overhead is minimal for get of any size of file,
but might be noticable for drop in some cases. It seems like a reasonable
trade-off. It would be possible to update the log file only at the end, but
then it would not happen if the command is interrupted.
When not in an adjusted branch, there should be no additional overhead.
(getCurrentBranch is an MVar read, and it avoids the MVar read of
getGitConfig.)
Note that this does not deal with situations such as:
git checkout master, git-annex get, git checkout adjusted branch,
git-annex sync. The sync won't know that the adjusted branch needs to be
updated. Dealing with that would add overhead to operation in non-adjusted
branches, which I don't like. Also, there are other situations like having
two adjusted branches that both need to be updated like this, and switching
between them and sync not updating.
This does mean a behavior change to sync, since it did previously deal
with those situations. But, the documentation did not say that it did.
The man pages only talk about sync updating the adjusted branch after
it transfers content.
I did consider making sync keep track of content it transferred (and
dropped) and only update the adjusted branch then, not to catch up to other
changes made previously. That would perform better. But it seemed rather
hard to implement, and also it would have problems with races with a
concurrent get/drop, which this implementation avoids.
And it seemed pretty likely someone had gotten used to get/drop followed by
sync updating the branch. It seems much less likely someone is switching
branches, doing get/drop, and then switching back and expecting sync to update
the branch.
Re-running git-annex adjust still does a full re-adjusting of the branch,
for anyone who needs that.
Sponsored-by: Leon Schuermann on Patreon
Introduced recently in commit 64fc34b3da.
adjustBranch changes the sha that is recorded for the current branch
(eg the adjusted branch). So, have to get the original sha before
calling it.
Sponsored-by: Jack Hill on Patreon
Updating an adjusted branch can take a while when there are a lot of
files. HEAD was detached at the start, so if eg git-annex sync was
interrupted at the wrong point, there was a possibly wide window where
it would leave the repo with HEAD detached.
There's still a window, just much narrower. I don't know if it's
possible to close the window entirely. While git can clearly update
the currently checked out branch in eg git merge, it doesn't seem to
provide another way to do it.
Sponsored-by: Graham Spencer on Patreon
Reading from the cidsdb is responsible for about 25% of the runtime of
an import. Since the cidmap is used to store the same information in
ram, the cidsdb is not written to during an import any longer. And so,
if it started off empty (and updateFromLog wasn't needed), those reads
can just be skipped.
This is kind of a cheesy optimisation, since after any import from any
special remote, the database will no longer be empty, so it's a single
use optimisation. But it's probably not uncommon to start by importing a
lot of files, and it can save a lot of time then.
Sponsored-by: Brock Spratlen on Patreon
Large speed up to importing trees from special remotes that contain a lot
of files, by only processing changed files.
Benchmarks:
Importing from a special remote that has 10000 files, that have all been
imported before, and 1 new file sped up from 26.06 to 2.59 seconds.
An import with no change and 10000 unchanged files sped up from 24.3 to
1.99 seconds.
Going up to 20000 files, an import with no changes sped up from
125.95 to 3.84 seconds.
Sponsored-by: k0ld on Patreon
For simplicity, I've not tried to make it handle History yet, so when
there is a history, a full import will still be done. Probably the right
way to handle history is to first diff from the current tree to the last
imported tree. Then, diff from the current tree to each of the
historical trees, and recurse through the history diffing from child tree
to parent tree.
I don't think that will need a record of the previously imported
historical trees, and so Logs.Import doesn't store them. Although I did
leave room for future expansion in that log just in case.
Next step will be to change importTree to importChanges and modify
recordImportTree et all to handle it, by using adjustTree.
Sponsored-by: Brett Eisenberg on Patreon
This gets the trees built, but it does not use them. Next step will be
to remember the tree for next time an import is done, and diff between
old and new trees to find the files that have changed.
Added --missing to the mktree parameters. That only disables a check, so
it's ok to do everywhere mktree is used. It probably also speeds up
mktree to disable the check.
Note that git fsck does not complain about the resulting tree objects
that point to shas that are not in the repository. Even with --strict.
A quick benchmark, importing 10000 files, this slowed it down
from 2:04.06 to 2:04.28. So it will more than pay for itself.
Sponsored-by: Luke Shumaker on Patreon
Speed up importing trees from special remotes somewhat by avoiding
redundant writes to sqlite database.
Before, import would write to both the git-annex branch and also to the
sqlite database. But then the next time it was run, needsUpdateFromLog
would see the branch had changed, so run updateFromLog, which would make
the same writes to the sqlite database a second time.
Now import writes only to the git-annex branch. The next time it's run,
needsUpdateFromLog sees that the branch has changed and so calls
updateFromLog, which updates the sqlite database.
Why defer the write to the sqlite database like this? It seems that it
could write to the database as it goes, and at the end call
recordAnnexBranchTree to indicate that the information in the git-annex
branch has all been written to the cidsdb. That would avoid the second
import doing extra work.
But, there could be other processes running at the same time, and one of
them may update the git-annex branch, eg merging a remote git-annex branch
into it. Any cids logs on that merged git-annex branch would not be
reflected in the cidsdb yet. If the import then called
recordAnnexBranchTree, the cidsdb would never get updated with that merged
information.
I don't think there's a good way to prevent, or to detect that situation.
So, it can't call recordAnnexBranchTree at the end. So it might as well
wait until the next run and do updateFromLog then. It could instead do
updateFromLog at the end, but it's going to check needsUpdateFromLog
at the beginning anyway.
Note that the database writes were queued, so there is already a cidmap
that is used to remember changes that the current process has made.
So, omitting database writes can't change the behavior of the current
process.
Also note that thirdpartypopulatedimport uses recordcidkeyindb, which
reflects what it already did. That code path does not use the cidmap,
but does not need to query it either. It might be possible to make that
code path also only update the git-annex branch and not the db, but I
haven't checked.
Sponsored-by: Noam Kremen on Patreon
I noticed git-annex was using a lot of CPU when downloading from youtube,
and was not displaying progress. Turns out that yt-dlp (and I think also
youtube-dl) sometimes only knows an estimated size, not the actual size,
and displays the progress output slightly differently for that. That broke
the parser. And, the parser was feeding chunks that failed to parse back
as a remainder, which caused it to try to re-parse the entire output each
time, so it got slower and slower.
Using --progress-template like this should avoid parsing problems as well
as future proof against output changes. But it will work with only yt-dlp.
So, this seemed like the right time to deprecate youtube-dl, and default
to yt-dlp when available.
git-annex will still use youtube-dl if that's all that's available.
However, since the progress parser for youtube-dl was buggy, and I don't
want to maintain two different progress parsers (especially since
youtube-dl is no longer in debian unstable having been replaced by
yt-dlp), made git-annex no longer try to parse youtube-dl's progress.
Also, updated docs for yt-dlp being default. It did not seem worth
renaming annex.youtube-dl-options and annex.youtube-dl-command.
Note that yt-dlp does not seem to document the fields available in the
progress template. I found them by reading the source and looking at
the templates it uses internally. Also note that the use of "i" (rather
than "s") in progressTemplate makes it display floats rounded to integers;
particularly the estimated total size can be a float. That also does not
seem to be documented but I assume is a python thing?
Sponsored-by: Joshua Antonishen on Patreon
This makes annexFileMode be just an application of setAnnexPerm',
which avoids having 2 functions that do different versions of the same
thing.
Fixes some buggy behavior for some combinations of core.sharedRepository
and umask.
Sponsored-by: Jack Hill on Patreon
These two missed setting it.
It rarely matters that the journal gets the right perm. But, when using
annex.alwayscommit=false, someone else may come along later and want
to append to the journal file.
It probably never matters what the sentinal perms are, but for
completeness..
Sponsored-by: Luke Shumaker on Patreon
This spams the user with a lot of messages, but it seems like busywork to
avoid that and only warn once, since this warning will go away when it gets
implemented.
Also fix parsing of the octal value.
Sponsored-by: Kevin Mueller on Patreon
New command, currently limited to changing autoenable= setting of a special remote.
It will probably never be used for more than that given the limitations on
it.
Sponsored-by: Brock Spratlen on Patreon
When displaying a ByteString like "💕", safeOutput operates on
individual bytes like "\240\159\146\149" and isControl '\146' = True,
so it got truncated to just "\240".
So, only treat the low control characters, and DEL, as control
characters.
Also split Utility.Terminal out of Utility.SafeOutput. The latter needs
win32, but Utility.SafeOutput is used by Control.Exception, which is
used by Setup.
Sponsored-by: Nicholas Golder-Manning on Patreon
I'm on the fence about this. Notice that pulling from a git remote can
pull branches that have escape sequences in their names. Git will
display those as-is. Arguably git should try harder to avoid that.
But, names of remotes are usually up to the local user, and autoenable
changes that, and so it makes sense that git chooses to display control
characters in names of remotes, and so autoenable needs to guard against
it.
Sponsored-by: Graham Spencer on Patreon
Seems unlikely to have a tab in a path, but it's not a control character
that needs to be prevented either.
Left \n \r \v and \a as other non-threatening control characters
that are still obnoxious to have in a filepath because of how it causes
issues with display and/or with shell scripting.
This does, as a side effect, make long notes in json output not
be indented. The indentation is only needed to offset them
underneath the display of the file they apply to, so that's ok.
Sponsored-by: Brock Spratlen on Patreon
Converted warning and similar to use StringContainingQuotedPath. Most
warnings are static strings, some do refer to filepaths that need to be
quoted, and others don't need quoting.
Note that, since quote filters out control characters of even
UnquotedString, this makes all warnings safe, even when an attacker
sneaks in a control character in some other way.
When json is being output, no quoting is done, since json gets its own
quoting.
This does, as a side effect, make warning messages in json output not
be indented. The indentation is only needed to offset warning messages
underneath the display of the file they apply to, so that's ok.
Sponsored-by: Brett Eisenberg on Patreon
giveup changed to filter out control characters. (It is too low level to
make it use StringContainingQuotedPath.)
error still does not, but it should only be used for internal errors,
where the message is not attacker-controlled.
Changed a lot of existing error to giveup when it is not strictly an
internal error.
Of course, other exceptions can still be thrown, either by code in
git-annex, or a library, that include some attacker-controlled value.
This does not guard against those.
Sponsored-by: Noam Kremen on Patreon
When the filenames are part of the git repository or other files that
might have attacker-controlled names, quote them in error messages.
This is fairly complete, although I didn't do the one in
Utility.DirWatcher.INotify.hs because that doesn't have access to
Git.Filename or Annex.
But it's also quite possible I missed some. And also while scanning for
these, I found giveup used with other things that could be attacker
controlled to contain control characters (eg Keys). So, I'm thinking
it would also be good for giveup to just filter out control characters.
This commit is then not the only line of defence, but just good
formatting when git-annex displays a filename in an error message.
Sponsored-by: Kevin Mueller on Patreon
As well as escape sequences, control characters seem unlikely to be desired when
doing addurl, and likely to trip someone up. So disallow them as well.
I did consider going the other way and allowing filenames with control characters
and escape sequences, since git-annex is in the process of escaping display
of all filenames. Might still be a better idea?
Also display the illegal filename git quoted when it rejects it.
Sponsored-by: Nicholas Golder-Manning on Patreon
Added StringContainingQuotedPath, which is used for ActionItemOther.
In the process, checked every ActionItemOther for those containing
filenames, and made them use quoting.
Sponsored-by: Graham Spencer on Patreon
This reverts commit 66eb63dd82.
git-annex init is the only thing that uses ensureCommit. So overriding
there will make later commits to the git-annex branch or by git-annex sync
fail.
It's ugly that git-annex init sets user.name and user.email, but it only
does it on systems that are badly configured.
When it's set and git cannot determine user.name or user.email, this will
result in git-annex init failing when committing to create the git-annex
branch. Other git-annex commands that commit can also fail.
Sponsored-by: Jack Hill on Patreon
Avoid setting user.name and user.email in the git config when git is unable
to detect them.
git-annex has good reason to want to ensure git commit succeeds when eg
committing to the git-annex branch. But it's not playing nice to set these
values where other commands can see them.
Sponsored-by: Brett Eisenberg on Patreon
Fix laziness bug introduced in last release that breaks use of
--unlock-present and --hide-missing adjusted branches.
Since there is a writeFile of the same file immediately after readFile, it
may still have the file open for read (or may have happened to read it
already and closed it).
I was not able to reproduce the problem in brief testing, but this seems
obvious.
Sponsored-by: Luke Shumaker on Patreona
Remote.Directory makes a temp file, then calls this, and since the temp
file exists, it prevented probing if CoW works.
Note that deleting the empty file does mean there's a small window for a
race. If another process is also exporting to the remote, that could let it
make the same temp file. However, the temp filename actually has the
processes's pid in it, which avoids that being a problem.
This may have been a reversion caused by commits around
63d508e885, but I haven't gone back and
tested to be sure. The directory special remote had supposedly supported
CoW for this going back to about half a year before that.
Sponsored-by: Graham Spencer on Patreon
The temporary URL key used for the download, before the real key is
generated, was blocked by annex.securehashesonly.
Fixed by passing the Backend that will be used for the final key into
runTransfer. When a Backend is provided, have preCheckSecureHashes
check that, rather than the key being transferred.
Sponsored-by: unqueued on Patreon
view: Support annex.maxextensionlength when generating filenames for the
view branch.
Note that refining an existing view will reuse the extension length that was
configured when initially constructing the view. This is necessarily the case
because it reuses the filenames.
Also view files used to have all extensions at the end, no matter how
many there were. Since annex.maxextensionlength's documentation includes
that it's limited to 2 extensions, I made it consistent with that.
Sponsored-by: k0ld on Patreon
I don't know of scenarios where that can happen (besides the bug
fixed by the parent commit), but there probably are some.
Sponsored-by: Boyd Stephen Smith Jr. on Patreon
Avoid failure to update adjusted branch --unlock-present after git-annex
drop when annex.adjustedbranchrefresh=1
At higher values, it did flush the queue, which ran restagePointerFiles.
But at 1, adjustedBranchRefreshFull gets added to the queue, and while
restagePointerFiles is also in the queue, it runs after that.
Sponsored-by: Brock Spratlen on Patreon
Works around this bug in unix-compat:
https://github.com/jacobstanley/unix-compat/issues/56
getFileStatus and other FilePath using functions in unix-compat do not do
UNC conversion on Windows.
Made Utility.RawFilePath use convertToWindowsNativeNamespace to do the
necessary conversion on windows to support long filenames.
Audited all imports of System.PosixCompat.Files to make sure that no
functions that operate on FilePath were imported from it. Instead, use
the equvilants from Utility.RawFilePath. In particular the
re-export of that module in Common had to be removed, which led to lots
of other changes throughout the code.
The changes to Build.Configure, Build.DesktopFile, and Build.TestConfig
make Utility.Directory not be needed to build setup. And so let it use
Utility.RawFilePath, which depends on unix, which cannot be in
setup-depends.
Sponsored-by: Dartmouth College's Datalad project
I had thought this would not make sense to combine with view branches,
since removing files from a view changes metadata.
However, that's committing removal of files. With --hide-missing, the
files get removed when git-annex updates the branch itself, so there is
no conflict.
It does not seem likely to be very useful, but it does work! And that's
nice because it means all types of adjusted branches can be combined with
view branches.
Sponsored-by: Max Thoursie on Patreon
When generating the view, check if the key is present.
When syncing in a view branch with an adjustment, run adjustedBranchRefreshFull
the same as is done when syncing in other adjusted branches. This is
needed because the docs for git-annex adjust --unlock-present suggest
using git-annex sync to update the branch when annex.adjustedbranchrefresh
is not set.
Note that, with annex.adjustedbranchrefresh set, it just works! The
adjusted branch gets updated in the usual way and it doesn't matter that
there's a view branch underneath.
And of course, re-running git-annex adjut --unlock-present also works,
as suggested in the docs.
Sponsored-by: Erik Bjäreholt on Patreon
Just make pointer files rather than symlinks, easy.
As for the other adjustments:
--lock is the default for views
--fix happens automatically in views
--hide-missing probably does not make sense when combined with views,
because deleting a file from a view removes metadata
--unlock-present will need a bit more work
An adjusted view branch has a name like
"refs/heads/adjusted/views/master(author=_)(unlocked)", so it is a view
branch that has been converted to an adjusted branch.
Made Logs.View support such branch names. So now git-annex sync and
pre-commit handle updating metadata on commit in such a branch.
Much remains to be done to fully support adjusted view branches,
including actually applying the adjustment when updating the view branch.
Sponsored-by: Graham Spencer on Patreon
When git-annex adjust is run in a view branch, and the adjusted branch
already exists, overwrite the old adjusted branch with the new one
without being forced.
Usually overwriting an adjusted branch is avoided because it could lose
data. But when a view branch has been adjusted, there is no data to lose
in the adjusted branch, because the only changes that can be made of
significance are to move files between directories. Which changes
metadata on commit. And the old branch has already been committed.
Sponsored-by: Lawrence Brogan on Patreon
An adjusted view branch has a name like
"adjusted/views/master(author=_)(unlocked)"
and so the adjustment starts at the last open paren, not the first open
paren.
Note that git-annex sync still does not do anything useful when run in
such a branch, because it does not realize that it is a view branch.
This is only groundwork for adjusted view branches.
This also fixes adjusted branches when the basis branch name contains
parens for some other reason, though that is not common in a git branch
name.
Sponsored-by: Boyd Stephen Smith Jr. on Patreon
view: Fix a reversion in 10.20230214 that omitted a file from a view when
the file had no metadata set, but the view only used path fields.
Sponsored-by: Jack Hill on Patreon
A benchmark in my sound repository with `git-annex view feedtitle=*`
took 2:52 wall clock time before and 1:58 after. Though it still only used
130% of CPU.
This is the same kind of optimisation that is in seekFilteredKeys, though
that precaches location logs while this streams the metadata logs direct
to parsing them.
seekFilteredKeys contains more streaming, to find the annexed files, and
this could be further sped up with similar streaming.
Sponsored-by: Nicholas Golder-Manning on Patreon
* sync: When run in a view branch, refresh the view branch to reflect any
changes that have been made to the parent branch or metadata.
This is basically working, but probably needs some more work to deal with
all the edge cases of things sync does.
Sponsored-by: Lawrence Brogan on Patreon
* view: New field?=glob and ?tag syntax that includes a directory "_"
in the view for files that do not have the specified metadata set.
* Added annex.viewunsetdirectory git config to change the name of the
"_" directory in a view.
When in a view using the new syntax, old git-annex will fail to parse the
view log. It errors with "Not in a view.", which is not ideal. But that
only affects view commands.
annex.viewunsetdirectory is included in the View for a couple of reasons.
One is to avoid needing to warn the user that it should not be changed when
in a view, since that would confuse git-annex. Another reason is that it
helped with plumbing the value through to some pure functions.
annex.viewunsetdirectory is actually mangled the same as any other view
directory. So if it's configured to something like "N/A", there won't be
multiple levels of directories, which would also confuse git-annex.
Sponsored-By: Jack Hill on Patreon
Use separate stages for download and upload. In the common case where
it downloads the file from one remote and then uploads to the other,
those are by far the most expensive operations, and there's a decent
chance the two remotes bottleneck on different resources.
Suppose it's being run with -J2 and a bunch of 10 mb files. Two threads
will be started both downloading from the src remote. They will probably
finish at the same time. Then two threads will be started uploading to
the dst remote. They will probably take the same time as well. Before
this change, it would alternate back and forth, bottlenecking on src and dst.
With this change, as soon as the two threads start uploading to dst, two
more threads are able to start, downloading from src. So bandwidth to
both remotes is saturated more often.
Other commands that use transferStages only send in one direction at a
time. So the worker threads for the other direction will sit idle, and
there will be no change in their behavior.
Sponsored-by: Dartmouth College's DANDI project
Lock the local content for drop after getting it from src, to prevent another
process from using the local content as a copy and dropping it from src,
which would prevent dropping the local content after sending it to dest.
Support resuming an interrupted move that downloaded the content from
src, leaving the local content populated. In this case, the location log
has not been updated to say the content is present locally, so we can
assume that it's resuming and go ahead and drop the local content after
sending it to dest.
Note that if a `git-annex get` is being ran at the same time as a
`git-annex move --from --to`, it may get a file just before the move
processes it. So the location log has not been updated yet, and the move
thinks it's resuming. Resulting in local copy being dropped after it's
sent to the dest. This race is something we'll just have to live with,
it seems.
I also gave up on the idea of checking if the location log had been updated
by a `git-annex get` that is ran at the same time. That wouldn't work, because
the location log is precached in the seek stage, so reading it again after
sending the content to dest would not notice changes made to it, unless the cache
were invalidated, which would slow it down a lot. That idea anyway was subject
to races where it would not detect the concurrent `git-annex get`.
So concurrent `git-annex get` will have results that may be surprising.
To make that less surprising, updated the documentation of this feature to
be explicit that it downloads content to the local repository
temporarily.
Sponsored-by: Dartmouth College's DANDI project
Prep for move --to --from, which needs to download from a src repo
without updating the location log for the local repo, before sending the
content on to the dest repo.
Note that caller of download' already update the log themselves.
See previous commit a422a056f2
that pushed it up to download from getViaTmpFrom.
(Also removed in passing a debug print + readline that I accidentially
committed last week on this branch.)
Sponsored-by: Dartmouth College's DANDI project
Note that when this is specified and an older git-annex is used to
enableremote such a special remote, it will simply ignore the cost= field
and use whatever the default cost is.
In passing, fixed adb to support the remote.name.cost and
remote.name.cost-command configs.
Sponsored-by: Dartmouth College's DANDI project
readish ignores a trailing string after a number, but to support values
like "YYYY:MM:DD" which it makes sense to compare lexographically,
require the whole string to be parsed as a number in order to enable
numeric comparison.
Sponsored-by: Max Thoursie on Patreon
Improve handling of some .git/annex/ subdirectories being on other
filesystems, in the bittorrent special remote, and youtube-dl integration,
and git-annex addurl.
The only one of these that I've confirmed to be a problem is in the
bittorrent special remote when .git/annex/tmp and .git/annex/othertmp are
on different filesystems.
As well as auditing for renameFile, also audited for createLink, all of
those are ok as are the other remaining renameFile calls. Also audited all
code paths that use .git/annex/othertmp, and did not find any other
cross-device problems. So, removing mention of othertmp needing to be on
the same device.
Sponsored-by: Dartmouth College's Datalad project
Change --metadata comparisons < > <= and >= to fall back to lexicographical
comparisons when one or both values being compared are not numbers.
Sponsored-by: Erik Bjäreholt on Patreon
Fix a hang that occasionally occurred during commands such as move.
(A bug introduced in 10.20220927, in
commit 6a3bd283b8)
The restage.log was kept locked while running a complex index refresh
action. In an unusual situation, that action could need to write to the
restage log, which caused a deadlock.
The solution is a two-stage process. First the restage.log is moved to a
work file, which is done with the lock held. Then the content of the work
file is read and processed, which happens without the lock being held.
This is all done in a crash-safe manner.
Note that streamRestageLog may not be fully safe to run concurrently
with itself. That's ok, because restagePointerFiles uses it with the
index lock held, so only one can be run at a time.
streamRestageLog does delete the restage.old file at the end without
locking. If a calcRestageLog is run concurrently, it will either see the
file content before it was deleted, or will see it's missing. Either is
ok, because at most this will cause calcRestageLog to report more
work remains to be done than there is.
Sponsored-by: Dartmouth College's Datalad project
Debian is going to drop youtube-dl which is not active upstream, and yt-dlp
is the replacement. This will make it be used if youtube-dl gets removed.
If an old version of youtube-dl remains installed, git-annex will still use
it. That might not be desirable, but changing git-annex to use yt-dlp in
preference to youtube-dl when both are installed risks breaking when
the user has annex.youtube-dl-options set to something that is supported
by youtube-dl, but not by yt-dlp.
Sponsored-by: Boyd Stephen Smith Jr. on Patreon
init: Avoid scanning for annexed files, which can be lengthy in a
large repository. Instead that scan is done on demand. This lets git-annex
init be run and some query commands be used in a repository without
waiting.
Note that autoinit already behaved this way, so while this will mean some
commands like git-annex get/unlock/add will do the scan the first time run,
that is not really a significant behavior change.
And, it's really better to have a consistent behavior. The reason for
the inconsistency was a strange bug discussed in
b3c4579c79. Avoiding reconcileStaged in
init will keep avoiding whatever that was.
Sponsored-by: Dartmouth College's DANDI project
Clean the standalone environment before running the su command
to run "sh". Otherwise, PATH leaked through, causing it to run
git-annex.linux/bin/sh, but GIT_ANNEX_DIR was not set,
which caused that script to not work:
[2022-10-26 15:07:02.145466106] (Utility.Process) process [938146] call: pkexec ["sh","-c","cd '/home/joey/tmp/git-annex.linux/r' && '/home/joey/tmp/git-annex.linux/git-annex' 'enable-tor' '1000'"]
/home/joey/tmp/git-annex.linux/bin/sh: 4: exec: /exe/sh: not found
Changed programPath to not use GIT_ANNEX_PROGRAMPATH,
but instead run the scripts at the top of GIT_ANNEX_DIR.
That works both when the standalone environment is set up, and when it's
not.
Sponsored-by: Kevin Mueller on Patreon
Make --batch mode handle unstaged annexed files consistently whether the
file is unlocked or not. Before this, a unstaged locked file
would have the symlink on disk examined and operated on in --batch mode,
while an unstaged unlocked file would be skipped.
Note that, when not in batch mode, unstaged files are skipped over too.
That is actually somewhat new behavior; as late as 7.20191114 a
command like `git-annex whereis .` would operate on unstaged locked
files and skip over unstaged unlocked files. That changed during
optimisation of CmdLine.Seek with apparently little fanfare or notice.
Turns out that rmurl still behaved that way when given an unstaged file
on the command line. It was changed to use lookupKeyStaged to
handle its --batch mode. That also affected its non-batch mode, but
since that's just catching up to the change earlier made to most
other commands, I have not mentioed that in the changelog.
It may be that other uses of lookupKey should also change to
lookupKeyStaged. But it may also be that would slow down some things,
or lead to unwanted behavior changes, so I've kept the changes minimal
for now.
An example of a place where the use of lookupKey is better than
lookupKeyStaged is in Command.AddUrl, where it looks to see if the file
already exists, and adds the url to the file when so. It does not matter
there whether the file is staged or not (when it's locked). The use of
lookupKey in Command.Unused likewise seems good (and faster).
Sponsored-by: Nicholas Golder-Manning on Patreon
When running eg git-annex get, for each file it has to read from and
write to the keys database. But it's reading exclusively from one table,
and writing to a different table. So, it is not necessary to flush the
write to the database before reading. This avoids writing the database
once per file, instead it will buffer 1000 changes before writing.
Benchmarking getting 1000 small files from a local origin,
git-annex get now takes 13.62s, down from 22.41s!
git-annex drop now takes 9.07s, down from 18.63s!
Wowowowowowowow!
(It would perhaps have been better if there were separate databases for
the two tables. At least it would have avoided this complexity. Ah well,
this is better than splitting the table in a annex.version upgrade.)
Sponsored-by: Dartmouth College's Datalad project
The flush was only done Annex.run' to make sure that the queue was flushed
before git-annex exits. But, doing it there means that as soon as one
change gets queued, it gets flushed soon after, which contributes to
excessive writes to the database, slowing git-annex down.
(This does not yet speed git-annex up, but it is a stepping stone to
doing so.)
Database queues do not autoflush when garbage collected, so have to
be flushed explicitly. I don't think it's possible to make them
autoflush (except perhaps if git-annex sqitched to using ResourceT..).
The comment in Database.Keys.closeDb used to be accurate, since the
automatic flushing did mean that all writes reached the database even
when closeDb was not called. But now, closeDb or flushDb needs to be
called before stopping using an Annex state. So, removed that comment.
In Remote.Git, change to using quiesce everywhere that it used to use
stopCoProcesses. This means that uses on onLocal in there are just as
slow as before. I considered only calling closeDb on the local git remotes
when git-annex exits. But, the reason that Remote.Git calls stopCoProcesses
in each onLocal is so as not to leave git processes running that have files
open on the remote repo, when it's on removable media. So, it seemed to make
sense to also closeDb after each one, since sqlite may also keep files
open. Although that has not seemed to cause problems with removable
media so far. It was also just easier to quiesce in each onLocal than
once at the end. This does likely leave performance on the floor, so
could be revisited.
In Annex.Content.saveState, there was no reason to close the db,
flushing it is enough.
The rest of the changes are from auditing for Annex.new, and making
sure that quiesce is called, after any action that might possibly need
it.
After that audit, I'm pretty sure that the change to Annex.run' is
safe. The only concern might be that this does let more changes get
queued for write to the db, and if git-annex is interrupted, those will be
lost. But interrupting git-annex can obviously already prevent it from
writing the most recent change to the db, so it must recover from such
lost data... right?
Sponsored-by: Dartmouth College's Datalad project
When importing from versioned remotes, fix tracking of the content of
deleted files.
Only S3 supports versioning so far, so only it was affected.
But, the draft import/export interface for external remotes also seemed to
need a change, so that versionedExport could be set.
Well, actually, fix a typo that has always been in the implementation of
that. "inbacked" used to work, but let's not tell users about that; they
might try to use it and expect git-annex to keep supporting the typo..
Sponsored-by: Jack Hill on Patreon
The user sets these hooks deliberately so they should always be run. For
example this allows hooks to be used to manage file permissions on NTFS
volumes in WSL1.
This is much easier and less failure-prone than having the user run
git update-index --refresh themselves.
Sponsored-by: Dartmouth College's DANDI project
When pointer files need to be restaged, they're first written to the
log, and then when the restage operation runs, it reads the log. This
way, if the git-annex process is interrupted before it can do the
restaging, a later git-annex process can do it.
Currently, this lets a git-annex get/drop command be interrupted and
then re-ran, and as long as it gets/drops additional files, it will
clean up after the interrupted command. But more changes are
needed to make it easier to restage after an interrupted process.
Kept using the git queue to run the restage action, even though the
list of files that it builds up for that action is not actually used by
the action. This could perhaps be simplified to make restaging a cleanup
action that gets registered, rather than using the git queue for it. But
I wasn't sure if that would cause visible behavior changes, when eg
dropping a large number of files, currently the git queue flushes
periodically, and so it restages incrementally, rather than all at the
end.
In restagePointerFiles, it reads the restage log twice, once to get
the number of files and size, and a second time to process it.
This seemed better than reading the whole file into memory, since
potentially a huge number of files could be in there. Probably the OS
will cache the file in memory and there will not be much performance
impact. It might be better to keep running tallies in another file
though. But updating that atomically with the log seems hard.
Also note that it's possible for calcRestageLog to see a different file
than streamRestageLog does. More files may be added to the log in
between. That is ok, it will only cause the filterprocessfaster heuristic to
operate with slightly out of date information, so it may make the wrong
choice for the files that got added and be a little slower than ideal.
Sponsored-by: Dartmouth College's DANDI project
Fixes updating git index file after getting an unlocked file when
annex.stalldetection is set.
The transferrer may want to send additional protocol messages when it's
shut down. Closing the read handle prevented it from doing that, and caused
it to crash rather than cleanly shutting down.
Draining the handle without processing the protocol seemed ok to do,
because anything it outputs is going to be some side message displayed
at shutdown. Displaying those once per transferrer process that is running
seems unncessary.
Sponsored-by: Dartmouth College's DANDI project
This partly fixes an issue where there are duplicate files in the
special remote, and the first file gets swapped with another duplicate,
or deleted. The swap case is fixed by this, the deleted case will need
other changes.
This makes retrieveExportWithContentIdentifier take a list of allowed
ContentIdentifier, same as storeExportWithContentIdentifier,
removeExportWithContentIdentifier, and
checkPresentExportWithContentIdentifier.
Of the special remotes that support importtree, borg is a special case
and does not use content identifiers, S3 I assume can't get mixed up
like this, directory certainly has the problem, and adb also appears to
have had the problem.
Sponsored-by: Graham Spencer on Patreon
Use curl for downloads from git remotes when annex.url-options and other
git configs are set.
If the url needs a password, curl will fail, and git credential will not be
used to prompt for it. But the user can set --netrc in url-options and
put the password in the netrc file.
This also means that url-options settings like -4 will take effect.
That was the case before commit 1883f7ef8f
forced conduit to be used.
autoEnableSpecialRemotes runs a subprocess, and if the uuid for a git
remote has not been probed yet, that will do a http get that will prompt
for a password. And then the parent process will subsequently prompt
for a password when getting annexed files from the remote.
So the solution is for autoEnableSpecialRemotes to run remoteList before
the subprocess, which will probe for the uuid for the git remote in the
same process that will later be used to get annexed files.
But, Remote.Git imports Annex.Init, and Remote.List imports Remote.Git,
so Annex.Init cannot import Remote.List. Had to pass remoteList into
functions in Annex.Init to get around this dependency loop.
When accessing a git remote over http needs a git credential prompt for a
password, cache it for the lifetime of the git-annex process, rather than
repeatedly prompting.
The git-lfs special remote already caches the credential when discovering
the endpoint. And presumably commands like git pull do as well, since they
may download multiple urls from a remote.
The TMVar CredentialCache is read, so two concurrent calls to
getBasicAuthFromCredential will both prompt for a credential.
There would already be two concurrent password prompts in such a case,
and existing uses of `prompt` probably avoid it. Anyway, it's no worse
than before.
Help the user get annex.dbdir configured when their filesystem is not
one that sqlite works on.
The change in Database.Handle makes an error from sqlite not be ignored
besides being displayed, which it was before. I can't see any reason
git-annex would want to ignore these errors.
I chose to use the fsck database rather than the keys database because
opening the keys database populates it, and see commit
b3c4579c79.
The placement of the call to checkSqliteWorks inside checkInitializeAllowed
avoids annex.uuid getting set before it's called.
Sponsored-by: Dartmouth College's Datalad project
Use curl when annex.security.allowed-url-schemes includes an url scheme not
supported by git-annex internally, as long as
annex.security.allowed-ip-addresses is configured to allow using curl.
Sponsored-by: Luke Shumaker on Patreon
This allows annex.dbdir to be set globally or always set to the same
value when needed. Each repository uses a subdirectory of it.
Sponsored-by: Dartmouth College's Datalad project
Completes work started in e60766543f
I've verified that all the sqlite databases get stored in annex.dbdir
and are created successfully. If annex.dbdir does not exist, it will be
created; its parent directory must already exist though.
Sponsored-by: Dartmouth College's Datalad project
This should not change the behavior of it, unless there are multiple top
directories, and then it should behave the same as if there was a single
top directory that was actually above the directory to be created.
Sponsored-by: Dartmouth College's Datalad project
WIP: This is mostly complete, but there is a problem: createDirectoryUnder
throws an error when annex.dbdir is set to outside the git repo.
annex.dbdir is a workaround for filesystems where sqlite does not work,
due to eg, the filesystem not properly supporting locking.
It's intended to be set before initializing the repository. Changing it
in an existing repository can be done, but would be the same as making a
new repository and moving all the annexed objects into it. While the
databases get recreated from the git-annex branch in that situation, any
information that is in the databases but not stored in the branch gets
lost. It may be that no information ever gets stored in the databases
that cannot be reconstructed from the branch, but I have not verified
that.
Sponsored-by: Dartmouth College's Datalad project
Work around bug in git 2.37 that causes a segfault when when
core.untrackedCache is set, and broke git-annex init.
Depending on when git gets fixed and how widely the buggy versions are
used, this could be reverted quite soon, or need to linger for a long time.
It only makes git-annex init a tiny bit slower in a new repo.
Sponsored-by: Max Thoursie on Patreon
(And v9 later on to v10.)
When v9/v10 were added, making v8 automatically upgrade was deferred
"for a few months" to prevent interoperability problems if users also
have an old version of git-annex. Of course that could still be the
case, but there has been a good amount of time and this can't be put off
forever.
Allow setting annex.autoupgraderepository to false to avoid this upgrade.
Previously, that only prevented upgrades from no longer supported git-annex
versions, but v8 is still supported, and users may want to keep on v8 to
interoperate with an old git-annex version.
Sponsored-by: Boyd Stephen Smith Jr. on Patreon
I was thinking that discardIncompleteAppend would make it strict, since
it looks at the end of the bytestring. But, it's applied lazily..
This probably fixes windows, which was failing:
git-annex.exe: .git\annex\journal\trust.log: DeleteFile "\\\\?\\C:\\Users\\runneradmin\\.t\\5\\tmprepo22\\.git\\annex\\journal\\trust.log": permission denied (The process cannot access the file because it is being used by another process.)