Except when configuration makes curl be used. It did not seem worth
trying to tail the file when curl is downloading.
But when an interrupted download is resumed, it does not read the whole
existing file to hash it. Same reason discussed in
commit 7eb3742e4b76d1d7a487c2c53bf25cda4ee5df43; that could take a long
time with no progress being displayed. And also there's an open http
request, which needs to be consumed; taking a long time to hash the file
might cause it to time out.
Also in passing implemented it for git and external special remotes when
downloading from the web. Several others like S3 are within striking
distance now as well.
Sponsored-by: Dartmouth College's DANDI project
This fixes the recent reversion that annex.verify is not honored,
because retrieveChunks was passed RemoteVerify baser, but baser
did not have export/import set up.
Sponsored-by: Dartmouth College's DANDI project
This eliminates the distinction between decodeBS and decodeBS', encodeBS
and encodeBS', etc. The old implementation truncated at NUL, and the
primed versions had to do extra work to avoid that problem. The new
implementation does not truncate at NUL, and is also a lot faster.
(Benchmarked at 2x faster for decodeBS and 3x for encodeBS; more for the
primed versions.)
Note that filepath-bytestring 1.4.2.1.8 contains the same optimisation,
and upgrading to it will speed up to/fromRawFilePath.
AFAIK, nothing relied on the old behavior of truncating at NUL. Some
code used the faster versions in places where I was sure there would not
be a NUL. So this change is unlikely to break anything.
Also, moved s2w8 and w82s out of the module, as they do not involve
filesystem encoding really.
Sponsored-by: Shae Erisson on Patreon
* Deal with clock skew, both forwards and backwards, when logging
information to the git-annex branch.
* GIT_ANNEX_VECTOR_CLOCK can now be set to a fixed value (eg 1)
rather than needing to be advanced each time a new change is made.
* Misuse of GIT_ANNEX_VECTOR_CLOCK will no longer confuse git-annex.
When changing a file in the git-annex branch, the vector clock to use is now
determined by first looking at the current time (or GIT_ANNEX_VECTOR_CLOCK
when set), and comparing it to the newest vector clock already in use in
that file. If a newer time stamp was already in use, advance it forward by
a second instead.
When the clock is set to a time in the past, this avoids logging with
an old timestamp, which would risk that log line later being ignored in favor
of "newer" line that is really not newer.
When a log entry has been made with a clock that was set far ahead in the
future, this avoids newer information being logged with an older timestamp
and so being ignored in favor of that future-timestamped information.
Once all clocks get fixed, this will result in the vector clocks being
incremented, until finally enough time has passed that time gets back ahead
of the vector clock value, and then it will return to usual operation.
(This latter situation is not ideal, but it seems the best that can be done.
The issue with it is, since all writers will be incrementing the last
vector clock they saw, there's no way to tell when one writer made a write
significantly later in time than another, so the earlier write might
arbitrarily be picked when merging. This problem is why git-annex uses
timestamps in the first place, rather than pure vector clocks.)
Advancing forward by 1 second is somewhat arbitrary. setDead
advances a timestamp by just 1 picosecond, and the vector clock could
too. But then it would interfere with setDead, which wants to be
overrulled by any change. So it could use 2 picoseconds or something,
but that seems weird. It could just as well advance it forward by a
minute or whatever, but then it would be harder for real time to catch
up with the vector clock when forward clock slew had happened.
A complication is that many log files contain several different peices of
information, and it may be best to only use vector clocks for the same peice
of information. For example, a key's location log file contains
InfoPresent/InfoMissing for each UUID, and it only looks at the vector
clocks for the UUID that is being changed, and not other UUIDs.
Although exactly where the dividing line is can be hard to determine.
Consider metadata logs, where a field "tag" can have multiple values set
at different times. Should it advance forward past the last tag?
Probably. What about when a different field is set, should it look at
the clocks of other fields? Perhaps not, but currently it does, and
this does not seems like it will cause any problems.
Another one I'm not entirely sure about is the export log, which is
keyed by (fromuuid, touuid). So if multiple repos are exporting to the
same remote, different vector clocks can be used for that remote.
It looks like that's probably ok, because it does not try to determine
what order things occurred when there was an export conflict.
Sponsored-by: Jochen Bartl on Patreon
git-annex get when run as the first git-annex command in a new repo did not
populate unlocked files. (Reversion in version 8.20210621)
I am not entirely happy with this, because I don't understand how
428c91606b caused the problem in the first
place, and I don't fully understand how skipping calling scanAnnexedFiles
during autoinit avoids the problem.
Kept the explicit call to scanAnnexedFiles during git-annex init,
so that when reconcileStaged is expensive, it can be made to run then,
rather than at some later point when the information is needed.
Sponsored-by: Brock Spratlen on Patreon
The pass was needed to populate files when annex.thin was set,
but in commit 73e0cbbb19,
reconcileStaged started to do that. So, this second pass is not needed
any longer.
An easy way to see this in action is to have an unlocked file, and touch the
object file.
While all code that compares inode caches for object files needs to be
prepared for this kind of problem and fall back to verification, having
fsck notice it and correct it is cheap (as long as fsck is being run
anyway) and ensures that if it happens for some unusual reason, there's a
way for the user to notice that it's happening.
Not that, when annex.thin is in use, the earlier call to isUnmodified
(and also potentially earlier calls to inAnnex in eg, verifyLocationLog)
will fix up the same problem silently. That might prevent the warning
being displayed, although probably it still will be, because the
Database.Keys write of the InodeCache will be queued but will not have
happened yet. I can't see a way to improve this, but it's not great.
Sponsored-by: Dartmouth College's Datalad project
This avoids it calling enteringStage VerifyStage when it's used in
places that only fall back to verification rarely, and which might be
called while in TransferStage and be going to perform a transfer after
the verification.
Some uses of linkFromAnnex are inside replaceWorkTreeFile, which was
already safe, but others use it directly on the work tree file, which
was race-prone. Eg, if the work tree file was first removed, then
linkFromAnnex called to populate it, the user could have re-written it in
the interim.
This came to light during an audit of all calls of addInodeCaches,
looking for such races. All the other uses of it seem ok.
Sponsored-by: Brett Eisenberg on Patreon
In Annex.Content, the object file was statted after pointer files were
populated. But if annex.thin is set, once the pointer files are
populated, the object file can potentially be modified via the hard
link. So, it was possible, though seemingly very unlikely, for the inode
of the modified object file to be cached.
Command.Fix and Command.Fsck had similar problems, statting the work
tree files after they were in place. Changed them to stat the temp file
that gets moved into place. This does rely on .git/annex being on the
same filesystem. If it's not, the cached inode will not be the same as
the one that the temp file gets moved to. Result will be that git-annex
will later need to do an expensive verification of the content of the
worktree files. Note that the cross-filesystem move of the temp file
already is a larger amount of extra work, so this seems acceptable.
Sponsored-by: Luke Shumaker on Patreon
When git-annex lock repopulates the object file by copying an associated
file that still has its content, it negected to update the inode cache.
I was not able to actually get this code to successfully repopulate the
object file; the associated file gets replaced with a dangling pointer
before unlock is able to do that. (By what I'm not sure..
reconcileStaged?) Which might be itself a bug, but
anyway this makes me doubtful that this was really leading to a stale
inode cache. Still, in case there is some situation in which it does
work, fixed it to update the inode cache.
Which is the same as the git merge option.
After last commit, this turns out to be needed in the test suite, and when
doing git-annex import from special remote, followed by a git-annex merge.
Sponsored-by: Svenne Krap on Patreon
sync, merge, post-receive: Avoid merging unrelated histories, which used to
be allowed only to support direct mode repositories.
(However, sync does still merge unrelated histories when importing trees
from special remotes, and the assistant still merges unrelated histories
always.)
See 556b2ded2b for why this was added
back in 2016, for direct mode.
This is a behavior change, which might break something that was relying
on sync merging unrelated histories, but git had a good reason to
prevent it, since it's easy to foot shoot with it, and git-annex should
follow suit.
Sponsored-by: Noam Kremen on Patreon
* sync: When --quiet is used, run git commit, push, and pull without
their ususual output.
* merge: When --quiet is used, run git merge without its usual output.
This might also make --quiet work better for some other commands
that make commits, like git-annex adjust.
Sponsored-by: Kevin Mueller on Patreon
Handles keys that are substrings of other keys, as well as pointer files
that contain a newline after the key.
Note that -S does not match regexp, while -G does by default. Docs are
not clear, determined experimentally. The only other difference in
changing to -G is that if a file used to contain the key and changed
in some way, while still containing the key, -G will match and -S would
not. So eg, annex links that git annex fix rewrites will match, and
files that change lock status will match. Which is an improvement anyway.
Sponsored-by: Jochen Bartl on Patreon
Does not check the reflog, but otherwise works.
It's possible for it to display something that is not an annexed file,
if a non-annexed file somehow ends up containing something that looks
like the key's name. This seems very unlikely to happen, and it would
add a lot of complexity to detect it and somehow skip over that file,
since the git log would need to either be run again, or not limited to 1
result and canceled once enough results have been read.
Also, it kind of seems ok, if a file refers to a key, to consider that
as a place the key was used, for some definition of used. So, I punted
on dealing with that. May revisit later.
Sponsored-by: Brock Spratlen on Patreon
Forces eg, download with youtube-dl without falling back to raw download.
Since youtube-dl failing due to an url not being supported is difficult to
distinguish from it failing due to being blocked in some way, this can be
useful to avoid the fallback of git-annex downloading the raw web page and
adding that.
Since --raw also prevents using special remotes, --no-raw also
allows special remote downloads. Although it's always possible that some
special remote may claim an url and fall back to raw download of the
content, which --no-raw cannot prevent.
Sponsored-by: Boyd Stephen Smith Jr. on Patreon
Dropping an object with drop --unused or dropunused will mark it as
dead, preventing fsck --all from complaining about it after it's been
dropped from all repositories.
If another repository still has a copy, it won't be treated as dead
until it's also dropped from there.
The drop has to use --unused, can't be --key or something else, because
this indicates that the user has recently ran git-annex unused. If it
checked the unused log on every drop, bad things would happen when the
unused log was out of date, eg a file used to be unused but then got
re-added. Marking such a file as dead could be confusing. When the user
uses --unused/dropunused, they must consider the unused information to be
up-to-date.
The particular workflow this enables is:
git annex add foo
git annex unannex foo
git annex unused
git annex drop --unused / dropunused
git annex fsck --all # no warnings
The docs for git-annex unannex say to use git-annex unused and dropunused,
so the user should be pointed in this direction when they want to undo an
accidental add.
Sponsored-by: Brock Spratlen on Patreon
sync: Partly work around github behavior that first branch to be pushed to
a new repository is assumed to be the head branch, by not pushing
synced/git-annex first.
github expects master (or whatever the name is) to be pushed first, but
git-annex sync can't, because it's got to also support pushes to non-bare
repos where pushing master fails, as explained in the big comment. So
pushing synced/master is not entirely a fix, but at least it makes github
default to a branch with the stuff the user expects in it, not a bunch of
annex log files.
Aside from fixing github to not make this assumption, or improving
the git push protocol to include what the current HEAD is, the only other
approach I can think of is to identify git push's progress messages and
display those when pushing master, while filtering out error messages
about non-fast-forward etc. But git doesn't provide a way to separate out
or identify its progress messages.
Sponsored-by: Luke Shumaker on Patreon
Eg, before with a .gitattributes like:
*.2 annex.numcopies=2
*.1 annex.numcopies=1
And foo.1 and foo.2 having the same content and key, git-annex drop foo.1 foo.2
would succeed, leaving just 1 copy, despite foo.2 needing 2 copies.
It dropped foo.1 first and then skipped foo.2 since its content was gone.
Now that the keys database includes locked files, this longstanding wart
can be fixed.
Sponsored-by: Noam Kremen on Patreon
When the log has an activity that is not known, eg added by a future
version of git-annex, it used to be treated as no activity at all,
which would make git-annex expire think it should expire the repository,
despite it having some kind of recent activity.
Hopefully there will be no reason to add a new activity until enough
time has passed that this commit is in use everywhere.
Sponsored-by: Jake Vosloo on Patreon
This makes git checkout and git merge hooks do the work to catch up with
changes that they made to the tree. Rather than doing it at some later
point when the user is not thinking about that past operation.
Sponsored-by: Dartmouth College's Datalad project
When two files have the same content, and a required content expression
matches one but not the other, dropping the latter file will fail as it
would also remove the content of the required file.
This will slow down drop (w/o --auto), dropunused, mirror, and move, by one
keys db lookup per file. But I did include an optimisation to avoid a
double db lookup in the drop --auto / sync --content case. I suspect that
dropunused could also use PreferredContentChecked True, but haven't
entirely thought it through and it's rarely used with enough files for the
optimisation to matter.
Sponsored-by: Dartmouth College's Datalad project
* drop: When two files have the same content, and a preferred content
expression matches one but not the other, do not drop the file.
* sync --content, assistant: Fix an edge case where a file that is not
preferred content did not get dropped.
The sync --content edge case is that handleDropsFrom loaded associated files
and used them without verifying that the information from the database was
not stale.
It seemed best to avoid changing --want-drop's behavior, this way when
debugging a preferred content expression with it, the files matched will
still reflect the expression. So added a note to the --want-drop documentation,
to make clear it may not behave identically to git-annex drop --auto.
While it would be possible to introspect the preferred content
expression to see if it matches on filenames, and only look up the
associated files when it does, it's generally fairly rare for 2 files to
have the same content, and the database lookup is already avoided when
there's only 1 file, so I did not implement that further optimisation.
Note that there are still some situations where the associated files
database does not get locked files recorded in it, which will prevent
this fix from working.
Sponsored-by: Dartmouth College's Datalad project
Before only unlocked files were included.
The initial scan now scans for locked as well as unlocked files. This
does mean it gets a little bit slower, although I optimised it as well
as I think it can be.
reconcileStaged changed to diff from the current index to the tree of
the previous index. This lets it handle deletions as well, removing
associated files for both locked and unlocked files, which did not
always happen before.
On upgrade, there will be no recorded previous tree, so it will diff
from the empty tree to current index, and so will fully populate the
associated files, as well as removing any stale associated files
that were present due to them not being removed before.
reconcileStaged now does a bit more work. Most of the time, this will
just be due to running more often, after some change is made to the
index, and since there will be few changes since the last time, it will
not be a noticable overhead. What may turn out to be a noticable
slowdown is after changing to a branch, it has to go through the diff
from the previous index to the new one, and if there are lots of
changes, that could take a long time. Also, after adding a lot of files,
or deleting a lot of files, or moving a large subdirectory, etc.
Command.Lock used removeAssociatedFile, but now that's wrong because a
newly locked file still needs to have its associated file tracked.
Command.Rekey used removeAssociatedFile when the file was unlocked.
It could remove it also when it's locked, but it is not really
necessary, because it changes the index, and so the next time git-annex
run and accesses the keys db, reconcileStaged will run and update it.
There are probably several other places that use addAssociatedFile and
don't need to any more for similar reasons. But there's no harm in
keeping them, and it probably is a good idea to, if only to support
mixing this with older versions of git-annex.
However, mixing this and older versions does risk reconcileStaged not
running, if the older version already ran it on a given index state. So
it's not a good idea to mix versions. This problem could be dealt with
by changing the name of the gitAnnexKeysDbIndexCache, but that would
leave the old file dangling, or it would need to keep trying to remove
it.
They're only needed to cover a gc edge case, and it's better someone
gets caught by that edge case than that someone who does not know about
them ends up with a filtered git-annex branch that contains such a tree
when some of the files listed in it are ones they wanted to *remove*
from the repository.
It's not currently possible to exclude a sameas repo using its
annex-config-uuid. (Remote.nameToUUID rejects them).
Since there's no real documented way to learn those, this seems ok, at
least for now. Also it avoids the problem of someone excluding the
parent but including the sameas, which would probably make the sameas
repo not usable when using the filtered branch.
Added a note to man page about what happens to information that is
recorded in the private journal. Since it uses Branch.get, that
information will be copied when options allow. It seemed better to allow
it and document it than not allow it, since the options allow excluding
repositories and so can be used to exclude private repos if desired.
Not tested yet but should work.
Noted a possible optimisation, which should probably be added, to
speed it up in cases where there is no uuid filtering being done.
It would need Annex.Branch to add a function like getRef that uses
catFileDetails, so the sha is also returned. The difficulty would be
making it support the precached file content; if it didn't it would
probably not be any faster and could even be slower. So probably the
precaching would need to be changed to also cache the sha.
This is less erorr-prone, and easier for the user to reason about; it
preserves the man page's promise that only explicitly included
information will be copied.
ghc 8.8.4 seems to have changed something that broke code that has been
successfully using forkProcess since 2012. Likely a change to GC internals.
Since forkProcess has never had clear documentation about how to
use it safely, avoid using it at all. Instead, when git-annex needs to
daemonize itself, re-run the git-annex command, in a new process group
and session.
This commit was sponsored by Luke Shumaker on Patreon.
Similar to what commit 675556fd9a did for
adding a non-annexed file, this prevents the smudge clean filter
recognising the inode if git add is later run on the unannexed file.
smudge: Fix a case where an unlocked annexed file that annex.largefiles
does not match could get its unchanged content checked into git, due to git
running the smudge filter unecessarily.
When the file has the same inodecache as an already annexed file,
we can assume that the user is not intending to change how it's stored in
git.
Note that checkunchangedgitfile already handled the inverse case, where the
file was added to git previously. That goes further and actually sha1
hashes the new file and checks if it's the same hash in the index.
It would be possible to generate a key for the file and see if it's the
same as the old key, however that could be considerably more expensive than
sha1 of a small file is, and it is not necessary for the case I have, at
least, where the file is not modified or touched, and so its inode will
match the cache.
git-annex add was changed, when adding a small file, to remove the inode
cache for it. This is necessary to keep the recipe in
doc/tips/largefiles.mdwn for converting from annex to git working.
It also avoids bugs/case_where_using_pathspec_with_git-commit_leaves_s.mdwn
which the earlier try at this change introduced.
Fix behavior of several commands, including reinject, addurl, and rmurl
when given an absolute path to an unlocked file, or a relative path that
leaves and re-enters the repository.
To avoid slowing down all the cases where the paths are already ok
with an unncessary call to getCurrentDirectory, put in an optimisation
in relPathCwdToFile. That will probably also speed up other parts of
git-annex by some small amount, but I have not benchmarked.
Note that I did not convert branchFileRef, because it seems likely that
it will be used with a file that is not provided by the user, so is already
in a sane format. This is certainly true for the way git-annex uses it,
though maybe arguable to the extent Git.Ref is a reusable library.
GHC was complaining about it possibly being a homoglyph:
Command/Multicast.hs:111:36: error:
warning: treating Unicode character <U+2212> as identifier character rather than as '-' symbol [-Wunicode-homoglyph]
-- using a nice prime, namely 2521−1 but the sheer size of this
^
|
111 | -- using a nice prime, namely 2521−1 but the sheer size of this
| ^
1 warning generated.
smudge: Fix a case where an unlocked annexed file that annex.largefiles
does not match could get its unchanged content checked into git, due to git
running the smudge filter unecessarily.
When the file has the same inodecache as an already annexed file,
we can assume that the user is not intending to change how it's stored in
git.
Note that checkunchangedgitfile already handled the inverse case, where the
file was added to git previously. That goes further and actually sha1
hashes the new file and checks if it's the same hash in the index.
It would be possible to generate a key for the file and see if it's the
same as the old key, however that could be considerably more expensive than
sha1 of a small file is, and it is not necessary for the case I have, at
least, where the file is not modified or touched, and so its inode will
match the cache.
fromkey: Create an unlocked file when used in an adjusted branch where the
file should be unlocked, or when configured by annex.addunlocked.
There is some overlap with code in Annex.Ingest, however it's not quite the
same because ingesting has a temp file with the content, where here the
content, if any, is in the annex object file. So it eg, makes sense for
Annex.Ingest to copy the execute mode of the content file, but it does not make
sense for fromkey to do that.
Also changed in passing to stage the file in git directly, rather than
using git add. One consequence of that is that if the file is gitignored,
it will still get added, rather than the old behavior:
The following paths are ignored by one of your .gitignore files:
ignored
hint: Use -f if you really want to add them.
hint: Turn this message off by running
hint: "git config advice.addIgnoredFile false"
git-annex: user error (xargs ["-0","git","--git-dir=.git","--work-tree=.","--literal-pathspecs","add","--"] exited 123)
That old behavior was a surprise to me, and so I consider it a bug, and doubt
anyone would have relied on it.
Note that, when on an --hide-missing branch, it is possible to fromkey a key
that is not present (needs --force). The annex link or pointer file still gets
written in this case. It doesn't seem to make any sense not to write it,
because then fromkey would not do anything useful in this case, and this way
the file can be committed and synced to master, and the branch re-adjusted to
hide the new missing file.
This commit was sponsored by Noam Kremen on Patreon.
* Fix bug that could make git-annex importfeed not see recently recorded
state when configured with annex.alwayscommit=false.
* importfeed: Made "checking known urls" phase run 12 times faster.
The massive speedup is because it no longer queries for metadata
accompanying each url. Instead it processes the whole git-annex branch and
checks all metadata files for feed item ids, and uses any it finds.
This could result in a behavior change, in an unlikely situation: If a feed
id is recorded in a key's metadata, but the url gets removed, the old code
would not see that item id and would re-download it if it finds an url for
it in a feed, while the new code will see the item id. I don't think
the old behavior was intentional, and it may be that the new behavior is
better. Not gonna worry about this.
This adds a separate journal, which does not currently get committed to
an index, but is planned to be committed to .git/annex/index-private.
Changes that are regarding a UUID that is private will get written to
this journal, and so will not be published into the git-annex branch.
All log writing should have been made to indicate the UUID it's
regarding, though I've not verified this yet.
Currently, no UUIDs are treated as private yet, a way to configure that
is needed.
The implementation is careful to not add any additional IO work when
privateUUIDsKnown is False. It will skip looking at the private journal
at all. So this should be free, or nearly so, unless the feature is
used. When it is used, all branch reads will be about twice as expensive.
It is very lucky -- or very prudent design -- that Annex.Branch.change
and maybeChange are the only ways to change a file on the branch,
and Annex.Branch.set is only internal use. That let Annex.Branch.get
always yield any private information that has been recorded, without
the risk that Annex.Branch.set might be called, with a non-private UUID,
and end up leaking the private information into the git-annex branch.
And, this relies on the way git-annex union merges the git-annex branch.
When reading a file, there can be a public and a private version, and
they are just concacenated together. That will be handled the same as if
there were two diverged git-annex branches that got union merged.
When downloading content from a remote, if the content is able to be
verified during the transfer, skip checksumming it a second time.
Note that in this case, the fsck output does not include "(checksum)"
which it does when the checksumming is done separately from the download.
This commit was sponsored by Brock Spratlen on Patreon.
This uses a DebugSelector, rather than debug levels, which will allow
for a later option like --debug-from=Process to only
see debuging about running processes.
The module name that contains the thing being debugged is used as the
DebugSelector (in most cases; does not need to be a hard and fast rule).
Debug calls were changed to add that. hslogger did not display
that first parameter to debugM, but the DebugSelector does get
displayed.
Also fastDebug will allow doing debugging in places that are used in
tight loops, with the DebugSelector coming from the Annex Reader
essentially for free. Not done yet.
Values in AnnexRead can be read more efficiently, without MVar overhead.
Only a few things have been moved into there, and the performance
increase so far is not likely to be noticable.
This is groundwork for putting more stuff in there, particularly a value
that indicates if debugging is enabled.
The obvious next step is to change option parsing to not run in the
Annex monad to set values in AnnexState, and instead return a pure value
that gets stored in AnnexRead.
Not yet used, but allows getting the size of items in the tree fairly
cheaply.
I noticed that CmdLine.Seek uses ls-tree and the feeds the files into
another long-running process to check their size. That would be an
example of a place that might be sped up by using this. Although in that
particular case, it only needs to know the size of unlocked files, not
locked. And since enabling --long probably doubles the ls-tree runtime
or more, the overhead of using it there may outwweigh the benefit.
I don't think this was really intentional behavior. It may be that it was
useful to include it so it could be passed to rmurl, since without it rmurl
would not actually remove the url. Since that was changed earlier today,
now seems like a good time to clean up the display of these urls.
This commit was sponsored by Jochen Bartl on Patreon.
fsck: When --from is used in combination with --all or similar options, do
not verify required content, which can't be checked properly when operating
on keys.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
unregisterurl: Fix a bug that caused an url to not be unregistered when it
is claimed by a special remote other than the web.
See commit f175d4cc90 for rationalle.
* rmurl: When youtube-dl was used for an url, it no longer needs to be
prefixed with "yt:" in order to be removed.
* rmurl: If an url is both used by the web and also claimed by another
special remote, fix a bug that caused the url to to not be removed.
The youtube-dl change is a consequence of how the bug fix is implemented.
But I also think it's the right thing to do. Consider that, before,
git-annex addurl $url followed by git-annex rmurl $url would not remove the
url in the case where youtube-dl was used. That was surprising behavior.
In the unlikely case where a special remote claims an url, and it's been
added using OtherDownloader, but it was also added already as a web url,
it seems better for rmurl to remove both than to arbitrarily remove only one.
And in the case the bug report was filed for, when an url was added as a
web url, but a special remote now claims it, that should not prevent rmurl
removing the web url.
Calling setUrlMissing lets other callers of it behave differently.
Probably the calls to it in eg, Remote.External and Remote.BitTorrent are
fine, since they don't mangle the url and just remove what was provided,
and the OtherDownloader form of a bittorrent url, respectively.
I suspect unregisterurl needs to have a similar change made to rmurl, for
similar reasons.
Like import was using ActionItemWorkTreeFile, it's ok to use it for export,
even though it might not correspond with a file in the work tree.
And renamed it to ActionItemTreeFile to make that clearer.
Note that when an export has to rename files, it still uses
ActionItemOther, so file will still be null in that case, but as no file is
being transferred, that seems ok.
import: When the previously exported tree contained a submodule,
preserve it in the imported tree so it does not get deleted.
The export exclude log, which was used for non-preferred content,
now also includes the submodules. Since the log format is git ls-tree
output, this does not break backwards compatibility.
It's not necessary to log location of GIT keys, because these files are
not annexed files and so git-annex will never need to get them.
This corresponds to code in Annex.Import that already checked before
updating the location log when handling deleted files.
Older versions of git-annex that used SHA1 keys for non-annexed files
also unncessarily updated the location log for them.
GIT keys still appear in the git-annex branch for content identifier
logs, so kept the documentation of them in backends.mdwn
This commit was sponsored by Jake Vosloo on Patreon.
This solves the problem that import of such files gets confused and
converts them back to annexed files.
The import code already used GIT keys internally when it determined a
file should not be annexed. So now when it sees a GIT key that export
used, it already does the right thing.
This also means that even older version of git-annex can import and will
do the right thing, once a fixed version has exported. Still, there may
be other complications around upgrades; still need to think it all
through.
Moved gitShaKey and keyGitSha from Key to Annex.Export since they're
only used for export/import.
Documented GIT keys in backends, since they do appear in the git-annex
branch now.
This commit was sponsored by Graham Spencer on Patreon.
Added LinkType to ProvidedInfo, and unified MatchingKey with
ProvidedInfo. They're both used in the same way, so there was no real
reason to keep separate.
Note that addLocked and addUnlocked still set matchNeedsFileName,
because to handle MatchingFile, they do need it. However, they
don't use it when MatchingInfo is provided. This should be ok,
the --branch case will be able skip checking matchNeedsFileName,
since it will provide a filename in any case.
Implemented by generalizing registerurl. Without the implicit batch mode
of registerurl since that is only a backwards compatability thing
(see commit 1d1054faa6).
unannex, uninit: When an annexed file is modified, don't overwrite the
modified version with an older version from the annex
This commit was sponsored by Mark Reidenbach on Patreon.
When a git remote is configured with an absolute path, use that path,
rather than making it relative. If it's configured with a relative path,
use that.
Git.Construct.fromPath changed to preserve the path as-is,
rather than making it absolute. And Annex.new changed to not
convert the path to relative. Instead, Git.CurrentRepo.get
generates a relative path.
A few things that used fromAbsPath unncessarily were changed in passing to
use fromPath instead. I'm seeing fromAbsPath as a security check,
while before it was being used in some cases when the path was
known absolute already. It may be that fromAbsPath is not really needed,
but only git-annex-shell uses it now, and I'm not 100% sure that there's
not some input that would cause a relative path to be used, opening a
security hole, without the security check. So left it as-is.
Test suite passes and strace shows the configured remote url is used
unchanged in the path into it. I can't be 100% sure there's not some code
somewhere that takes an absolute path to the repo and converts it to
relative and uses it, but it seems pretty unlikely that the code paths used
for a git remote would call such code. One place I know of is gitAnnexLink,
but I'm pretty sure that git remotes never deal with annex symlinks. If
that did get called, it generates a path relative to cwd, which would have
been wrong before this change as well, when operating on a remote.
When annex.stalldetection is not enabled, and a likely stall is detected,
display a suggestion to enable it.
Note that the progress meter display is not taken down when displaying
the message, so it will display like this:
0% 8 B 0 B/s
Transfer seems to have stalled. To handle stalling transfers, configure annex.stalldetection
0% 10 B 0 B/s
Although of course if it's really stalled, it will never update
again after the message. Taking down the progress meter and starting
a new one doesn't seem too necessary given how unusual this is,
also this does help show the state it was at when it stalled.
Use of uninterruptibleCancel here is ok, the thread it's canceling
only does STM transactions and sleeps. The annex thread that gets
forked off is separate to avoid it being canceled, so that it
can be joined back at the end.
A module cycle required moving from dupState the precaching of the
remote list. Doing it at startConcurrency should cover all the cases
where the remote list is used in concurrent actions.
This commit was sponsored by Kevin Mueller on Patreon.
Missed this when implementing it because of the default case catching
the new constructor. So, removed that default case to make sure
future types of adjusted branches don't make the same mistake.
Complicated by git-annex addurl --fast which adds the file whose content
is not present, so it needs to stay unlocked when on such a branch.
This commit was sponsored by Brock Spratlen on Patreon.
34a535ebea broke the test suite.
Getting a file started failing in one case, because the annex object did
not have its inode cached, so was not trusted to be unmodified.
This adds something very similar to what was added to linkAnnex
in commit 2e9341a47d -- if there are not
yet any inodes cached for a key, add the inode of the annex object when
adding the inode of the unlocked file.
Feels like this should be handled in a more principled way. How
do we know the addInodeCaches call in getMoveRaceRecovery just above
this change is currently correct? It doesn't add the annex object inode
cache. Ah well, maybe sometime when I've not had my entire evening eaten
by a reversion that the test suite caught as I was cooking dinner.
Avoids the smudge --clean filter failing because URL keys do not support
genKey. Instead the modified content will be added using the default
backend.
This commit was sponsored by Jochen Bartl on Patreon.
This avoids the smudge --clean filter failing on the URL keys.
git checkout runs the post-checkout hook, which runs smudge --update.
That populates all the pointer files, but it neglected to store their inode
caches in the keys db. With that done, and the keys db flushed before
smudge --clean gets run (by restagePointerFile), the isUnmodifiedCheap
check can tell the file is not modified, so will not try to re-ingest it,
which does not work with URL keys because they do not support genKey.
It also seems possible that the isUnmodifiedCheap was also failing for
non-URL keys, which would cause them to be re-ingested, leading to a lot of
extra work. I have not verified that, but don't see why it wouldn't have
happened. So this probably also speeds up checking out adjusted branches.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
Configuring chunking and encryption for a git remote has no effect, so
skip testing those variants in the TestRemote call.
It would be better if TestRemote itself could do this, but it
doesn't seem possible there. There is no way to look at a Remote and
tell if it supports chunking or encryption.
Note that, while the test suite displays output as it it's testing
exporting, it actually skips doing anything for the tests when run on
the git remote. So at least does not waste time even though the output
is not ideal.
This commit was sponsored by Noam Kremen on Patreon.
Since unconsidered use of trusted repositories can lead to data loss.
Trusted has always been this way, but it used to be acceptable for
git-annex to be set up so that data could be lost without using --force,
and most or all other ways that can happen have already been eliminated.
This commit was sponsored by Mark Reidenbach on Patreon.
This is conceptually very simple, just making a 1 that was hard coded be
exposed as a config option. The hard part was plumbing all that, and
dealing with complexities like reading it from git attributes at the
same time that numcopies is read.
Behavior change: When numcopies is set to 0, git-annex used to drop
content without requiring any copies. Now to get that (highly unsafe)
behavior, mincopies also needs to be set to 0. It seemed better to
remove that edge case, than complicate mincopies by ignoring it when
numcopies is 0.
This commit was sponsored by Denis Dzyubenko on Patreon.
* add: Significantly speed up adding lots of non-large files to git,
by disabling the annex smudge filter when running git add.
* add --force-small: Run git add rather than updating the index itself,
so any other smudge filters than the annex one that may be enabled will
be used.