sync: Partly work around github behavior that first branch to be pushed to
a new repository is assumed to be the head branch, by not pushing
synced/git-annex first.
github expects master (or whatever the name is) to be pushed first, but
git-annex sync can't, because it's got to also support pushes to non-bare
repos where pushing master fails, as explained in the big comment. So
pushing synced/master is not entirely a fix, but at least it makes github
default to a branch with the stuff the user expects in it, not a bunch of
annex log files.
Aside from fixing github to not make this assumption, or improving
the git push protocol to include what the current HEAD is, the only other
approach I can think of is to identify git push's progress messages and
display those when pushing master, while filtering out error messages
about non-fast-forward etc. But git doesn't provide a way to separate out
or identify its progress messages.
Sponsored-by: Luke Shumaker on Patreon
Eg, before with a .gitattributes like:
*.2 annex.numcopies=2
*.1 annex.numcopies=1
And foo.1 and foo.2 having the same content and key, git-annex drop foo.1 foo.2
would succeed, leaving just 1 copy, despite foo.2 needing 2 copies.
It dropped foo.1 first and then skipped foo.2 since its content was gone.
Now that the keys database includes locked files, this longstanding wart
can be fixed.
Sponsored-by: Noam Kremen on Patreon
When the log has an activity that is not known, eg added by a future
version of git-annex, it used to be treated as no activity at all,
which would make git-annex expire think it should expire the repository,
despite it having some kind of recent activity.
Hopefully there will be no reason to add a new activity until enough
time has passed that this commit is in use everywhere.
Sponsored-by: Jake Vosloo on Patreon
This makes git checkout and git merge hooks do the work to catch up with
changes that they made to the tree. Rather than doing it at some later
point when the user is not thinking about that past operation.
Sponsored-by: Dartmouth College's Datalad project
When two files have the same content, and a required content expression
matches one but not the other, dropping the latter file will fail as it
would also remove the content of the required file.
This will slow down drop (w/o --auto), dropunused, mirror, and move, by one
keys db lookup per file. But I did include an optimisation to avoid a
double db lookup in the drop --auto / sync --content case. I suspect that
dropunused could also use PreferredContentChecked True, but haven't
entirely thought it through and it's rarely used with enough files for the
optimisation to matter.
Sponsored-by: Dartmouth College's Datalad project
* drop: When two files have the same content, and a preferred content
expression matches one but not the other, do not drop the file.
* sync --content, assistant: Fix an edge case where a file that is not
preferred content did not get dropped.
The sync --content edge case is that handleDropsFrom loaded associated files
and used them without verifying that the information from the database was
not stale.
It seemed best to avoid changing --want-drop's behavior, this way when
debugging a preferred content expression with it, the files matched will
still reflect the expression. So added a note to the --want-drop documentation,
to make clear it may not behave identically to git-annex drop --auto.
While it would be possible to introspect the preferred content
expression to see if it matches on filenames, and only look up the
associated files when it does, it's generally fairly rare for 2 files to
have the same content, and the database lookup is already avoided when
there's only 1 file, so I did not implement that further optimisation.
Note that there are still some situations where the associated files
database does not get locked files recorded in it, which will prevent
this fix from working.
Sponsored-by: Dartmouth College's Datalad project
Before only unlocked files were included.
The initial scan now scans for locked as well as unlocked files. This
does mean it gets a little bit slower, although I optimised it as well
as I think it can be.
reconcileStaged changed to diff from the current index to the tree of
the previous index. This lets it handle deletions as well, removing
associated files for both locked and unlocked files, which did not
always happen before.
On upgrade, there will be no recorded previous tree, so it will diff
from the empty tree to current index, and so will fully populate the
associated files, as well as removing any stale associated files
that were present due to them not being removed before.
reconcileStaged now does a bit more work. Most of the time, this will
just be due to running more often, after some change is made to the
index, and since there will be few changes since the last time, it will
not be a noticable overhead. What may turn out to be a noticable
slowdown is after changing to a branch, it has to go through the diff
from the previous index to the new one, and if there are lots of
changes, that could take a long time. Also, after adding a lot of files,
or deleting a lot of files, or moving a large subdirectory, etc.
Command.Lock used removeAssociatedFile, but now that's wrong because a
newly locked file still needs to have its associated file tracked.
Command.Rekey used removeAssociatedFile when the file was unlocked.
It could remove it also when it's locked, but it is not really
necessary, because it changes the index, and so the next time git-annex
run and accesses the keys db, reconcileStaged will run and update it.
There are probably several other places that use addAssociatedFile and
don't need to any more for similar reasons. But there's no harm in
keeping them, and it probably is a good idea to, if only to support
mixing this with older versions of git-annex.
However, mixing this and older versions does risk reconcileStaged not
running, if the older version already ran it on a given index state. So
it's not a good idea to mix versions. This problem could be dealt with
by changing the name of the gitAnnexKeysDbIndexCache, but that would
leave the old file dangling, or it would need to keep trying to remove
it.
They're only needed to cover a gc edge case, and it's better someone
gets caught by that edge case than that someone who does not know about
them ends up with a filtered git-annex branch that contains such a tree
when some of the files listed in it are ones they wanted to *remove*
from the repository.
It's not currently possible to exclude a sameas repo using its
annex-config-uuid. (Remote.nameToUUID rejects them).
Since there's no real documented way to learn those, this seems ok, at
least for now. Also it avoids the problem of someone excluding the
parent but including the sameas, which would probably make the sameas
repo not usable when using the filtered branch.
Added a note to man page about what happens to information that is
recorded in the private journal. Since it uses Branch.get, that
information will be copied when options allow. It seemed better to allow
it and document it than not allow it, since the options allow excluding
repositories and so can be used to exclude private repos if desired.
Not tested yet but should work.
Noted a possible optimisation, which should probably be added, to
speed it up in cases where there is no uuid filtering being done.
It would need Annex.Branch to add a function like getRef that uses
catFileDetails, so the sha is also returned. The difficulty would be
making it support the precached file content; if it didn't it would
probably not be any faster and could even be slower. So probably the
precaching would need to be changed to also cache the sha.
This is less erorr-prone, and easier for the user to reason about; it
preserves the man page's promise that only explicitly included
information will be copied.
ghc 8.8.4 seems to have changed something that broke code that has been
successfully using forkProcess since 2012. Likely a change to GC internals.
Since forkProcess has never had clear documentation about how to
use it safely, avoid using it at all. Instead, when git-annex needs to
daemonize itself, re-run the git-annex command, in a new process group
and session.
This commit was sponsored by Luke Shumaker on Patreon.
Similar to what commit 675556fd9a did for
adding a non-annexed file, this prevents the smudge clean filter
recognising the inode if git add is later run on the unannexed file.
smudge: Fix a case where an unlocked annexed file that annex.largefiles
does not match could get its unchanged content checked into git, due to git
running the smudge filter unecessarily.
When the file has the same inodecache as an already annexed file,
we can assume that the user is not intending to change how it's stored in
git.
Note that checkunchangedgitfile already handled the inverse case, where the
file was added to git previously. That goes further and actually sha1
hashes the new file and checks if it's the same hash in the index.
It would be possible to generate a key for the file and see if it's the
same as the old key, however that could be considerably more expensive than
sha1 of a small file is, and it is not necessary for the case I have, at
least, where the file is not modified or touched, and so its inode will
match the cache.
git-annex add was changed, when adding a small file, to remove the inode
cache for it. This is necessary to keep the recipe in
doc/tips/largefiles.mdwn for converting from annex to git working.
It also avoids bugs/case_where_using_pathspec_with_git-commit_leaves_s.mdwn
which the earlier try at this change introduced.
Fix behavior of several commands, including reinject, addurl, and rmurl
when given an absolute path to an unlocked file, or a relative path that
leaves and re-enters the repository.
To avoid slowing down all the cases where the paths are already ok
with an unncessary call to getCurrentDirectory, put in an optimisation
in relPathCwdToFile. That will probably also speed up other parts of
git-annex by some small amount, but I have not benchmarked.
Note that I did not convert branchFileRef, because it seems likely that
it will be used with a file that is not provided by the user, so is already
in a sane format. This is certainly true for the way git-annex uses it,
though maybe arguable to the extent Git.Ref is a reusable library.
GHC was complaining about it possibly being a homoglyph:
Command/Multicast.hs:111:36: error:
warning: treating Unicode character <U+2212> as identifier character rather than as '-' symbol [-Wunicode-homoglyph]
-- using a nice prime, namely 2521−1 but the sheer size of this
^
|
111 | -- using a nice prime, namely 2521−1 but the sheer size of this
| ^
1 warning generated.
smudge: Fix a case where an unlocked annexed file that annex.largefiles
does not match could get its unchanged content checked into git, due to git
running the smudge filter unecessarily.
When the file has the same inodecache as an already annexed file,
we can assume that the user is not intending to change how it's stored in
git.
Note that checkunchangedgitfile already handled the inverse case, where the
file was added to git previously. That goes further and actually sha1
hashes the new file and checks if it's the same hash in the index.
It would be possible to generate a key for the file and see if it's the
same as the old key, however that could be considerably more expensive than
sha1 of a small file is, and it is not necessary for the case I have, at
least, where the file is not modified or touched, and so its inode will
match the cache.
fromkey: Create an unlocked file when used in an adjusted branch where the
file should be unlocked, or when configured by annex.addunlocked.
There is some overlap with code in Annex.Ingest, however it's not quite the
same because ingesting has a temp file with the content, where here the
content, if any, is in the annex object file. So it eg, makes sense for
Annex.Ingest to copy the execute mode of the content file, but it does not make
sense for fromkey to do that.
Also changed in passing to stage the file in git directly, rather than
using git add. One consequence of that is that if the file is gitignored,
it will still get added, rather than the old behavior:
The following paths are ignored by one of your .gitignore files:
ignored
hint: Use -f if you really want to add them.
hint: Turn this message off by running
hint: "git config advice.addIgnoredFile false"
git-annex: user error (xargs ["-0","git","--git-dir=.git","--work-tree=.","--literal-pathspecs","add","--"] exited 123)
That old behavior was a surprise to me, and so I consider it a bug, and doubt
anyone would have relied on it.
Note that, when on an --hide-missing branch, it is possible to fromkey a key
that is not present (needs --force). The annex link or pointer file still gets
written in this case. It doesn't seem to make any sense not to write it,
because then fromkey would not do anything useful in this case, and this way
the file can be committed and synced to master, and the branch re-adjusted to
hide the new missing file.
This commit was sponsored by Noam Kremen on Patreon.
* Fix bug that could make git-annex importfeed not see recently recorded
state when configured with annex.alwayscommit=false.
* importfeed: Made "checking known urls" phase run 12 times faster.
The massive speedup is because it no longer queries for metadata
accompanying each url. Instead it processes the whole git-annex branch and
checks all metadata files for feed item ids, and uses any it finds.
This could result in a behavior change, in an unlikely situation: If a feed
id is recorded in a key's metadata, but the url gets removed, the old code
would not see that item id and would re-download it if it finds an url for
it in a feed, while the new code will see the item id. I don't think
the old behavior was intentional, and it may be that the new behavior is
better. Not gonna worry about this.
This adds a separate journal, which does not currently get committed to
an index, but is planned to be committed to .git/annex/index-private.
Changes that are regarding a UUID that is private will get written to
this journal, and so will not be published into the git-annex branch.
All log writing should have been made to indicate the UUID it's
regarding, though I've not verified this yet.
Currently, no UUIDs are treated as private yet, a way to configure that
is needed.
The implementation is careful to not add any additional IO work when
privateUUIDsKnown is False. It will skip looking at the private journal
at all. So this should be free, or nearly so, unless the feature is
used. When it is used, all branch reads will be about twice as expensive.
It is very lucky -- or very prudent design -- that Annex.Branch.change
and maybeChange are the only ways to change a file on the branch,
and Annex.Branch.set is only internal use. That let Annex.Branch.get
always yield any private information that has been recorded, without
the risk that Annex.Branch.set might be called, with a non-private UUID,
and end up leaking the private information into the git-annex branch.
And, this relies on the way git-annex union merges the git-annex branch.
When reading a file, there can be a public and a private version, and
they are just concacenated together. That will be handled the same as if
there were two diverged git-annex branches that got union merged.
When downloading content from a remote, if the content is able to be
verified during the transfer, skip checksumming it a second time.
Note that in this case, the fsck output does not include "(checksum)"
which it does when the checksumming is done separately from the download.
This commit was sponsored by Brock Spratlen on Patreon.
This uses a DebugSelector, rather than debug levels, which will allow
for a later option like --debug-from=Process to only
see debuging about running processes.
The module name that contains the thing being debugged is used as the
DebugSelector (in most cases; does not need to be a hard and fast rule).
Debug calls were changed to add that. hslogger did not display
that first parameter to debugM, but the DebugSelector does get
displayed.
Also fastDebug will allow doing debugging in places that are used in
tight loops, with the DebugSelector coming from the Annex Reader
essentially for free. Not done yet.
Values in AnnexRead can be read more efficiently, without MVar overhead.
Only a few things have been moved into there, and the performance
increase so far is not likely to be noticable.
This is groundwork for putting more stuff in there, particularly a value
that indicates if debugging is enabled.
The obvious next step is to change option parsing to not run in the
Annex monad to set values in AnnexState, and instead return a pure value
that gets stored in AnnexRead.
Not yet used, but allows getting the size of items in the tree fairly
cheaply.
I noticed that CmdLine.Seek uses ls-tree and the feeds the files into
another long-running process to check their size. That would be an
example of a place that might be sped up by using this. Although in that
particular case, it only needs to know the size of unlocked files, not
locked. And since enabling --long probably doubles the ls-tree runtime
or more, the overhead of using it there may outwweigh the benefit.
I don't think this was really intentional behavior. It may be that it was
useful to include it so it could be passed to rmurl, since without it rmurl
would not actually remove the url. Since that was changed earlier today,
now seems like a good time to clean up the display of these urls.
This commit was sponsored by Jochen Bartl on Patreon.
fsck: When --from is used in combination with --all or similar options, do
not verify required content, which can't be checked properly when operating
on keys.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
unregisterurl: Fix a bug that caused an url to not be unregistered when it
is claimed by a special remote other than the web.
See commit f175d4cc90 for rationalle.
* rmurl: When youtube-dl was used for an url, it no longer needs to be
prefixed with "yt:" in order to be removed.
* rmurl: If an url is both used by the web and also claimed by another
special remote, fix a bug that caused the url to to not be removed.
The youtube-dl change is a consequence of how the bug fix is implemented.
But I also think it's the right thing to do. Consider that, before,
git-annex addurl $url followed by git-annex rmurl $url would not remove the
url in the case where youtube-dl was used. That was surprising behavior.
In the unlikely case where a special remote claims an url, and it's been
added using OtherDownloader, but it was also added already as a web url,
it seems better for rmurl to remove both than to arbitrarily remove only one.
And in the case the bug report was filed for, when an url was added as a
web url, but a special remote now claims it, that should not prevent rmurl
removing the web url.
Calling setUrlMissing lets other callers of it behave differently.
Probably the calls to it in eg, Remote.External and Remote.BitTorrent are
fine, since they don't mangle the url and just remove what was provided,
and the OtherDownloader form of a bittorrent url, respectively.
I suspect unregisterurl needs to have a similar change made to rmurl, for
similar reasons.
Like import was using ActionItemWorkTreeFile, it's ok to use it for export,
even though it might not correspond with a file in the work tree.
And renamed it to ActionItemTreeFile to make that clearer.
Note that when an export has to rename files, it still uses
ActionItemOther, so file will still be null in that case, but as no file is
being transferred, that seems ok.
import: When the previously exported tree contained a submodule,
preserve it in the imported tree so it does not get deleted.
The export exclude log, which was used for non-preferred content,
now also includes the submodules. Since the log format is git ls-tree
output, this does not break backwards compatibility.