Noticed that Semigroup instance of Map is not suitable to use
for MapLog. For example, it behaved like this:
ghci> parseTrustLog "foo 1 timestamp=10\nfoo 2 timestamp=11" <> parseTrustLog "foo X timestamp=12"
fromList [(UUID "foo",LogEntry {changed = VectorClock 11s, value = SemiTrusted})]
Which was wrong, it lost the newer DeadTrusted value.
Luckily, nothing used that Semigroup when operating on a MapLog. And this
provides a safe instance.
Sponsored-by: Graham Spencer on Patreon
CSV format so it can be fed into a program to graph it.
Note that dead repositories are not yet handled so their sizes show as
nonzero after they are marked dead.
Sponsored-By: k0ld on Patreon
This can take a lot of memory. I decided to violate the usual rule in
git-annex that it operate in constant memory no matter how many annexed
objects. In this case, it would be hard to be fast without using a big
map of the location logs. The main difficulty here is that there can be
many git-annex branches and it needs to display a consistent view at a
point in time, which means merging information from multiple git-annex
branches.
I have not checked if there are any laziness leaks in this code. It
takes 1 gb to run in my big repo, which is around what I estimated
before writing it.
2 options that are documented are not yet implemented.
Small bug: With eg --when=1h, it will display at 12:00 then 1:10 if the
next change after 12:59 is then. Then it waits until after 2:10 to
display the next change. It ought to wait until after 2:00.
Sponsored-by: Brock Spratlen on Patreon
Factored out overLocationLogs from CmdLine.Seek, which can calculate this
pretty fast even in a large repo. In my big repo, the time to run git-annex
info went up from 1.33s to 8.5s.
Note that the "backend usage" stats are for annexed files in the working
tree only, not all annexed files. This new data source would let that be
changed, but that would be a confusing behavior change. And I cannot
retitle it either, out of fear something uses the current title (eg parsing
the json).
Also note that, while time says "402108maxresident" in my big repo now,
up from "54092maxresident", top shows the RES constant at 64mb, and it
was 48mb before. So I don't think there is a memory leak. I tried using
deepseq to force full evaluation of addKeyCopies and memory use didn't
change, which also says no memory leak. And indeed, not even calling
addKeyCopies resulted in the same memory use. Probably the increased memory
usage is buffering the stream of data from git in overLocationLogs.
Sponsored-by: Brett Eisenberg on Patreon
Minor optimisation, but a win in every case, except for a couple where
it's a wash.
Note that replaceFile still takes a FilePath, because it needs to
operate on Chars to truncate unicode filenames properly.
Note that the use of s2w8 in genUUIDInNameSpace made it truncate unicode
characters. Luckily, genUUIDInNameSpace is only ever used on ASCII
strings as far as I can determine. In particular, git-remote-gcrypt's
gcrypt-id is an ASCII string.
importfeed: Use caching database to avoid needing to list urls on every
run, and avoid using too much memory.
Benchmarking in my podcasts repo, importfeed got 1.42 seconds faster,
and memory use dropped from 203000k to 59408k.
Database.ImportFeed is Database.ContentIdentifier with the serial number
filed off. There is a bit of code duplication I would like to avoid,
particularly recordAnnexBranchTree, and getAnnexBranchTree. But these use
the persistent sqlite tables, so despite the code being the same, they
cannot be factored out.
Since this database includes the contentidentifier metadata, it will be
slightly redundant if a sqlite database is ever added for metadata. I
did consider making such a generic database and using it for this. But,
that would then need importfeed to update both the url database and the
metadata database, which is twice as much work diffing the git-annex
branch trees. Or would entagle updating two databases in a complex way.
So instead it seems better to optimise the database that
importfeed needs, and if the metadata database is used by another command,
use a little more disk space and do a little bit of redundant work to
update it.
Sponsored-by: unqueued on Patreon
push: When on an adjusted branch, propagate changes to parent branch before
updating export remotes.
This is a somewhat redundant call to propigateAdjustedCommits, since it
also gets called at pushLocal time. That other one needs to come after
importing from importtree remotes though, and seekExportContent has to come
earlier, so I don't see a way to avoid doing it twice.
Note that git-annex sync also manages to avoid the problem, it's only
git-annex push that had the bug.
Sponsored-by: Leon Schuermann on Patreon
A misleading message was displayed in several cases.
If the user has run eg: git config
remote.push-win-remote.annex-tracking-branch 'adjusted/main(unlocked)'
That is not supported, and now it will tell them it's not a valid
configuration. A user reported doing that, but I don't know if it's a
common point of confusion. If it is a common problem, a better message
would be possible, or it could convert back from the adjusted branch to
the actual branch.
Sponsored-by: Graham Spencer on Patreon
Removed the prior code that checked for keys used by current versions of
the files being acted on. It is redundant with the associated files
check (so long as the associated files database is always up-to-date,
which reconcileStaged should accomplish).
Sponsored-by: Luke T. Shumaker on Patreon
The tricky thing about this turned out to be handling renames and reverts.
For that, it has to make two passes over the git log, and to avoid
buffering a possibly huge amount of logs in memory (ie the whole git log of
an entire repository!), runs git log twice.
(It might be possible to speed this up by asking git log to show a diff,
and so avoid needing to use catKey.)
Sponsored-By: Brock Spratlen on Patreon
Removed the dontCheck repoExists, because running it in a repo that has not
been initialized yet would update location log with nouuid. And I guess
it's ok for it to only support running in git-annex repos.
This is groundwork for making special remotes like borg be skipped by
sync when on an offline drive.
Added AVAILABILITY UNAVAILABLE reponse and the UNAVAILABLERESPONSE extension
to the external special remote protocol. The extension is needed because
old git-annex, if it sees that response, will display a warning
message. (It does continue as if the remote is globally available, which
is acceptable, and the warning is only displayed at initremote due to
remote.name.annex-availability caching, but still it seemed best to make
this a protocol extension.)
The remote.name.annex-availability git config is no longer used any
more, and is documented as such. It was only used by external special
remotes to cache the availability, to avoid needing to start the
external process every time. Now that availability is queried as an
Annex action, the external is only started by sync (and the assistant),
when they actually check availability.
Sponsored-by: Nicholas Golder-Manning on Patreon
Fix behavior when importing a tree from a directory remote when the
directory does not exist. An empty tree was imported, rather than the
import failing. Merging that tree would delete every file in the
branch, if those files had been exported to the directory before.
The problem was that dirContentsRecursive returned [] when the directory
did not exist. Better for it to throw an exception. But in commit
74f0d67aa3 back in 2012, I made it never
theow exceptions, because exceptions throw inside unsafeInterleaveIO become
untrappable when the list is being traversed.
So, changed it to list the contents of the directory before entering
unsafeInterleaveIO. So exceptions are thrown for the directory. But still
not if it's unable to list the contents of a subdirectory. That's less of a
problem, because the subdirectory does exist (or if not, it got removed
after being listed, and it's ok to not include it in the list). A
subdirectory that has permissions that don't allow listing it will have its
contents omitted from the list still.
(Might be better to have it return a type that includes indications of
errors listing contents of subdirectories?)
The rest of the changes are making callers of dirContentsRecursive
use emptyWhenDoesNotExist when they relied on the behavior of it not
throwing an exception when the directory does not exist. Note that
it's possible some callers of dirContentsRecursive that used to ignore
permissions problems listing a directory will now start throwing exceptions
on them.
The fix to the directory special remote consisted of not making its
call in listImportableContentsM use emptyWhenDoesNotExist. So it will
throw an exception as desired.
Sponsored-by: Joshua Antonishen on Patreon
Only display warning when git-annex sync (without --content or
--no-content) is used with repositories that have preferred content
configured.
Sponsored-by: Leon Schuermann on Patreon
I considered a more wide-ranging config option to make other commands
also show dead repositories. But it would be difficult to implement that
because Remote.keyLocations is used to get locations, filtering out dead
repos, and commands like get then try to use those locations. So a config
setting would make dead repos sometimes be acted on by commands.
Sponsored-by: unqueued on Patreon
And annex.largefiles and annex.addunlocked.
Also git-annex matchexpression --explain explains why its input
expression matches or fails to match.
When there is no limit, avoid explaining why the lack of limit
matches. This is also done when no preferred content expression is set,
although in a few cases it defaults to a non-empty matcher, which will
be explained.
Sponsored-by: Dartmouth College's DANDI project
importfeed bug fix: When -J was used with multiple feeds, some feeds did
not get their items downloaded.
In my case, I had added a feed to the end of the list, and no items from it
were ever downloaded.
Sponsored-by: Leon Schuermann on Patreon
This causes changes to the original branch to get merged with a single
sync. Before, it took 2 syncs; the first happened to update the synced/
branch, and the second merged changes from the synced/ branch into the
ajusted branch.
Using mergeToAdjustedBranch when tomerge == origbranch is probably
overkill, but it does work fine.
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
(And allow it to be used in the --template although that seems unlikely to
be very useful.)
My use case for this is that one of the podcast feeds I subscribe to is
sometimes leaking episodes of some other podcast. The other podcast is also
very close to spam, so this may be a form of intentional spamming. I have
not been able to catch the podcast feed containing those episodes, so I
don't know which one is at fault. So putting this in the metadata will let
me eventually catch it.
Command.Add.seek starts concurrency with CommandStages. And for
Command.Sync, it needs TransferStages. So, to get both types of concurrency
for the two different parts, it either needs to change the type of
concurrency in between, or just call startConcurrency once for each.
It seems safe enough to call startConcurrency twice, because it does shut
down concurrency (mostly) at the end, and eg the old Annex.workers get
emptied.
Sponsored-by: unqueued on Patreon
This ended up having an interface like sync, rather than like get/copy/drop.
That let it be implemented in terms of sync, which took a lot less code.
Also, it lets it handle many of the edge cases that sync does, such as
getting files that are not visible in a --hide-missing branch, and sending
files to exporttree remotes.
As well as being easier to implement, `git-annex satisfy myremote` makes
sense as it satisfies the preferred content settings of the remote.
`git-annex satisfy somefile` does not form a sentence that makes sense. So
while -C can be a little bit annoying, it still makes sense to have this
syntax.
Note that, while I initially thought this would also satisfy numcopies, it
does not. Arguably it ought to. But, sync does not send files in order to
satisfy numcopies, it only sends files to satisfy preferred content. And
it's important that this transfer the same files as sync does, because
it will probably be used in a workflow where the user sometimes syncs and
sometimes satisfies, and does not expect satisfy to do things that sync
would not do.
(Also opened a new bug that also affects sync et all, not only this command.)
Sponsored-by: Nicholas Golder-Manning on Patreon
This was already possible, but it was rather hard to come up with the
complex shell command needed.
Note that the diff output starts with "diff a/... b/...".
I left off the "--git" because it's not a git format diff.
Rather than after it, which can leave one wondering what file it's
downloading.
youtubeDl should not ever return Right Nothing in normal operation,
becaause it's already asked youtube-dl if it supports the url. So it
would have to succeed at that, then not download any file, but also
exit successfully, in order for the new error message to display.
Also display the name of yt-dlp when using it.
Fixes a failure like this:
curl: (33) HTTP server doesn't seem to support byte ranges. Cannot resume.
That happens because the whole web page has already been downloaded
previously, and kept, so now addurl tries to download it, and curl asks the
server to resume from the last byte. And youtube.com can't, for whatever
stupid reason.
So, delete the temp file after determining that youtube-dl can be used.
That will fail, and it already exports whole trees.
f6dd34ca81 made it sync content with
import remotes, and if an import remote is also an export remote, that
caused this new failure mode.
Sponsored-by: Brock Spratlen on Patreon
* config: Added the --show-origin and --for-file options.
* config: Support annex.numcopies and annex.mincopies.
There is a little bit of redundancy here with other code elsewhere that
combines the various configs and selects which to use. But really only
for the special case of annex.numcopies, which is a git config that does
not override the annex branch setting and for annex.mincopies, which does
not have a git config but does have gitattributes settings as well as the
annex branch setting.
That seems small enough, and unlikely enough to grow into a mess that it was
worth supporting annex.numcopies and annex.mincopies in git-annex config
--show-origin. Because these settings are a prime thing that someone might
get confused about and want to know where they were configured.
And, it followed that git-annex config might as well support those two
for --set and --get as well. While this is redundant with the speclialized
commands, it's only a little code and it makes it more consistent.
Note that --set does not have as nice output as numcopies/mincopies
commands in some special cases like setting to 0 or a negative number.
It does avoid setting to a bad value thanks to the smart
constructors (eg configuredNumCopies).
As for other git-annex branch configurations that are not set by git-annex
config, things like trust and wanted that are specific to a repository
don't map to a git config name, so don't really fit into git-annex config.
And they are only configured in the git-annex branch with no local override
(at least so far), so --show-origin would not be useful for them.
Sponsored-by: Dartmouth College's DANDI project
This didn't used to be needed because importKeys would import all
content and so doing another pass was redundant.
But since 40017089f2 it uses
importChanges, so only new files are imported. If a file that was
already imported before was dropped, that would prevent sync --content
from gettng its content again.
Sponsored-by: Jack Hill on Patreon
Large speed up to importing trees from special remotes that contain a lot
of files, by only processing changed files.
Benchmarks:
Importing from a special remote that has 10000 files, that have all been
imported before, and 1 new file sped up from 26.06 to 2.59 seconds.
An import with no change and 10000 unchanged files sped up from 24.3 to
1.99 seconds.
Going up to 20000 files, an import with no changes sped up from
125.95 to 3.84 seconds.
Sponsored-by: k0ld on Patreon
For simplicity, I've not tried to make it handle History yet, so when
there is a history, a full import will still be done. Probably the right
way to handle history is to first diff from the current tree to the last
imported tree. Then, diff from the current tree to each of the
historical trees, and recurse through the history diffing from child tree
to parent tree.
I don't think that will need a record of the previously imported
historical trees, and so Logs.Import doesn't store them. Although I did
leave room for future expansion in that log just in case.
Next step will be to change importTree to importChanges and modify
recordImportTree et all to handle it, by using adjustTree.
Sponsored-by: Brett Eisenberg on Patreon