pull, sync: When operating on content, automatically hard link objects
that have been migrated.
Added annex.syncmigrations config that can be set to false to prevent
pull and sync from migrating object content.
I think that true is a good default for this config, because it avoids
users having to re-download migrated content or learning about migration.
But, some users will surely not like it, whether because it does take some
time (especially for the first git-annex branch scan when there is a long
history), or because they want to deal with it manually, or because their
filesystem doesn't support hard links and they don't want it to copy
objects.
Sponsored-by: k0ld on Patreon
This allows getting rid of the ugly and error prone handling of
"bag of bytes" String in Remote.Helper.Encryptable.
Avoiding breakage like that dealt with by commit
9862d64bf9
And allows converting Utility.Gpg to use ByteString for IO, which is
a welcome change.
Tested the new git-annex interoperability with old, using all 3
encryption= types.
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
Note the use of fromString and toString from Data.ByteString.UTF8 dated
back to commit 9b93278e8a. Back then it
was using the dataenc package for base64, which operated on Word8 and
String. But with the switch to sandi, it uses ByteString, and indeed
fromB64' and toB64' were already using ByteString without that
complication. So I think there is no risk of such an encoding related
breakage.
I also tested the case that 9b93278e8a
fixed:
git-annex metadata -s foo='a …' x
git-annex metadata x
metadata x
foo=a …
In Remote.Helper.Encryptable, it was avoiding using Utility.Base64
because of that UTF8 conversion. Since that's no longer done, it can
just use it now.
This is groundwork for making special remotes like borg be skipped by
sync when on an offline drive.
Added AVAILABILITY UNAVAILABLE reponse and the UNAVAILABLERESPONSE extension
to the external special remote protocol. The extension is needed because
old git-annex, if it sees that response, will display a warning
message. (It does continue as if the remote is globally available, which
is acceptable, and the warning is only displayed at initremote due to
remote.name.annex-availability caching, but still it seemed best to make
this a protocol extension.)
The remote.name.annex-availability git config is no longer used any
more, and is documented as such. It was only used by external special
remotes to cache the availability, to avoid needing to start the
external process every time. Now that availability is queried as an
Annex action, the external is only started by sync (and the assistant),
when they actually check availability.
Sponsored-by: Nicholas Golder-Manning on Patreon
And annex.largefiles and annex.addunlocked.
Also git-annex matchexpression --explain explains why its input
expression matches or fails to match.
When there is no limit, avoid explaining why the lack of limit
matches. This is also done when no preferred content expression is set,
although in a few cases it defaults to a non-empty matcher, which will
be explained.
Sponsored-by: Dartmouth College's DANDI project
Currently it only displays explanations of options like --in and --copies.
In the future, it should explain preferred content expression evaluation
and other decisions.
The explanations of a few things could be better. In particular,
"standard" will just appear as-is (or as "!standard" if it doesn't
match), rather than explaining why the standard preferred content expression
for the group matches or not.
Currently as implemented, it goes to stdout, and so commands like
git-annex find that have custom output will not display --explain
information. Perhaps that should change, dunno.
Sponsored-by: Dartmouth College's DANDI project
For simplicity, I've not tried to make it handle History yet, so when
there is a history, a full import will still be done. Probably the right
way to handle history is to first diff from the current tree to the last
imported tree. Then, diff from the current tree to each of the
historical trees, and recurse through the history diffing from child tree
to parent tree.
I don't think that will need a record of the previously imported
historical trees, and so Logs.Import doesn't store them. Although I did
leave room for future expansion in that log just in case.
Next step will be to change importTree to importChanges and modify
recordImportTree et all to handle it, by using adjustTree.
Sponsored-by: Brett Eisenberg on Patreon
Split out two new commands, git-annex pull and git-annex push. Those plus a
git commit are equivilant to git-annex sync.
In a sense, git-annex sync conflates 3 things, and it would have been
better to have push and pull from the beginning and not sync. Although
note that git-annex sync --content is faster than a pull followed by a
push, because it only has to walk the tree once, look at preferred
content once, etc. So there is some value in git-annex sync in speed, as
well as user convenience.
And it would be hard to split out pull and push from sync, as far as the
implementaton goes. The implementation inside sync was easy, just adjust
SyncOptions so it does the right thing.
Note that the new commands default to syncing content, unless
annex.synccontent is explicitly set to false. I'd like sync to also do
that, but that's a hard transition to make. As a start to that
transition, I added a note to git-annex-sync.mdwn that it may start to
do so in a future version of git-annex. But a real transition would
necessarily involve displaying warnings when sync is used without
--content, and time.
Sponsored-by: Kevin Mueller on Patreon
Had to convert uninit to do everything that can error out inside a
CommandStart. This was harder than feels nice.
(Also, in passing, converted CommandCheck to use a data type, not a
weird number that it was not clear how it managed to be unique.)
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
For expire, the normal output is unchanged, but the --json output includes the uuid
in machine parseable form. Which could be very useful for this somewhat obscure
command. That needed ActionItemUUID to be implemented, which seemed like a lot
of work, but then ---
I had been going to skip implementing them for trust, untrust, dead, semitrust,
and describe, but putting the uuid in the json is useful information, it tells
what uuid git-annex picked given the input. It was not hard to support
these once ActionItemUUID was implemented.
Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
When a nonexistant file is passed to a command and --json-error-messages
is enabled, output a JSON object indicating the problem.
(But git ls-files --error-unmatch still displays errors about such files in
some situations.)
I don't like the duplication of the name of the command introduced by this,
but I can't see a great way around it. One way would be to pass the Command
instead.
When json is not enabled, the stderr is unchanged. This is necessary
because some commands like find have custom output. So dislaying
"find foo not found" would be wrong. So had to complicate things with
toplevelFileProblem having different output with and without json.
When not using --json-error-messages but still using --json, it displays
the error to stderr, but does display a json object without the error. It
does have an errorid though. Unsure how useful that behavior is.
Sponsored-by: Dartmouth College's Datalad project
This reverts commit a325524454.
Turns out this was predicated on an incorrect belief that json output
didn't already sometimes lack the "key" field. Since json output already
can when `giveup` was used, it seems unncessary to add a whole new
option for this.
Added a --json-exceptions option, which makes some exceptions be output in json.
The distinction is that --json-error-messages is for messages relating
to a particular ActionItem, while --json-exceptions is for messages that
are not, eg ones for a file that does not exist.
It's unfortunate that we need two switches with such a fine distinction
between them, but I'm worried about maintaining backwards compatability
in the json output, to avoid breaking anything that parses it, and this was
the way to make sure I didn't.
toplevelWarning is generally used for the latter kind of message. And
the other calls to toplevelWarning could be converted to showException. The
only possible gotcha is that if toplevelWarning is ever called after
starting acting on a file, it will add to the --json-error-messages of the
json displayed for that file and converting to showException would be a
behavior change. That seems unlikely, but I didn't convery everything to
avoid needing to satisfy myself it was not a concern.
Sponsored-by: Dartmouth College's Datalad project
This does, as a side effect, make long notes in json output not
be indented. The indentation is only needed to offset them
underneath the display of the file they apply to, so that's ok.
Sponsored-by: Brock Spratlen on Patreon
Added StringContainingQuotedPath, which is used for ActionItemOther.
In the process, checked every ActionItemOther for those containing
filenames, and made them use quoting.
Sponsored-by: Graham Spencer on Patreon
This is by no means complete, but escaping filenames in actionItemDesc does
cover most commands.
Note that for ActionItemBranchFilePath, the value is branch:file, and I
choose to only quote the file part (if necessary). I considered quoting the
whole thing. But, branch names cannot contain control characters, and while
they can contain unicode, git coes not quote unicode when displaying branch
names. So, it would be surprising for git-annex to quote unicode in a
branch name.
The find command is the most obvious command that still needs to be
dealt with. There are probably other places that filenames also get
displayed, eg embedded in error messages.
Some other commands use ActionItemOther with a filename, I think that
ActionItemOther should either be pre-sanitized, or should explicitly not
be used for filenames, so that needs more work.
When --json is used, unicode does not get escaped, but control
characters were already escaped in json.
(Key escaping may turn out to be needed, but I'm ignoring that for now.)
Sponsored-by: unqueued on Patreon
This turns out to be possible after all, because the old one decomposed
a unicode Char to multiple Word8s and encoded those. It should be faster
in some places, particularly in Git.Filename.encodeAlways.
The old version encoded all unicode by default as well as ascii control
characters and also '"'. The new one only encodes ascii control
characters by default.
That old behavior was visible in Utility.Format.format, which did escape
'"' when used in eg git-annex find --format='${escaped_file}\n'
So made sure to keep that working the same. Although the man page only
says it will escape "unusual" characters, so it might be able to be
changed.
Git.Filename.encodeAlways also needs to escape '"' ; that was the
original reason that was escaped.
Types.Transferrer I judge is ok to not escape '"', because the escaped
value is sent in a line-based protocol, which is decoded at the other
end by decode_c. So old git-annex and new will be fine whether that is
escaped or not, the result will be the same.
Note that when asked to escape a double quote, it is escaped to \"
rather than to \042. That's the same behavior as git has. It's
perhaps somehow more of a special case than it needs to be.
Sponsored-by: k0ld on Patreon
This speeds up a few things, notably CmdLine.Seek using Git.Filename
which uses decode_c and this avoids a conversion to String and back,
and probably the ByteString implementation of decode_c is also faster
for simple cases at least than the string version.
encode_c cannot be converted to ByteString (or if it did, it would have
to convert right back to String in order to handle unicode).
Sponsored-by: Brock Spratlen on Patreon
A repository can have a newline in its description due to being in a
directory containing a newline, or due to git-annex describe being
passed a string with a newline in it for some reason. Putting that
newline in uuid.log breaks its format.
So, escape the newline when it enters uuid.log, to \n
This is a one-way escaping, it is not converted back to a newline
when reading the log. If it were, commands like git-annex info and
whereis would display a multi-line description, which could be confusing
to read.
And, implementing roundtripping would necessarily cause problems if an
old version of git-annex were used to set a description that contained
whatever special character is used to escape the \n. Eg, a \ or if
it used the ! prefix before base64 data that is used in some other logs,
the ! character. Then the description set by the old git-annex would not
roundtrip.
There just doesn't seem to be any benefit of roundtripping newlines through,
so why bother? And, git often displays \n for newline when a filename
contains a newline, so git-annex doing it in this case seems sorta ok
by analogy to git.
(Some other git-annex logs can also have newlines put into them if the
user really wants to break git-annex. For example:
git-annex config annex.largefiles "foo
bar"
The full list is probably config.log, remote.log, group.log,
preferred-content.log, required-content.log,
group-preferred-content.log, schedule.log. Probably there is no
good reason to use a newline in any of these, and the breakage is
probably limited to the bad data the user put in not coming back out.
And users can write any garbage to log files themselves manually in any
case. So, I am not going to address all of those at this time. If a
problem such as this one with the newline in the repository path comes
up, it can be dealt with on a case by case basis.)
Sponsored-by: Dartmouth College's Datalad project
* view: New field?=glob and ?tag syntax that includes a directory "_"
in the view for files that do not have the specified metadata set.
* Added annex.viewunsetdirectory git config to change the name of the
"_" directory in a view.
When in a view using the new syntax, old git-annex will fail to parse the
view log. It errors with "Not in a view.", which is not ideal. But that
only affects view commands.
annex.viewunsetdirectory is included in the View for a couple of reasons.
One is to avoid needing to warn the user that it should not be changed when
in a view, since that would confuse git-annex. Another reason is that it
helped with plumbing the value through to some pure functions.
annex.viewunsetdirectory is actually mangled the same as any other view
directory. So if it's configured to something like "N/A", there won't be
multiple levels of directories, which would also confuse git-annex.
Sponsored-By: Jack Hill on Patreon
Use separate stages for download and upload. In the common case where
it downloads the file from one remote and then uploads to the other,
those are by far the most expensive operations, and there's a decent
chance the two remotes bottleneck on different resources.
Suppose it's being run with -J2 and a bunch of 10 mb files. Two threads
will be started both downloading from the src remote. They will probably
finish at the same time. Then two threads will be started uploading to
the dst remote. They will probably take the same time as well. Before
this change, it would alternate back and forth, bottlenecking on src and dst.
With this change, as soon as the two threads start uploading to dst, two
more threads are able to start, downloading from src. So bandwidth to
both remotes is saturated more often.
Other commands that use transferStages only send in one direction at a
time. So the worker threads for the other direction will sit idle, and
there will be no change in their behavior.
Sponsored-by: Dartmouth College's DANDI project
When importing from versioned remotes, fix tracking of the content of
deleted files.
Only S3 supports versioning so far, so only it was affected.
But, the draft import/export interface for external remotes also seemed to
need a change, so that versionedExport could be set.
It's hard to know what's a good default for this. But 1 mb seems way too
small, because it's very easy for a git pull or some similar operation
that we don't think of as using much space to use up 1 mb of space.
Most people would want to free up some space if a filesystem only had 100
mb free. But on a small VPS, it's probably not uncommon to have only 1 gb
free. So 1 gb is too large for annex.diskreserve.
While old 1 gb USB keys are around, it's unlikely that anyone is
relying on them to shuttle annex data around; it would be worth anyone's
time to upgrade to a 32 gb or larger cheap modern USB key ($5).
Sponsored-by: Kevin Mueller on Patreon
This partly fixes an issue where there are duplicate files in the
special remote, and the first file gets swapped with another duplicate,
or deleted. The swap case is fixed by this, the deleted case will need
other changes.
This makes retrieveExportWithContentIdentifier take a list of allowed
ContentIdentifier, same as storeExportWithContentIdentifier,
removeExportWithContentIdentifier, and
checkPresentExportWithContentIdentifier.
Of the special remotes that support importtree, borg is a special case
and does not use content identifiers, S3 I assume can't get mixed up
like this, directory certainly has the problem, and adb also appears to
have had the problem.
Sponsored-by: Graham Spencer on Patreon
This allows annex.dbdir to be set globally or always set to the same
value when needed. Each repository uses a subdirectory of it.
Sponsored-by: Dartmouth College's Datalad project
WIP: This is mostly complete, but there is a problem: createDirectoryUnder
throws an error when annex.dbdir is set to outside the git repo.
annex.dbdir is a workaround for filesystems where sqlite does not work,
due to eg, the filesystem not properly supporting locking.
It's intended to be set before initializing the repository. Changing it
in an existing repository can be done, but would be the same as making a
new repository and moving all the annexed objects into it. While the
databases get recreated from the git-annex branch in that situation, any
information that is in the databases but not stored in the branch gets
lost. It may be that no information ever gets stored in the databases
that cannot be reconstructed from the branch, but I have not verified
that.
Sponsored-by: Dartmouth College's Datalad project
This is intended for users who want to see what it would output in order to
eg, check if a file would be added to git or the annex. It is not intended
as a way for scripts to get information.
Sponsored-by: Dartmouth College's Datalad project
Added annex.alwayscompact setting which can be unset to speed up writes to
the git-annex branch in some cases.
Sponsored-by: Dartmouth College's DANDI project
--backend is no longer a global option, and is only accepted by commands
that actually need it.
Three commands that used to support backend but don't any longer are
watch, webapp, and assistant. It would be possible to make them support it,
but I doubt anyone used the option with these. And in the case of webapp
and assistant, the option was handled inconsistently, only taking affect
when the command is run with an existing git-annex repo, not when it
creates a new one.
Also, renamed GlobalOption etc to AnnexOption. Because there are many
options of this type that are not actually global (any more) and get
added to commands that need them.
Sponsored-by: Kevin Mueller on Patreon
None of the special remotes do it yet, but this lays the groundwork.
Added MustFinishIncompleteVerify so that, when an incremental verify is
started but not complete, it can be forced to finish it. Otherwise, it
would have skipped doing it when verification is disabled, but
verification must always be done when retrievin from export remotes
since files can be modified during retrieval.
Note that retrieveExportWithContentIdentifier doesn't support incremental
verification yet. And I'm not sure if it can -- it doesn't know the Key
before it downloads the content. It seems a new API call would need to
be split out of that, which is provided with the key.
Sponsored-by: Dartmouth College's Datalad project
Ignore annex.numcopies set to 0 in gitattributes or git config, or by
git-annex numcopies or by --numcopies, since that configuration would make
git-annex easily lose data. Same for mincopies.
This is a continuation of the work to make data only be able to be lost
when --force is used. It earlier led to the --trust option being disabled,
and similar reasoning applies here.
Most numcopies configs had docs that strongly discouraged setting it to 0
anyway. And I can't imagine a use case for setting to 0. Not that there
might not be one, but it's just so far from the intended use case of
git-annex, of managing and storing your data, that it does not seem like
it makes sense to cater to such a hypothetical use case, where any
git-annex drop can lose your data at any time.
Using a smart constructor makes sure every place avoids 0. Note that this
does mean that NumCopies is for the configured desired values, and not the
actual existing number of copies, which of course can be 0. The name
configuredNumCopies is used to make that clear.
Sponsored-by: Brock Spratlen on Patreon
Default to the number of CPU cores, which seems about optimal
on my laptop. Using one more saves me 2 seconds actually.
Better packing of workers improves speed significantly.
In 2 tests runs, I saw segfaulting workers despite my attempt
to work around that issue. So detect when a worker does, and re-run it.
Removed installSignalHandlers again, because I was seeing an
error "lost signal due to full pipe", which I guess was somehow caused
by using it.
Sponsored-by: Dartmouth College's Datalad project
In aeson 2.0, Text has been replaced by the Key type and HashMap by the
KeyMap interface. Accomodating this required adding some CPP in order to
still be able to compile with aeson < 2.0. The required changes were:
* Prevent Key from being re-exported by Utilities.Aeson, as it clashes
with git-annex's own Key type.
* Fix up convertion from String/Text to Key (or Text in aeson 1.*) in a
couple of places
* Import Data.Aeson.KeyMap instead of Data.HashMap.Strict, as they are
mostly API-compatible. insertWith needs to be replaced by unionWith,
however, as KeyMap lacks the former function.
annex.skipunknown now defaults to false, so commands like `git annex get foo*`
will not silently skip over files/dirs that are not checked into git.
Sponsored-by: Brock Spratlen on Patreon
v10 will run 1 year after the upgrade to v9, to give time for any v8
processes to die. Until that point, the v10 upgrade will be tried by
every process but deferred, so added support for deferring upgrades.
The upgrade prevention lock file that will be used by v10 is not yet
implemented, so it does not yet defer.
Sponsored-by: Dartmouth College's Datalad project
Capstone to this feature. Any transitions that have been performed on an
unmerged remote ref but not on the local git-annex branch, or vice-versa
have to be applied on the fly when reading files.
Sponsored-by: Dartmouth College's Datalad project
Improved support for using git-annex in a read-only repository, git-annex
branch information from remotes that cannot be merged into the git-annex
branch will now not crash it, but will be merged in memory.
To avoid this making git-annex behave one way in a read-only repository,
and another way when it can write, it's important that Annex.Branch.get
return the same thing (modulo log file compaction) in both cases.
This manages that mostly. There are some exceptions:
- When there is a transition in one of the remote git-annex branches
that has not yet been applied to the local or other git-annex branches.
Transitions are not handled.
- `git-annex log` runs git log on the git-annex branch, and so
it will not be able to show information coming from the other, not yet
merged branches.
- Annex.Branch.files only looks at files in the git-annex branch and not
unmerged branches. This affects git-annex info output.
- Annex.Branch.hs.overBranchFileContents ditto. Affects --all and
also importfeed (but importfeed cannot work in a read-only repo
anyway).
- CmdLine.Seek.seekFilteredKeys when precaching location logs.
Note use of Annex.Branch.fullname
- Database.ContentIdentifier.needsUpdateFromLog and updateFromLog
These warts make this not suitable to be merged yet.
This readonly code path is more expensive, since it has to query several
branches. The value does get cached, but still large queries will be
slower in a read-only repository when there are unmerged git-annex
branches.
When annex.merge-annex-branches=false, updateTo skips doing anything,
and so the read-only repository code does not get triggered. So a user who
is bothered by the extra work can set that.
Other writes to the repository can still result in permissions errors.
This includes the initial creation of the git-annex branch, and of course
any writes to the git-annex branch.
Sponsored-by: Dartmouth College's Datalad project
When non-concurrent git coprocesses have been started, setConcurrency
used to not stop them, and so could leak processes when enabling
concurrency, eg when forkState is called.
I do not think that ever actually happened, given where setConcurrency
is called. And it probably would only leak one of each process, since it
never downgrades from concurrent to non-concurrent.