Add tuning, docs, etc.
Not sure if status is the right place to remote size.. perhaps unused
should report the size and also warn if it sees more keys than the bloom
filter allows?
Locking is used, so that, if there are multiple git-annex processes
using a remote concurrently, the stop hook is only run by the last
process that uses it.
To avoid commits of data to the git-annex branch after each command
is run, set annex.alwayscommit=false. Its data will then be committed
less frequently, when a merge or sync is done.
useful when adding hundreds of thousands of files on a system with plenty
of memory.
git add gets quite slow in such a large repository, so if the system has
more than the ~32 mb of memory the queue can use by default, it's a useful
optimisation to increase the queue size, in order to decrease the number
of times git add is run.
Can be used to specify what file the url is added to. This can be used to
override the default filename that is used when adding an url, which is
based on the url. Or, when the file already exists, the url is recorded as
another location of the file.
Fscking a remote is now supported. It's done by retrieving
the contents of the specified files from the remote, and checking them,
so can be an expensive operation.
(Several optimisations are possible, to speed it up, of course.. This is
the slow and stupid remote fsck to start with.)
Still, if the remote is a special remote, or a git repository that you
cannot run fsck in locally, it's nice to have the ability to fsck it.
If you have any directory special remotes, now would be a good time to
fsck them, in case you were hit by the data loss bug fixed in the
previous release!
This overrides the trust.log, and is overridden by the command-line trust
parameters.
It would have been nicer to have Logs.Trust.trustMap just look up the
configuration for all remotes, but a dependency loop prevented that
(Remotes depends on Logs.Trust in several ways). So instead, look up
the configuration when building remotes, storing it in the same forcetrust
field used for the command-line trust parameters.
This needs to run git log on the location log files to get at all past
versions of the file, which tends to be a bit slow.
It would be possible to make a version optimised for showing the location
logs for every key. That would only need to run git log once, so would be
faster, but it would need to process an enormous amount of data, so
would not speed up the individual file case.
In the future it would be nice to support log --format. log --json also
doesn't work right yet.
Dotfiles, and files inside dotdirs are not added by "git annex add" unless
the dotfile or directory is explicitly listed. So "git annex add ." will
add all untracked files in the current directory except for those in
dotdirs.
One reason for this is that it will make git-annex more usable with vcsh,
where you don't want "vcsh big annex add" to check in all the dotfiles
that are already versioned in other repositories.
(If you're using vcsh for repos that contain non-dotfiles, this won't help,
and you'll need to .gitignore such things, but this will cover the common
case.)
A more general reason why this seems like a good idea is the same reason ls
ignores dotfiles, just the unix convention that they are cruft that is kept
out of the way most of the time.
All the other git-annex commands still do deal with any dotfiles that do
get into the annex. This seemed right because if I've gone to the trouble
to add a dotfile, I will want "git annex get ." to get it along with
everything else.
It would be nice if command-specific options were supported. The first
difficulty is that which command is being called is not known until after
getopt; but that could be worked around by finding the first non-dashed
parameter. Storing the settings without putting them in the annex monad is
the next difficulty; it could perhaps be handled by making the seek stage
pass applicable settings into the start stage (and from there on to perform
as needed). But that still leaves a problem, what data type to use to
represent the options between getopt and seek?
Using Sets is the right thing; they have constant size lookup like my
SizeList, and logn insertation, which beats nub to death.
Runs faster than --fast mode did before, and gives accurate counts.
13 seconds total runtime with a warm cache in a repository with 40 thousand
keys.
find: Rather than only showing files whose contents are present, when used
with --exclude --copies or --in, displays all files that match the
specified conditions.
Note that this is a behavior change for find --exclude! Old behavior
can be gotten with find --in . --exclude=...
get, drop: Added --auto option, which decides whether to get/drop content
as needed to work toward the configured numcopies.
The problem with bundling it up in optimize was that I then found I wanted
to run an optmize that did not drop files, only got them. Considered adding
a --only-get switch to it, but that seemed wrong. Instead, let's make
existing subcommands optionally smarter.
Note that the only actual difference between drop and drop --auto is that
the latter does not even try to drop a file if it knows of not enough
copies, and does not print any error messages about files it was unable to
drop.
It might be nice to make get avoid asking git for attributes when not in
auto mode. For now it always asks for attributes.
This includes a generic JSONStream library built on top of Text.JSON
(somewhat hackishly).
It would be possible to stream out a single json document describing
all actions, but it's probably better for consumers if they can expect
one json document per line, so I did it that way instead.
Output from external programs used for transferring files is not
currently hidden when outputting json, which probably makes it not very
useful there. This may be dealt with if there is demand for json
output for --get or --move to be parsable.
The version, status, and find subcommands have hand-crafted output and
don't do json. The whereis subcommand needs to be modified to produce
useful json.
Backends are now only used to generate keys (and check them); they
are not arbitrary key-value stores for data, because it turned out such
a store is better modeled as a special remote. Updated docs to not
imply backends do more than they do now.
Sometimes I'm tempted to rename "backend" to "keytype" or something,
which would really be more clear. But it would be an annoying transition
for users, with annex.backends etc.
The tricky part about this is that to generate a key, the file must be
present already. Worked around by adding (back) an URL key type, which
is used for addurl --fast.
This allows eg, `git-annex -c annex.rsync-options=-6 get file`
The overridden git configs are not passed on to git plumbing commands
that are run. Perhaps someone will find a need to do that, but I don't yet
and it would require storing more state to know what config settings
have been overridden and need to be passed on.
get not honoring --from has surprised me a few times, so least surprise
suggests it should just behave like copy --from. This leaves the difference
between get and copy being that copy always requires the remote to copy
from, while get will decide whether to get a file from a key/value store or
a remote.
This takes advantage of the debug logging done by missingh, and I added
my own debug messages for executeFile calls. There are still some other
low-level ways git-annex runs stuff that are not shown by debugging,
but this gets most of it easily.