While writing this documentation, I realized that there needed to be a way
to stay in a view like tag=* while adding a filter like tag=work that
applies to the same field.
So, there are really two ways a view can be refined. It can have a new
"field=explicitvalue" filter added to it, which does not change the
"shape" of the view, but narrows the files it shows.
Or, it can have a new view added, which adds another level of
subdirectories.
So, added a vfilter command, which takes explicit values to add to the
filter, and rejects changes that would change the shape of the view.
And, made vadd only accept changes that change the shape of the view.
And, changed the View data type slightly; now components that can match
multiple metadata values can be visible, or not visible.
This commit was sponsored by Stelian Iancu.
So the user can now switch to a view and then move files around within it
to manage metadata. For example, moving a file into a new directory
when in the tags=* view adds a tag to it.
Implementation is fairly efficient. One diff-index, which is no more
expensive than the first stage of a git commit, followed by possibly
some cat-file --batch traffic to find the key (when deleting a file).
Very similar to what's done in direct mode when committing. And like
direct mode when updating the WC after a merge, it has to buffer the
diff-tree values in order to make 2 passes over them.
When not in a view, pre-commit now does one extra git symbolic-ref,
which is tiny overhead.
This commit was sponsored by Andrew Eskridge.
I was careful to write the code so its clear how laziness memoizes it,
although it's likely that much less explicit currying would have had
the same effect. Verified that the memoization works using a Debug.Trace.
Removed instance, got it all to build using fromRef. (With a few things
that really need to show something using a ref for debugging stubbed out.)
Then added back Read instance, and made Logs.View use it for serialization.
This changes the view log format.
(And a vpop command, which is still a bit buggy.)
Still need to do vadd and vrm, though this also adds their documentation.
Currently not very happy with the view log data serialization. I had to
lose the TDFA regexps temporarily, so I can have Read/Show instances of
View. I expect the view log format will change in some incompatable way
later, probably adding last known refs for the parent branch to View
or something like that.
Anyway, it basically works, although it's a bit slow looking up the
metadata. The actual git branch construction is about as fast as it can be
using the current git plumbing.
This commit was sponsored by Peter Hogg.
Promosing work toward metadata driven filter branches. A few methods
to construct them are stubbed out; all the data types and pure code
seems good.
This commit was sponsored by Walter Somerville.
Adds metadata log, and command.
Note that unsetting field values seems to currently be broken.
And in general this has had all of 2 minutes worth of testing.
This commit was sponsored by Julien Lefrique.
ef24751922 described a bug moving between
remotes in direct mode; I can no longer reproduce it with this strange
workaround removed. Also test suite still passes. Hope the broken code just
got fixed in the meantime.
Potentially fixes some FD leak if an action on an opened file handle fails
for some reason. There have been some hard to reproduce reports of
git-annex leaking FDs, and this may solve them.
Seems that locking of annexed objects when they're being dropped was broken
in direct mode:
* When taking the lock before dropping, it created the .git/annex/objects
file, as an empty file. It seems that the dropping code deleted that,
but that is not right, and for all I know could in some situation cause
a corrupted object to leak out.
* When the lock was checked, it actually tried to open each direct mode
file, and checked if it was locked. Not the same lock used above, and
could also fail if some consumer of the file locked it.
Fixed this, and added windows support by switching direct mode to lock a
.lck file.
Several places assumed this would not happen, and when the AssociatedFile
was Nothing, did nothing.
As part of this, preferred content checks pass the Key around.
Note that checkMatcher is sometimes now called with Just Key and Just File.
It currently constructs a FileMatcher, ignoring the Key. However, if it
constructed a FileKeyMatcher, which contained both, then it might be
possible to speed up parts of Limit, which currently call the somewhat
expensive lookupFileKey to get the Key.
I have not made this optimisation yet, because I am not sure if the key is
always the same. Will need some significant checking to satisfy myself
that's the case..
Checking .gitattributes adds a full minute to a git annex find looking for
files that don't have enough copies. 2:25 increasts to 3:27. I feel this is
too much of a slowdown to justify making it the default. So, exposed two
versions of the preferred content expression, a slow one and a fast but
approximate one.
I'm using the approximate one in the default preferred content expressions
to avoid slowing down the assistant.
* Add numcopiesneeded preferred content expression.
* Client, transfer, incremental backup, and archive repositories
now want to get content that does not yet have enough copies.
This means the asssistant will make copies of files that don't yet
meet the configured numcopies, even to places that would not normally want
the file.
For example, if numcopies is 4, and there are 2 client repos and
2 transfer repos, and 2 removable backup drives, the file will be sent
to both transfer repos in order to make 4 copies. Once a removable drive
get a copy of the file, it will be dropped from one transfer repo or the
other (but not both).
Another example, numcopies is 3 and there is a client that has a backup
removable drive and two small archive repos. Normally once one of the small
archives has a file, it will not be put into the other one. But, to satisfy
numcopies, the assistant will duplicate it into the other small archive
too, if the backup repo is not available to receive the file.
I notice that these examples are fairly unlikely setups .. the old behavior
was not too bad, but it's nice to finally have it really correct.
.. Almost. I have skipped checking the annex.numcopies .gitattributes
out of fear it will be too slow.
This commit was sponsored by Florian Schlegel.
* numcopies: New command, sets global numcopies value that is seen by all
clones of a repository.
* The annex.numcopies git config setting is deprecated. Once the numcopies
command is used to set the global number of copies, any annex.numcopies
git configs will be ignored.
* assistant: Make the prefs page set the global numcopies.
This global numcopies setting is needed to let preferred content
expressions operate on numcopies.
It's also convenient, because typically if you want git-annex to preserve N
copies of files in a repo, you want it to do that no matter which repo it's
running in. Making it global avoids needing to warn the user about gotchas
involving inconsistent annex.numcopies settings.
(See changes to doc/numcopies.mdwn.)
Added a new variety of git-annex branch log file, that holds only 1 value.
Will probably be useful for other stuff later.
This commit was sponsored by Nicolas Pouillard.
I've been disliking how the command seek actions were written for some
time, with their inversion of control and ugly workarounds.
The last straw to fix it was sync --content, which didn't fit the
Annex [CommandStart] interface well at all. I have not yet made it take
advantage of the changed interface though.
The crucial change, and probably why I didn't do it this way from the
beginning, is to make each CommandStart action be run with exceptions
caught, and if it fails, increment a failure counter in annex state.
So I finally remove the very first code I wrote for git-annex, which
was before I had exception handling in the Annex monad, and so ran outside
that monad, passing state explicitly as it ran each CommandStart action.
This was a real slog from 1 to 5 am.
Test suite passes.
Memory usage is lower than before, sometimes by a couple of megabytes, and
remains constant, even when running in a large repo, and even when
repeatedly failing and incrementing the error counter. So no accidental
laziness space leaks.
Wall clock speed is identical, even in large repos.
This commit was sponsored by an anonymous bitcoiner.
Similar to the assistant, this honors any configured preferred content
expressions.
I am not entirely happpy with the implementation. It would be nicer if
the seek function returned a list of actions which included the individual
file gets and copies and drops, rather than the current list of calls to
syncContent. This would allow getting rid of the somewhat reundant display
of "sync file [ok|failed]" after the get/put display.
But, do that, withFilesInGit would need to somehow be able to construct
such a mixed action list. And it would be less efficient than the current
implementation, which is able to reuse several values between eg get and
drop.
Note that currently this does not try to satisfy numcopies when
getting/putting files (numcopies are of course checked when dropping
files!) This makes it like the assistant, and unlike get --auto
and copy --auto, which do duplicate files when numcopies is not yet
satisfied. I don't know if this is the right decision; it only seemed to
make sense to have this parallel the assistant as far as possible to start
with, since I know the assistant works.
This commit was sponsored by Øyvind Andersen Holm.
This adds a http HEAD before the download is done. That was already the
case when the assistant was running, and it seems worth it to avoid filling
up the whole disk, like happened to my server today.
This allows a remote to store a piece of arbitrary state associated with a
key. This is needed to support Tahoe, where the file-cap is calculated from
the data stored in it, and used to retrieve a key later. Glacier also would
be much improved by using this.
GETSTATE and SETSTATE are added to the external special remote protocol.
Note that the state is left as-is even when a key is removed from a remote.
It's up to the remote to decide when it wants to clear the state.
The remote state log, $KEY.log.rmt, is a UUID-based log. However,
rather than using the old UUID-based log format, I created a new variant
of that format. The new varient is more space efficient (since it lacks the
"timestamp=" hack, and easier to parse (and the parser doesn't mess with
whitespace in the value), and avoids compatability cruft in the old one.
This seemed worth cleaning up for these new files, since there could be a
lot of them, while before UUID-based logs were only used for a few log
files at the top of the git-annex branch. The transition code has also
been updated to handle these new UUID-based logs.
This commit was sponsored by Daniel Hofer.
The assistant's commit code also always avoids git commit, for simplicity.
Indirect mode sync still does a git commit -a to catch unstaged changes.
Note that this means that direct mode sync no longer runs the pre-commit
hook or any other hooks git commit might call. The git annex pre-commit
hook action for direct mode is however explicitly run. (The assistant
already ran git commit with hooks disabled, so no change there.)
0980f3dae6 broke support for local remotes
from direct mode repos, because the relative path was taken to be from the
gitdir, rather than from the work tree.
This works around horribleness in the Mavericks cpp, which falls over on
the #if when configure is running. Moving it avoids the file being built at
that point.
But it's also a location that makes sense..
Because that allowed writing to symlinks of files that are not present,
which followed the link and put bad content in an object location.
fsck: Fix up .git/annex/object directory permissions.
This commit was sponsored by an anonymous bitcoin donor.
This works for both direct and indirect mode.
It may need some performance tuning.
Note that unlike git status, it only shows the status of the work tree, not
the status of the index. So only one status letter, not two .. and since
files that have been added and not yet committed do not differ between the
work tree and the index, they are not shown. Might want to add display of
the index vs the last commit eventually.
This commit was sponsored by an unknown bitcoin contributor, whose
contribution as been going up lately! ;)
Now that direct mode sets core.bare=true, git's normal prohibition about
pushing into the currently checked out branch doesn't work.
A simple fix for this would be an update hook which blocks the pushes..
but git hooks must be executable, and git-annex needs to be usable on eg,
FAT, which lacks x bits.
Instead, enabling direct mode switches the branch (eg master) to a special
purpose branch (eg annex/direct/master). This branch is not pushed when
syncing; instead any changes that git annex sync commits get written to
master, and it's pushed (along with synced/master) to the remote.
Note that initialization has been changed to always call setDirect,
even if it's just setDirect False for indirect mode. This is needed because
if the user has just cloned a direct mode repo, that nothing has synced
with before, it may have no master branch, and only a annex/direct/master.
Resulting in that branch being checked out locally too. Calling setDirect False
for indirect mode moves back out of this branch, to a new master branch,
and ensures that a manual "git push" doesn't push changes directly to
the annex/direct/master of the remote. (It's possible that the user
makes a commit w/o using git-annex and pushes it, but nothing I can do
about that really.)
This commit was sponsored by Jonathan Harrington.
This guarantees that stopping an existing socket never fails.
This might be the route out of the mess of needing to worry about socket
lengths in general. However, it would need quite a lot of refactoring
to make every place in git-annex that runs ssh run it with a cwd that was
determined by the location of its connection caching socket. If this
wasn't already such a mess, I'd consider even the thought of that API a bad
idea..
The control socket path passed to ssh needs to be 17 characters shorter
than the maximum unix domain socket length, because ssh appends stuff to it
to make a temporary filename. Closes: #725512
Also, take the shorter of the relative and the absolute paths to the
socket. Typically the relative path will be a lot shorter (unless
deep inside a subdirectory of the repository), and so using it will
avoid flirting with the maximum safe socket lenghts in more situations,
and so lead to less breakage if all my attempts at fixing this are
still buggy.
My implementation does not guard against double locking of the journal. But
it does ensure that the journal is always locked when operated on, by using
a type that is only produced by lockJournal, and which is required as a
parameter of all functions that operate on the journal.
Note that I had to add the fooStale functions for cases where it does not
make sense to lock the journal when querying it. I was more concerned about
ensuring that anything that modifies the journal is locked.
setJournalFile's implementation ensures that any query of the journal will
get one value or the other atomically, even if the journal is being changed
at the time.
This may not strictly be needed -- the transition code bypasses the
journal. However, this ensures that the git-annex branch is only
committed with the journal locked. This will allow for further
improvements.
Overridable with --user-agent option.
Not yet done for S3 or WebDAV due to limitations of libraries used --
nether allows a user-agent header to be specified.
This commit sponsored by Michael Zehrer.
Since 006cf7976f was incomplete, not being
able to get the right mode of the file when the index differs from HEAD,
this is a final workaround. Only buffering the start of the file
in this case avoids leaking memory.
This does not prevent git-cat-file being asked to output the whole file,
which needs to be consumed, and can be slow. But this only happens in a
rare edge case.
Done using a mode witness, which ensures it's fixed everywhere.
Fixing catFileKey was a bear, because git cat-file does not provide a
nice way to query for the mode of a file and there is no other efficient
way to do it. Oh, for libgit2..
Note that I am looking at tree objects from HEAD, rather than the index.
Because I cat-file cannot show a tree object for the index.
So this fix is technically incomplete. The only cases where it matters
are:
1. A new large file has been directly staged in git, but not committed.
2. A file that was committed to HEAD as a symlink has been staged
directly in the index.
This could be fixed a lot better using libgit2.
The second commit had some bad refs which resulted in the race detection
code running. But that commit was unnecessary anyway, it only was there to
merge in the other refs.
Wrote nice pure transition calculator, and ugly code to stage its results
into the git-annex branch. Also had to split up several Log modules
that Annex.Branch needed to use, but that themselves used Annex.Branch.
The transition calculator is limited to looking at and changing one file at
a time. While this made the implementation relatively easy, it precludes
transitions that do stuff like deleting old url log files for keys that are
being removed because they are no longer present anywhere.