This is so git remotes on servers without git-annex installed can be used
to keep clients' git repos in sync.
This is a behavior change, but since annex-sync can be set to disable
syncing with a remote, I think it's acceptable.
Introduced a new per-remote option 'annex-rsync-transport' to specify
the remote shell that it to be used with rsync. In case the value is
'ssh', connections are cached unless 'sshcaching' is unset.
Most remotes have meters in their implementations of retrieveKeyFile
already. Simply hooking these up to the transfer log makes that information
available. Easy peasy.
This is particularly valuable information for encrypted remotes, which
otherwise bypass the assistant's polling of temp files, and so don't have
good progress bars yet.
Still some work to do here (see progressbars.mdwn changes), but this
is entirely an improvement from the lack of progress bars for encrypted
downloads.
Unless highRandomQuality=false (or --fast) is set, use Libgcypt's
'GCRY_VERY_STRONG_RANDOM' level by default for cipher generation, like
it's done for OpenPGP key generation.
On the assistant side, the random quality is left to the old (lower)
level, in order not to scare the user with an enless page load due to
the blocking PRNG waiting for IO actions.
* since this is a crippled filesystem anyway, git-annex doesn't use
symlinks on it
* so there's no reason to use the mixed case hash directories that we're
stuck using to avoid breaking everyone's symlinks to the content
* so we can do what is already done for all bare repos, and make non-bare
repos on crippled filesystems use the all-lower case hash directories
* which are, happily, all 3 letters long, so they cannot conflict with
mixed case hash directories
* so I was able to 100% fix this and even resuming `git annex add` in the
test case will recover and it will all just work.
There was confusion in different parts of the progress bar code about
whether an update contained the total number of bytes transferred, or the
number of bytes transferred since the last update. One way this bug
showed up was progress bars that seemed to stick at zero for a long time.
In order to fix it comprehensively, I add a new BytesProcessed data type,
that is explicitly a total quantity of bytes, not a delta.
Note that this doesn't necessarily fix every problem with progress bars.
Particularly, buffering can now cause progress bars to seem to run ahead
of transfers, reaching 100% when data is still being uploaded.
This got broken in commit e9238e9588.
I observed a key that had been copied to a remote, but the location
log was out of date, and due to this bug, git annex transferkey failed
and so the file could not be dropped when it was moved to an archive
directory.
Pass subcommand as a regular param, which allows passing git parameters
like -c before it. This was already done in the pipeing set of functions,
but not the command running set.
Pity that the library does not provide a function to extract the status
code from the StatusCodeException, so when they had to add a new field, it
breaks every single place that does it.
In general, git-annex does not try to preserve file permissions. For
example, they don't round trip through special remotes. So it's ok to not
preserve them for git remotes either.
On crippled filesystems, rsync has been observed failing after the file
was transferred because it couldn't set some permission or other.
With an encrypted rsync remote, the encrpyted file can be renamed, rather
than being copied, in crippled filesystem mode. This gets back to just as
fast as non-crippled mode for this very common case.
Cannot make a hard link, have to copy.
I did find a way to make it work without setting up a tree, just using
--include and --exclude. But it needs the same hash directories to be used
on both sides, which is normally not the case. Still, I hope one day I will
convert non-bare repos to use the same hash dirs as everything else, and
then this will get more efficient.
git annex init probes for crippled filesystems, and sets direct mode, as
well as `annex.crippledfilesystem`.
Avoid manipulating permissions of files on crippled filesystems.
That would likely cause an exception to be thrown.
Very basic support in Command.Add for cripped filesystems; avoids the lock
down entirely since doing it needs both permissions and hard links.
Will make this better soon.
However, I don't yet have a reliable way to deal with files being modified
while they're being transferred. I have code that detects it on the sending
side, but the receiver is still free to move the wrong content into its
annex, and record that it has the content. So that's not acceptable, and
I'll need to work on it some more.
However, at this point I can use a direct mode repository as a remote and
transfer files from and to it.
Higher than any other remote, this is mostly due to the long retrieval
time, so it'd make sense to get a file from nearly any other remote.
(Unless it's behind a very slow connection.)
Ensure that each file has something written to it, even if the bytestring
chunk size is greater than the configured chunksize.
This means we may write a bit larger than the configured value, but only
when the configured value is very small; ie, < 8 kb.
Files are now written to a tmp directory in the remote, and once all
chunks are written, etc, it's moved into the final place atomically.
For now, checkpresent still checks every single chunk of a file, because
the old method could leave partially transferred files with some chunks
present and others not.
Both the directory and webdav special remotes used to have to buffer
the whole file contents before it could be decrypted, as they read
from chunks. Now the chunks are streamed through gpg with no buffering.
This allows deleting all chunks for a file with a single http command,
so it's a win after all.
However, does not look in the mixed case hash directories, which were
in the past used by the directory, etc remotes.
The benefit of using a compatable directory structure does not outweigh the
cost in complexity of handling the multiple locations content can be stored
in directory special remotes. And this also allows doing away with the parent
directories, which can't be made unwritable in DAV, so have no benefit
there. This will save 2 http calls per file store.
But, kept the directory hashing, just in case.
bup 0.25 does not accept that; and bup split reads from stdin by
default if no file is given. I'm not sure what version of bup changed this.
This only affected bup special remotes that were encrypted.
Aka solve the github problem.
Note that it's possible the initial configlist will fail for some network
reason etc, and then the fetch succeeds. In this case, a usable remote gets
disabled. But it does print a message, and this only happens once per
remote, so that seems ok.
Rather than store decrypted creds in the environment, store them in the
creds cache file.
This way, a single git-annex can have multiple S3 remotes using different
creds.
When a transfer fails, the progress info can be used to intelligently
retry it. If the transfer managed to make some progress, but did not
fully complete, then there's a good chance that a retry will finish it
(or at least make more progress).
Easy!
Note that with an encrypted remote, rsync will be sending a little more
data than the key size, so displayed progress may get to 100% slightly
quicker than it should. I doubt this is a big enough effect to worry about.
cp is used here, but we can just watch the size of the destination file
This commit made from within the ruins of an old mill, overlooking a
beautiful waterfall.
Current implementation parses rsync's output a character a time, which
is hardly efficient. It could be sped up a lot by using hGetBufSome,
but that would require going really lowlevel, down to raw C style buffers
(good example of that here: http://users.aber.ac.uk/afc/stricthaskell.html)
But rsync doesn't output very much, so currently it seems ok.
Transfer info files are updated when the callback is called, updating
the number of bytes transferred.
Left unused p variables at every place the callback should be used.
Which is rather a lot..
Turns out that recvkey already does this same check. This avoids a transfer
file being created for the download that never happened, which in turn
will avoid the assistant seeing that the download has finished, when no
transfer actually took place.
Currently only the web special remote is readonly, but it'd be possible to
also have readonly drives, or other remotes. These are handled in the
assistant by only downloading from them, and never trying to upload to
them.
This commit includes a paydown on technical debt incurred two years ago,
when I didn't know that it was bad to make custom Read and Show instances
for types. As the routes need Read and Show for Transfer, which includes a
Key, and deriving my own Read instance of key was not practical,
I had to finally clean that up.
So the compact Key read and show functions are now file2key and key2file,
and Read and Show are now derived instances.
Changed all code that used the old instances, compiler checked.
(There were a few places, particularly in Command.Unused, and the test
suite where the Show instance continue to be used for legitimate
comparisons; ie show key_x == show key_y (though really in a bloom filter))
Make Utility.Process wrap the parts of System.Process that I use,
and add debug logging to them.
Also wrote some higher-level code that allows running an action
with handles to a processes stdin or stdout (or both), and checking
its exit status, all in a single function call.
As a bonus, the debug logging now indicates whether the process
is being run to read from it, feed it data, chat with it (writing and
reading), or just call it for its side effect.
Test suite now passes with -threaded!
I traced back all the hangs with -threaded to System.Cmd.Utils. It seems
it's just crappy/unsafe/outdated, and should not be used. System.Process
seems to be the cool new thing, so converted all the code to use it
instead.
In the process, --debug stopped printing commands it runs. I may try to
bring that back later.
Note that even SafeSystem was switched to use System.Process. Since that
was a modified version of code from System.Cmd.Utils, it needed to be
converted too. I also got rid of nearly all calls to forkProcess,
and all calls to executeFile, which I'm also doubtful about working
well with -threaded.
This *almost* works.
Along the way, I noticed that the --uuid parameter was being accidentially
passed after the --, so that has never been actually used by
git-annex-shell to verify it's running in the expected repository. Oops. Fixed.
In order to record a semi-useful filename associated with the key,
this required plumbing the filename all the way through to the remotes'
storeKey and retrieveKeyFile.
Note that there is potential for deadlock here, narrowly avoided.
Suppose the repos are A and B. A sends file foo to B, and at the same
time, B gets file foo from A. So, A locks its upload transfer info file,
and then locks B's download transfer info file. At the same time,
B is taking the two locks in the opposite order. This is only not a
deadlock because the lock code does not wait, and aborts. So one of A or
B's transfers will be aborted and the other transfer will continue.
Whew!
Not including such remotes turned out to have other consequences,
including annex-truselevel git config being ignored. Instead, add guards
before each operation that might try to operate on such a repo.
Prelude.undefined error message was introduced by
bb4f31a0ee.
It seems best to filter out local repositories that cannot be accessed
from the list of remotes, rather than keeping them in and making every
thing that uses the list have to deal with remotes that may have an unknown
location.
Besides fixing the error message, this also makes unavailable local
remotes' names not be shown in various messages, including in git annex
status output.
Also, move --to an unavailable local repository now avoids some ugly
errors like "changeWorkingDirectory: does not exist".
This was shown redundantly for a tricky reason -- while it runs
inside a doSideAction block that would appear to supress it,
the action being run is in a different state monad; for the remote,
and so the suppression doesn't work.
Always suppressing the message when committing to a local remote is
ok do to though -- it mirrors the /dev/nulling of the git annex shell commit
output. And it turns out that any time there is a git-annex branch state
change to commit on the remote, the local repo has also had a similar
change made, and so the message has been shown already.
The environment needs to override git-config. Changed when git config is
read, and avoid rereading it once it's been read.
chdir for both worktree settings.
Baked into the code was an assumption that a repository's git directory
could be determined by adding ".git" to its work tree (or nothing for bare
repos). That fails when core.worktree, or GIT_DIR and GIT_WORK_TREE are
used to separate the two.
This was attacked at the type level, by storing the gitdir and worktree
separately, so Nothing for the worktree means a bare repo.
A complication arose because we don't learn where a repository is bare
until its configuration is read. So another Location type handles
repositories that have not had their config read yet. I am not entirely
happy with this being a Location type, rather than representing them
entirely separate from the Git type. The new code is not worse than the
old, but better types could enforce more safety.
Added support for core.worktree. Overriding it with -c isn't supported
because it's not really clear what to do if a git repo's config is read, is
not bare, and is then overridden to bare. What is the right git directory
in this case? I will worry about this if/when someone has a use case for
overriding core.worktree with -c. (See Git.Config.updateLocation)
Also removed and renamed some functions like gitDir and workTree that
misused git's terminology.
One minor regression is known: git annex add in a bare repository does not
print a nice error message, but runs git ls-files in a way that fails
earlier with a less nice error message. This is because before --work-tree
was always passed to git commands, even in a bare repo, while now it's not.
annex.ssh-options, annex.rsync-options, annex.bup-split-options.
And adjust types to avoid the bugs that broke several config settings
recently. Now "annex." prefixing is enforced at the type level.
Rsync special remotes can be configured with shellescape=no to avoid shell
quoting that is normally done when using rsync over ssh. This is known to
be needed for certian rsync hosting providers (specificially
hidrive.strato.com) that use rsync over ssh but do not pass it through the
shell.
This option avoids gpg key distribution, at the expense of flexability, and
with the requirement that all clones of the git repository be equally
trusted.
Continue using the key name as bup ref name, to preserve backwards
compatability, unless it is an illegal git ref. In that case, use a sha256
of the key name instead.
getConfig got a remote-specific config, and this confusing name caused it
to be used a couple of places that only were interested in global configs.
Rename to getRemoteConfig and make getConfig only get global configs.
There are no behavior changes here, but remote.<name>.annex-web-options
never actually worked (and per-remote web options is a very unlikely to be
useful case so I didn't make it work), so fix the documentation for it.
openSUSE patches rsync with a patch adding SIP protocol support.
https://gist.github.com/2026167
With this patch, running rsync with no hostname parameter is apparently
supposed to list SIP hosts on the network. Practically, it does nothing
and exits 0.
git-annex uses rsync in a very special way to allow git-annex-shell to be
run on the remote host, and so did not need to specify a hostname, or a
file to transfer as a rsync parameter. So it sent ":", a degenerate case of
"host:file".
But the patch cannot differentiate ":" with no host parameter
(a bug in the SIP patch surely).
Results were that getting files failed, as rsync seemed to succeed, but the
requested file failed to arrive. Also I think that sending files will
make git-annex think a file has been transferred to the remote when
really rsync does nothing.
The workaround for this buggy rsync patch is to use "dummy:" as the
hostname.
Locking is used, so that, if there are multiple git-annex processes
using a remote concurrently, the stop hook is only run by the last
process that uses it.
That was actually really easy. But, when getting a file from an encrypted
directory special remote, no meter can be shown, because the total file
size is not known.
Avoiding writing files larger than a specified size is useful on certian
things. For example, box.com has a file size limit of 100 mb. Could also
be useful on really crappy removable media.
Added Annex.cleanup, which is a general purpose interface for adding
actions to run at the end.
Remotes with the old git-annex-shell will commit every time, and have no
commit command, so hide stderr when running the commit command.
Now gitattributes are looked up, efficiently, in only the places that
really need them, using the same approach used for cat-file.
The old CheckAttr code seemed very fragile, in the way it streamed files
through git check-attr.
I actually found that cad8824852
was still deadlocking with ghc 7.4, at the end of adding a lot of files.
This should fix that problem, and avoid future ones.
The best part is that this removes withAttrFilesInGit and withNumCopies,
which were complicated Seek methods, as well as simplfying the types
for several other Seek methods that had a Backend tupled in.
If there's no Content-Length, or the key has no size, this check is not
done, but it should happen most of the time, and protect against web
content that has changed.
Done by adding a oneshot mode, in which location log changes are written to
the journal, but not committed. Taking advantage of git-annex's existing
ability to recover in this situation.
This is used by git-annex-shell and other places where changes are made to
a remote's location log.
This reverts commit 6da40100c9.
On closer examinaton, this change is wrong. The bup special remote
can be configured with "buprepo=", which makes it use the default
~/.bup repo. This change makes it use a different temp dir each time,
which I'm sure would not be appreciated by anyone with that
configuration.
Bup insisting in creating ~/.bup even when using a different repo
does seem like a bug in *something*, but I'm leaning toward the bug
being in bup itself.
This drops the >>! and >>? with the nice low fixity. IfElse does have
undocumented >>=>>! and >>=>>? operators, but I deem that too fishy.
Anyway, using whenM and unlessM is easier; I sometimes mixed the operators
up.
Ssh connection caching is now enabled automatically by git-annex. Only one
ssh connection is made to each host per git-annex run, which can speed some
things up a lot, as well as avoiding repeated password prompts. Concurrent
git-annex processes also share ssh connections. Cached ssh connections are
shut down when git-annex exits.
Note: The rsync special remote does not yet participate in the ssh
connection caching.
For a local git remote, can symlink the file.
For a git remote using rsync, can preseed any local content.
There are a few reasons to use fsck --from on a normal git remote.
One is if it's using gitosis or similar, and you don't have shell access
to run git annex locally. Another reason could be if you just want to
fsck certian files of a bare remote.
When moving a file to the remote failed, and partially transferred content
was left behind in the directory, re-running the same move would think it
succeeded and delete the local copy.
I reproduced data loss when moving files to a partition that was almost
full. Interrupting a transfer could have similar results.
Easily fixed by using a temp file which is then moved atomically into place
once the transfer completes.
I've audited other calls to copyFileExternal, and other special remote
file transfer code; everything else seems to use temp files correctly
(rsync, git), or otherwise use atomic transfers (bup, S3).
With --fast, unavailable local remotes are filtered out of the fast set.
This way, if there are local remotes, --fast always acts only on them,
and if none are mounted, acts on nothing. This consistency is better
than --fast acting on different remotes depending on what's mounted.
Rsync is only run once, with include / exclude rules used to specify
exactly what to delete. This is faster, and avoids ugly error messages
from rsync, and doesn't fail if the content already got deleted somehow.
A crash on parsing was fixed a while ago. This adds support for fully
correctly parsing multiline git config values, using git config --null.
Since git-annex-shell configlist uses normal git config output, I left in
support for that too; the two forms of config output can be easily
identified by the parser. Since configlist only prints the annex.uuid
config, there's no risk of multiline values there, so no need to change it.
Needed due to this scenario: Bare repo origin is made, foo is cloned from it;
foo is initalized; a file is added to foo's annex; git annex move --to origin
Since the git-annex branch has not yet been pushed to origin, it doesn't
auto-initialize. When the content is sent to it, it's stored, but
the remote has NoUUID, and so nothing is logged in the location log.
Then the content is removed from the local repo, and git-annex has lost
track of it.
git annex fsck in origin will find the lost content, but let's not let this
happen. Content should only be sent to initalized remotes.
This cannot happen for non-local remotes, since git-annex-shell always
checks that the repo is initialized.
Directory special remotes will now always store keys in the lowercase name,
which avoids the complication of catching failures to create the mixed case
name.
Git remotes using http will now try the lowercase name first.
Supporting multiple directory hash types will allow converting to a
different one, without a flag day.
gitAnnexLocation now checks which of the possible locations have a file.
This means more statting of files. Several places currently use
gitAnnexLocation and immediately check if the returned file exists;
those need to be optimised.