Been meaning to do this for some time; Android port was last straw.
Note that newer versions of the uuid library have a Data.UUID.V4 that
generates random UUIDs slightly more cleanly, but Debian has an old version
of the library, so I do it slightly round-about.
These files were left behind, and made getKeysPresent find keys that were
not present. It would be expensive to make getKeysPresent check that the
actual key files are present (it just lists the directories). But that's not
needed if we just clean up the stale cache and mapping files.
To handle systems that were in direct mode and got switched back with stale
direct mode files, made cleanObjectLoc remove all files in the key's directory.
git annex unused will still list keys that are gone but for which the stale
direct mode files exists. To deal with that, made dropunused remove the key's
directory even if the key does not seem to be present.
Making the pre-commit hook look at git diff-index to find changed direct
mode files and update the mappings works pretty well.
One case where it does not work is when a file is git annex added, and then
git rmed, and then this is committed. That's a no-op commit, so the hook
probably doesn't even run, and it certianly never notices that the file
was deleted, so the mapping will still have the original filename in it.
For this and other reasons, it's important that the mappings still be
treated as possibly inconsistent.
Also, the assistant now allows the pre-commit hook to run when in direct
mode, so the mappings also get updated there.
The most common way for a mapping to be stale is when a file was deleted,
or renamed. Nothing updates the mappings for deletions yet.
But they can also become stale in other ways. For example a file can
be modified.
So, the mapping is not trusted to be consistent. When we get a key,
only replace symlinks that still point to that key with its content.
When we drop a key, only put back symlinks for files that still have
the direct mode content.
New setting, can be used to disable autocommit of changed files by the
assistant, while it still does data syncing and other tasks.
Also wired into webapp UI
Union merges involving two or more repositories could sometimes result in
data from one repository getting lost. This could result in the location
log data becoming wrong, and fsck being needed to fix it.
NB: I audited for any other occurrences of this problem. There are other
places than union merge where multiple changes are fed into update-index
in a stream, but they all involve working copy files being staged, or their
deletion being staged, and in this case it's fine for the later changes
to override the earlier ones.
It used to not log to daemon.log when a repository was first created, and
when starting the webapp. Now both do. Redirecting stdout and stderr to the
log is tricky when starting the webapp, because the web browser may want to
communicate with the user. (Either a console web browser, or web.browser = echo)
This is handled by restoring the original fds when running the browser.
A while ago I added code to support recent versions of shakespeare-js,
(commit fe11b3a940). But it seems that resulted
in quoting of all strings inserted into javascript files, which means it's
now impossible to do the type of metaprogramming that longpolling.julius
relied on. I have found another way to accomplish the same thing without
needing to generate unique function names. Hopefully it's portable.
Opinion of shakespeare-js now at rock bottom. One of these days, this
needs to be redone to use Fay.
since some systems may have configuration problems or other issues that
prevent web browsers from connecting to the right localhost IP for the
webapp.
Tested on both ipv4 and ipv6 localhost. Url for the latter looks like:
http://[::1]:50676
The expensive scan uses lookupFile, but in direct mode, that doesn't work
for files that are present. So the scan was not finding things that are
present that need to be uploaded. (It did find things not present that
needed to be downloaded.)
Now lookupFile also works in direct mode. Note that it still prefers
symlinks on disk to info committed to git, in direct mode. This is
necessary to make things like Assistant.Threads.Watcher.onAddSymlink
work correctly, when given a new symlink not yet checked into git (or
replacing a file checked into git).
Would like to also have restart UI, but that's rather harder to do,
seems it'd need to start another copy of the webapp, and redirect the
browser to its new url, but running two assistants in the same repo at
the same time isn't good.
* SHA*E backends: Exclude non-alphanumeric characters from extensions.
* migrate: Remove leading \ in SHA* checksums, and non-alphanumerics
from extensions of SHA*E keys.
* Bugfix: Remove leading \ from checksums output by sha*sum commands,
when the filename contains \ or a newline. Closes: #696384
* fsck: Still accept checksums with a leading \ as valid, now that
above bug is fixed.
* migrate: Remove leading \ in checksums
The newline after the filename was included in it.
This was generally benign -- mostly these filenames are just displayed,
and the newline didn't matter.
But in the assistant, it caused unexpected dropping of preferred
content.
A characteristic of this bug is that the drop was displayed like this:
drop some_file
ok
* get/copy --auto: Transfer data even if it would exceed numcopies,
when preferred content settings want it.
* drop --auto: Fix dropping content when there are no preferred content
settings.
It was doubly broken; both missing a slash, and containing
"runshell git-annex", while some parts of the code expected it to be a
simple path to a program. This appears to include the transfer queue
runner, and the code that starts a new assistant process when switching to
another repository in the webapp.
Ensure that each file has something written to it, even if the bytestring
chunk size is greater than the configured chunksize.
This means we may write a bit larger than the configured value, but only
when the configured value is very small; ie, < 8 kb.
Files are now written to a tmp directory in the remote, and once all
chunks are written, etc, it's moved into the final place atomically.
For now, checkpresent still checks every single chunk of a file, because
the old method could leave partially transferred files with some chunks
present and others not.
Both the directory and webdav special remotes used to have to buffer
the whole file contents before it could be decrypted, as they read
from chunks. Now the chunks are streamed through gpg with no buffering.
Wrote a better git remote name sanitizer. Git blows up on lots of weird
stuff, especially if it starts the remote name, but I managed to get
some common punctuation working.
MountWatcher can't do this, because it uses the session dbus,
and won't have access to the new DBUS_SESSION_BUS_ADDRESS if a new session
is started.
Bumped dbus library version, FD leak in it is fixed.
For now, when dbus goes away, the assistant keeps running but does not fall
back or reconnect. To do so needs more changes to the DBus library; in
particular a connectSessionWith and connectSystemWith to let me specify
my own clientThreadRunner.
So, it might be called sha1sum, or on some other OS, it might be called
sha1. It might be hidden away off of PATH on that OS. That's just expected
insanity; UNIX has been this way since 1980's. And these days, nobody even
gives the flying flip about standards that we briefly did in the 90's
after the first round of unix wars.
But it's the 2010's now, and we've certainly learned something.
So, let's make it so sometimes sha1 is a crazy program that wants to run as
root so it can lock memory while prompting for a passphrase, and outputting
binary garbage. Yes, that'd be wise. Let's package that in major Linux
distros, too, so users can stumble over it.
bup 0.25 does not accept that; and bup split reads from stdin by
default if no file is given. I'm not sure what version of bup changed this.
This only affected bup special remotes that were encrypted.
Monitors git-annex branch for changes, which are noticed by the Merger
thread whenever the branch ref is changed (either due to an incoming push,
or a local change), and refreshes cached config values for modified config
files.
Rate limited to run no more often than once per minute. This is important
because frequent git-annex branch changes happen when files are being
added, or transferred, etc.
A primary use case is that, when preferred content changes are made,
and get pushed to remotes, the remotes start honoring those settings.
Other use cases include propigating repository description and trust
changes to remotes, and learning when a remote has added a new special
remote, so the webapp can present the GUI to enable that special remote
locally.
Also added a uuid.log cache. All other config files already had caches.
in= was problimatic in two ways. First, it referred to a remote by name,
but preferred content expressions can be evaluated elsewhere, where that
remote doesn't exist, or a different remote has the same name. This name
lookup code could error out at runtime. Secondly, in= seemed pretty useless.
in=here did not cause content to be gotten, but it did let present content
be dropped.
present is more useful, although "not present" is unstable and should be
avoided.
When in a subdir, both the normal filepath, and the filepath relative to
the top of the git repo are needed for matching. The former for key lookup,
and the latter for include/exclude to match against. Previously, key lookup
didn't work in this situation.
The old code was just wrong in taking fromPath of GIT_DIR -- that made an
localUnknown location with the GIT_DIR in it, which only worked by
accident, and failed in submodules.
Now that this is handled correctly, git-annex can be used in git submodules.
Also, fixed infelicity where Git.CurrentRepo and Git.Config.updateLocation
were both dealing with core.worktree. Now updateLocation handles it for
Local as well as for LocalUnknown repos.
Aka solve the github problem.
Note that it's possible the initial configlist will fail for some network
reason etc, and then the fetch succeeds. In this case, a usable remote gets
disabled. But it does print a message, and this only happens once per
remote, so that seems ok.
There was one forkProcess lurking in test.hs, and that seems to be
responsible for recent buildd failures on amd64 and armhf. I was able to
reproduce it pretty easily on amd64, and even once on i386, and it was
clearly that same bad old threaded runtime hang. So removing this
forkProcess should fix it. Odd that it lurked for some months before
popping up.
Setting GIT_INDEX_FILE clobbers the rest of the environment, making git
not read ~/.gitconfig, and blow up if GECOS didn't have a name for the
user.
I'm not entirely happy with getEnvironment being run every time now,
that's somewhat expensive. It may make sense to just set GIT_COMMITTER_*
and GIT_AUTHOR_*, but I worry that clobbering the rest could break PATH,
or GIT_PATH, or something else that might be used by a command run in here.
And caching the environment is not a good idea either; it can change..
webapp: Adds newly created repositories to one of these groups:
clients, drives, servers
This is heuristic, but it's a pretty good heuristic, and can always be
configured.
Both when queueing downloads, and uploads, consults the preferred content
settings.
I didn't make it check yet when requeing failed transfers or queuing
deferred downloads; dealing with the preferred content settings (or indeed,
other settings) changing while the assistant is running still needs work.
Incomplete; I need to finish parsing and saving. This will also be used
for editing transfer control expresssions.
Removed the group display from the status output, I didn't really
like that format, and vicfg can be used to see as well as edit rempository
group membership.
Makes it safe to use git annex unlock with the watcher/assistant.
And also to mix use of the watcher/assistant with regular files stored in git.
Long ago, I had avoided doing this check, except during the startup scan,
because it would be slow to run ls-files repeatedly.
But then I added the lsof check, and to make that fast, got it to detect
batch file adds. So let's move the ls-files check to also occur when it'll
have a batch, and can check them all with one call.
This does slow down adding a single file by just a bit, but really only
a little bit. (The lsof check is probably more expensive.) It also
speeds up the startup scan, especially when there are lots of new files
found by the scan.
Also, fixed the sleep for annex.delayadd to not run while the threadstate
lock is held, so it doesn't unnecessarily freeze everything else.
Also, --force no longer makes it skip the lsof check, which was not
documented, and seems never a good idea.
This fixes a problem I was seeing in the assistant where two remotes would
attempt to sync with one another at the same time, and both failed pushing
the diverged git-annex branch. Then when both tried to resolve the failed
push, they each modified their git-annex branch, which again each blocked
the other from pushing into it. The result was that the git-annex
branches were perpetually diverged (despite having the same content!) and
once the assistant fell into this trap, it couldn't get out and always
had to do the slow push/fail/pull/merge/push/fail cycle.
The default backend used when adding files to the annex is changed from
SHA256 to SHA256E, to simplify interoperability with OSX, media players,
and various programs that needlessly look at symlink targets.
To get old behavior, add a .gitattributes containing: * annex.backend=SHA256
Avoid crashing when "git annex get" fails to download from one location,
and falls back to downloading from a second location.
The problem is that git annex get calls download recursively from within
itself if the first download attempt fails. So the first time through, it
writes a transfer info file, which is then overwritten on the second,
recursive call. Then on cleanup, it tries to delete the file twice, which
of course doesn't work.
Fixed both by not crashing if the transfer file is removed, and by
changing Get to not run download recursively like that. It's the only
thing that did so, and it just seems like a bad idea.
This reverts commit abde98cda2.
Temporarily dropping from master, since this actually uses stuff
that's only currently availble in the assistant branch. Will come back when
I merge that, and can wait..
Using Crypto's version of the hashes would be another option.
I need to benchmark it. The SHA2 library (which provides SHA1 also,
confusing name) may be the fastest option, but is not currently in Debian.
This *almost* works.
Along the way, I noticed that the --uuid parameter was being accidentially
passed after the --, so that has never been actually used by
git-annex-shell to verify it's running in the expected repository. Oops. Fixed.
In order to record a semi-useful filename associated with the key,
this required plumbing the filename all the way through to the remotes'
storeKey and retrieveKeyFile.
Note that there is potential for deadlock here, narrowly avoided.
Suppose the repos are A and B. A sends file foo to B, and at the same
time, B gets file foo from A. So, A locks its upload transfer info file,
and then locks B's download transfer info file. At the same time,
B is taking the two locks in the opposite order. This is only not a
deadlock because the lock code does not wait, and aborts. So one of A or
B's transfers will be aborted and the other transfer will continue.
Whew!
Note this is per-remote, so trying to get the same file from multiple
remotes can still let duplicate downloads run. (And uploading the same file
to multiple remotes is not duplicate at all of course.)
get, move, and copy are the only git-annex subcommands that transfer
files, but there's still git-annex-shell recvkey and sendkey to deal with too.
I considered modifying retrieveKeyFile or getViaTmp, but they are called
by other code that does not involve expensive file transfers (migrate)
or that does file transfers that should not be checked by this (fsck --from).
Accept arbitrarily encoded repository filepaths etc when reading git config
output. This fixes support for remotes with unusual characters in their
names.
For example, a remote with a url of /tmp/çüş was previously
skipped, because the filename wasn't encoded right so it didn't think it
was available. And when setting the annex-uuid of a remote named "çüş",
it used to add it under a mis-encoded form of the remote's name. Both these
cases now work ok in my testing.
Prelude.undefined error message was introduced by
bb4f31a0ee.
It seems best to filter out local repositories that cannot be accessed
from the list of remotes, rather than keeping them in and making every
thing that uses the list have to deal with remotes that may have an unknown
location.
Besides fixing the error message, this also makes unavailable local
remotes' names not be shown in various messages, including in git annex
status output.
Also, move --to an unavailable local repository now avoids some ugly
errors like "changeWorkingDirectory: does not exist".
Was decoding the git-cat-file of the symlink target as utf8, but that can't
do, unix filenames are from the 70's and need this shiny disco
fileSystemEncoding.
Could not reproduce the build failure I had seen related to this,
but the numbers were wrong with statfs64. Probably pulling from the wrong
place in the structure. statvfs seems to work..
This ensures that all special remotes show up in git annex status.
Before, a special remote that was not manually described, and was not
a current git remote, did not show up there, although initremote did list
it.
Anything that tries to open the file for write, or delete the file,
or replace it with something else, will not affect the add.
Only if a process has the file open for write before add starts
can it still change it while (or after) it's added to the annex.
(fsck will catch this later of course)
Resetting an unlocked file to the branch head failed if it had just been
added, not committed, and unlocked, since the branch didbn't have it.
The code was concerned about dropping any changes that might be staged in the
index, but I cannot see why.
The environment needs to override git-config. Changed when git config is
read, and avoid rereading it once it's been read.
chdir for both worktree settings.
Baked into the code was an assumption that a repository's git directory
could be determined by adding ".git" to its work tree (or nothing for bare
repos). That fails when core.worktree, or GIT_DIR and GIT_WORK_TREE are
used to separate the two.
This was attacked at the type level, by storing the gitdir and worktree
separately, so Nothing for the worktree means a bare repo.
A complication arose because we don't learn where a repository is bare
until its configuration is read. So another Location type handles
repositories that have not had their config read yet. I am not entirely
happy with this being a Location type, rather than representing them
entirely separate from the Git type. The new code is not worse than the
old, but better types could enforce more safety.
Added support for core.worktree. Overriding it with -c isn't supported
because it's not really clear what to do if a git repo's config is read, is
not bare, and is then overridden to bare. What is the right git directory
in this case? I will worry about this if/when someone has a use case for
overriding core.worktree with -c. (See Git.Config.updateLocation)
Also removed and renamed some functions like gitDir and workTree that
misused git's terminology.
One minor regression is known: git annex add in a bare repository does not
print a nice error message, but runs git ls-files in a way that fails
earlier with a less nice error message. This is because before --work-tree
was always passed to git commands, even in a bare repo, while now it's not.
Amoung other things, this makes unlocking a WORM backed file and then
re-adding it without making any changes not add a new object, as the
timestamp is preserved.
annex.ssh-options, annex.rsync-options, annex.bup-split-options.
And adjust types to avoid the bugs that broke several config settings
recently. Now "annex." prefixing is enforced at the type level.
Rsync special remotes can be configured with shellescape=no to avoid shell
quoting that is normally done when using rsync over ssh. This is known to
be needed for certian rsync hosting providers (specificially
hidrive.strato.com) that use rsync over ssh but do not pass it through the
shell.
This option avoids gpg key distribution, at the expense of flexability, and
with the requirement that all clones of the git repository be equally
trusted.
This is incomplete, it does not honor it yet for hash directories
and other annex bookkeeping files. Some of that is not needed for a bare
repo; some of it may be.
git-annex (but not git-annex-shell) supports the git help.autocorrect
configuration setting, doing fuzzy matching using the restricted
Damerau-Levenshtein edit distance, just as git does. This adds a build
dependency on the haskell edit-distance library.
Continue using the key name as bup ref name, to preserve backwards
compatability, unless it is an illegal git ref. In that case, use a sha256
of the key name instead.
Don't check if configure indicated checks won't work. This should fix a
FTBFS on mipsel, where configure correctly detects the checks won't work,
while garbage is returned for disk space info at git-annex runtime. It also
means that, when built via cabal, disk space checks are not enabled,
unfortunatly.
* git-annex now behaves as git-annex-shell if symlinked to and run by that
name. The Makefile sets this up, saving some 8 mb of installed size.
* git-union-merge is a demo program, so it is no longer built by default.
openSUSE patches rsync with a patch adding SIP protocol support.
https://gist.github.com/2026167
With this patch, running rsync with no hostname parameter is apparently
supposed to list SIP hosts on the network. Practically, it does nothing
and exits 0.
git-annex uses rsync in a very special way to allow git-annex-shell to be
run on the remote host, and so did not need to specify a hostname, or a
file to transfer as a rsync parameter. So it sent ":", a degenerate case of
"host:file".
But the patch cannot differentiate ":" with no host parameter
(a bug in the SIP patch surely).
Results were that getting files failed, as rsync seemed to succeed, but the
requested file failed to arrive. Also I think that sending files will
make git-annex think a file has been transferred to the remote when
really rsync does nothing.
The workaround for this buggy rsync patch is to use "dummy:" as the
hostname.