I've seen rss feeds that have no permalinks, only guids (which are
sometimes in the form of permalinks, argh/sigh).
I had previously avoided trusting guids to be globally unique, because my
survey of rss feeds that I subscribe to shows a lot of pretty bad
"guids" like "2 at http://serialpodcast.org" or even worse "oth20150401-hq".
Worry was that two podcasts that are generating guids so badly, that
there's no guarantee they're actually globally unique.
But, I'm seeing too many url changes that result in redundant files, so
let's try this. If feeds are so broken that guids overlap, they could just
as well incorrectly call them permalinks too.
"checkPresent baser" was wrong; the baser has a dummy checkPresent action
not the real one. So, to fix this, we need to call preparecheckpresent to
get a checkpresent action that can be used to check if chunks are present.
Note that, for remotes like S3, this means that the preparer is run,
which opens a S3 handle, that will be used for each checkpresent of a
chunk. That's a good thing; if we're resuming an upload that's already many
chunks in, it'll reuse that same http connection for each chunk it checks.
Still, it's not a perfectly ideal thing, since this is a different http
connection that the one that will be used to upload chunks. It would be
nice to improve the API so that both use the same http connection.
The branch needs to be created when merging from the remote in sync,
since we diff between it and the remote's sync branch. But git annex merge
should not be creating sync branches.
This was a reversion caused by the relative path changes in 5.20150113.
Other uses of addAuthorizedKeys seem to be ok. If the user enters a
directory like ~/annex, it writes GIT_ANNEX_SHELL_DIRECTORY=annex, and
git-annex-shell assumes that's relative to HOME.
This makes git annex unused use around 48 mb more memory than it did before,
but the massive increase in accuracy makes this worthwhile for all but the
smallest systems.
Also, I want to use the bloom filter for sync --all --content, to avoid
dropping files that the preferred content doesn't want, and 1/1000
false positives would be far too many in that use case, even if it were
acceptable for unused.
Actual memory use numbers:
1000: 21.06user 3.42system 0:26.40elapsed 92%CPU (0avgtext+0avgdata 501552maxresident)k
1000000: 21.41user 3.55system 0:26.84elapsed 93%CPU (0avgtext+0avgdata 549496maxresident)k
10000000: 21.84user 3.52system 0:27.89elapsed 90%CPU (0avgtext+0avgdata 549920maxresident)k
Based on these numbers, 10 million seemed a better pick than 1 million.
backup: Use new "anything" terminal. This means that content that
is not unused, but has no associated file will be wanted by backup repos.
unwanted: "not anything" will result in any and all content moving
off of these repos.
incremental backup: Remove the "(include=* or unused)",
so it matches content that has no associated files
but is not unused.
client: Add a include=* to the expression. This limits it to matching
only files in the work tree. Without this change, sync --all --content
would match a key against the expression, and since it matches
exclude=archive/*, the client repo would have wanted the file content.
The "and not unused" would have kept unused objects out, but not
objects that were not known to be unused, or objects that another branch
referred to. In practice, everything would have flooded into client repos
without this change.
In my tests, this has to be set when uploading a file to the bucket
and then the file can be accessed using the bucketname.s3.amazonaws.com
url.
Setting it when creating the bucket didn't seem to make the whole bucket
public, or allow accessing files stored in it. But I have gone ahead and
also sent it when creating the bucket just in case that is needed in some
case.
This is especially useful because the caller doesn't need to generate valid
url keys, which involves some escaping of characters, and may involve
taking a md5sum of the url if it's too long.
The one exception is in Utility.Daemon. As long as a process only
daemonizes once, which seems reasonable, and as long as it avoids calling
checkDaemon once it's already running as a daemon, the fcntl locking
gotchas won't be a problem there.
Annex.LockFile has it's own separate lock pool layer, which has been
renamed to LockCache. This is a persistent cache of locks that persist
until closed.
This is not quite done; lockContent stil needs to be converted.
Only the assistant uses these, and only the assistant cleans them up, so
make only git annex transferkeys write them,
There is one behavior change from this. If glacier is being used, and a
manual git annex get --from glacier fails because the file isn't available
yet, the assistant will no longer later see that failed transfer file and
retry the get. Hope no-one depended on that old behavior.
I've tested all the dataenc to sandi conversions except Assistant.XMPP,
and all have unchanged behavior, including behavior on large unicode code
points.
For example, it failed to get files from a bucket named S3.
Also fixes `git annex initremote UPPERCASE type=S3`, which failed with the
new aws library, with a signing error message.
The setDifferences that got added to initialize turns out to make a git
commit, and before ensureCommit has been used. Thus, repo init can fail
when the system has a broken hostname etc.
Move the ensureCommit to the very first thing to avoid this kind of breakage.
The directory special remote was not affected in its normal configuration,
since annex-directory is an absolute path normally. But it could fail
when a relative path was used.
The git remote was affected even when an absolute path to it was used in
.git/config, since git-annex now converts all such paths to relative.
Since we started using this for git repos, when a remote was on another
drive, it resulted in a bogus relative path to it being used by git-annex,
which didn't work.
This is a nearly free feature; it piggybacks on the location log lookups
done for the numcopies stats. So, the only extra overhead is updating
the map of repository sizes.
However, I had to switch to Data.Map.Strict, which needs containers 0.5.
If backporting to wheezy, will probably need to revert this commit.
This works, and seems fairly robust. Clean get of 20 files at -J3. At -J10,
there are some messages about ssh multiplexing, probably due to a race
spinning up the ssh connection cacher. But, it manages to get all the files
ok regardless.
The progress bars are a scrambled mess though, due to bugs in
ascii-progress, which I've already filed. Particularly this one:
https://github.com/yamadapc/haskell-ascii-progress/issues/8
webapp: When adding another local repository, and combining it with the
current repository, the new repository's remote path was set to "." rather
than the path to the current repository. This was a reversion caused by the
relative path changes in 5.20150113.
git-checkignore refuses to work if any pathspec options are set. Urgh.
I audited the rest of git, and no other commands used by git-annex have
such limitations. Indeed, AFAICS, *all* other commands support
--literal-pathspecs. So, worked around this where git-checkignore is
called.
I don't quite understand the cause of the deadlock. It only occurred
when git-annex-shell transferinfo was being spawned over ssh to feed
download transfer progress back. And if I removed this line from
feedprogressback, the deadlock didn't occur:
bytes <- readSV v
The problem was not a leaked FD, as far as I could see. So what was it?
I don't know.
Anyway, this is a nice clean implementation, that avoids the deadlock.
Just fork off the async threads to handle filtering the stdout and stderr,
and let them clean up their handles whenever they decide to exit.
I've verified that the handles do get promptly closed, although a little
later than I would expect. Presumably that "little later" is what
was making waiting on the threads deadlock.
Despite the late exit, the last line of stdout and stderr appears where
I'd want it to, so I guess this is ok..
Stderr reader blocks waiting for all stderr, and so blocks the process ever
exiting.
I tried several ways to get around this, but no success yet. For now,
disable the stderr reader entirely.