I've seen rss feeds that have no permalinks, only guids (which are
sometimes in the form of permalinks, argh/sigh).
I had previously avoided trusting guids to be globally unique, because my
survey of rss feeds that I subscribe to shows a lot of pretty bad
"guids" like "2 at http://serialpodcast.org" or even worse "oth20150401-hq".
Worry was that two podcasts that are generating guids so badly, that
there's no guarantee they're actually globally unique.
But, I'm seeing too many url changes that result in redundant files, so
let's try this. If feeds are so broken that guids overlap, they could just
as well incorrectly call them permalinks too.
"checkPresent baser" was wrong; the baser has a dummy checkPresent action
not the real one. So, to fix this, we need to call preparecheckpresent to
get a checkpresent action that can be used to check if chunks are present.
Note that, for remotes like S3, this means that the preparer is run,
which opens a S3 handle, that will be used for each checkpresent of a
chunk. That's a good thing; if we're resuming an upload that's already many
chunks in, it'll reuse that same http connection for each chunk it checks.
Still, it's not a perfectly ideal thing, since this is a different http
connection that the one that will be used to upload chunks. It would be
nice to improve the API so that both use the same http connection.
The branch needs to be created when merging from the remote in sync,
since we diff between it and the remote's sync branch. But git annex merge
should not be creating sync branches.
This was a reversion caused by the relative path changes in 5.20150113.
Other uses of addAuthorizedKeys seem to be ok. If the user enters a
directory like ~/annex, it writes GIT_ANNEX_SHELL_DIRECTORY=annex, and
git-annex-shell assumes that's relative to HOME.
This makes git annex unused use around 48 mb more memory than it did before,
but the massive increase in accuracy makes this worthwhile for all but the
smallest systems.
Also, I want to use the bloom filter for sync --all --content, to avoid
dropping files that the preferred content doesn't want, and 1/1000
false positives would be far too many in that use case, even if it were
acceptable for unused.
Actual memory use numbers:
1000: 21.06user 3.42system 0:26.40elapsed 92%CPU (0avgtext+0avgdata 501552maxresident)k
1000000: 21.41user 3.55system 0:26.84elapsed 93%CPU (0avgtext+0avgdata 549496maxresident)k
10000000: 21.84user 3.52system 0:27.89elapsed 90%CPU (0avgtext+0avgdata 549920maxresident)k
Based on these numbers, 10 million seemed a better pick than 1 million.
backup: Use new "anything" terminal. This means that content that
is not unused, but has no associated file will be wanted by backup repos.
unwanted: "not anything" will result in any and all content moving
off of these repos.
incremental backup: Remove the "(include=* or unused)",
so it matches content that has no associated files
but is not unused.
client: Add a include=* to the expression. This limits it to matching
only files in the work tree. Without this change, sync --all --content
would match a key against the expression, and since it matches
exclude=archive/*, the client repo would have wanted the file content.
The "and not unused" would have kept unused objects out, but not
objects that were not known to be unused, or objects that another branch
referred to. In practice, everything would have flooded into client repos
without this change.