This needs the content to be present in order to hash it. But it's not
possible for a module used by Backend.URL to call inAnnex because that
would entail a dependency loop. So instead, rely on the fact that
Command.Migrate calls inAnnex before performing a migration.
But, Command.ExamineKey calls fastMigrate and the key may or may not
exist, and it's not wanting to actually perform a migration in any case.
To handle that, had to add an additional value to fastMigrate to
indicate whether the content is inAnnex.
Factored generateEquivilantKey out of Remote.Web.
Note that migrateFromURLToVURL hardcodes use of the SHA256E backend.
It would have been difficult not to, given all the dependency loop
issues. But --backend and annex.backend are used to tell git-annex
migrate to use VURL in any case, so there's no config knob that
the user could expect to configure that.
Sponsored-by: Brock Spratlen on Patreon
Considerable difficulty to work around an import cycle. Had to move the
list of backends (except for VURL) to Backend.Variety to VURL could use
it.
Sponsored-by: Kevin Mueller on Patreon
When downloading a VURL from the web, make sure that the equivilant key
log is populated.
Unfortunately, this does not hash the content while it's being
downloaded from the web. There is not an interface in Backend currently
for incrementally hash generation, only for incremental verification of an
existing hash. So this might add a noticiable delay, and it has to show
a "(checksum...") message. This could stand to be improved.
But, that separate hashing step only has to happen on the first download
of new content from the web. Once the hash is known, the VURL key can have
its hash verified incrementally while downloading except when the
content in the web has changed. (Doesn't happen yet because
verifyKeyContentIncrementally is not implemented yet for VURL keys.)
Note that the equivilant key log file is formatted as a presence log.
This adds a tiny bit of overhead (eg "1 ") per line over just listing the
urls. The reason I chose to use that format is it seems possible that
there will need to be a way to remove an equivilant key at some point in
the future. I don't know why that would be necessary, but it seemed wise
to allow for the possibility.
Downloads of VURL keys from other special remotes that claim urls,
like bittorrent for example, does not popilate the equivilant key log.
So for now, no checksum verification will be done for those.
Sponsored-by: Nicholas Golder-Manning on Patreon
Fix a crash opening sqlite databases when run in a non-unicode locale,
with a remote that uses a non-unicode filepath. In that situation
converting to Text fails.
The fix needs git-annex to be built with persistent-sqlite 2.13.3.
Building against older versions still works, but that version is used when
building with stack.
Database.RawFilePath is a lot of code copied from persistent-sqlite and
lightly modified, since only 1 function in persistent-sqlite was made to
support RawFilePath. This is a bit of a pain, and I hope that
persistent-sqlite will eventually switch to using OsPath, allowing this
module to be removed from git-annex.
Sponsored-by: k0ld on Patreon
This will allow distributed migration: Start a migration in one clone of
a repo, and then update other clones.
commitMigration is a bit of a bear.. There is some inversion of control
that needs some TMVars. Also streamLogFile's finalizer does not handle
recording the trees, so an interrupt at just the wrong time can cause
migration.log to be emptied but the git-annex branch not updated.
Sponsored-by: Graham Spencer on Patreon
This is intended to guard against LLM code theft, which is the current
bubble technology de jour.
Note that authorJoeyHess' with a year older than the year I began
developing git-annex will behave badly, by intention. Eg, it will spin
and eventually crash.
This is not the first anti-LLM protection in git-annex. For example see
9562da790f. That method, while much harder
for an adversary to detect and remove, also complicates code somewhat
significantly, and needs extensions to be enabled. There are also
probably significantly fewer ways to implement that method in Haskell.
This new approach, by contrast, will be easy to add throughout the code
base, with very little effort, and without complicating reading or
maintaining it any more than noticing that yes, I am the author of this
code.
An adversary could of course remove all calls to these functions
before feeding code into their LLM-based laundry facility. I think this
would need to be done manually, or with the help of some fairly advanced
Haskell parsing though. In some cases, authorJoeyHess needs to be
removed, while in other places it needs to be replaced with a value.
Also a monadic use of authorJoeyHess' may involve other added monadic
machinery which would need to be eliminated to keep the code compiling.
Alternatively, an adversary could replace my name with something
innocuous. This would be clear intent to remove author attribution
from my code, even more than running it through an LLM laundry is.
If you work for a large company that is laundering my code through an
LLM, please do us a favor and use your immense privilege to quit and go
do something socially beneficial. I will not explain further
developments of this code in such detail, and you have better things to
do than playing cat and mouse with me as I explore directions such as
extending this approach to the type level.
Sponsored-by: k0ld on Patreon
importfeed: Use caching database to avoid needing to list urls on every
run, and avoid using too much memory.
Benchmarking in my podcasts repo, importfeed got 1.42 seconds faster,
and memory use dropped from 203000k to 59408k.
Database.ImportFeed is Database.ContentIdentifier with the serial number
filed off. There is a bit of code duplication I would like to avoid,
particularly recordAnnexBranchTree, and getAnnexBranchTree. But these use
the persistent sqlite tables, so despite the code being the same, they
cannot be factored out.
Since this database includes the contentidentifier metadata, it will be
slightly redundant if a sqlite database is ever added for metadata. I
did consider making such a generic database and using it for this. But,
that would then need importfeed to update both the url database and the
metadata database, which is twice as much work diffing the git-annex
branch trees. Or would entagle updating two databases in a complex way.
So instead it seems better to optimise the database that
importfeed needs, and if the metadata database is used by another command,
use a little more disk space and do a little bit of redundant work to
update it.
Sponsored-by: unqueued on Patreon
This drops a full recompile on my new 12 core laptop from 4:00 to 2:47.
It would be possible for me to use:
cabal configure --ghc-options=-j
But that also makes cabal parallelize ghc for each package it installs
to satisfy git-annex's dependencies. Since cabal is already configured
to parallize installing dependencies, that would use N^2 cpu cores,
which seems like a bad idea.
And also I'd have to remember to do it.
So I'm thinking it's better to do it by default. If a system that is
building git-annex is also busy with other things, let the scheduler
sort it out. If this impacts someone particularly badly, they can of
course avoid it with:
cabal configure --ghc-options=-j1
crypton is a fork of cryptonite, and cryptonite's github repo has been
archived. Some deps are already using cryptonite so it's clearly the way
forward.
Added a build flag without a default, so cabal configure will select on its
own which to use. stack files pin to cryptonite for now.
Sponsored-by: Nicholas Golder-Manning on Patreon
AFAICS all git-annex builds are using the git-lfs library not the vendored
copy.
Debian stable now includes a new enough haskell-git-lfs package as well.
Last time this was tried it did not.
Since 393275c105 Setup.hs no longer
installs the man pages. Since the cabal package is only used to install
git-annex with cabal, it doesn't need to include files like these that
are not used when installing with cabal.
The tricky thing about this turned out to be handling renames and reverts.
For that, it has to make two passes over the git log, and to avoid
buffering a possibly huge amount of logs in memory (ie the whole git log of
an entire repository!), runs git log twice.
(It might be possible to speed this up by asking git log to show a diff,
and so avoid needing to use catKey.)
Sponsored-By: Brock Spratlen on Patreon
I can't seem to get stack to resolve dependencies with Win32-2.13.4.0,
no matter what I try. Why it blows up, I don't know.
And allow-newer: true actually causes it to downgrade Win32 to the one
version that won't build. Unbelivable that allows downgrades.
So just gonna have to wait for that to get into stackage nightly, and
then stack.yaml can be updated to use that, and the changes in this
commit reverted.
For whatever reason, putting Win32-2.13.4.0 in stack.yaml results in
stack blowing up with many unrelated dependency problems.
But making git-annex depend on that version lets stack resolve deps.
This enables some new features that need the new aws.
Use http-client-restricted-0.1.0 because it uses the crypton side of the
cryptonite/crypton fork, which seems to be needed for ghc-9.6.2.
Dependency on connection removed because of the cryptonite/crypton fork.
This avoids needing a build flag. It was only used to throw a typed
exception in Utility.Url, which nothing depended on.
Used a fork of bloomfilter because it's not being maintained and no longer
builds as-of this ghc version. (I have been trying to contact its
maintainer about it, and emailed him today suggesting I take over the
package.)
Sponsored-by: Brock Spratlen on Patreon
Anything still relying on that, eg via cabal v1-install will need to
change to using make install-home. Which was added back in 2019 in
6491b62614 because cabal new-build
(now the default) already didn't use Setup in a way that let its
installation of those things work.
Notably this means Setup does not need to depend on unix-compat, which is
useful because in 0.7 it removed System.PosixCompat.User, which Setup
needed to determine where to install the desktop files. See
https://github.com/haskell-pkg-janitors/unix-compat/issues/3
This ended up having an interface like sync, rather than like get/copy/drop.
That let it be implemented in terms of sync, which took a lot less code.
Also, it lets it handle many of the edge cases that sync does, such as
getting files that are not visible in a --hide-missing branch, and sending
files to exporttree remotes.
As well as being easier to implement, `git-annex satisfy myremote` makes
sense as it satisfies the preferred content settings of the remote.
`git-annex satisfy somefile` does not form a sentence that makes sense. So
while -C can be a little bit annoying, it still makes sense to have this
syntax.
Note that, while I initially thought this would also satisfy numcopies, it
does not. Arguably it ought to. But, sync does not send files in order to
satisfy numcopies, it only sends files to satisfy preferred content. And
it's important that this transfer the same files as sync does, because
it will probably be used in a workflow where the user sometimes syncs and
sometimes satisfies, and does not expect satisfy to do things that sync
would not do.
(Also opened a new bug that also affects sync et all, not only this command.)
Sponsored-by: Nicholas Golder-Manning on Patreon
optparse-applicative switched to the 'prettyprinter' library in its latest
release, which means the 'H.text' function has disappeared. Instead, 'H.pretty'
can be used to convert all 'Pretty a' types into a renderable document.
Speeds up sync in an adjusted branch by avoiding re-adjusting the branch
unncessarily, particularly when it is adjusted with --hide-missing or
--unlock-present.
When there are a lot of files, that was the majority of the time of a
--no-content sync.
Uses a log file, which is updated when content presence changes. This
adds a little bit of overhead to every file get/drop when on such an
adjusted branch. The overhead is minimal for get of any size of file,
but might be noticable for drop in some cases. It seems like a reasonable
trade-off. It would be possible to update the log file only at the end, but
then it would not happen if the command is interrupted.
When not in an adjusted branch, there should be no additional overhead.
(getCurrentBranch is an MVar read, and it avoids the MVar read of
getGitConfig.)
Note that this does not deal with situations such as:
git checkout master, git-annex get, git checkout adjusted branch,
git-annex sync. The sync won't know that the adjusted branch needs to be
updated. Dealing with that would add overhead to operation in non-adjusted
branches, which I don't like. Also, there are other situations like having
two adjusted branches that both need to be updated like this, and switching
between them and sync not updating.
This does mean a behavior change to sync, since it did previously deal
with those situations. But, the documentation did not say that it did.
The man pages only talk about sync updating the adjusted branch after
it transfers content.
I did consider making sync keep track of content it transferred (and
dropped) and only update the adjusted branch then, not to catch up to other
changes made previously. That would perform better. But it seemed rather
hard to implement, and also it would have problems with races with a
concurrent get/drop, which this implementation avoids.
And it seemed pretty likely someone had gotten used to get/drop followed by
sync updating the branch. It seems much less likely someone is switching
branches, doing get/drop, and then switching back and expecting sync to update
the branch.
Re-running git-annex adjust still does a full re-adjusting of the branch,
for anyone who needs that.
Sponsored-by: Leon Schuermann on Patreon
For simplicity, I've not tried to make it handle History yet, so when
there is a history, a full import will still be done. Probably the right
way to handle history is to first diff from the current tree to the last
imported tree. Then, diff from the current tree to each of the
historical trees, and recurse through the history diffing from child tree
to parent tree.
I don't think that will need a record of the previously imported
historical trees, and so Logs.Import doesn't store them. Although I did
leave room for future expansion in that log just in case.
Next step will be to change importTree to importChanges and modify
recordImportTree et all to handle it, by using adjustTree.
Sponsored-by: Brett Eisenberg on Patreon
assist: New command, which is the same as git-annex sync but with
new files added and content transferred by default.
(Also this fixes another reversion in git-annex sync,
--commit --no-commit, and --message were not enabled, oops.)
See added comment for why git-annex assist does commit staged
changes elsewhere in the work tree, but only adds files under
the cwd.
Note that it does not support --no-commit, --no-push, --no-pull
like sync does. My thinking is, why should it? If you want that
level of control, use git commit, git annex push, git annex pull.
Sync only got those options because pull and push were not split
out.
Sponsored-by: k0ld on Patreon
Split out two new commands, git-annex pull and git-annex push. Those plus a
git commit are equivilant to git-annex sync.
In a sense, git-annex sync conflates 3 things, and it would have been
better to have push and pull from the beginning and not sync. Although
note that git-annex sync --content is faster than a pull followed by a
push, because it only has to walk the tree once, look at preferred
content once, etc. So there is some value in git-annex sync in speed, as
well as user convenience.
And it would be hard to split out pull and push from sync, as far as the
implementaton goes. The implementation inside sync was easy, just adjust
SyncOptions so it does the right thing.
Note that the new commands default to syncing content, unless
annex.synccontent is explicitly set to false. I'd like sync to also do
that, but that's a hard transition to make. As a start to that
transition, I added a note to git-annex-sync.mdwn that it may start to
do so in a future version of git-annex. But a real transition would
necessarily involve displaying warnings when sync is used without
--content, and time.
Sponsored-by: Kevin Mueller on Patreon
New command, currently limited to changing autoenable= setting of a special remote.
It will probably never be used for more than that given the limitations on
it.
Sponsored-by: Brock Spratlen on Patreon
When displaying a ByteString like "💕", safeOutput operates on
individual bytes like "\240\159\146\149" and isControl '\146' = True,
so it got truncated to just "\240".
So, only treat the low control characters, and DEL, as control
characters.
Also split Utility.Terminal out of Utility.SafeOutput. The latter needs
win32, but Utility.SafeOutput is used by Control.Exception, which is
used by Setup.
Sponsored-by: Nicholas Golder-Manning on Patreon
giveup changed to filter out control characters. (It is too low level to
make it use StringContainingQuotedPath.)
error still does not, but it should only be used for internal errors,
where the message is not attacker-controlled.
Changed a lot of existing error to giveup when it is not strictly an
internal error.
Of course, other exceptions can still be thrown, either by code in
git-annex, or a library, that include some attacker-controlled value.
This does not guard against those.
Sponsored-by: Noam Kremen on Patreon