Commit graph

1703 commits

Author SHA1 Message Date
Joey Hess
d33ab4bbe4
add preciseSize 2024-08-11 15:40:21 -04:00
Joey Hess
bd5affa362
use hmac in balanced preferred content
This deals with the possible security problem that someone could make an
unusually low UUID and generate keys that are all constructed to hash to
a number that, mod the number of repositories in the group, == 0.
So balanced preferred content would always put those keys in the
repository with the low UUID as long as the group contains the
number of repositories that the attacker anticipated.
Presumably the attacker than holds the data for ransom? Dunno.

Anyway, the partial solution is to use HMAC (sha256) with all the UUIDs
combined together as the "secret", and the key as the "message". Now any
change in the set of UUIDs in a group will invalidate the attacker's
constructed keys from hashing to anything in particular.

Given that there are plenty of other things someone can do if they can
write to the repository -- including modifying preferred content so only
their repository wants files, and numcopies so other repositories drom
them -- this seems like safeguard enough.

Note that, in balancedPicker, combineduuids is memoized.
2024-08-10 16:32:54 -04:00
Joey Hess
3ce2e95a5f
balanced preferred content and --rebalance
This all works fine. But it doesn't check repository sizes yet, and
without repository size checking, once a repository gets full, there
will be no other repository that will want its files.

Use of sha2 seems unncessary, probably alder2 or md5 or crc would have
been enough. Possibly just summing up the bytes of the key mod the number
of repositories would have sufficed. But sha2 is there, and probably
hardware accellerated. I doubt very much there is any security benefit
to using it though. If someone wants to construct a key that will be
balanced onto a given repository, sha2 is certianly not going to stop
them.
2024-08-09 14:16:09 -04:00
Joey Hess
99b7a0cfe9
use REMOVE-BEFORE in P2P protocol
Only clusters still need to be fixed to close this todo.
2024-07-04 13:47:38 -04:00
Joey Hess
e9bd03fc4b
use Boottime clock on linux
Better to advance while suspended, that way a stale content retention
lock will expire while suspended.

The fallback the Monotonic to support older kernels assumes that there
are not systems where Boottime sometimes succeeds and sometimes fails.
If that ever happened, the clock would probably not monotonically
advance as it read from different clocks on different calls! I don't see
any indication in clock_gettime(2) errono list that it can fail
intermittently so probably this is ok.
2024-07-03 18:44:38 -04:00
Joey Hess
5b6150e5d5
factor out Utility.MonotonicClock 2024-07-03 17:54:01 -04:00
Joey Hess
f18740699e
P2P protocol version 2, adding SUCCESS-PLUS and ALREADY-HAVE-PLUS
Client side support for SUCCESS-PLUS and ALREADY-HAVE-PLUS
is complete, when a PUT stores to additional repositories
than the expected on, the location log is updated with the
additional UUIDs that contain the content.

Started implementing PUT fanout to multiple remotes for clusters.
It is untested, and I fear fencepost errors in the relative
offset calculations. And it is missing proxying for the protocol
after DATA.
2024-06-18 16:21:40 -04:00
Joey Hess
da2c02162c
Fix Windows build with Win32 2.13.4+
Thanks, Oleg Tolmatcev
2024-06-03 13:04:15 -04:00
Yaroslav Halchenko
87e2ae2014
run codespell throughout fixing typos automagically
=== Do not change lines below ===
{
 "chain": [],
 "cmd": "codespell -w",
 "exit": 0,
 "extra_inputs": [],
 "inputs": [],
 "outputs": [],
 "pwd": "."
}
^^^ Do not change lines above ^^^
2024-05-01 15:46:21 -04:00
Joey Hess
81608c3c37
windows build fix 2024-03-26 13:51:51 -04:00
Joey Hess
418e97e847
fix build warnings on windows 2024-03-26 13:11:53 -04:00
Joey Hess
7c5007279c
Windows: Fix escaping output to terminal when using old versions of MinTTY 2024-03-26 13:09:21 -04:00
Joey Hess
db95de6f2b
clean up windows build warnings about unused imports 2024-03-26 13:06:52 -04:00
Joey Hess
5a3ed3f910
attribution armoring 2024-03-07 19:19:45 -04:00
Joey Hess
c2d6c02c27
Added dependency on unbounded-delays
And stop vendoring part of it.

This is a free dependency because tasty depends on it.

Sponsored-by: Leon Schuermann on Patreon
2024-02-27 13:11:59 -04:00
Joey Hess
8e9ee31621
webapp: Added --port option, and annex.port config
The getSocket comment that mentioned using ":port"
in the hostname seems to have been incorrect or be out of date.
After all, the bug report came when the user first tried doing that,
and it didn't work.

Sponsored-by: the NIH-funded NICEMAN (ReproNim TR&D3) project
2024-01-25 14:08:36 -04:00
Joey Hess
8da85fd3a3
RawFilePath conversion
Sponsored-by: Dartmouth College's DANDI project
2024-01-19 14:26:21 -04:00
Joey Hess
703a70cafa
avoid watchFileSize running backward
This is groundwork for using watchFileSize for downloads from external
special remotes.

In Annex.Content.downloadUrl, this potentially avoids jitter in the
progress meter. When downloading with conduit, the meter gets updated based
on both the size of the file, and on the data flowing through conduit.
If that has not yet been flushed to the file, it seems possible for the
meter to run backwards when meter is updated with the file size.
It's probably only a few kb of jitter, so may not be visible.

Sponsored-by: Dartmouth College's DANDI project
2024-01-19 14:11:27 -04:00
Joey Hess
7e69063a29
support annex.shared-sop-command for encryption=shared
This works well, and it interoperates with gpg in my testing (although some
SOP commands might choose to use a profile that does not so caveat emptor).

Note that for creating the Cipher, gpg --gen-random is still used. SOP
does not have an eqivilant, and as long as the user has gpg around,
which seems likely, it doesn't matter that it uses gpg here, it's not being
used for encryption. That seemed better than implementing a second way
to get high quality entropy, at least for now.

The need for the sop command to run in an empty directory has each call
to encrypt and decrypt creating a new temporary directory. That is some
unncessary overhead, though probably swamped by the overhead of running
the sop command. This could be improved in the future by passing an
already empty directory to them, or a sufficiently empty directory
(.git/annex/tmp would probably suffice).

Sponsored-by: Brett Eisenberg on Patreon
2024-01-12 13:31:18 -04:00
Joey Hess
dd3e779020
more groundwork for StatelessOpenPGP
no behavior changes
2024-01-12 13:11:36 -04:00
Joey Hess
790600f7b2
close send side of password pipe on exec
This avoids a hang approximately 1% of the time when running the test
suite on StatelessOpenPGP.

Since I've not seen git-annex hang when running git like that, I guess
git probably does something that avoids hanging similarly. Still, fixed
the same problem in Utility.Gpg too.

Sponsored-by: Kevin Mueller on Patreon
2024-01-10 17:31:58 -04:00
Joey Hess
d98f02a5b0
test annex.shared-sop-command
Test a specified Stateless OpenPGP command with eg:
git-annex test --test-git-config annex.shared-sop-command=sqop

Also documented that config and another one, but so far only the test suite
uses the configs, have not yet implemented using it for actual symmetric
encryption.

Sponsored-by: Joshua Antonishen on Patreon
2024-01-10 16:30:38 -04:00
Joey Hess
812cbf0e17
Stateless OpenPGP interface
Implemented according to
https://www.ietf.org/archive/id/draft-dkg-openpgp-stateless-cli-09.html#name-encrypt-encrypt-a-message

Not yet used by git-annex.

Sponsored-by: Leon Schuermann on Patreon
2024-01-10 15:59:35 -04:00
Joey Hess
b728e935bc
correct comment 2024-01-10 15:59:16 -04:00
Joey Hess
478f0870d1
update comment 2024-01-10 13:24:09 -04:00
Joey Hess
de6a297d36
assistant: When generating a gpg secret key, avoid hardcoding the key algorithm and size
This aims to future-proof gpg key generation. OpenPGP is in flux with a
conflict over standards ongoing. It seems not unlikely that different
systems will have different gpg commands that support different algorithms.

This also simplifies the code by using the --quick-gen-key interface rather
than the experimental batch interface. It seems less likely that
--quick-gen-key will break than an experimental interface (whose
documentation I can no longer find).

--quick-gen-key is supported since gpg 2.1.0 (2014).

Sponsored-by: Graham Spencer on Patreon
2024-01-09 15:31:53 -04:00
Joey Hess
0e9bc41588
Revert "import Data.Time.Clock to build with time-1.9.1"
This reverts commit 7484191284.

Not necessary after all.
2023-12-27 19:11:15 -04:00
Joey Hess
7484191284
import Data.Time.Clock to build with time-1.9.1
In more recent versions Data.Time exports secondsToNominalDiffTime etc,
but originally it did not.
2023-12-27 16:58:40 -04:00
Joey Hess
4b52657b37
fix build with old version of time package
Can't truncate timestamp resolution with that version.
2023-12-27 15:33:46 -04:00
Joey Hess
c64db46b7f
refactor 2023-12-18 21:35:00 -04:00
Joey Hess
9a67ed0f10
importtree: support preferred content expressions needing keys
When importing from a special remote, support preferred content expressions
that use terms that match on keys (eg "present", "copies=1"). Such terms
are ignored when importing, since the key is not known yet.

When "standard" or "groupwanted" is used, the terms in those
expressions also get pruned accordingly.

This does allow setting preferred content to "not (copies=1)" to make a
special remote into a "source" type of repository. Importing from it will
import all files. Then exporting to it will drop all files from it.

In the case of setting preferred content to "present", it's pruned on
import, so everything gets imported from it. Then on export, it's applied,
and everything in it is left on it, and no new content is exported to it.

Since the old behavior on these preferred content expressions was for
importtree to error out, there's no backwards compatability to worry about.
Except that sync/pull/etc will now import where before it errored out.
2023-12-18 16:27:59 -04:00
Joey Hess
eb59da9dd2
Lower precision of timestamps in git-annex branch
This can reduce the size of the branch by up to 8%. My test was
running git-annex add 1000 times on one file each.
Lots of different high-resolution timestamps were recorded before
and eliminating those, after packing, the git repo was 8% smaller.

Due to the use of vector clocks, high resolution timestamps are
not necessary to make clear which information is most recent when
eg, a value is changed repeatedly in the same second. In such a
case, the vector clock will be advanced to the next second after
the last modification. For example, running
git-annex numcopies 1; git-annex numcopies 2
The first will record the current second, while the next records
the second after that even if it runs in the same second.

As for conflicting information written to two different clones of the
repository, this will make git-annex sometimes pick information that
was written earlier in a second over information written later in the
same second. Usually git-annex does not write conflicting information,
but there are some cases where it could. Eg, storing an object on a remote
can update the remote state log with some state. If two repos both store the
same object, and end up storing different remote state for some reason,
this can result in one that ran a tiny bit later winning. Such a situation
seems unlikely to be user visible. And a small amount of clock skew could
already result in such things.

The only case I can think of where this might be a user visible change
is if a configuration command like git-annex numcopies is being run
in 2 clones of a repository on the same machine at very
close to the same time. Then the user will know which they ran last,
and git-annex won't.

If that did become a problem, this could be dialed back to eg log
milliseconds with still some space saving.
2023-12-11 15:04:06 -04:00
Joey Hess
d06aee7ce0
make commitMigration interuption safe
Fixed inversion of control issue, so the tree is recorded
in streamLogFile finalizer.

Sponsored-by: Leon Schuermann on Patreon
2023-12-06 16:29:58 -04:00
Joey Hess
0bd8b17b59
log migration trees to git-annex branch
This will allow distributed migration: Start a migration in one clone of
a repo, and then update other clones.

commitMigration is a bit of a bear.. There is some inversion of control
that needs some TMVars. Also streamLogFile's finalizer does not handle
recording the trees, so an interrupt at just the wrong time can cause
migration.log to be emptied but the git-annex branch not updated.

Sponsored-by: Graham Spencer on Patreon
2023-12-06 15:40:03 -04:00
Joey Hess
1a586f80e6
remove debug print 2023-12-05 15:56:58 -04:00
Joey Hess
a6eb7d7339
prevent relatedTemplate from truncating a filename to end in "."
Avoid a problem with temp file names ending in "." on certian filesystems
that have problems with such filenames.

relatedTemplate is quite an ugly hack really; since it doesn't know the max
filename length of the filesystem it can only assume that the filename is
max allowed length. When given the input "lh.aparc.DKTatlas.annot", it
wants to reserve 20 characters for tempfile so it truncates to "lh.". That
ending period is apparently a problem on some filesystem (FAT eats it, but
does not throw EINVAL; ntfs does not seem bothered by it, I don't know what
FUSE filesystem the bug reporter was really using).

Sponsored-by: Brett Eisenberg on Patreon
2023-12-05 12:38:14 -04:00
Joey Hess
c1037da2e5
attribution armoring
Read a fun paper where they got chatgpt to emit large chunks of code
https://not-just-memorization.github.io/extracting-training-data-from-chatgpt.html
"ChatGPT memorized significant fractions of its training dataset"
2023-11-29 14:42:44 -04:00
Joey Hess
f1c2e18b8d
improve attribution armoring
Split out an author parameter, will make it easier to add authors and
reads better.

Got rid of the function without the copyright year, because an adversary
could have mechanically changed the function with a copyright year to
the one without, and so bypassed the protection of LLM copyright
year hallucination.

Sponsored-by: Luke T. Shumaker on Patreon
2023-11-21 11:34:21 -04:00
Joey Hess
e901d31feb
exhaustiveness check fix 2023-11-20 21:34:29 -04:00
Joey Hess
dab9687184
improve attribution armoring 2023-11-20 21:20:37 -04:00
Joey Hess
d5d570a96c
avoid replacing otherwise
While authorJoeyHess is True same as otherwise, ghc's exhastiveness
checker turns out to special case otherwise. So this avoids warnings.
2023-11-20 20:25:51 -04:00
Joey Hess
cda3e85164
make my authorship explicit in the code
This is intended to guard against LLM code theft, which is the current
bubble technology de jour.

Note that authorJoeyHess' with a year older than the year I began
developing git-annex will behave badly, by intention. Eg, it will spin
and eventually crash.

This is not the first anti-LLM protection in git-annex. For example see
9562da790f. That method, while much harder
for an adversary to detect and remove, also complicates code somewhat
significantly, and needs extensions to be enabled. There are also
probably significantly fewer ways to implement that method in Haskell.
This new approach, by contrast, will be easy to add throughout the code
base, with very little effort, and without complicating reading or
maintaining it any more than noticing that yes, I am the author of this
code.

An adversary could of course remove all calls to these functions
before feeding code into their LLM-based laundry facility. I think this
would need to be done manually, or with the help of some fairly advanced
Haskell parsing though. In some cases, authorJoeyHess needs to be
removed, while in other places it needs to be replaced with a value.
Also a monadic use of authorJoeyHess' may involve other added monadic
machinery which would need to be eliminated to keep the code compiling.

Alternatively, an adversary could replace my name with something
innocuous. This would be clear intent to remove author attribution
from my code, even more than running it through an LLM laundry is.

If you work for a large company that is laundering my code through an
LLM, please do us a favor and use your immense privilege to quit and go
do something socially beneficial. I will not explain further
developments of this code in such detail, and you have better things to
do than playing cat and mouse with me as I explore directions such as
extending this approach to the type level.

Sponsored-by: k0ld on Patreon
2023-11-20 12:29:12 -04:00
Joey Hess
c41ca6c832
convert StorableCipher to ByteString
This allows getting rid of the ugly and error prone handling of
"bag of bytes" String in Remote.Helper.Encryptable.
Avoiding breakage like that dealt with by commit
9862d64bf9

And allows converting Utility.Gpg to use ByteString for IO, which is
a welcome change.

Tested the new git-annex interoperability with old, using all 3
encryption= types.

Sponsored-By: the NIH-funded NICEMAN (ReproNim TR&D3) project
2023-11-01 14:39:49 -04:00
Joey Hess
ea2876ae77
add PackageImports
This makes loading it in ghci work when both crypton and cryptonite are
installed.
2023-10-30 14:10:46 -04:00
Joey Hess
0f3b78ec29
simplify 2023-10-26 14:00:02 -04:00
Joey Hess
c873586e14
eliminate s2w8 and w82s
Note that the use of s2w8 in genUUIDInNameSpace made it truncate unicode
characters. Luckily, genUUIDInNameSpace is only ever used on ASCII
strings as far as I can determine. In particular, git-remote-gcrypt's
gcrypt-id is an ASCII string.
2023-10-26 13:12:57 -04:00
Joey Hess
3742263c99
simplify base64 to only use ByteString
Note the use of fromString and toString from Data.ByteString.UTF8 dated
back to commit 9b93278e8a. Back then it
was using the dataenc package for base64, which operated on Word8 and
String. But with the switch to sandi, it uses ByteString, and indeed
fromB64' and toB64' were already using ByteString without that
complication. So I think there is no risk of such an encoding related
breakage.

I also tested the case that 9b93278e8a
fixed:

	git-annex metadata -s foo='a …' x
	git-annex metadata x
	metadata x
	  foo=a …

In Remote.Helper.Encryptable, it was avoiding using Utility.Base64
because of that UTF8 conversion. Since that's no longer done, it can
just use it now.
2023-10-26 13:10:05 -04:00
Joey Hess
6a61c7ff45
Fix crash of enableremote when the special remote has embedcreds=yes
The crash occurred because writeCreds got called twice, and writeFileProtected
neglected to close its file handle, so the file was open for write when
written the second time.

It seems unncessary and suboptimal that writeCreds gets called twice.
One call is from getRemoteCredPair and the other from setRemoteCredPair'.
What happens is that in the enableremote case, code that also runs at
initremote does unncessary work. Might be possible to improve that, but
I've gone for the simple fix.

Sponsored-by: k0ld on Patreon
2023-10-20 13:19:12 -04:00
Joey Hess
54da44d42a
Support being built with crypton rather than cryptonite
crypton is a fork of cryptonite, and cryptonite's github repo has been
archived. Some deps are already using cryptonite so it's clearly the way
forward.

Added a build flag without a default, so cabal configure will select on its
own which to use. stack files pin to cryptonite for now.

Sponsored-by: Nicholas Golder-Manning on Patreon
2023-09-21 12:43:42 -04:00
Joey Hess
50300a47fe
Removed the vendored git-lfs and the GitLfs build flag
AFAICS all git-annex builds are using the git-lfs library not the vendored
copy.

Debian stable now includes a new enough haskell-git-lfs package as well.
Last time this was tried it did not.
2023-08-28 13:12:31 -04:00