Note the use of fromString and toString from Data.ByteString.UTF8 dated
back to commit 9b93278e8a. Back then it
was using the dataenc package for base64, which operated on Word8 and
String. But with the switch to sandi, it uses ByteString, and indeed
fromB64' and toB64' were already using ByteString without that
complication. So I think there is no risk of such an encoding related
breakage.
I also tested the case that 9b93278e8a
fixed:
git-annex metadata -s foo='a …' x
git-annex metadata x
metadata x
foo=a …
In Remote.Helper.Encryptable, it was avoiding using Utility.Base64
because of that UTF8 conversion. Since that's no longer done, it can
just use it now.
In aeson 2.0, Text has been replaced by the Key type and HashMap by the
KeyMap interface. Accomodating this required adding some CPP in order to
still be able to compile with aeson < 2.0. The required changes were:
* Prevent Key from being re-exported by Utilities.Aeson, as it clashes
with git-annex's own Key type.
* Fix up convertion from String/Text to Key (or Text in aeson 1.*) in a
couple of places
* Import Data.Aeson.KeyMap instead of Data.HashMap.Strict, as they are
mostly API-compatible. insertWith needs to be replaced by unionWith,
however, as KeyMap lacks the former function.
This does not change the overall license of the git-annex program, which
was already AGPL due to a number of sources files being AGPL already.
Legally speaking, I'm adding a new license under which these files are
now available; I already released their current contents under the GPL
license. Now they're dual licensed GPL and AGPL. However, I intend
for all my future changes to these files to only be released under the
AGPL license, and I won't be tracking the dual licensing status, so I'm
simply changing the license statement to say it's AGPL.
(In some cases, others wrote parts of the code of a file and released it
under the GPL; but in all cases I have contributed a significant portion
of the code in each file and it's that code that is getting the AGPL
license; the GPL license of other contributors allows combining with
AGPL code.)
This fixes a reversion in the ByteString conversion. The old code used
isSpace to decide when the metadata value needs to be base64 encoded,
and that incorrectly changed to only checking if it contained ' '.
Note that only '\n' and '\r' were added and not other sorts of
whitespace that isSpace matches, like '\t' and '\v'. Only the former
would cause problems.
MetaField was already limited to alphanumerics, so it makes sense to use
Text for it.
Note that technically a UUID can contain invalid UTF-8, and so
remoteMetaDataPrefix's use of T.pack . fromUUID could replace non-UTF8
values with '?' or whatever. In practice, a UUID is usually also text,
I only kept open the possibility of it containing invalid UTF-8 to avoid
breaking parsing of strange UUIDs in git-annex branch files. So, I
decided to let this edge case slip by.
Have not updated the rest of the code base yet for this change, as the
change took 2.5 hours longer than I expected to get working properly.
Since the same key can be stored in a versioned S3 bucket multiple times
with different version IDs, this allows tracking them all. Not currently
needed, but if we ever want to drop from a versioned S3 bucket, we'll
need to know them all.
This commit was supported by the NSF-funded DataLad project.
Actually very straightforward reuse of the metadata log file code.
Although I had to add a todo item as git-annex forget won't clean up
dead remote's metadata yet.
This would be worth adding to the external special remote interface
sometime. Have not opened a todo though, guess I'll wait until something
needs it.
This commit was supported by the NSF-funded DataLad project.
Switch to Data.Map.Strict everywhere that used it.
There are still lots of lazy maps in git-annex. I think switching these
is safe. The risk is that there might be a map that is used in a way
that relies on the values not being evaluated to WHNF, and switching to
strict might result in bad performance or memory use. So, I have not
switched everything.
As long as all code imports Utility.Aeson rather than Data.Aeson,
and no Strings that may contain utf-8 characters are used for eg, object
keys via T.pack, this is guaranteed to fix the problem everywhere that
git-annex generates json.
It's kind of annoying to need to wrap ToJSON with a ToJSON', especially
since every data type that has a ToJSON instance has to be ported over.
However, that only took 50 lines of code, which is worth it to ensure full
coverage. I initially tried an alternative approach of a newtype FileEncoded,
which had to be used everywhere a String was fed into aeson, and chasing
down all the sites would have been far too hard. Did consider creating an
intentionally overlapping instance ToJSON String, and letting ghc fail
to build anything that passed in a String, but am not sure that wouldn't
pollute some library that git-annex depends on that happens to use ToJSON
String internally.
This commit was supported by the NSF-funded DataLad project.
When adding a new version of a file, and annex.genmetadata is enabled,
don't copy the data metadata from the old version of the file, instead use
the mtime of the file. Rationalle being that the user has requested to
generate metadata and so would expect to get the new mtime into metadata.
Also, avoid warning about copying metadata when all the old metadata is
date metadata. Which was rather the harder part.
This commit was sponsored by Boyd Stephen Smith Jr. on Patreon.
Motivation is to remove all metadata when it gets copied from a previous
version of the file, and that is not deisrable.
This commit was supported by the NSF-funded DataLad project.
metadata --json output format has changed, adding a inner json object
named "fields" which contains only the fields and their values.
This should be easier to parse than the old format, which mixed up
metadata fields with other keys in the json object.
Any consumers of the old format will need to be updated.
This adds a dependency on unordered-containers for parsing MetaData
from JSON, but it's a free dependency; aeson pulls in that library.
This was caused by 23e9d3bb77
an Arbitrary String is not necessarily encoded using the filesystem
encoding, and in a non-utf8 locale, encodeBS throws an exception on such a
string. All I could think to do is limit test data to ascii.
This shouldn't be a problem in practice, because the all Strings in
git-annex that are not generated by Arbitrary should be loaded in a way
that does apply the filesystem encoding.
The fix is to stop using w82s, which does not properly reconstitute unicode
strings. Instrad, use utf8 bytestring to get the [Word8] to base64. This
passes unicode through perfectly, including any invalid filesystem encoded
characters.
Note that toB64 / fromB64 are also used for creds and cipher
embedding. It would be unfortunate if this change broke those uses.
For cipher embedding, note that ciphers can contain arbitrary bytes (should
really be using ByteString.Char8 there). Testing indicated it's not safe to
use the new fromB64 there; I think that characters were incorrectly
combined.
For credpair embedding, the username or password could contain unicode.
Before, that unicode would fail to round-trip through the b64.
So, I guess this is not going to break any embedded creds that worked
before.
This bug may have affected some creds before, and if so,
this change will not fix old ones, but should fix new ones at least.
This fixes all instances of " \t" in the code base. Most common case
seems to be after a "where" line; probably vim copied the two space layout
of that line.
Done as a background task while listening to episode 2 of the Type Theory
podcast.
Note that this is a nearly entirely free feature. The data was already
stored in the metadata log in an easily accessible way, and already was
parsed to a time when parsing the log. The generation of the metadata
fields may even be done lazily, although probably not entirely (the map
has to be evaulated to when queried).
Using the extract(1) program to do the heavy lifting.
Decided to make git-annex run pre-commit-annex when committing. Since
git-annex pre-commit also runs it, it'll be run when git commit is run too,
via the pre-commit hook. This basically gives back the pre-commit hook
that git-annex took away. The implementation avoids repeatedly looking
for the hook script when the assistant is running and committing
repeatedly; only checks if the hook is available once.
To make the script simpler, made git-annex metadata -s field?=value
only set a field when it's not already got a value.
This commit was sponsored by bak.
So the user can now switch to a view and then move files around within it
to manage metadata. For example, moving a file into a new directory
when in the tags=* view adds a tag to it.
Implementation is fairly efficient. One diff-index, which is no more
expensive than the first stage of a git commit, followed by possibly
some cat-file --batch traffic to find the key (when deleting a file).
Very similar to what's done in direct mode when committing. And like
direct mode when updating the WC after a merge, it has to buffer the
diff-tree values in order to make 2 passes over them.
When not in a view, pre-commit now does one extra git symbolic-ref,
which is tiny overhead.
This commit was sponsored by Andrew Eskridge.
(And a vpop command, which is still a bit buggy.)
Still need to do vadd and vrm, though this also adds their documentation.
Currently not very happy with the view log data serialization. I had to
lose the TDFA regexps temporarily, so I can have Read/Show instances of
View. I expect the view log format will change in some incompatable way
later, probably adding last known refs for the parent branch to View
or something like that.
Anyway, it basically works, although it's a bit slow looking up the
metadata. The actual git branch construction is about as fast as it can be
using the current git plumbing.
This commit was sponsored by Peter Hogg.
Promosing work toward metadata driven filter branches. A few methods
to construct them are stubbed out; all the data types and pure code
seems good.
This commit was sponsored by Walter Somerville.
Adds metadata log, and command.
Note that unsetting field values seems to currently be broken.
And in general this has had all of 2 minutes worth of testing.
This commit was sponsored by Julien Lefrique.
A very haskell commit! Just data types, instances to serialize the metadata
to a nice format, and QuickCheck tests.
This commit was sponsored by Andreas Leha.