Improve SHA*E extension extraction code

Do not treat parts of the filename that contain punctuation or other
non-alphanumeric characters as extensions. Before, such characters were
filtered out.

Note that in 45308ec78b "foo.ba__________r"
was munged to ".bar" and so incorrectly treated as an extension. That was
fixed by changing the filter order, but not allowing punctuation seems a
better fix.

This assumes that extensions containing punctuation are rare. "_" seems the
most likely character; I used it in ikiwiki "._comment" files. But I can't
recall seeing it anywhere else. It certianly seems that no commonly used
extensions contain punctuation. If git-annex doesn't treat "._comment"
as an extension, it's not likely to break software that expects to see that
extension like some software expects to see .epub or .mp3.

This commit was sponsored by Jack Hill on Patreon.
This commit is contained in:
Joey Hess 2018-03-05 11:25:01 -04:00
parent 6d6e3c6c49
commit 07e253b1fb
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
4 changed files with 34 additions and 1 deletions

View file

@ -94,7 +94,7 @@ selectExtension f
| otherwise = intercalate "." ("":es)
where
es = filter (not . null) $ reverse $
take 2 $ map (filter validInExtension) $
take 2 $ filter (all validInExtension) $
takeWhile shortenough $
reverse $ splitc '.' $ takeExtensions f
shortenough e = length e <= 4 -- long enough for "jpeg"

View file

@ -3,6 +3,9 @@ git-annex (6.20180228) UNRELEASED; urgency=medium
* Support exporttree=yes for rsync special remotes.
* Dial back optimisation when building on arm, which prevents
ghc and llc from running out of memory when optimising some files.
* Improve SHA*E extension extraction code to not treat parts of the
filename that contain punctuation or other non-alphanumeric characters
as extensions. Before, such characters were filtered out.
-- Joey Hess <id@joeyh.name> Wed, 28 Feb 2018 11:53:03 -0400

View file

@ -3,6 +3,8 @@ Files with special unicode characters(in this case japanese) for some reason hav
This is an issue because it causes errors when using glacier-cli when uploading copies to Glacier vault.
[[!meta title="kanji in key extension cause glacier-cli upload error"]]
### What steps will reproduce the problem?
Here's how it looks for me:

View file

@ -0,0 +1,28 @@
[[!comment format=mdwn
username="joey"
subject="""comment 5"""
date="2018-03-05T14:47:20Z"
content="""
The easy workaround to bugs like this migrate the file to the
SHA256 backend rather than SHA256E.
It may be obvious to us that a file ending in "(feat. xy).mp3"
has an extension of ".mp3" and not of ". xy).mp3", but this is not very
obvious to git-annex, which would like to treat a file ending in ".tar.gz"
as having that compound extension.
The only rule I can think of that would help git-annex understand this is
to not allow punctuation (other than "." in file extensions). Which it
actually already filters out of extensions, which is why the extension it
comes up with is ".xy.mp3". But it could notice the space and closing paren
in the filename and assume those are not part of an extension. It might
bite some file with an extension like .foo_", I can't recall seeing many
such extensions. Ok, made this change.
It remains a bug in the glacier special remote if unicode characters
prevent uploading to it. We can't limit file
extensions to ascii, it's perfectly reasonable to use your native language
characters in a file extension. Leaving bug open since my change does
nothing about whatever upload bug glacier-cli has. Is the python program
failing?
"""]]