S3: Allow removing files from IA, but warn about derived versions potentially still existing there.

Removal works, only derives are a potential issue, so allow removing
with a warning. This way, unexporting a file works, and behavior is
consistent with IA remotes whether or not exporttree=yes.

Also tested exporting filenames containing unicode, spaces, underscores.
All worked, despite the IA's faq saying it doesn't.

This commit was sponsored by Trenton Cronholm on Patreon.
This commit is contained in:
Joey Hess 2017-09-12 12:33:08 -04:00
parent 7f0e2a4685
commit 267f47c473
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
4 changed files with 33 additions and 23 deletions

View file

@ -9,6 +9,8 @@ git-annex (6.20170819) UNRELEASED; urgency=medium
* Support building with feed-1.0, while still supporting older versions.
* init: Display an additional message when it detects a filesystem that
allows writing to files whose write bit is not set.
* S3: Allow removing files from IA, but warn about derived versions
potentially still existing there.
-- Joey Hess <id@joeyh.name> Mon, 28 Aug 2017 12:20:59 -0400

View file

@ -278,14 +278,17 @@ retrieveCheap _ _ _ = return False
- While it may remove the file, there are generally other files
- derived from it that it does not remove. -}
remove :: S3Info -> S3Handle -> Remover
remove info h k
remove info h k = warnIARemoval info $ do
res <- tryNonAsync $ sendS3Handle h $
S3.DeleteObject (T.pack $ bucketObject info k) (bucket info)
return $ either (const False) (const True) res
warnIARemoval :: S3Info -> Annex a -> Annex a
warnIARemoval info a
| isIA info = do
warning "Cannot remove content from the Internet Archive"
return False
| otherwise = do
res <- tryNonAsync $ sendS3Handle h $
S3.DeleteObject (T.pack $ bucketObject info k) (bucket info)
return $ either (const False) (const True) res
warning "Derived versions of removed file may still be present in the Internet Archive"
a
| otherwise = a
checkKey :: Remote -> S3Info -> Maybe S3Handle -> CheckPresent
checkKey r info Nothing k = case getpublicurl info of
@ -342,7 +345,7 @@ retrieveExportS3 r info _k loc f p =
return True
removeExportS3 :: Remote -> S3Info -> Key -> ExportLocation -> Annex Bool
removeExportS3 r info _k loc =
removeExportS3 r info _k loc = warnIARemoval info $
catchNonAsync go (\e -> warning (show e) >> return False)
where
go = withS3Handle (config r) (gitconfig r) (uuid r) $ \h -> do
@ -620,9 +623,9 @@ getBucketObject c = munge . key2file
getBucketExportLocation :: RemoteConfig -> ExportLocation -> FilePath
getBucketExportLocation c (ExportLocation loc) = getFilePrefix c ++ loc
{- Internet Archive limits filenames to a subset of ascii,
- with no whitespace. Other characters are xml entity
- encoded. -}
{- Internet Archive documentation limits filenames to a subset of ascii.
- While other characters seem to work now, this entity encodes everything
- else to avoid problems. -}
iaMunge :: String -> String
iaMunge = (>>= munge)
where

View file

@ -11,9 +11,10 @@ comply with their [terms of service](http://www.archive.org/about/terms.php).
A nice added feature is that whenever git-annex sends a file to the
Internet Archive, it records its url, the same as if you'd run `git annex
addurl`. So any users who can clone your repository can download the files
from archive.org, without needing any login or password info. This makes
the Internet Archive a nice way to publish the large files associated with
a public git repository.
from archive.org, without needing any login or password info.
The url to the content in the Internet Archive is also displayed by
`git annex whereis`. This makes the Internet Archive a nice way to
publish the large files associated with a public git repository.
## webapp setup
@ -50,10 +51,15 @@ Then you can annex files and copy them to the remote as usual:
# git annex copy photo1.jpeg --fast --to archive-panama
copy (to archive-panama...) ok
Once a file has been stored on archive.org, it cannot be (easily) removed
from it. Also, git-annex whereis will tell you a public url for the file
on archive.org. (It may take a while for archive.org to make the file
publically visibile.)
It may take a while for archive.org to make files publically visible after
they've been uploaded.
## removing files
While files can be removed from the Internet Archive,
[derived versions](https://archive.org/help/derivatives.php)
of some files may continued to be stored there after the originals
were removed. git-annex warns about this problem.
## exporting trees
@ -63,6 +69,7 @@ are important, you can run `git annex initremote` with an additional
parameter "exporttree=yes", and then use [[git-annex-export]] to publish
a tree of files to the Internet Archive.
Note that the Internet Archive does not support filenames containing
whitespace and some other characters. Exporting such problem filenames will
fail; you can rename the file and re-export.
Note that the Internet Archive may not support certian characters
in filenames ([see FAQ](http://archive.org/about/faqs.php#1099)).
If exporting a filename fails due to such limitations, you would need
to rename it in your git annex repository in order to export it.

View file

@ -29,8 +29,6 @@ Work is in progress. Todo list:
Would need git-annex sync to export to the master tree?
This is similar to the little-used preferreddir= preferred content
setting and the "public" repository group.
* Test export to IA via S3. In particualar, does removing an exported file
work?
Low priority: