add back inode to directory special remote ContentIdentifier

Directory special remotes with importtree=yes have changed to once more
take inodes into account. This will cause extra work when importing from a
directory on a FAT filesystem that changes inodes on every mount.

To avoid that extra work, set ignoreinodes=yes when initializing a new
directory special remote, or change the configuration of your existing
remote: git-annex enableremote foo ignoreinodes=yes

This will mean a one-time re-import of all contents from every directory
special remote due to the changed setting.

73df633a62 thought
it was too unlikely that there would be modifications that the inode number
was needed to notice. That was probably right; it's very unlikely that a
file will get modified and end up with the same size and mtime as before.
But, what was not considered is that a program like NextCloud might write
two files with different content so closely together that they share the
mtime. The inode is necessary to detect that situation.

Sponsored-by: Max Thoursie on Patreon
This commit is contained in:
Joey Hess 2022-03-21 13:12:02 -04:00
parent d219626794
commit 3e2f1f73cb
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
6 changed files with 117 additions and 45 deletions

View file

@ -21,6 +21,13 @@ git-annex (10.20220223) UNRELEASED; urgency=medium
Thanks, sternenseemann for the patch. Thanks, sternenseemann for the patch.
* test: Runs tests in parallel to speed up the test suite. * test: Runs tests in parallel to speed up the test suite.
* test: Added --jobs option. * test: Added --jobs option.
* Directory special remotes with importtree=yes have changed to once more
take inodes into account. This will cause extra work when importing
from a directory on a FAT filesystem that changes inodes on every
mount. To avoid that extra work, set ignoreinodes=yes when initializing
a new directory special remote, or change the configuration of your
existing remote:
git-annex enableremote foo ignoreinodes=yes
-- Joey Hess <id@joeyh.name> Wed, 23 Feb 2022 14:14:09 -0400 -- Joey Hess <id@joeyh.name> Wed, 23 Feb 2022 14:14:09 -0400

View file

@ -1,6 +1,6 @@
{- A "remote" that is just a filesystem directory. {- A "remote" that is just a filesystem directory.
- -
- Copyright 2011-2021 Joey Hess <id@joeyh.name> - Copyright 2011-2022 Joey Hess <id@joeyh.name>
- -
- Licensed under the GNU AGPL version 3 or higher. - Licensed under the GNU AGPL version 3 or higher.
-} -}
@ -54,6 +54,8 @@ remote = specialRemoteType $ RemoteType
, configParser = mkRemoteConfigParser , configParser = mkRemoteConfigParser
[ optionalStringParser directoryField [ optionalStringParser directoryField
(FieldDesc "(required) where the special remote stores data") (FieldDesc "(required) where the special remote stores data")
, yesNoParser ignoreinodesField (Just False)
(FieldDesc "ignore inodes when importing/exporting")
] ]
, setup = directorySetup , setup = directorySetup
, exportSupported = exportIsSupported , exportSupported = exportIsSupported
@ -64,12 +66,17 @@ remote = specialRemoteType $ RemoteType
directoryField :: RemoteConfigField directoryField :: RemoteConfigField
directoryField = Accepted "directory" directoryField = Accepted "directory"
ignoreinodesField :: RemoteConfigField
ignoreinodesField = Accepted "ignoreinodes"
gen :: Git.Repo -> UUID -> RemoteConfig -> RemoteGitConfig -> RemoteStateHandle -> Annex (Maybe Remote) gen :: Git.Repo -> UUID -> RemoteConfig -> RemoteGitConfig -> RemoteStateHandle -> Annex (Maybe Remote)
gen r u rc gc rs = do gen r u rc gc rs = do
c <- parsedRemoteConfig remote rc c <- parsedRemoteConfig remote rc
cst <- remoteCost gc cheapRemoteCost cst <- remoteCost gc cheapRemoteCost
let chunkconfig = getChunkConfig c let chunkconfig = getChunkConfig c
cow <- liftIO newCopyCoWTried cow <- liftIO newCopyCoWTried
let ii = IgnoreInodes $ fromMaybe True $
getRemoteConfigValue ignoreinodesField c
return $ Just $ specialRemote c return $ Just $ specialRemote c
(storeKeyM dir chunkconfig cow) (storeKeyM dir chunkconfig cow)
(retrieveKeyFileM dir chunkconfig cow) (retrieveKeyFileM dir chunkconfig cow)
@ -99,15 +106,15 @@ gen r u rc gc rs = do
, renameExport = renameExportM dir , renameExport = renameExportM dir
} }
, importActions = ImportActions , importActions = ImportActions
{ listImportableContents = listImportableContentsM dir { listImportableContents = listImportableContentsM ii dir
, importKey = Just (importKeyM dir) , importKey = Just (importKeyM ii dir)
, retrieveExportWithContentIdentifier = retrieveExportWithContentIdentifierM dir cow , retrieveExportWithContentIdentifier = retrieveExportWithContentIdentifierM ii dir cow
, storeExportWithContentIdentifier = storeExportWithContentIdentifierM dir cow , storeExportWithContentIdentifier = storeExportWithContentIdentifierM ii dir cow
, removeExportWithContentIdentifier = removeExportWithContentIdentifierM dir , removeExportWithContentIdentifier = removeExportWithContentIdentifierM ii dir
-- Not needed because removeExportWithContentIdentifier -- Not needed because removeExportWithContentIdentifier
-- auto-removes empty directories. -- auto-removes empty directories.
, removeExportDirectoryWhenEmpty = Nothing , removeExportDirectoryWhenEmpty = Nothing
, checkPresentExportWithContentIdentifier = checkPresentExportWithContentIdentifierM dir , checkPresentExportWithContentIdentifier = checkPresentExportWithContentIdentifierM ii dir
} }
, whereisKey = Nothing , whereisKey = Nothing
, remoteFsck = Nothing , remoteFsck = Nothing
@ -351,8 +358,8 @@ removeExportLocation topdir loc =
mkExportLocation loc' mkExportLocation loc'
in go (upFrom loc') =<< tryIO (removeDirectory p) in go (upFrom loc') =<< tryIO (removeDirectory p)
listImportableContentsM :: RawFilePath -> Annex (Maybe (ImportableContentsChunkable Annex (ContentIdentifier, ByteSize))) listImportableContentsM :: IgnoreInodes -> RawFilePath -> Annex (Maybe (ImportableContentsChunkable Annex (ContentIdentifier, ByteSize)))
listImportableContentsM dir = liftIO $ do listImportableContentsM ii dir = liftIO $ do
l <- dirContentsRecursive (fromRawFilePath dir) l <- dirContentsRecursive (fromRawFilePath dir)
l' <- mapM (go . toRawFilePath) l l' <- mapM (go . toRawFilePath) l
return $ Just $ ImportableContentsComplete $ return $ Just $ ImportableContentsComplete $
@ -360,41 +367,38 @@ listImportableContentsM dir = liftIO $ do
where where
go f = do go f = do
st <- R.getFileStatus f st <- R.getFileStatus f
mkContentIdentifier f st >>= \case mkContentIdentifier ii f st >>= \case
Nothing -> return Nothing Nothing -> return Nothing
Just cid -> do Just cid -> do
relf <- relPathDirToFile dir f relf <- relPathDirToFile dir f
sz <- getFileSize' f st sz <- getFileSize' f st
return $ Just (mkImportLocation relf, (cid, sz)) return $ Just (mkImportLocation relf, (cid, sz))
-- Make a ContentIdentifier that contains the size and mtime of the file. newtype IgnoreInodes = IgnoreInodes Bool
-- Make a ContentIdentifier that contains the size and mtime of the file,
-- and also normally the inode, unless ignoreinodes=yes.
--
-- If the file is not a regular file, this will return Nothing. -- If the file is not a regular file, this will return Nothing.
-- mkContentIdentifier :: IgnoreInodes -> RawFilePath -> FileStatus -> IO (Maybe ContentIdentifier)
-- The inode is zeroed because often this is used for import from a mkContentIdentifier (IgnoreInodes ii) f st =
-- FAT filesystem, whose inodes change each time it's mounted, and liftIO $ fmap (ContentIdentifier . encodeBS . showInodeCache)
-- including inodes would cause repeated re-hashing of files, and <$> if ii
-- bloat the git-annex branch with changes to content identifier logs. then toInodeCache' noTSDelta f st 0
-- else toInodeCache noTSDelta f st
-- This does mean that swaps of two files with the same size and
-- mtime won't be noticed, nor will modifications to files that
-- preserve the size and mtime. Both very unlikely so acceptable.
mkContentIdentifier :: RawFilePath -> FileStatus -> IO (Maybe ContentIdentifier)
mkContentIdentifier f st =
fmap (ContentIdentifier . encodeBS . showInodeCache)
<$> toInodeCache' noTSDelta f st 0
guardSameContentIdentifiers :: a -> ContentIdentifier -> Maybe ContentIdentifier -> a guardSameContentIdentifiers :: a -> ContentIdentifier -> Maybe ContentIdentifier -> a
guardSameContentIdentifiers cont old new guardSameContentIdentifiers cont old new
| new == Just old = cont | new == Just old = cont
| otherwise = giveup "file content has changed" | otherwise = giveup "file content has changed"
importKeyM :: RawFilePath -> ExportLocation -> ContentIdentifier -> ByteSize -> MeterUpdate -> Annex (Maybe Key) importKeyM :: IgnoreInodes -> RawFilePath -> ExportLocation -> ContentIdentifier -> ByteSize -> MeterUpdate -> Annex (Maybe Key)
importKeyM dir loc cid sz p = do importKeyM ii dir loc cid sz p = do
backend <- chooseBackend f backend <- chooseBackend f
unsizedk <- fst <$> genKey ks p backend unsizedk <- fst <$> genKey ks p backend
let k = alterKey unsizedk $ \kd -> kd let k = alterKey unsizedk $ \kd -> kd
{ keySize = keySize kd <|> Just sz } { keySize = keySize kd <|> Just sz }
currcid <- liftIO $ mkContentIdentifier absf currcid <- liftIO $ mkContentIdentifier ii absf
=<< R.getFileStatus absf =<< R.getFileStatus absf
guardSameContentIdentifiers (return (Just k)) cid currcid guardSameContentIdentifiers (return (Just k)) cid currcid
where where
@ -406,8 +410,8 @@ importKeyM dir loc cid sz p = do
, inodeCache = Nothing , inodeCache = Nothing
} }
retrieveExportWithContentIdentifierM :: RawFilePath -> CopyCoWTried -> ExportLocation -> ContentIdentifier -> FilePath -> Annex Key -> MeterUpdate -> Annex Key retrieveExportWithContentIdentifierM :: IgnoreInodes -> RawFilePath -> CopyCoWTried -> ExportLocation -> ContentIdentifier -> FilePath -> Annex Key -> MeterUpdate -> Annex Key
retrieveExportWithContentIdentifierM dir cow loc cid dest mkkey p = retrieveExportWithContentIdentifierM ii dir cow loc cid dest mkkey p =
precheck docopy precheck docopy
where where
f = exportPath dir loc f = exportPath dir loc
@ -449,7 +453,7 @@ retrieveExportWithContentIdentifierM dir cow loc cid dest mkkey p =
-- Check before copy, to avoid expensive copy of wrong file -- Check before copy, to avoid expensive copy of wrong file
-- content. -- content.
precheck cont = guardSameContentIdentifiers cont cid precheck cont = guardSameContentIdentifiers cont cid
=<< liftIO . mkContentIdentifier f =<< liftIO . mkContentIdentifier ii f
=<< liftIO (R.getFileStatus f) =<< liftIO (R.getFileStatus f)
-- Check after copy, in case the file was changed while it was -- Check after copy, in case the file was changed while it was
@ -470,7 +474,7 @@ retrieveExportWithContentIdentifierM dir cow loc cid dest mkkey p =
#else #else
postchecknoncow cont = do postchecknoncow cont = do
#endif #endif
currcid <- liftIO $ mkContentIdentifier f currcid <- liftIO $ mkContentIdentifier ii f
#ifndef mingw32_HOST_OS #ifndef mingw32_HOST_OS
=<< getFdStatus fd =<< getFdStatus fd
#else #else
@ -484,22 +488,22 @@ retrieveExportWithContentIdentifierM dir cow loc cid dest mkkey p =
-- the modified version was copied CoW, and then the file was -- the modified version was copied CoW, and then the file was
-- restored to the original content before this check. -- restored to the original content before this check.
postcheckcow cont = do postcheckcow cont = do
currcid <- liftIO $ mkContentIdentifier f currcid <- liftIO $ mkContentIdentifier ii f
=<< R.getFileStatus f =<< R.getFileStatus f
guardSameContentIdentifiers cont cid currcid guardSameContentIdentifiers cont cid currcid
storeExportWithContentIdentifierM :: RawFilePath -> CopyCoWTried -> FilePath -> Key -> ExportLocation -> [ContentIdentifier] -> MeterUpdate -> Annex ContentIdentifier storeExportWithContentIdentifierM :: IgnoreInodes -> RawFilePath -> CopyCoWTried -> FilePath -> Key -> ExportLocation -> [ContentIdentifier] -> MeterUpdate -> Annex ContentIdentifier
storeExportWithContentIdentifierM dir cow src _k loc overwritablecids p = do storeExportWithContentIdentifierM ii dir cow src _k loc overwritablecids p = do
liftIO $ createDirectoryUnder dir (toRawFilePath destdir) liftIO $ createDirectoryUnder dir (toRawFilePath destdir)
withTmpFileIn destdir template $ \tmpf tmph -> do withTmpFileIn destdir template $ \tmpf tmph -> do
liftIO $ hClose tmph liftIO $ hClose tmph
void $ fileCopier cow src tmpf p Nothing void $ fileCopier cow src tmpf p Nothing
let tmpf' = toRawFilePath tmpf let tmpf' = toRawFilePath tmpf
resetAnnexFilePerm tmpf' resetAnnexFilePerm tmpf'
liftIO (getFileStatus tmpf) >>= liftIO . mkContentIdentifier tmpf' >>= \case liftIO (getFileStatus tmpf) >>= liftIO . mkContentIdentifier ii tmpf' >>= \case
Nothing -> giveup "unable to generate content identifier" Nothing -> giveup "unable to generate content identifier"
Just newcid -> do Just newcid -> do
checkExportContent dir loc checkExportContent ii dir loc
overwritablecids overwritablecids
(giveup "unsafe to overwrite file") (giveup "unsafe to overwrite file")
(const $ liftIO $ rename tmpf dest) (const $ liftIO $ rename tmpf dest)
@ -509,16 +513,16 @@ storeExportWithContentIdentifierM dir cow src _k loc overwritablecids p = do
(destdir, base) = splitFileName dest (destdir, base) = splitFileName dest
template = relatedTemplate (base ++ ".tmp") template = relatedTemplate (base ++ ".tmp")
removeExportWithContentIdentifierM :: RawFilePath -> Key -> ExportLocation -> [ContentIdentifier] -> Annex () removeExportWithContentIdentifierM :: IgnoreInodes -> RawFilePath -> Key -> ExportLocation -> [ContentIdentifier] -> Annex ()
removeExportWithContentIdentifierM dir k loc removeablecids = removeExportWithContentIdentifierM ii dir k loc removeablecids =
checkExportContent dir loc removeablecids (giveup "unsafe to remove modified file") $ \case checkExportContent ii dir loc removeablecids (giveup "unsafe to remove modified file") $ \case
DoesNotExist -> return () DoesNotExist -> return ()
KnownContentIdentifier -> removeExportM dir k loc KnownContentIdentifier -> removeExportM dir k loc
checkPresentExportWithContentIdentifierM :: RawFilePath -> Key -> ExportLocation -> [ContentIdentifier] -> Annex Bool checkPresentExportWithContentIdentifierM :: IgnoreInodes -> RawFilePath -> Key -> ExportLocation -> [ContentIdentifier] -> Annex Bool
checkPresentExportWithContentIdentifierM dir _k loc knowncids = checkPresentExportWithContentIdentifierM ii dir _k loc knowncids =
checkPresentGeneric' dir $ checkPresentGeneric' dir $
checkExportContent dir loc knowncids (return False) $ \case checkExportContent ii dir loc knowncids (return False) $ \case
DoesNotExist -> return False DoesNotExist -> return False
KnownContentIdentifier -> return True KnownContentIdentifier -> return True
@ -539,12 +543,12 @@ data CheckResult = DoesNotExist | KnownContentIdentifier
-- --
-- So, it suffices to check if the destination file's current -- So, it suffices to check if the destination file's current
-- content is known, and immediately run the callback. -- content is known, and immediately run the callback.
checkExportContent :: RawFilePath -> ExportLocation -> [ContentIdentifier] -> Annex a -> (CheckResult -> Annex a) -> Annex a checkExportContent :: IgnoreInodes -> RawFilePath -> ExportLocation -> [ContentIdentifier] -> Annex a -> (CheckResult -> Annex a) -> Annex a
checkExportContent dir loc knowncids unsafe callback = checkExportContent ii dir loc knowncids unsafe callback =
tryWhenExists (liftIO $ R.getFileStatus dest) >>= \case tryWhenExists (liftIO $ R.getFileStatus dest) >>= \case
Just destst Just destst
| not (isRegularFile destst) -> unsafe | not (isRegularFile destst) -> unsafe
| otherwise -> catchDefaultIO Nothing (liftIO $ mkContentIdentifier dest destst) >>= \case | otherwise -> catchDefaultIO Nothing (liftIO $ mkContentIdentifier ii dest destst) >>= \case
Just destcid Just destcid
| destcid `elem` knowncids -> callback KnownContentIdentifier | destcid `elem` knowncids -> callback KnownContentIdentifier
-- dest exists with other content -- dest exists with other content

View file

@ -61,3 +61,5 @@ Unfortunately, I don't have the output of git-annex anymore. However, the used c
### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders) ### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
I've been successfully using git-annex to manage the growing collection of photos I take for several years now. Synchronizing the collection across several disks while files are renamed and some disks are only rarely synchronized and some contain only a subset of the photos wouldn't have been possible without git-annex. Also, using git-annex gives me the safety that the original photos will never be modified. There is nothing better than using git-annex to verify that all photos have been copied unmodified from a memory card before wiping it. I've been successfully using git-annex to manage the growing collection of photos I take for several years now. Synchronizing the collection across several disks while files are renamed and some disks are only rarely synchronized and some contain only a subset of the photos wouldn't have been possible without git-annex. Also, using git-annex gives me the safety that the original photos will never be modified. There is nothing better than using git-annex to verify that all photos have been copied unmodified from a memory card before wiping it.
> [[fixed|done]] --[[Joey]]

View file

@ -0,0 +1,36 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2022-03-21T16:20:27Z"
content="""
The directory special remote use the size and mtime as the
ContentIdentifier of a file. It used to use inode number as well, but
some filesystems make up new inodes on each mount so that was dropped.
So when two files have the same size and mtime, it will import one of them,
and then when it comes to the other one, see that it already has that
ContentIdentifier and use the content it already imported.
For example:
joey@darkstar:/tmp/t>git-annex initremote d type=directory directory=../d importtree=yes exporttree=yes encryption=none
joey@darkstar:/tmp/t>echo hi > ../d/1
joey@darkstar:/tmp/t>echo no > ../d/2
joey@darkstar:/tmp/t>touch --reference=../d/1 ../d/2
joey@darkstar:/tmp/t>git-annex import master --from d
joey@darkstar:/tmp/t>git merge d/master
joey@darkstar:/tmp/t>cmp 1 2
joey@darkstar:/tmp/t>
Perhaps NextCloud is for some reason giving multiple files the same mtime?
This is rather unlikely to happen naturally, especially on modern
filesystems that have high resolution mtimes. Even "echo 1> 1; echo 2>2"
generates files with 2 different mtimes. To get the same mtime, the two
files have to be written by the same process at very close to the same
time.
Probably that's why NextCloud managed to get the same mtime, although why
it would be rewriting the content of the file when the content had not
changed I don't know and perhaps that's a bug of some kind in it.
"""]]

View file

@ -0,0 +1,15 @@
[[!comment format=mdwn
username="joey"
subject="""comment 2"""
date="2022-03-21T16:38:43Z"
content="""
So this makes me think that [[!73df633a6215faa093e4a7524c6d328aa988aed1]]
needs to be reverted and the inode number used again. But that did fix a
bug of a sort, that importing from a fat filesystem after it was remounted
would re-import everything. So I suppose the best that can be done is to
make it configurable whether inodes should be ignored or not when importing
from a directory special remote. And presumably default to not ignoring
them.
done
"""]]

View file

@ -39,6 +39,14 @@ remote:
by [[git-annex-import]]. It will not be usable as a general-purpose by [[git-annex-import]]. It will not be usable as a general-purpose
special remote. special remote.
* `ignoreinodes` - Usually when importing, the inode numbers
of files are used to detect when files have changed. Since some
filesystems generate new inode numbers each time they are mounted,
that can lead to extra work being done. Setting this to "yes" will
ignore the inode numbers and so avoid that extra work.
This should not be used when the filesystem has stable inode numbers,
as it does risk confusing two files that have the same size and mtime.
Setup example: Setup example:
# git annex initremote usbdrive type=directory directory=/media/usbdrive/ encryption=none # git annex initremote usbdrive type=directory directory=/media/usbdrive/ encryption=none