2018-01-02 21:17:10 +00:00
|
|
|
{- git-annex log files
|
|
|
|
-
|
Windows: Fix CRLF handling in some log files
In particular, the mergedrefs file was written with CR added to each line,
but read without CRLF handling. This resulted in each update of the file
adding CR to each line in it, growing the number of lines, while also
preventing the optimisation from working, so it remerged unncessarily.
writeFile and readFile do NewlineMode translation on Windows. But the
ByteString conversion prevented that from happening any longer.
I've audited for other cases of this, and found three more
(.git/annex/index.lck, .git/annex/ignoredrefs, and .git/annex/import/). All
of those also only prevent optimisations from working. Some other files are
currently both read and written with ByteString, but old git-annex may have
written them with NewlineMode translation. Other files are at risk for
breakage later if the reader gets converted to ByteString.
This is a minimal fix, but should be enough, as long as I remember to use
fileLines when splitting a ByteString into lines. This leaves files written
using ByteString without CR added, but that's ok because old git-annex has
no difficulty reading such files.
When the mergedrefs file has gotten lines that end with "\r\r\r\n", this
will eventually clean it up. Each update will remove a single trailing CR.
Note that S8.lines is still used in eg Command.Unused, where it is parsing
git show-ref, and similar in Git/*. git commands don't include CR in their
output so that's ok.
Sponsored-by: Joshua Antonishen on Patreon
2023-10-30 18:23:23 +00:00
|
|
|
- Copyright 2018-2023 Joey Hess <id@joeyh.name>
|
2018-01-02 21:17:10 +00:00
|
|
|
-
|
2018-10-25 18:43:13 +00:00
|
|
|
- Licensed under the GNU AGPL version 3 or higher.
|
2018-01-02 21:17:10 +00:00
|
|
|
-}
|
|
|
|
|
Windows: Fix CRLF handling in some log files
In particular, the mergedrefs file was written with CR added to each line,
but read without CRLF handling. This resulted in each update of the file
adding CR to each line in it, growing the number of lines, while also
preventing the optimisation from working, so it remerged unncessarily.
writeFile and readFile do NewlineMode translation on Windows. But the
ByteString conversion prevented that from happening any longer.
I've audited for other cases of this, and found three more
(.git/annex/index.lck, .git/annex/ignoredrefs, and .git/annex/import/). All
of those also only prevent optimisations from working. Some other files are
currently both read and written with ByteString, but old git-annex may have
written them with NewlineMode translation. Other files are at risk for
breakage later if the reader gets converted to ByteString.
This is a minimal fix, but should be enough, as long as I remember to use
fileLines when splitting a ByteString into lines. This leaves files written
using ByteString without CR added, but that's ok because old git-annex has
no difficulty reading such files.
When the mergedrefs file has gotten lines that end with "\r\r\r\n", this
will eventually clean it up. Each update will remove a single trailing CR.
Note that S8.lines is still used in eg Command.Unused, where it is parsing
git show-ref, and similar in Git/*. git commands don't include CR in their
output so that's ok.
Sponsored-by: Joshua Antonishen on Patreon
2023-10-30 18:23:23 +00:00
|
|
|
{-# LANGUAGE CPP, BangPatterns #-}
|
2020-10-20 20:42:28 +00:00
|
|
|
|
|
|
|
module Logs.File (
|
|
|
|
writeLogFile,
|
|
|
|
withLogHandle,
|
|
|
|
appendLogFile,
|
|
|
|
modifyLogFile,
|
|
|
|
streamLogFile,
|
fix deadlock in restagePointerFiles
Fix a hang that occasionally occurred during commands such as move.
(A bug introduced in 10.20220927, in
commit 6a3bd283b8af53f810982e002e435c0d7c040c59)
The restage.log was kept locked while running a complex index refresh
action. In an unusual situation, that action could need to write to the
restage log, which caused a deadlock.
The solution is a two-stage process. First the restage.log is moved to a
work file, which is done with the lock held. Then the content of the work
file is read and processed, which happens without the lock being held.
This is all done in a crash-safe manner.
Note that streamRestageLog may not be fully safe to run concurrently
with itself. That's ok, because restagePointerFiles uses it with the
index lock held, so only one can be run at a time.
streamRestageLog does delete the restage.old file at the end without
locking. If a calcRestageLog is run concurrently, it will either see the
file content before it was deleted, or will see it's missing. Either is
ok, because at most this will cause calcRestageLog to report more
work remains to be done than there is.
Sponsored-by: Dartmouth College's Datalad project
2022-12-08 18:18:54 +00:00
|
|
|
streamLogFileUnsafe,
|
2020-10-20 20:42:28 +00:00
|
|
|
checkLogFile,
|
add restage log
When pointer files need to be restaged, they're first written to the
log, and then when the restage operation runs, it reads the log. This
way, if the git-annex process is interrupted before it can do the
restaging, a later git-annex process can do it.
Currently, this lets a git-annex get/drop command be interrupted and
then re-ran, and as long as it gets/drops additional files, it will
clean up after the interrupted command. But more changes are
needed to make it easier to restage after an interrupted process.
Kept using the git queue to run the restage action, even though the
list of files that it builds up for that action is not actually used by
the action. This could perhaps be simplified to make restaging a cleanup
action that gets registered, rather than using the git queue for it. But
I wasn't sure if that would cause visible behavior changes, when eg
dropping a large number of files, currently the git queue flushes
periodically, and so it restages incrementally, rather than all at the
end.
In restagePointerFiles, it reads the restage log twice, once to get
the number of files and size, and a second time to process it.
This seemed better than reading the whole file into memory, since
potentially a huge number of files could be in there. Probably the OS
will cache the file in memory and there will not be much performance
impact. It might be better to keep running tallies in another file
though. But updating that atomically with the log seems hard.
Also note that it's possible for calcRestageLog to see a different file
than streamRestageLog does. More files may be added to the log in
between. That is ok, it will only cause the filterprocessfaster heuristic to
operate with slightly out of date information, so it may make the wrong
choice for the files that got added and be a little slower than ideal.
Sponsored-by: Dartmouth College's DANDI project
2022-09-23 18:38:59 +00:00
|
|
|
calcLogFile,
|
fix deadlock in restagePointerFiles
Fix a hang that occasionally occurred during commands such as move.
(A bug introduced in 10.20220927, in
commit 6a3bd283b8af53f810982e002e435c0d7c040c59)
The restage.log was kept locked while running a complex index refresh
action. In an unusual situation, that action could need to write to the
restage log, which caused a deadlock.
The solution is a two-stage process. First the restage.log is moved to a
work file, which is done with the lock held. Then the content of the work
file is read and processed, which happens without the lock being held.
This is all done in a crash-safe manner.
Note that streamRestageLog may not be fully safe to run concurrently
with itself. That's ok, because restagePointerFiles uses it with the
index lock held, so only one can be run at a time.
streamRestageLog does delete the restage.old file at the end without
locking. If a calcRestageLog is run concurrently, it will either see the
file content before it was deleted, or will see it's missing. Either is
ok, because at most this will cause calcRestageLog to report more
work remains to be done than there is.
Sponsored-by: Dartmouth College's Datalad project
2022-12-08 18:18:54 +00:00
|
|
|
calcLogFileUnsafe,
|
Windows: Fix CRLF handling in some log files
In particular, the mergedrefs file was written with CR added to each line,
but read without CRLF handling. This resulted in each update of the file
adding CR to each line in it, growing the number of lines, while also
preventing the optimisation from working, so it remerged unncessarily.
writeFile and readFile do NewlineMode translation on Windows. But the
ByteString conversion prevented that from happening any longer.
I've audited for other cases of this, and found three more
(.git/annex/index.lck, .git/annex/ignoredrefs, and .git/annex/import/). All
of those also only prevent optimisations from working. Some other files are
currently both read and written with ByteString, but old git-annex may have
written them with NewlineMode translation. Other files are at risk for
breakage later if the reader gets converted to ByteString.
This is a minimal fix, but should be enough, as long as I remember to use
fileLines when splitting a ByteString into lines. This leaves files written
using ByteString without CR added, but that's ok because old git-annex has
no difficulty reading such files.
When the mergedrefs file has gotten lines that end with "\r\r\r\n", this
will eventually clean it up. Each update will remove a single trailing CR.
Note that S8.lines is still used in eg Command.Unused, where it is parsing
git show-ref, and similar in Git/*. git commands don't include CR in their
output so that's ok.
Sponsored-by: Joshua Antonishen on Patreon
2023-10-30 18:23:23 +00:00
|
|
|
fileLines,
|
|
|
|
fileLines',
|
2020-10-20 20:42:28 +00:00
|
|
|
) where
|
2018-01-02 21:17:10 +00:00
|
|
|
|
|
|
|
import Annex.Common
|
|
|
|
import Annex.Perms
|
2018-10-25 18:43:13 +00:00
|
|
|
import Annex.LockFile
|
2019-05-20 20:37:04 +00:00
|
|
|
import Annex.ReplaceFile
|
2018-01-02 21:17:10 +00:00
|
|
|
import Utility.Tmp
|
|
|
|
|
Windows: Fix CRLF handling in some log files
In particular, the mergedrefs file was written with CR added to each line,
but read without CRLF handling. This resulted in each update of the file
adding CR to each line in it, growing the number of lines, while also
preventing the optimisation from working, so it remerged unncessarily.
writeFile and readFile do NewlineMode translation on Windows. But the
ByteString conversion prevented that from happening any longer.
I've audited for other cases of this, and found three more
(.git/annex/index.lck, .git/annex/ignoredrefs, and .git/annex/import/). All
of those also only prevent optimisations from working. Some other files are
currently both read and written with ByteString, but old git-annex may have
written them with NewlineMode translation. Other files are at risk for
breakage later if the reader gets converted to ByteString.
This is a minimal fix, but should be enough, as long as I remember to use
fileLines when splitting a ByteString into lines. This leaves files written
using ByteString without CR added, but that's ok because old git-annex has
no difficulty reading such files.
When the mergedrefs file has gotten lines that end with "\r\r\r\n", this
will eventually clean it up. Each update will remove a single trailing CR.
Note that S8.lines is still used in eg Command.Unused, where it is parsing
git show-ref, and similar in Git/*. git commands don't include CR in their
output so that's ok.
Sponsored-by: Joshua Antonishen on Patreon
2023-10-30 18:23:23 +00:00
|
|
|
import qualified Data.ByteString as S
|
|
|
|
import qualified Data.ByteString.Char8 as S8
|
2020-10-20 20:42:28 +00:00
|
|
|
import qualified Data.ByteString.Lazy as L
|
|
|
|
import qualified Data.ByteString.Lazy.Char8 as L8
|
|
|
|
|
2018-01-04 18:46:58 +00:00
|
|
|
-- | Writes content to a file, replacing the file atomically, and
|
|
|
|
-- making the new file have whatever permissions the git repository is
|
|
|
|
-- configured to use. Creates the parent directory when necessary.
|
2020-10-29 16:02:46 +00:00
|
|
|
writeLogFile :: RawFilePath -> String -> Annex ()
|
|
|
|
writeLogFile f c = createDirWhenNeeded f $ viaTmp writelog (fromRawFilePath f) c
|
2018-01-02 21:17:10 +00:00
|
|
|
where
|
2020-11-05 22:45:37 +00:00
|
|
|
writelog tmp c' = do
|
|
|
|
liftIO $ writeFile tmp c'
|
|
|
|
setAnnexFilePerm (toRawFilePath tmp)
|
2018-10-25 18:43:13 +00:00
|
|
|
|
2019-05-20 20:37:04 +00:00
|
|
|
-- | Runs the action with a handle connected to a temp file.
|
|
|
|
-- The temp file replaces the log file once the action succeeds.
|
2020-10-29 16:02:46 +00:00
|
|
|
withLogHandle :: RawFilePath -> (Handle -> Annex a) -> Annex a
|
2019-05-20 20:37:04 +00:00
|
|
|
withLogHandle f a = do
|
|
|
|
createAnnexDirectory (parentDir f)
|
2020-10-29 16:02:46 +00:00
|
|
|
replaceGitAnnexDirFile (fromRawFilePath f) $ \tmp ->
|
2019-05-20 20:37:04 +00:00
|
|
|
bracket (setup tmp) cleanup a
|
|
|
|
where
|
|
|
|
setup tmp = do
|
2023-10-26 17:36:49 +00:00
|
|
|
setAnnexFilePerm tmp
|
|
|
|
liftIO $ openFile (fromRawFilePath tmp) WriteMode
|
2019-05-20 20:37:04 +00:00
|
|
|
cleanup h = liftIO $ hClose h
|
|
|
|
|
2018-10-25 18:43:13 +00:00
|
|
|
-- | Appends a line to a log file, first locking it to prevent
|
|
|
|
-- concurrent writers.
|
2022-08-11 20:57:44 +00:00
|
|
|
appendLogFile :: RawFilePath -> RawFilePath -> L.ByteString -> Annex ()
|
2020-10-29 16:02:46 +00:00
|
|
|
appendLogFile f lck c =
|
2020-11-03 14:11:04 +00:00
|
|
|
createDirWhenNeeded f $
|
2020-10-29 16:02:46 +00:00
|
|
|
withExclusiveLock lck $ do
|
2020-11-03 14:11:04 +00:00
|
|
|
liftIO $ withFile f' AppendMode $
|
|
|
|
\h -> L8.hPutStrLn h c
|
2020-11-05 22:45:37 +00:00
|
|
|
setAnnexFilePerm (toRawFilePath f')
|
2020-11-03 14:11:04 +00:00
|
|
|
where
|
|
|
|
f' = fromRawFilePath f
|
2018-10-25 18:43:13 +00:00
|
|
|
|
2020-10-20 20:42:28 +00:00
|
|
|
-- | Modifies a log file.
|
|
|
|
--
|
|
|
|
-- If the function does not make any changes, avoids rewriting the file
|
|
|
|
-- for speed, but that does mean the whole file content has to be buffered
|
|
|
|
-- in memory.
|
|
|
|
--
|
|
|
|
-- The file is locked to prevent concurrent writers, and it is written
|
|
|
|
-- atomically.
|
2022-08-11 20:57:44 +00:00
|
|
|
modifyLogFile :: RawFilePath -> RawFilePath -> ([L.ByteString] -> [L.ByteString]) -> Annex ()
|
2020-10-20 20:42:28 +00:00
|
|
|
modifyLogFile f lck modf = withExclusiveLock lck $ do
|
|
|
|
ls <- liftIO $ fromMaybe []
|
Windows: Fix CRLF handling in some log files
In particular, the mergedrefs file was written with CR added to each line,
but read without CRLF handling. This resulted in each update of the file
adding CR to each line in it, growing the number of lines, while also
preventing the optimisation from working, so it remerged unncessarily.
writeFile and readFile do NewlineMode translation on Windows. But the
ByteString conversion prevented that from happening any longer.
I've audited for other cases of this, and found three more
(.git/annex/index.lck, .git/annex/ignoredrefs, and .git/annex/import/). All
of those also only prevent optimisations from working. Some other files are
currently both read and written with ByteString, but old git-annex may have
written them with NewlineMode translation. Other files are at risk for
breakage later if the reader gets converted to ByteString.
This is a minimal fix, but should be enough, as long as I remember to use
fileLines when splitting a ByteString into lines. This leaves files written
using ByteString without CR added, but that's ok because old git-annex has
no difficulty reading such files.
When the mergedrefs file has gotten lines that end with "\r\r\r\n", this
will eventually clean it up. Each update will remove a single trailing CR.
Note that S8.lines is still used in eg Command.Unused, where it is parsing
git show-ref, and similar in Git/*. git commands don't include CR in their
output so that's ok.
Sponsored-by: Joshua Antonishen on Patreon
2023-10-30 18:23:23 +00:00
|
|
|
<$> tryWhenExists (fileLines <$> L.readFile f')
|
2020-10-20 20:42:28 +00:00
|
|
|
let ls' = modf ls
|
|
|
|
when (ls' /= ls) $
|
2020-11-03 14:11:04 +00:00
|
|
|
createDirWhenNeeded f $
|
|
|
|
viaTmp writelog f' (L8.unlines ls')
|
2020-10-20 20:42:28 +00:00
|
|
|
where
|
2020-11-03 14:11:04 +00:00
|
|
|
f' = fromRawFilePath f
|
|
|
|
writelog lf b = do
|
|
|
|
liftIO $ L.writeFile lf b
|
2020-11-05 22:45:37 +00:00
|
|
|
setAnnexFilePerm (toRawFilePath lf)
|
2020-10-20 20:42:28 +00:00
|
|
|
|
|
|
|
-- | Checks the content of a log file to see if any line matches.
|
2022-10-07 17:19:17 +00:00
|
|
|
checkLogFile :: RawFilePath -> RawFilePath -> (L.ByteString -> Bool) -> Annex Bool
|
|
|
|
checkLogFile f lck matchf = withSharedLock lck $ bracket setup cleanup go
|
2020-10-20 20:42:28 +00:00
|
|
|
where
|
2022-10-07 17:19:17 +00:00
|
|
|
setup = liftIO $ tryWhenExists $ openFile f' ReadMode
|
2020-10-20 20:42:28 +00:00
|
|
|
cleanup Nothing = noop
|
|
|
|
cleanup (Just h) = liftIO $ hClose h
|
|
|
|
go Nothing = return False
|
|
|
|
go (Just h) = do
|
Windows: Fix CRLF handling in some log files
In particular, the mergedrefs file was written with CR added to each line,
but read without CRLF handling. This resulted in each update of the file
adding CR to each line in it, growing the number of lines, while also
preventing the optimisation from working, so it remerged unncessarily.
writeFile and readFile do NewlineMode translation on Windows. But the
ByteString conversion prevented that from happening any longer.
I've audited for other cases of this, and found three more
(.git/annex/index.lck, .git/annex/ignoredrefs, and .git/annex/import/). All
of those also only prevent optimisations from working. Some other files are
currently both read and written with ByteString, but old git-annex may have
written them with NewlineMode translation. Other files are at risk for
breakage later if the reader gets converted to ByteString.
This is a minimal fix, but should be enough, as long as I remember to use
fileLines when splitting a ByteString into lines. This leaves files written
using ByteString without CR added, but that's ok because old git-annex has
no difficulty reading such files.
When the mergedrefs file has gotten lines that end with "\r\r\r\n", this
will eventually clean it up. Each update will remove a single trailing CR.
Note that S8.lines is still used in eg Command.Unused, where it is parsing
git show-ref, and similar in Git/*. git commands don't include CR in their
output so that's ok.
Sponsored-by: Joshua Antonishen on Patreon
2023-10-30 18:23:23 +00:00
|
|
|
!r <- liftIO (any matchf . fileLines <$> L.hGetContents h)
|
2020-10-20 20:42:28 +00:00
|
|
|
return r
|
2022-10-07 17:19:17 +00:00
|
|
|
f' = fromRawFilePath f
|
2020-10-20 20:42:28 +00:00
|
|
|
|
add restage log
When pointer files need to be restaged, they're first written to the
log, and then when the restage operation runs, it reads the log. This
way, if the git-annex process is interrupted before it can do the
restaging, a later git-annex process can do it.
Currently, this lets a git-annex get/drop command be interrupted and
then re-ran, and as long as it gets/drops additional files, it will
clean up after the interrupted command. But more changes are
needed to make it easier to restage after an interrupted process.
Kept using the git queue to run the restage action, even though the
list of files that it builds up for that action is not actually used by
the action. This could perhaps be simplified to make restaging a cleanup
action that gets registered, rather than using the git queue for it. But
I wasn't sure if that would cause visible behavior changes, when eg
dropping a large number of files, currently the git queue flushes
periodically, and so it restages incrementally, rather than all at the
end.
In restagePointerFiles, it reads the restage log twice, once to get
the number of files and size, and a second time to process it.
This seemed better than reading the whole file into memory, since
potentially a huge number of files could be in there. Probably the OS
will cache the file in memory and there will not be much performance
impact. It might be better to keep running tallies in another file
though. But updating that atomically with the log seems hard.
Also note that it's possible for calcRestageLog to see a different file
than streamRestageLog does. More files may be added to the log in
between. That is ok, it will only cause the filterprocessfaster heuristic to
operate with slightly out of date information, so it may make the wrong
choice for the files that got added and be a little slower than ideal.
Sponsored-by: Dartmouth College's DANDI project
2022-09-23 18:38:59 +00:00
|
|
|
-- | Folds a function over lines of a log file to calculate a value.
|
2022-10-07 17:19:17 +00:00
|
|
|
calcLogFile :: RawFilePath -> RawFilePath -> t -> (L.ByteString -> t -> t) -> Annex t
|
fix deadlock in restagePointerFiles
Fix a hang that occasionally occurred during commands such as move.
(A bug introduced in 10.20220927, in
commit 6a3bd283b8af53f810982e002e435c0d7c040c59)
The restage.log was kept locked while running a complex index refresh
action. In an unusual situation, that action could need to write to the
restage log, which caused a deadlock.
The solution is a two-stage process. First the restage.log is moved to a
work file, which is done with the lock held. Then the content of the work
file is read and processed, which happens without the lock being held.
This is all done in a crash-safe manner.
Note that streamRestageLog may not be fully safe to run concurrently
with itself. That's ok, because restagePointerFiles uses it with the
index lock held, so only one can be run at a time.
streamRestageLog does delete the restage.old file at the end without
locking. If a calcRestageLog is run concurrently, it will either see the
file content before it was deleted, or will see it's missing. Either is
ok, because at most this will cause calcRestageLog to report more
work remains to be done than there is.
Sponsored-by: Dartmouth College's Datalad project
2022-12-08 18:18:54 +00:00
|
|
|
calcLogFile f lck start update =
|
|
|
|
withSharedLock lck $ calcLogFileUnsafe f start update
|
|
|
|
|
|
|
|
-- | Unsafe version that does not do locking.
|
|
|
|
calcLogFileUnsafe :: RawFilePath -> t -> (L.ByteString -> t -> t) -> Annex t
|
|
|
|
calcLogFileUnsafe f start update = bracket setup cleanup go
|
add restage log
When pointer files need to be restaged, they're first written to the
log, and then when the restage operation runs, it reads the log. This
way, if the git-annex process is interrupted before it can do the
restaging, a later git-annex process can do it.
Currently, this lets a git-annex get/drop command be interrupted and
then re-ran, and as long as it gets/drops additional files, it will
clean up after the interrupted command. But more changes are
needed to make it easier to restage after an interrupted process.
Kept using the git queue to run the restage action, even though the
list of files that it builds up for that action is not actually used by
the action. This could perhaps be simplified to make restaging a cleanup
action that gets registered, rather than using the git queue for it. But
I wasn't sure if that would cause visible behavior changes, when eg
dropping a large number of files, currently the git queue flushes
periodically, and so it restages incrementally, rather than all at the
end.
In restagePointerFiles, it reads the restage log twice, once to get
the number of files and size, and a second time to process it.
This seemed better than reading the whole file into memory, since
potentially a huge number of files could be in there. Probably the OS
will cache the file in memory and there will not be much performance
impact. It might be better to keep running tallies in another file
though. But updating that atomically with the log seems hard.
Also note that it's possible for calcRestageLog to see a different file
than streamRestageLog does. More files may be added to the log in
between. That is ok, it will only cause the filterprocessfaster heuristic to
operate with slightly out of date information, so it may make the wrong
choice for the files that got added and be a little slower than ideal.
Sponsored-by: Dartmouth College's DANDI project
2022-09-23 18:38:59 +00:00
|
|
|
where
|
2022-10-07 17:19:17 +00:00
|
|
|
setup = liftIO $ tryWhenExists $ openFile f' ReadMode
|
add restage log
When pointer files need to be restaged, they're first written to the
log, and then when the restage operation runs, it reads the log. This
way, if the git-annex process is interrupted before it can do the
restaging, a later git-annex process can do it.
Currently, this lets a git-annex get/drop command be interrupted and
then re-ran, and as long as it gets/drops additional files, it will
clean up after the interrupted command. But more changes are
needed to make it easier to restage after an interrupted process.
Kept using the git queue to run the restage action, even though the
list of files that it builds up for that action is not actually used by
the action. This could perhaps be simplified to make restaging a cleanup
action that gets registered, rather than using the git queue for it. But
I wasn't sure if that would cause visible behavior changes, when eg
dropping a large number of files, currently the git queue flushes
periodically, and so it restages incrementally, rather than all at the
end.
In restagePointerFiles, it reads the restage log twice, once to get
the number of files and size, and a second time to process it.
This seemed better than reading the whole file into memory, since
potentially a huge number of files could be in there. Probably the OS
will cache the file in memory and there will not be much performance
impact. It might be better to keep running tallies in another file
though. But updating that atomically with the log seems hard.
Also note that it's possible for calcRestageLog to see a different file
than streamRestageLog does. More files may be added to the log in
between. That is ok, it will only cause the filterprocessfaster heuristic to
operate with slightly out of date information, so it may make the wrong
choice for the files that got added and be a little slower than ideal.
Sponsored-by: Dartmouth College's DANDI project
2022-09-23 18:38:59 +00:00
|
|
|
cleanup Nothing = noop
|
|
|
|
cleanup (Just h) = liftIO $ hClose h
|
|
|
|
go Nothing = return start
|
Windows: Fix CRLF handling in some log files
In particular, the mergedrefs file was written with CR added to each line,
but read without CRLF handling. This resulted in each update of the file
adding CR to each line in it, growing the number of lines, while also
preventing the optimisation from working, so it remerged unncessarily.
writeFile and readFile do NewlineMode translation on Windows. But the
ByteString conversion prevented that from happening any longer.
I've audited for other cases of this, and found three more
(.git/annex/index.lck, .git/annex/ignoredrefs, and .git/annex/import/). All
of those also only prevent optimisations from working. Some other files are
currently both read and written with ByteString, but old git-annex may have
written them with NewlineMode translation. Other files are at risk for
breakage later if the reader gets converted to ByteString.
This is a minimal fix, but should be enough, as long as I remember to use
fileLines when splitting a ByteString into lines. This leaves files written
using ByteString without CR added, but that's ok because old git-annex has
no difficulty reading such files.
When the mergedrefs file has gotten lines that end with "\r\r\r\n", this
will eventually clean it up. Each update will remove a single trailing CR.
Note that S8.lines is still used in eg Command.Unused, where it is parsing
git show-ref, and similar in Git/*. git commands don't include CR in their
output so that's ok.
Sponsored-by: Joshua Antonishen on Patreon
2023-10-30 18:23:23 +00:00
|
|
|
go (Just h) = go' start =<< liftIO (fileLines <$> L.hGetContents h)
|
add restage log
When pointer files need to be restaged, they're first written to the
log, and then when the restage operation runs, it reads the log. This
way, if the git-annex process is interrupted before it can do the
restaging, a later git-annex process can do it.
Currently, this lets a git-annex get/drop command be interrupted and
then re-ran, and as long as it gets/drops additional files, it will
clean up after the interrupted command. But more changes are
needed to make it easier to restage after an interrupted process.
Kept using the git queue to run the restage action, even though the
list of files that it builds up for that action is not actually used by
the action. This could perhaps be simplified to make restaging a cleanup
action that gets registered, rather than using the git queue for it. But
I wasn't sure if that would cause visible behavior changes, when eg
dropping a large number of files, currently the git queue flushes
periodically, and so it restages incrementally, rather than all at the
end.
In restagePointerFiles, it reads the restage log twice, once to get
the number of files and size, and a second time to process it.
This seemed better than reading the whole file into memory, since
potentially a huge number of files could be in there. Probably the OS
will cache the file in memory and there will not be much performance
impact. It might be better to keep running tallies in another file
though. But updating that atomically with the log seems hard.
Also note that it's possible for calcRestageLog to see a different file
than streamRestageLog does. More files may be added to the log in
between. That is ok, it will only cause the filterprocessfaster heuristic to
operate with slightly out of date information, so it may make the wrong
choice for the files that got added and be a little slower than ideal.
Sponsored-by: Dartmouth College's DANDI project
2022-09-23 18:38:59 +00:00
|
|
|
go' v [] = return v
|
|
|
|
go' v (l:ls) = do
|
|
|
|
let !v' = update l v
|
|
|
|
go' v' ls
|
2022-10-07 17:19:17 +00:00
|
|
|
f' = fromRawFilePath f
|
2020-10-20 20:42:28 +00:00
|
|
|
|
2022-09-23 17:49:01 +00:00
|
|
|
-- | Streams lines from a log file, passing each line to the processor,
|
|
|
|
-- and then empties the file at the end.
|
2018-10-25 18:43:13 +00:00
|
|
|
--
|
2022-09-23 17:49:01 +00:00
|
|
|
-- If the processor is interrupted or throws an exception, the log file is
|
2018-10-25 18:43:13 +00:00
|
|
|
-- left unchanged.
|
|
|
|
--
|
2022-09-23 17:49:01 +00:00
|
|
|
-- There is also a finalizer, that is run once all lines have been
|
|
|
|
-- streamed. It is run even if the log file does not exist. If the
|
|
|
|
-- finalizer throws an exception, the log file is left unchanged.
|
2018-10-25 18:43:13 +00:00
|
|
|
--
|
|
|
|
-- Locking is used to prevent writes to to the log file while this
|
|
|
|
-- is running.
|
2022-09-23 17:49:01 +00:00
|
|
|
streamLogFile :: FilePath -> RawFilePath -> Annex () -> (String -> Annex ()) -> Annex ()
|
|
|
|
streamLogFile f lck finalizer processor =
|
fix deadlock in restagePointerFiles
Fix a hang that occasionally occurred during commands such as move.
(A bug introduced in 10.20220927, in
commit 6a3bd283b8af53f810982e002e435c0d7c040c59)
The restage.log was kept locked while running a complex index refresh
action. In an unusual situation, that action could need to write to the
restage log, which caused a deadlock.
The solution is a two-stage process. First the restage.log is moved to a
work file, which is done with the lock held. Then the content of the work
file is read and processed, which happens without the lock being held.
This is all done in a crash-safe manner.
Note that streamRestageLog may not be fully safe to run concurrently
with itself. That's ok, because restagePointerFiles uses it with the
index lock held, so only one can be run at a time.
streamRestageLog does delete the restage.old file at the end without
locking. If a calcRestageLog is run concurrently, it will either see the
file content before it was deleted, or will see it's missing. Either is
ok, because at most this will cause calcRestageLog to report more
work remains to be done than there is.
Sponsored-by: Dartmouth College's Datalad project
2022-12-08 18:18:54 +00:00
|
|
|
withExclusiveLock lck $ do
|
|
|
|
streamLogFileUnsafe f finalizer processor
|
|
|
|
liftIO $ writeFile f ""
|
|
|
|
setAnnexFilePerm (toRawFilePath f)
|
|
|
|
|
|
|
|
-- Unsafe version that does not do locking, and does not empty the file
|
|
|
|
-- at the end.
|
|
|
|
streamLogFileUnsafe :: FilePath -> Annex () -> (String -> Annex ()) -> Annex ()
|
|
|
|
streamLogFileUnsafe f finalizer processor = bracketOnError setup cleanup go
|
2018-10-25 18:43:13 +00:00
|
|
|
where
|
|
|
|
setup = liftIO $ tryWhenExists $ openFile f ReadMode
|
|
|
|
cleanup Nothing = noop
|
|
|
|
cleanup (Just h) = liftIO $ hClose h
|
2022-09-23 17:49:01 +00:00
|
|
|
go Nothing = finalizer
|
2018-10-25 18:43:13 +00:00
|
|
|
go (Just h) = do
|
2022-09-23 17:49:01 +00:00
|
|
|
mapM_ processor =<< liftIO (lines <$> hGetContents h)
|
2018-10-25 18:43:13 +00:00
|
|
|
liftIO $ hClose h
|
2022-09-23 17:49:01 +00:00
|
|
|
finalizer
|
2018-10-25 18:43:13 +00:00
|
|
|
|
2020-10-29 16:02:46 +00:00
|
|
|
createDirWhenNeeded :: RawFilePath -> Annex () -> Annex ()
|
2018-10-25 18:43:13 +00:00
|
|
|
createDirWhenNeeded f a = a `catchNonAsync` \_e -> do
|
|
|
|
-- Most of the time, the directory will exist, so this is only
|
|
|
|
-- done if writing the file fails.
|
|
|
|
createAnnexDirectory (parentDir f)
|
|
|
|
a
|
Windows: Fix CRLF handling in some log files
In particular, the mergedrefs file was written with CR added to each line,
but read without CRLF handling. This resulted in each update of the file
adding CR to each line in it, growing the number of lines, while also
preventing the optimisation from working, so it remerged unncessarily.
writeFile and readFile do NewlineMode translation on Windows. But the
ByteString conversion prevented that from happening any longer.
I've audited for other cases of this, and found three more
(.git/annex/index.lck, .git/annex/ignoredrefs, and .git/annex/import/). All
of those also only prevent optimisations from working. Some other files are
currently both read and written with ByteString, but old git-annex may have
written them with NewlineMode translation. Other files are at risk for
breakage later if the reader gets converted to ByteString.
This is a minimal fix, but should be enough, as long as I remember to use
fileLines when splitting a ByteString into lines. This leaves files written
using ByteString without CR added, but that's ok because old git-annex has
no difficulty reading such files.
When the mergedrefs file has gotten lines that end with "\r\r\r\n", this
will eventually clean it up. Each update will remove a single trailing CR.
Note that S8.lines is still used in eg Command.Unused, where it is parsing
git show-ref, and similar in Git/*. git commands don't include CR in their
output so that's ok.
Sponsored-by: Joshua Antonishen on Patreon
2023-10-30 18:23:23 +00:00
|
|
|
|
|
|
|
-- On windows, readFile does NewlineMode translation,
|
|
|
|
-- stripping CR before LF. When converting to ByteString,
|
|
|
|
-- use this to emulate that.
|
|
|
|
fileLines :: L.ByteString -> [L.ByteString]
|
|
|
|
#ifdef mingw32_HOST_OS
|
|
|
|
fileLines = map stripCR . L8.lines
|
|
|
|
where
|
|
|
|
stripCR b = case L8.unsnoc b of
|
|
|
|
Nothing -> b
|
|
|
|
Just (b', e)
|
|
|
|
| e == '\r' -> b'
|
|
|
|
| otherwise -> b
|
|
|
|
#else
|
|
|
|
fileLines = L8.lines
|
|
|
|
#endif
|
|
|
|
|
|
|
|
fileLines' :: S.ByteString -> [S.ByteString]
|
|
|
|
#ifdef mingw32_HOST_OS
|
|
|
|
fileLines' = map stripCR . S8.lines
|
|
|
|
where
|
|
|
|
stripCR b = case S8.unsnoc b of
|
|
|
|
Nothing -> b
|
|
|
|
Just (b', e)
|
|
|
|
| e == '\r' -> b'
|
|
|
|
| otherwise -> b
|
|
|
|
#else
|
|
|
|
fileLines' = S8.lines
|
|
|
|
#endif
|