This didn't work. In case I want to revisit, here's what I tried.
diff --git a/Annex/Proxy.hs b/Annex/Proxy.hs
index 48222872c1..e4e526d3dd 100644
--- a/Annex/Proxy.hs
+++ b/Annex/Proxy.hs
@@ -26,16 +26,21 @@ import Logs.UUID
import Logs.Location
import Utility.Tmp.Dir
import Utility.Metered
+import Utility.ThreadScheduler
+import Utility.OpenFd
import Git.Types
import qualified Database.Export as Export
import Control.Concurrent.STM
import Control.Concurrent.Async
+import Control.Concurrent.MVar
import qualified Data.ByteString as B
+import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as L
import qualified System.FilePath.ByteString as P
import qualified Data.Map as M
import qualified Data.Set as S
+import System.IO.Unsafe
proxyRemoteSide :: ProtocolVersion -> Bypass -> Remote -> Annex RemoteSide
proxyRemoteSide clientmaxversion bypass r
@@ -240,21 +245,99 @@ proxySpecialRemote protoversion r ihdl ohdl owaitv oclosedv mexportdb = go
writeVerifyChunk iv h b
storetofile iv h (n - fromIntegral (B.length b)) bs
- proxyget offset af k = withproxytmpfile k $ \tmpfile -> do
+ proxyget offset af k = withproxytmpfile k $ \tmpfile ->
+ let retrieve = tryNonAsync $ Remote.retrieveKeyFile r k af
+ (fromRawFilePath tmpfile) nullMeterUpdate vc
+ in case fromKey keySize k of
+ Just size | size > 0 -> do
+ cancelv <- liftIO newEmptyMVar
+ donev <- liftIO newEmptyMVar
+ streamer <- liftIO $ async $
+ streamdata offset tmpfile size cancelv donev
+ retrieve >>= \case
+ Right _ -> liftIO $ do
+ putMVar donev ()
+ wait streamer
+ Left err -> liftIO $ do
+ putMVar cancelv ()
+ wait streamer
+ propagateerror err
+ _ -> retrieve >>= \case
+ Right _ -> liftIO $ senddata offset tmpfile
+ Left err -> liftIO $ propagateerror err
+ where
-- Don't verify the content from the remote,
-- because the client will do its own verification.
- let vc = Remote.NoVerify
- tryNonAsync (Remote.retrieveKeyFile r k af (fromRawFilePath tmpfile) nullMeterUpdate vc) >>= \case
- Right _ -> liftIO $ senddata offset tmpfile
- Left err -> liftIO $ propagateerror err
+ vc = Remote.NoVerify
+ streamdata (Offset offset) f size cancelv donev = do
+ sendlen offset size
+ waitforfile
+ x <- tryNonAsync $ do
+ fd <- openFdWithMode f ReadOnly Nothing defaultFileFlags
+ h <- fdToHandle fd
+ hSeek h AbsoluteSeek offset
+ senddata' h (getcontents size)
+ case x of
+ Left err -> do
+ throwM err
+ Right res -> return res
+ where
+ -- The file doesn't exist at the start.
+ -- Wait for some data to be written to it as well,
+ -- in case an empty file is first created and then
+ -- overwritten. When there is an offset, wait for
+ -- the file to get that large. Note that this is not used
+ -- when the size is 0.
+ waitforfile = tryNonAsync (fromIntegral <$> getFileSize f) >>= \case
+ Right sz | sz > 0 && sz >= offset -> return ()
+ _ -> ifM (isEmptyMVar cancelv)
+ ( do
+ threadDelaySeconds (Seconds 1)
+ waitforfile
+ , do
+ return ()
+ )
+
+ getcontents n h = unsafeInterleaveIO $ do
+ isdone <- isEmptyMVar donev <||> isEmptyMVar cancelv
+ c <- BS.hGet h defaultChunkSize
+ let n' = n - fromIntegral (BS.length c)
+ let c' = L.fromChunks [BS.take (fromIntegral n) c]
+ if BS.null c
+ then if isdone
+ then return mempty
+ else do
+ -- Wait for more data to be
+ -- written to the file.
+ threadDelaySeconds (Seconds 1)
+ getcontents n h
+ else if n' > 0
+ then do
+ -- unsafeInterleaveIO causes
+ -- this to be deferred until
+ -- data is read from the lazy
+ -- ByteString.
+ cs <- getcontents n' h
+ return $ L.append c' cs
+ else return c'
+
senddata (Offset offset) f = do
size <- fromIntegral <$> getFileSize f
- let n = max 0 (size - offset)
- sendmessage $ DATA (Len n)
+ sendlen offset size
withBinaryFile (fromRawFilePath f) ReadMode $ \h -> do
hSeek h AbsoluteSeek offset
- sendbs =<< L.hGetContents h
+ senddata' h L.hGetContents
+
+ senddata' h getcontents = do
+ sendbs =<< getcontents h
-- Important to keep the handle open until
-- the client responds. The bytestring
-- could still be lazily streaming out to
@@ -272,6 +355,11 @@ proxySpecialRemote protoversion r ihdl ohdl owaitv oclosedv mexportdb = go
Just FAILURE -> return ()
Just _ -> giveup "protocol error"
Nothing -> return ()
+
+ sendlen offset size = do
+ let n = max 0 (size - offset)
+ sendmessage $ DATA (Len n)
+
{- Check if this repository can proxy for a specified remote uuid,
- and if so enable proxying for it. -}
Without actually simulating cluster implementation at all. Instead, only
the essential fact that cluster gateways know what changes they have
made to each node of a cluster. That is enough for sims like
sizebalanced_cluster.
Noticed that it was quite slow compared with things like action
sendwanted. Guessed that the slowdown is largely due to every step
doing a simulated git pull/push.
So, rather than always doing a pull/push, only do those when no actions
are found without doing a pull/push.
This does mean that step will sometimes experience a split brain
situation, but that seems like a good thing? Because step ought to
explore as many possible scenarios as it reasonably can.
This demonstrates concurrent behavior that looks right. And with a
random seed, the results are deterministic.
init foo
init bar
init backup
connect foo <-> bar
connect foo <-> backup
addmulti 10 testfiles 1mb 1gb foo backup
action foo gitpull backup
wanted foo nothing
wanted bar anything
wanted backup anything
action bar gitpull foo
action foo dropunwanted while action bar getwanted foo
When getting from a remote, have to check that the repo doing the
getting thinks the remote contains the key, but also that the remote
actually does. Before this bug fix, it would get from a repo that used
to have the key, but that had dropped it since the last git pull.
Had to add Read instances to Key and NumCopies and some other similar
types. I only expect to use those in serializing a sim. Of course, this
risks that implementation changes break reading old data. For a sim,
that would not be a big problem.
Not tested yet. This emulates the same checking that is done when
dropping. Note that when dropping from a special remote it is not able
to make a locked copy.
The generator doesn't emit the best possible connect commands,
but it does output something valid. Eg, an input like:
connect A <-> B <-> C <-> D
becomes:
connect A <-> B <-> C
connect C <-> D
Also:
connect A -> B <- C
becomes:
connect A -> B
connect C -> B
Which could be improved.
Also disconnect commands are not prettified at all, but probably there's
no reason to.
Detect when a preferred content expression contains "not present", which
would lead to repeatedly getting and then dropping files, and make it never
match. This also applies to "not balanced" and "not sizebalanced".
--explain will tell the user when this happens
Note that getMatcher calls matchMrun' and does not check for unstable
negated limits. While there is no --present anyway, if there was,
it would not make sense for --not --present to complain about
instability and fail to match.
Reorganized the reposize database directory, and split up a column.
checkStaleSizeChanges needs to run before needLiveUpdate,
otherwise the process won't be holding a lock on its pid file, and
another process could go in and expire the live update it records. It
just so happens that they do get called in the correct order, since
checking balanced preferred content calls getLiveRepoSizes before
needLiveUpdate.
The 1 minute delay between checks is arbitrary, but will avoid excess
work. The downside of it is that, if a process is dropping a file and
gets interrupted, for 1 minute another process can expect a repository
will soon be smaller than it is. And so a process might send data to a
repository when a file is not really going to be dropped from it. But
note that can already happen if a drop takes some time in eg locking and
then fails. So it seems possible that live updates should only be
allowed to increase, rather than decrease the size of a repository.
Only when the preferred content expression being matched uses balanced
preferred content is this overhead needed.
It might be possible to eliminate the locking entirely. Eg, check the
live changes before and after the action and re-run if they are not
stable. For now, this is good enough, it avoids existing preferred
content getting slow. If balanced preferred content turns out to be too
slow to check, that could be tried later.
This makes sure that two threads don't check balanced preferred content at the
same time, so each thread always sees a consistent picture of what is
happening.
This does add a fairly expensive file level lock to every check of
preferred content, in commands that use prepareLiveUpdate. It would
be good to only do that when live updates are actually needed, eg when
the preferred content expression uses balanced preferred content.
Fixed successfullyFinishedLiveSizeChange to not update the rolling total
when a redundant change is in RecentChanges.
Made setRepoSizes clear RecentChanges that are no longer needed.
It might be possible to clear those earlier, this is only a convenient
point to do it.
The reason it's safe to clear RecentChanges here is that, in order for a
live update to call successfullyFinishedLiveSizeChange, a change must be
made to a location log. If a RecentChange gets cleared, and just after
that a new live update is started, making the same change, the location
log has already been changed (since the RecentChange exists), and
so when the live update succeeds, it won't call
successfullyFinishedLiveSizeChange. The reason it doesn't
clear RecentChanges when there is a reduntant live update is because
I didn't want to think through whether or not all races are avoided in
that case.
The rolling total in SizeChanges is never cleared. Instead,
calcJournalledRepoSizes gets the initial value of it, and then
getLiveRepoSizes subtracts that initial value from the current value.
Since the rolling total can only be updated by updateRepoSize,
which is called with the journal locked, locking the journal in
calcJournalledRepoSizes ensures that the database does not change while
reading the journal.
It's possible for two processes or threads to both be doing the same
operation at the same time. Eg, both dropping the same key. If one
finishes and updates the rollingtotal, then the other one needs to be
prevented from later updating the rollingtotal as well. And they could
finish at the same time, or with some time in between.
Addressed this by making updateRepoSize be called with the journal
locked, and only once it's been determined that there is an actual
location change to record in the log. updateRepoSize waits for the
database to be updated.
When there is a redundant operation, updateRepoSize won't be called,
and the redundant LiveUpdate will be removed from the database on
garbage collection.
But: There will be a window where the redundant LiveUpdate is still
visible in the db, and processes can see it, combine it with the
rollingtotal, and arrive at the wrong size. This is a small window, but
it still ought to be addressed. Unsure if it would always be safe to
remove the redundant LiveUpdate? Consider the case where two drops and a
get are all running concurrently somehow, and the order they finish is
[drop, get, drop]. The second drop seems redundant to the first, but
it would not be safe to remove it. While this seems unlikely, it's hard
to rule out that a get and drop at different stages can both be running
at the same time.
I've tested the behavior of the thread that waits for the LiveUpdate to
be finished, and it does get signaled and exit cleanly when the
LiveUpdate is GCed instead.
Made finishedLiveUpdate wait for the thread to finish updating the
database.
There is a case where GC doesn't happen in time and the database is left
with a live update recorded in it. This should not be a problem as such
stale data can also happen when interrupted and will need to be detected
when loading the database.
Balanced preferred content expressions now call startLiveUpdate.
Each command that first checks preferred content (and/or required
content) and then does something that can change the sizes of
repositories needs to call prepareLiveUpdate, and plumb it through the
preferred content check and the location log update.
So far, only Command.Drop is done. Many other commands that don't need
to do this have been updated to keep working.
There may be some calls to NoLiveUpdate in places where that should be
done. All will need to be double checked.
Not currently in a compilable state.
A new repo that has no location log info yet, but has an entry in
uuid.log has 0 size, so make RepoSize aware of that.
Note that a new repo that does not yet appear in uuid.log will still not
be displayed.
When a remote is added but not synced with yet, it has no uuid.log
entry. If git-annex maxsize is used to configure that remote, it needs
to appear in the maxsize table, and the change to Command.MaxSize takes
care of that.
When the specified number of copies is > 1, and some repositories are
too full, it can be better to move content from them to other less full
repositories, in order to make space for new content.
annex.fullybalancedthreshhold is documented, but not implemented yet
This is not tested very well yet, and is known to sometimes take several
runs to stabalize.
Might want to make --rebalance turn balanced=group:N where N > 1
to fullysizebalanced=group:N. Have not yet determined if that will
improve situations enough to be worth the extra work.
The use of catObjectStream is optimally fast. Although it might be
possible to combine this with git-annex branch merge to avoid some
redundant work.
Benchmarking, a git-annex branch that had 100000 files changed
took less than 1.88 seconds to run through this.
At this point the RepoSize database is getting populated, and it
all seems to be working correctly. Incremental updates still need to be
done to make it performant.
Including locking on creation, handling of permissions errors, and
setting repo sizes.
I'm confident that locking is not needed while using this database.
Since writes happen in a single transaction. When there are two writers
that are recording sizes based on different git-annex branch commits,
one will overwrite what the other one recorded. Which is fine, it's only
necessary that the database stays consistent with the content of a
git-annex branch commit.
Plan is to run this when populating Annex.reposizes on demand.
So Annex.reposizes will be up-to-date with the journal, including
crucially journal entries for private repositories. But also
anything that has been written to the journal by another process,
especially if the process was ran with annex.alwayscommit=false.
From there, Annex.reposizes can be kept up to date with changes made
by the running process.
This is very innefficient, it will need to be optimised not to
calculate the sizes of repos every time.
Also, fixed a bug in balancedPicker that caused it to pick a too high
index when some repos were excluded due to being full.
This deals with the possible security problem that someone could make an
unusually low UUID and generate keys that are all constructed to hash to
a number that, mod the number of repositories in the group, == 0.
So balanced preferred content would always put those keys in the
repository with the low UUID as long as the group contains the
number of repositories that the attacker anticipated.
Presumably the attacker than holds the data for ransom? Dunno.
Anyway, the partial solution is to use HMAC (sha256) with all the UUIDs
combined together as the "secret", and the key as the "message". Now any
change in the set of UUIDs in a group will invalidate the attacker's
constructed keys from hashing to anything in particular.
Given that there are plenty of other things someone can do if they can
write to the repository -- including modifying preferred content so only
their repository wants files, and numcopies so other repositories drom
them -- this seems like safeguard enough.
Note that, in balancedPicker, combineduuids is memoized.
This all works fine. But it doesn't check repository sizes yet, and
without repository size checking, once a repository gets full, there
will be no other repository that will want its files.
Use of sha2 seems unncessary, probably alder2 or md5 or crc would have
been enough. Possibly just summing up the bytes of the key mod the number
of repositories would have sufficed. But sha2 is there, and probably
hardware accellerated. I doubt very much there is any security benefit
to using it though. If someone wants to construct a key that will be
balanced onto a given repository, sha2 is certianly not going to stop
them.
This removes versionedExport, which was only used by the S3 special
remote. Instead, versionedexport=yes is a common way for remotes to
indicate that they are versioned.
This is not perfect because it does not handle versioned special
remotes, which should not be untrustworthy, but now are when proxied.
The implementation turned out to be easy, because the exporttree field
is a default field, so is available in RemoteConfig even for git
remotes.