From b5319ec575945d76e503123a122f33d3d1f269db Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 19 Feb 2025 14:29:18 -0400 Subject: [PATCH 01/53] documentation for compute remote and associated commands None of this is implemented yet. --- doc/git-annex-addcomputed.mdwn | 108 +++++++++++++++++++++++++++++++ doc/git-annex-initremote.mdwn | 2 +- doc/git-annex-recompute.mdwn | 55 ++++++++++++++++ doc/git-annex.mdwn | 12 ++++ doc/special_remotes/compute.mdwn | 33 ++++++++++ 5 files changed, 209 insertions(+), 1 deletion(-) create mode 100644 doc/git-annex-addcomputed.mdwn create mode 100644 doc/git-annex-recompute.mdwn create mode 100644 doc/special_remotes/compute.mdwn diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn new file mode 100644 index 0000000000..395d6246e7 --- /dev/null +++ b/doc/git-annex-addcomputed.mdwn @@ -0,0 +1,108 @@ +# NAME + +git-annex addcomputed - adds computed files to the repository + +# SYNOPSIS + +git annex addcomputed `--to=remote [--input name=file ...] [--value name=value ...] [--output name=file ...]` + +git annex addcomputed `--describe=remote` + +# DESCRIPTION + +Adds files to the annex that are computed from input files and values, +using a compute special remote. + +For example, this adds a file `foo.jpeg` to the repository. It is computed +by the "photoconv" compute remote, based on an input file, `foo.raw`. A +configurable "passes" value is set to 10 when computing the file. + + git-annex addcomputed --to photoconv \ + --input raw=foo.raw --output photo=foo.jpeg \ + --value passes=10 + +There can be more than one input file that are combined to compute an +output file. And multiple output files can be computed at the same time. + +The output files are added to the repository as annexed files. +Once a file has been added to a compute remote, commands +like `git-annex get` will use it to compute the content of the file. +It is also possible to use commands like `git-annex drop` on the file, +with the compute remote being counted as one copy of it. + +# OPTIONS + +* `--to=remote` + + Specify which remote will compute the files. + + This must be a compute remote. For example, one can be + initialized with: + + git-annex initremote photoconv type=compute \ + program=git-annex-compute-photoconv + + For details about compute remotes, and a list of some + of the programs that are available, see + + +* `--input name=file` + + Provide a file as input to the computation, with the specified input name. + + The input file can be an annexed file, or a file stored in git. + +* `--output name=file` + + Add the output of the computation to the repository as an annexed file, + with the specified filename. + +* `--value name=value` + + Provide a value to the computation, with the specified name. + +* `--describe=remote` + + Describe all inputs, outputs, and values supported by a compute remote. + + For a machine-readable list, use this with `--json`. + +* `--fast` + + Adds computed files to the repository, without generating their content + yet. + +* `--unreproducible` + + Indicate that the computation is not expected to be fully reproducible. + It can vary, in ways that produce files that equivilant enough to + be interchangeable, but are not necessarily identical. + + This is the default unless the compute remote indicates that it produces + reproducible output. + +* `--reproducible`, `-r` + + Indicate that the computation is expected to be fully reproducible. + + This is the default when the compute remote indicates that it produces + reproducible output. + + If a computation turns out not to be fully reproducible, then getting + the file from the compute remote will later fail with a checksum + verification error. One thing that can be done then is to use + `git-annex recompute --unreproducible`. + +* Also the [[git-annex-common-options]](1) can be used. + +# SEE ALSO + +[[git-annex]](1) + +[[git-annex-recompute]](1) + +# AUTHOR + +Joey Hess + +Warning: Automatically converted into a man page by mdwn2man. Edit with care. diff --git a/doc/git-annex-initremote.mdwn b/doc/git-annex-initremote.mdwn index 0e4514e823..bcb3494b7f 100644 --- a/doc/git-annex-initremote.mdwn +++ b/doc/git-annex-initremote.mdwn @@ -52,7 +52,7 @@ want to use `git annex renameremote`. git annex initremote mys3 type=S3 --whatelse - For a machine-readable list of the parameters, use this with --json. + For a machine-readable list of the parameters, use this with `--json`. * `--fast` diff --git a/doc/git-annex-recompute.mdwn b/doc/git-annex-recompute.mdwn new file mode 100644 index 0000000000..c86173c2eb --- /dev/null +++ b/doc/git-annex-recompute.mdwn @@ -0,0 +1,55 @@ +# NAME + +git-annex recompute - update computed files + +# SYNOPSIS + +git-annex recompute [path ...]` + +# DESCRIPTION + +This updates computed files that were added with +[[git-annex-addcomputed]](1). + +By default, this only recomputes files whose input files have changed. +The new contents of the input files are used to re-run the computation, +and when the output is different, the updated computed file is staged +in the repository. + +# OPTIONS + +* `--unchanged` + + Recompute files even when their input files have not changed. + +* `--unreproducible` + + Convert files that were added with `git-annex addcomputed --reproducible` + to be as if they were added without that option. + +* `--reproducible` + + Convert files that were added with `git-annex addcomputed --unreproducible` + to be as if they were added with `--reproducible`. + +* matching options + + The [[git-annex-matching-options]](1) can be used to control what + files to recompute. + + For example, to only recompute files that are computed by the "photoconv" + compute remote, use `--in=photoconv` + +* Also the [[git-annex-common-options]](1) can be used. + +# SEE ALSO + +[[git-annex]](1) + +[[git-annex-addcomputed]](1) + +# AUTHOR + +Joey Hess + +Warning: Automatically converted into a man page by mdwn2man. Edit with care. diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn index 620aee61cd..85dda2b223 100644 --- a/doc/git-annex.mdwn +++ b/doc/git-annex.mdwn @@ -186,6 +186,18 @@ content from the key-value store. See [[git-annex-undo]](1) for details. +* `addcomputed` + + Adds computed files to the repository. + + See [[git-annex-addcomputed]](1) for details. + +* `recompute` + + Recomputes computed files. + + See [[git-annex-recompute]](1) for details. + * `multicast` Multicast file distribution. diff --git a/doc/special_remotes/compute.mdwn b/doc/special_remotes/compute.mdwn new file mode 100644 index 0000000000..b840f5fcbe --- /dev/null +++ b/doc/special_remotes/compute.mdwn @@ -0,0 +1,33 @@ +While other remotes store the contents of annexed files somewhere, +this special remote uses a program to compute the contents of annexed +files. + +To enable an instance of this special remote: + + # git-annex initremote myremote type=compute program=git-annex-compute-foo + +The `program` parameter is the only required parameter. It is the name of the +program to use to compute the contents of annexed files. It must start with +"git-annex-compute-". The program needs to be installed somewhere in the +`PATH`. + +To add a file to a compute special remote, use the [[git-annex-addcomputed]] +command. Once a file has been added to a compute special remote, commands +like `git-annex get` will use it to compute the content of the file. + +You can provide other parameters to `initremote`, in order to provide +default configuration values to use when adding files with +[[git-annex-addcomputed]]. To see a list of all the configuration values +supported by a given program, pass `--whatelse` to `initremote`: + + # git-annex initremote myremote type=compute program=git-annex-compute-foo --whatelse + +## compute programs + +To write programs used by the compute special remote, see the +[[design/compute_special_remote_interface]]. + +Have you written a generally useful (and secure) compute program? +List it here! + +* ... From c1b53dbbd05e9c302d9d4cec3da01aeee8fced95 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 20 Feb 2025 13:27:47 -0400 Subject: [PATCH 02/53] wip --- Remote/Compute.hs | 222 ++++++++++++++++++ Remote/List.hs | 2 + .../compute_special_remote_interface.mdwn | 26 +- doc/git-annex.mdwn | 5 + doc/special_remotes.mdwn | 1 + git-annex.cabal | 1 + 6 files changed, 246 insertions(+), 11 deletions(-) create mode 100644 Remote/Compute.hs diff --git a/Remote/Compute.hs b/Remote/Compute.hs new file mode 100644 index 0000000000..3fd52af5c8 --- /dev/null +++ b/Remote/Compute.hs @@ -0,0 +1,222 @@ +{- Compute remote. + - + - Copyright 2025 Joey Hess + - + - Licensed under the GNU AGPL version 3 or higher. + -} + +module Remote.Compute (remote) where + +import Annex.Common +import Types.Remote +import Types.ProposedAccepted +import Types.Creds +import Config +import Config.Cost +import Remote.Helper.Special +import Remote.Helper.ExportImport +import Annex.SpecialRemote.Config +import Annex.UUID +import Logs.RemoteState +import Utility.Metered +import qualified Git +import qualified Utility.SimpleProtocol as Proto + +import Control.Concurrent.STM +import qualified Data.Map as M +import qualified Data.Set as S + +remote :: RemoteType +remote = RemoteType + { typename = "compute" + , enumerate = const $ findSpecialRemotes "compute" + , generate = gen + , configParser = mkRemoteConfigParser + [ optionalStringParser programField + (FieldDesc $ "compute program (must start with \"" ++ safetyPrefix ++ "\")") + ] + , setup = setupInstance + , exportSupported = exportUnsupported + , importSupported = importUnsupported + , thirdPartyPopulated = False + } + +gen :: Git.Repo -> UUID -> RemoteConfig -> RemoteGitConfig -> RemoteStateHandle -> Annex (Maybe Remote) +gen r u rc gc rs = case getComputeProgram rc of + Left _err -> return Nothing + Right program -> do + interface <- liftIO $ newTMVarIO Nothing + c <- parsedRemoteConfig remote rc + cst <- remoteCost gc c veryExpensiveRemoteCost + return $ Just $ mk program interface c cst + where + mk program interface c cst = Remote + { uuid = u + , cost = cst + , name = Git.repoDescribe r + , storeKey = storeKeyUnsupported + , retrieveKeyFile = computeKey program interface + , retrieveKeyFileInOrder = pure True + , retrieveKeyFileCheap = Nothing + , retrievalSecurityPolicy = RetrievalAllKeysSecure + , removeKey = dropKey rs + , lockContent = Nothing + , checkPresent = checkKey program interface + , checkPresentCheap = False + , exportActions = exportUnsupported + , importActions = importUnsupported + , whereisKey = Nothing + , remoteFsck = Nothing + , repairRepo = Nothing + , config = c + , gitconfig = gc + , localpath = Nothing + , getRepo = return r + , readonly = True + , appendonly = False + , untrustworthy = False + , availability = pure LocallyAvailable + , remotetype = remote + , mkUnavailable = return Nothing + , getInfo = return [] + , claimUrl = Nothing + , checkUrl = Nothing + , remoteStateHandle = rs + } + +setupInstance :: SetupStage -> Maybe UUID -> Maybe CredPair -> RemoteConfig -> RemoteGitConfig -> Annex (RemoteConfig, UUID) +setupInstance _ mu _ c _ = do + ComputeProgram program <- either giveup return (getComputeProgram c) + unlessM (liftIO $ inSearchPath program) $ + giveup $ "Cannot find " ++ program ++ " in PATH" + u <- maybe (liftIO genUUID) return mu + gitConfigSpecialRemote u c [("compute", "true")] + return (c, u) + +newtype ComputeProgram = ComputeProgram String + deriving (Show) + +getComputeProgram :: RemoteConfig -> Either String ComputeProgram +getComputeProgram c = case fromProposedAccepted <$> M.lookup programField c of + Just program + | safetyPrefix `isPrefixOf` program -> + Right (ComputeProgram program) + | otherwise -> Left $ + "The program's name must begin with \"" ++ safetyPrefix ++ "\"" + Nothing -> Left "Specify program=" + +-- Limiting the program to "git-annex-compute-" prefix is important for +-- security, it prevents autoenabled compute remotes from running arbitrary +-- programs. +safetyPrefix :: String +safetyPrefix = "git-annex-compute-" + +programField :: RemoteConfigField +programField = Accepted "program" + +type Name = String +type Description = String +type Id = String + +data InterfaceItem + = InterfaceInput Id Description + | InterfaceOptionalInput Id Description + | InterfaceValue Name Description + | InterfaceOptionalValue Name Description + | InterfaceOutput Id Description + | InterfaceReproducible + deriving (Show, Eq) + +-- List order matters, because when displaying the interface to the +-- user, need to display it in the same order as the program +-- does. +data Interface = Interface [InterfaceItem] + deriving (Show, Eq) + +instance Proto.Receivable InterfaceItem where + parseCommand "INPUT" = Proto.parse2 InterfaceInput + parseCommand "INPUT?" = Proto.parse2 InterfaceOptionalInput + parseCommand "VALUE" = Proto.parse2 InterfaceValue + parseCommand "VALUE?" = Proto.parse2 InterfaceOptionalValue + parseCommand "OUTPUT" = Proto.parse2 InterfaceOutput + parseCommand "REPRODUCIBLE" = Proto.parse0 InterfaceReproducible + +getInterface :: ComputeProgram -> TMVar (Maybe Interface) -> IO (Either String Interface) +getInterface program iv = + atomically (takeTMVar iv) >>= \case + Nothing -> getInterface' program >>= \case + Left err -> do + atomically $ putTMVar iv Nothing + return (Left err) + Right interface -> ret interface + Just interface -> ret interface + where + ret interface = do + atomically $ putTMVar iv (Just interface) + return (Right interface) + +getInterface' :: ComputeProgram -> IO (Either String Interface) +getInterface' (ComputeProgram program) = + catchMaybeIO (readProcess program ["interface"]) >>= \case + Nothing -> return $ Left $ "Failed to run " ++ program + Just output -> return $ case parseInterface output of + Right i -> Right i + Left err -> Left $ program ++ " interface output problem: " ++ err + +parseInterface :: String -> Either String Interface +parseInterface = go [] . lines + where + go is [] + | null is = Left "empty interface output" + | otherwise = Right (Interface (reverse is)) + go is (l:ls) + | null l = go is ls + | otherwise = case Proto.parseMessage l of + Just i -> go (i:is) ls + Nothing -> Left $ "Unable to parse line: \"" ++ l ++ "\"" + +data ComputeInput = ComputeInput Key FilePath + deriving (Show, Eq) + +data ComputeValue = ComputeValue String + +data ComputeState = ComputeState + { computeInputs :: M.Map Id ComputInput + , computeValues :: M.Map Id ComputeValue + } + deriving (Show, Eq) + +-- The state is URI encoded. +-- +-- A ComputeValue with Id "foo" is represented as "vfoo=value" +-- A ComputeInput with Id "foo" is represented as "kfoo=key&pfoo=path" +formatComputeState :: ComputeState -> String +formatComputeState st = + map formatinput (computeInputes st) + ++ concatMap formatvalue (computeValues st) + +parseComputeState :: String -> ComputeState +parseComputeState = + +-- TODO +computeKey :: ComputeProgram -> TMVar (Maybe Interface) -> Key -> AssociatedFile -> OsPath -> MeterUpdate -> VerifyConfig -> Annex Verification +computeKey program iv key _af dest p vc = + liftIO (getInterface program iv) >>= \case + Left err -> giveup err + Right interface -> undefined + +-- TODO Make sure that the remote state meets the program's current +-- interface. +checkKey :: ComputeProgram -> TMVar (Maybe Interface) -> Key -> Annex Bool +checkKey program iv _ = + liftIO (getInterface program iv) >>= \case + Left err -> giveup err + Right interface -> undefined + +-- Removing remote state will prevent computing the key. +dropKey :: RemoteStateHandle -> Maybe SafeDropProof -> Key -> Annex () +dropKey rs _ k = setRemoteState rs k mempty + +storeKeyUnsupported :: Key -> AssociatedFile -> Maybe OsPath -> MeterUpdate -> Annex () +storeKeyUnsupported _ _ _ _ = giveup "transfer to compute remote not supported; use git-annex addcomputed instead" + diff --git a/Remote/List.hs b/Remote/List.hs index a266f2d2f2..9d39ddd81d 100644 --- a/Remote/List.hs +++ b/Remote/List.hs @@ -40,6 +40,7 @@ import qualified Remote.Borg import qualified Remote.Rclone import qualified Remote.Hook import qualified Remote.External +import qualified Remote.Compute remoteTypes :: [RemoteType] remoteTypes = map adjustExportImportRemoteType @@ -63,6 +64,7 @@ remoteTypes = map adjustExportImportRemoteType , Remote.Rclone.remote , Remote.Hook.remote , Remote.External.remote + , Remote.Compute.remote ] {- Builds a list of all Remotes. diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index f82fdc22c5..8b1a732e7a 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -51,12 +51,16 @@ In the example above, the program is expected to output something like: If possible, the program should write the content of the file it is computing directly to the file listed in COMPUTING, rather than writing to -somewhere else and renaming it at the end. If git-annex sees that the file -corresponding to the key it requested be computed is growing, it will use -its file size when displaying progress to the user. +somewhere else and renaming it at the end. Except, when the program writes +the file it computes out of order, it should write to a file somewhere else +and rename it at the end. + +If git-annex sees that the file corresponding to the key it requested be +computed is growing, it will use its file size when displaying progress to +the user. The program can also output lines to stdout to indicate its current -progress. +progress: PROGRESS 50% @@ -67,23 +71,23 @@ output, but not for progress displays. If the program exits nonzero, nothing it computed will be stored in the git-annex repository. -The program must also support listing the inputs and outputs that it +When run with the "interface" parameter, the program must describe its +interface. This is a list of the inputs and outputs that it supports. This allows `git-annex addcomputed` and `git-annex initremote` to list inputs and outputs, and also lets them reject invalid inputs and outputs. -In this mode, the program is run with a "list" parameter. -It should output lines, in the form: +The output is lines, in the form: - INPUT[?] Name Description - VALUE[?] Name Description + INPUT[?] Id Description + VALUE[?] Id Description OUTPUT Id Description Use "INPUT" when a file is an input to the computation, and "VALUE" for all other input values. Use "INPUT?" and "VALUE?" for optional inputs and values. -The program can also optionally output a "REPRODUCIBLE" line. +The interface can also optionally include a "REPRODUCIBLE" line. That indicates that the results of its computations are expected to be bit-for-bit reproducible. That makes `git-annex addcomputed` behave as if the `--reproducible` @@ -93,7 +97,7 @@ An example `git-annex-compute-foo` shell script follows: #!/bin/sh set -e - if [ "$1" = list ]; then + if [ "$1" = interface ]; then echo "INPUT raw A photo in RAW format" echo "VALUE? passes Number of passes" echo "OUTPUT photo Computed JPEG" diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn index 85dda2b223..daed2be98a 100644 --- a/doc/git-annex.mdwn +++ b/doc/git-annex.mdwn @@ -1957,6 +1957,11 @@ Remotes are configured using these settings in `.git/config`. the location of the borg repository to use. Normally this is automatically set up by `git annex initremote`, but you can change it if needed. +* `remote..annex-compute` + + Used to identify compute special remotes. + Normally this is automatically set up by `git annex initremote`. + * `remote..annex-ddarrepo` Used by ddar special remotes, this configures diff --git a/doc/special_remotes.mdwn b/doc/special_remotes.mdwn index 04f2feb9c6..0c4ff0131f 100644 --- a/doc/special_remotes.mdwn +++ b/doc/special_remotes.mdwn @@ -11,6 +11,7 @@ the content of files. * [[Amazon_Glacier|glacier]] * [[bittorrent]] * [[bup]] +* [[compute]] * [[ddar]] * [[directory]] * [[gcrypt]] (encrypted git repositories!) diff --git a/git-annex.cabal b/git-annex.cabal index fae2a3bbb8..0e95331084 100644 --- a/git-annex.cabal +++ b/git-annex.cabal @@ -930,6 +930,7 @@ Executable git-annex Remote.BitTorrent Remote.Borg Remote.Bup + Remote.Compute Remote.Ddar Remote.Directory Remote.Directory.LegacyChunked From 4f1eea90616e4ffaac12d1e0bd5c304d605df808 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Fri, 21 Feb 2025 14:51:02 -0400 Subject: [PATCH 03/53] remove unused adjustedBranchRefresh associated file parameter --- Annex/AdjustedBranch.hs | 12 ++++-------- Annex/Content.hs | 20 ++++++++++---------- Annex/Import.hs | 4 ++-- Annex/Ingest.hs | 13 +++---------- Annex/Transfer.hs | 2 +- CmdLine/GitRemoteAnnex.hs | 4 ++-- Command/Multicast.hs | 2 +- Command/ReKey.hs | 2 +- Command/RecvKey.hs | 2 +- Command/Reinject.hs | 2 +- Command/SetKey.hs | 2 +- Command/TestRemote.hs | 8 ++++---- Command/TransferKey.hs | 2 +- Command/TransferKeys.hs | 2 +- Command/Transferrer.hs | 4 ++-- P2P/Annex.hs | 2 +- Remote/Git.hs | 2 +- Upgrade/V0.hs | 3 +-- Upgrade/V1.hs | 2 +- 19 files changed, 39 insertions(+), 51 deletions(-) diff --git a/Annex/AdjustedBranch.hs b/Annex/AdjustedBranch.hs index 99cd40e835..95bd8cfc34 100644 --- a/Annex/AdjustedBranch.hs +++ b/Annex/AdjustedBranch.hs @@ -308,16 +308,12 @@ updateAdjustedBranch adj (AdjBranch currbranch) origbranch -- adjustment is stable. return True -{- Passed an action that, if it succeeds may get or drop the Key associated - - with the file. When the adjusted branch needs to be refreshed to reflect +{- Passed an action that, if it succeeds may get or drop a key. + - When the adjusted branch needs to be refreshed to reflect - those changes, it's handled here. - - - - Note that the AssociatedFile must be verified by this to point to the - - Key. In some cases, the value was provided by the user and might not - - really be an associated file. -} -adjustedBranchRefresh :: AssociatedFile -> Annex a -> Annex a -adjustedBranchRefresh _af a = do +adjustedBranchRefresh :: Annex a -> Annex a +adjustedBranchRefresh a = do r <- a go return r diff --git a/Annex/Content.hs b/Annex/Content.hs index c4a0f8490c..f01432669e 100644 --- a/Annex/Content.hs +++ b/Annex/Content.hs @@ -376,16 +376,16 @@ lockContentUsing contentlocker key fallback a = withContentLockFile key $ \mlock {- Runs an action, passing it the temp file to get, - and if the action succeeds, verifies the file matches - the key and moves the file into the annex as a key's content. -} -getViaTmp :: RetrievalSecurityPolicy -> VerifyConfig -> Key -> AssociatedFile -> Maybe FileSize -> (OsPath -> Annex (Bool, Verification)) -> Annex Bool -getViaTmp rsp v key af sz action = +getViaTmp :: RetrievalSecurityPolicy -> VerifyConfig -> Key -> Maybe FileSize -> (OsPath -> Annex (Bool, Verification)) -> Annex Bool +getViaTmp rsp v key sz action = checkDiskSpaceToGet key sz False $ - getViaTmpFromDisk rsp v key af action + getViaTmpFromDisk rsp v key action {- Like getViaTmp, but does not check that there is enough disk space - for the incoming key. For use when the key content is already on disk - and not being copied into place. -} -getViaTmpFromDisk :: RetrievalSecurityPolicy -> VerifyConfig -> Key -> AssociatedFile -> (OsPath -> Annex (Bool, Verification)) -> Annex Bool -getViaTmpFromDisk rsp v key af action = checkallowed $ do +getViaTmpFromDisk :: RetrievalSecurityPolicy -> VerifyConfig -> Key -> (OsPath -> Annex (Bool, Verification)) -> Annex Bool +getViaTmpFromDisk rsp v key action = checkallowed $ do tmpfile <- prepTmp key resuming <- liftIO $ doesPathExist tmpfile (ok, verification) <- action tmpfile @@ -400,7 +400,7 @@ getViaTmpFromDisk rsp v key af action = checkallowed $ do else verification if ok then ifM (verifyKeyContentPostRetrieval rsp v verification' key tmpfile) - ( pruneTmpWorkDirBefore tmpfile (moveAnnex key af) + ( pruneTmpWorkDirBefore tmpfile (moveAnnex key) , do verificationOfContentFailed tmpfile return False @@ -507,8 +507,8 @@ withTmp key action = do - accepted into the repository. Will display a warning message in this - case. May also throw exceptions in some cases. -} -moveAnnex :: Key -> AssociatedFile -> OsPath -> Annex Bool -moveAnnex key af src = ifM (checkSecureHashes' key) +moveAnnex :: Key -> OsPath -> Annex Bool +moveAnnex key src = ifM (checkSecureHashes' key) ( do #ifdef mingw32_HOST_OS {- Windows prevents deletion of files that are not @@ -523,7 +523,7 @@ moveAnnex key af src = ifM (checkSecureHashes' key) where storeobject dest = ifM (liftIO $ doesPathExist dest) ( alreadyhave - , adjustedBranchRefresh af $ modifyContentDir dest $ do + , adjustedBranchRefresh $ modifyContentDir dest $ do liftIO $ moveFile src dest -- Freeze the object file now that it is in place. -- Waiting until now to freeze it allows for freeze @@ -776,7 +776,7 @@ removeAnnex (ContentRemovalLock key) = withObjectLoc key $ \file -> -- it's unmodified. resetpointer file = unlessM (liftIO $ isSymbolicLink <$> R.getSymbolicLinkStatus (fromOsPath file)) $ ifM (isUnmodified key file) - ( adjustedBranchRefresh (AssociatedFile (Just file)) $ + ( adjustedBranchRefresh $ depopulatePointerFile key file -- Modified file, so leave it alone. -- If it was a hard link to the annex object, diff --git a/Annex/Import.hs b/Annex/Import.hs index b351504ace..2e86df920d 100644 --- a/Annex/Import.hs +++ b/Annex/Import.hs @@ -863,7 +863,7 @@ importKeys remote importtreeconfig importcontent thirdpartypopulated importablec ia loc [cid] tmpfile (Left k) (combineMeterUpdate p' p) - ok <- moveAnnex k af tmpfile + ok <- moveAnnex k tmpfile when ok $ logStatus NoLiveUpdate k InfoPresent return (Just (k, ok)) @@ -906,7 +906,7 @@ importKeys remote importtreeconfig importcontent thirdpartypopulated importablec p case keyGitSha k of Nothing -> do - ok <- moveAnnex k af tmpfile + ok <- moveAnnex k tmpfile when ok $ do recordcidkey cidmap cid k logStatus NoLiveUpdate k InfoPresent diff --git a/Annex/Ingest.hs b/Annex/Ingest.hs index 695a0cb063..07b5dad282 100644 --- a/Annex/Ingest.hs +++ b/Annex/Ingest.hs @@ -198,17 +198,11 @@ ingest' preferredbackend meterupdate (Just (LockedDown cfg source)) mk restage = | otherwise = gounlocked key mcache golocked key mcache = - tryNonAsync (moveAnnex key naf (contentLocation source)) >>= \case + tryNonAsync (moveAnnex key (contentLocation source)) >>= \case Right True -> success key mcache Right False -> giveup "failed to add content to annex" Left e -> restoreFile (keyFilename source) key e - -- moveAnnex uses the AssociatedFile provided to it to unlock - -- locked files when getting a file in an adjusted branch. - -- That case does not apply here, where we're adding an unlocked - -- file, so provide it nothing. - naf = AssociatedFile Nothing - gounlocked key (Just cache) = do -- Remove temp directory hard link first because -- linkToAnnex falls back to copying if a file @@ -377,7 +371,7 @@ addAnnexedFile matcher file key mtmp = ifM (addUnlocked matcher mi (isJust mtmp) stagePointerFile file mode =<< hashPointerFile key Database.Keys.addAssociatedFile key =<< inRepo (toTopFilePath file) case mtmp of - Just tmp -> ifM (moveAnnex key af tmp) + Just tmp -> ifM (moveAnnex key tmp) ( linkunlocked mode >> return True , writepointer mode >> return False ) @@ -388,11 +382,10 @@ addAnnexedFile matcher file key mtmp = ifM (addUnlocked matcher mi (isJust mtmp) , do addSymlink file key Nothing case mtmp of - Just tmp -> moveAnnex key af tmp + Just tmp -> moveAnnex key tmp Nothing -> return True ) where - af = AssociatedFile (Just file) mi = case mtmp of Just tmp -> MatchingFile $ FileInfo { contentFile = tmp diff --git a/Annex/Transfer.hs b/Annex/Transfer.hs index 45969003ae..7ec629e442 100644 --- a/Annex/Transfer.hs +++ b/Annex/Transfer.hs @@ -78,7 +78,7 @@ download r key f d witness = Just StallDetectionDisabled -> go Nothing Just sd -> runTransferrer sd r key f d Download witness where - go sd = getViaTmp (Remote.retrievalSecurityPolicy r) vc key f Nothing $ \dest -> + go sd = getViaTmp (Remote.retrievalSecurityPolicy r) vc key Nothing $ \dest -> download' (Remote.uuid r) key f sd d (go' dest) witness go' dest p = verifiedAction $ Remote.retrieveKeyFile r key f dest p vc diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index 79d6befd5b..beacd137a3 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -927,7 +927,7 @@ downloadGitBundle rmt k = getKeyExportLocations rmt k >>= \case getexport loc = catchNonAsync (getexport' loc) (const (pure False)) getexport' loc = - getViaTmp rsp vc k (AssociatedFile Nothing) Nothing $ \tmp -> do + getViaTmp rsp vc k Nothing $ \tmp -> do v <- Remote.retrieveExport (Remote.exportActions rmt) k loc tmp nullMeterUpdate return (True, v) @@ -986,7 +986,7 @@ generateGitBundle rmt bs manifest = tmp nullMeterUpdate if (bundlekey `notElem` inManifest manifest) then do - unlessM (moveAnnex bundlekey (AssociatedFile Nothing) tmp) $ + unlessM (moveAnnex bundlekey tmp) $ giveup "Unable to push" return (bundlekey, uploadaction bundlekey) else return (bundlekey, noop) diff --git a/Command/Multicast.hs b/Command/Multicast.hs index 280f862fe4..f29db57e47 100644 --- a/Command/Multicast.hs +++ b/Command/Multicast.hs @@ -213,7 +213,7 @@ storeReceived f = do warning $ "Received a file " <> QuotedPath f <> " that is not a git-annex key. Deleting this file." liftIO $ removeWhenExistsWith removeFile f Just k -> void $ logStatusAfter NoLiveUpdate k $ - getViaTmpFromDisk RetrievalVerifiableKeysSecure AlwaysVerify k (AssociatedFile Nothing) $ \dest -> unVerified $ + getViaTmpFromDisk RetrievalVerifiableKeysSecure AlwaysVerify k $ \dest -> unVerified $ liftIO $ catchBoolIO $ do renameFile f dest return True diff --git a/Command/ReKey.hs b/Command/ReKey.hs index 3f02f2ab60..8688dff25c 100644 --- a/Command/ReKey.hs +++ b/Command/ReKey.hs @@ -128,7 +128,7 @@ linkKey file oldkey newkey = ifM (isJust <$> isAnnexLink file) - and vulnerable to corruption. -} linkKey' :: VerifyConfig -> Key -> Key -> Annex Bool linkKey' v oldkey newkey = - getViaTmpFromDisk RetrievalAllKeysSecure v newkey (AssociatedFile Nothing) $ \tmp -> unVerified $ do + getViaTmpFromDisk RetrievalAllKeysSecure v newkey $ \tmp -> unVerified $ do oldobj <- calcRepo (gitAnnexLocation oldkey) isJust <$> linkOrCopy' (return True) newkey oldobj tmp Nothing diff --git a/Command/RecvKey.hs b/Command/RecvKey.hs index b1cd926236..c3f0eb3289 100644 --- a/Command/RecvKey.hs +++ b/Command/RecvKey.hs @@ -28,7 +28,7 @@ start :: (SeekInput, Key) -> CommandStart start (_, key) = fieldTransfer Download key $ \_p -> do -- This matches the retrievalSecurityPolicy of Remote.Git let rsp = RetrievalAllKeysSecure - ifM (getViaTmp rsp DefaultVerify key (AssociatedFile Nothing) Nothing go) + ifM (getViaTmp rsp DefaultVerify key Nothing go) ( do logStatus NoLiveUpdate key InfoPresent _ <- quiesce True diff --git a/Command/Reinject.hs b/Command/Reinject.hs index 7ea45623fb..0e5d2651d3 100644 --- a/Command/Reinject.hs +++ b/Command/Reinject.hs @@ -129,7 +129,7 @@ perform src key = do ) where move = checkDiskSpaceToGet key Nothing False $ - moveAnnex key (AssociatedFile Nothing) src + moveAnnex key src cleanup :: Key -> CommandCleanup cleanup key = do diff --git a/Command/SetKey.hs b/Command/SetKey.hs index b7db0200df..0026f82295 100644 --- a/Command/SetKey.hs +++ b/Command/SetKey.hs @@ -36,7 +36,7 @@ perform file key = do -- the file might be on a different filesystem, so moveFile is used -- rather than simply calling moveAnnex; disk space is also -- checked this way. - ok <- getViaTmp RetrievalAllKeysSecure DefaultVerify key (AssociatedFile Nothing) Nothing $ \dest -> unVerified $ + ok <- getViaTmp RetrievalAllKeysSecure DefaultVerify key Nothing $ \dest -> unVerified $ if dest /= file then liftIO $ catchBoolIO $ do moveFile file dest diff --git a/Command/TestRemote.hs b/Command/TestRemote.hs index b35ee6ecb2..3bc161d3fe 100644 --- a/Command/TestRemote.hs +++ b/Command/TestRemote.hs @@ -301,7 +301,7 @@ test runannex mkr mkk = Just verifier -> do loc <- Annex.calcRepo (gitAnnexLocation k) verifier k loc - get r k = logStatusAfter NoLiveUpdate k $ getViaTmp (Remote.retrievalSecurityPolicy r) (RemoteVerify r) k (AssociatedFile Nothing) Nothing $ \dest -> + get r k = logStatusAfter NoLiveUpdate k $ getViaTmp (Remote.retrievalSecurityPolicy r) (RemoteVerify r) k Nothing $ \dest -> tryNonAsync (Remote.retrieveKeyFile r k (AssociatedFile Nothing) dest nullMeterUpdate (RemoteVerify r)) >>= \case Right v -> return (True, v) Left _ -> return (False, UnVerified) @@ -375,13 +375,13 @@ testUnavailable runannex mkr mkk = , check (`notElem` [Right True, Right False]) "checkPresent" $ \r k -> Remote.checkPresent r k , check (== Right False) "retrieveKeyFile" $ \r k -> - logStatusAfter NoLiveUpdate k $ getViaTmp (Remote.retrievalSecurityPolicy r) (RemoteVerify r) k (AssociatedFile Nothing) Nothing $ \dest -> + logStatusAfter NoLiveUpdate k $ getViaTmp (Remote.retrievalSecurityPolicy r) (RemoteVerify r) k Nothing $ \dest -> tryNonAsync (Remote.retrieveKeyFile r k (AssociatedFile Nothing) dest nullMeterUpdate (RemoteVerify r)) >>= \case Right v -> return (True, v) Left _ -> return (False, UnVerified) , check (== Right False) "retrieveKeyFileCheap" $ \r k -> case Remote.retrieveKeyFileCheap r of Nothing -> return False - Just a -> logStatusAfter NoLiveUpdate k $ getViaTmp (Remote.retrievalSecurityPolicy r) (RemoteVerify r) k (AssociatedFile Nothing) Nothing $ \dest -> + Just a -> logStatusAfter NoLiveUpdate k $ getViaTmp (Remote.retrievalSecurityPolicy r) (RemoteVerify r) k Nothing $ \dest -> unVerified $ isRight <$> tryNonAsync (a k (AssociatedFile Nothing) dest) ] @@ -443,7 +443,7 @@ randKey sz = withTmpFile (literalOsPath "randkey") $ \f h -> do k <- case Types.Backend.genKey Backend.Hash.testKeyBackend of Just a -> a ks nullMeterUpdate Nothing -> giveup "failed to generate random key (backend problem)" - _ <- moveAnnex k (AssociatedFile Nothing) f + _ <- moveAnnex k f return k getReadonlyKey :: Remote -> OsPath -> Annex Key diff --git a/Command/TransferKey.hs b/Command/TransferKey.hs index 9732e7d656..2425082305 100644 --- a/Command/TransferKey.hs +++ b/Command/TransferKey.hs @@ -63,7 +63,7 @@ toPerform key af remote = go Upload af $ fromPerform :: Key -> AssociatedFile -> Remote -> CommandPerform fromPerform key af remote = go Upload af $ download' (uuid remote) key af Nothing stdRetry $ \p -> - logStatusAfter NoLiveUpdate key $ getViaTmp (retrievalSecurityPolicy remote) vc key af Nothing $ \t -> + logStatusAfter NoLiveUpdate key $ getViaTmp (retrievalSecurityPolicy remote) vc key Nothing $ \t -> tryNonAsync (Remote.retrieveKeyFile remote key af t p vc) >>= \case Right v -> return (True, v) Left e -> do diff --git a/Command/TransferKeys.hs b/Command/TransferKeys.hs index f06a687c71..07a0051ed0 100644 --- a/Command/TransferKeys.hs +++ b/Command/TransferKeys.hs @@ -50,7 +50,7 @@ start = do return True | otherwise = notifyTransfer direction af $ download' (Remote.uuid remote) key af Nothing stdRetry $ \p -> - logStatusAfter NoLiveUpdate key $ getViaTmp (Remote.retrievalSecurityPolicy remote) (RemoteVerify remote) key af Nothing $ \t -> do + logStatusAfter NoLiveUpdate key $ getViaTmp (Remote.retrievalSecurityPolicy remote) (RemoteVerify remote) key Nothing $ \t -> do r <- tryNonAsync (Remote.retrieveKeyFile remote key af t p (RemoteVerify remote)) >>= \case Left e -> do warning (UnquotedString (show e)) diff --git a/Command/Transferrer.hs b/Command/Transferrer.hs index f84f783597..a87fedd2b2 100644 --- a/Command/Transferrer.hs +++ b/Command/Transferrer.hs @@ -55,7 +55,7 @@ start = do -- so caller is responsible for doing notification -- and for retrying, and updating location log, -- and stall canceling. - let go p = getViaTmp (Remote.retrievalSecurityPolicy remote) (RemoteVerify remote) key af Nothing $ \t -> do + let go p = getViaTmp (Remote.retrievalSecurityPolicy remote) (RemoteVerify remote) key Nothing $ \t -> do Remote.verifiedAction (Remote.retrieveKeyFile remote key af t p (RemoteVerify remote)) in download' (Remote.uuid remote) key af Nothing noRetry go noNotification @@ -72,7 +72,7 @@ start = do runner (AssistantDownloadRequest _ key (TransferAssociatedFile file)) remote = notifyTransfer Download file $ download' (Remote.uuid remote) key file Nothing stdRetry $ \p -> - logStatusAfter NoLiveUpdate key $ getViaTmp (Remote.retrievalSecurityPolicy remote) (RemoteVerify remote) key file Nothing $ \t -> do + logStatusAfter NoLiveUpdate key $ getViaTmp (Remote.retrievalSecurityPolicy remote) (RemoteVerify remote) key Nothing $ \t -> do r <- tryNonAsync (Remote.retrieveKeyFile remote key file t p (RemoteVerify remote)) >>= \case Left e -> do warning (UnquotedString (show e)) diff --git a/P2P/Annex.hs b/P2P/Annex.hs index a6beb64eb3..15a829550b 100644 --- a/P2P/Annex.hs +++ b/P2P/Annex.hs @@ -81,7 +81,7 @@ runLocal runst runner a = case a of iv <- startVerifyKeyContentIncrementally DefaultVerify k let runtransfer ti = Right <$> transfer download' k af Nothing (\p -> - logStatusAfter NoLiveUpdate k $ getViaTmp rsp DefaultVerify k af Nothing $ \tmp -> + logStatusAfter NoLiveUpdate k $ getViaTmp rsp DefaultVerify k Nothing $ \tmp -> storefile tmp o l getb iv validitycheck p ti) let fallback = return $ Left $ ProtoFailureMessage "transfer already in progress, or unable to take transfer lock" diff --git a/Remote/Git.hs b/Remote/Git.hs index 71c6571554..cda705cb0e 100644 --- a/Remote/Git.hs +++ b/Remote/Git.hs @@ -682,7 +682,7 @@ copyToRemote' repo r st@(State connpool duc _ _ _) key af o meterupdate let checksuccess = liftIO checkio >>= \case Just err -> giveup err Nothing -> return True - logStatusAfter NoLiveUpdate key $ Annex.Content.getViaTmp rsp verify key af (Just sz) $ \dest -> + logStatusAfter NoLiveUpdate key $ Annex.Content.getViaTmp rsp verify key (Just sz) $ \dest -> metered (Just (combineMeterUpdate meterupdate p)) key bwlimit $ \_ p' -> copier object dest key p' checksuccess verify ) diff --git a/Upgrade/V0.hs b/Upgrade/V0.hs index ea8c8e7de9..a5cf83e36e 100644 --- a/Upgrade/V0.hs +++ b/Upgrade/V0.hs @@ -25,8 +25,7 @@ upgrade = do olddir <- fromRepo gitAnnexDir keys <- getKeysPresent0 olddir forM_ keys $ \k -> - moveAnnex k (AssociatedFile Nothing) - (olddir toOsPath (keyFile0 k)) + moveAnnex k (olddir toOsPath (keyFile0 k)) -- update the symlinks to the key files -- No longer needed here; V1.upgrade does the same thing diff --git a/Upgrade/V1.hs b/Upgrade/V1.hs index b9ae3af8a8..d0aaba73a3 100644 --- a/Upgrade/V1.hs +++ b/Upgrade/V1.hs @@ -85,7 +85,7 @@ moveContent = do let d = parentDir f liftIO $ allowWrite d liftIO $ allowWrite f - _ <- moveAnnex k (AssociatedFile Nothing) f + _ <- moveAnnex k f liftIO $ removeDirectory d updateSymlinks :: Annex () From e0b46ef7ad5817fe16c84d83583ca10302834c00 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Fri, 21 Feb 2025 15:02:53 -0400 Subject: [PATCH 04/53] compute special remote mostly implemented Except for some of the hard parts: progress displays, incremental verification, and getting inputs before running a computation. Untested! In order to test this, git-annex addcomputed needs to be implemented. --- Remote/Compute.hs | 409 ++++++++++++++++++++++++++++++++-------------- 1 file changed, 290 insertions(+), 119 deletions(-) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 18ebe950f7..3012337f3d 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -5,7 +5,19 @@ - Licensed under the GNU AGPL version 3 or higher. -} -module Remote.Compute (remote) where +{-# LANGUAGE OverloadedStrings #-} + +module Remote.Compute ( + remote, + Interface, + ComputeState(..), + setComputeState, + getComputeStates, + InterfaceEnv, + interfaceEnv, + getComputeProgram, + runComputeProgram, +) where import Annex.Common import Types.Remote @@ -18,21 +30,23 @@ import Remote.Helper.Special import Remote.Helper.ExportImport import Annex.SpecialRemote.Config import Annex.UUID +import Annex.Content +import Annex.Tmp import Logs.MetaData import Utility.Metered -import Utility.Hash import Utility.TimeStamp -import Git.FilePath +import Utility.Env import qualified Git import qualified Utility.SimpleProtocol as Proto +import Network.HTTP.Types.URI import Control.Concurrent.STM import Data.Time.Clock import Data.Either -import Data.Char -import Data.Ord +import Text.Read import qualified Data.Map as M import qualified Data.Set as S +import qualified Data.ByteString as B import qualified Data.Text as T import qualified Data.Text.Encoding as T @@ -106,19 +120,20 @@ computeConfigParser :: RemoteConfig -> Annex RemoteConfigParser computeConfigParser rc = do Interface interface <- case getComputeProgram rc of Left _ -> pure $ Interface [] - Right program -> liftIO (getInterfaceUncached program) >>= return . \case + Right program -> liftIO (getInterface program) >>= return . \case Left _ -> Interface [] Right interface -> interface let m = M.fromList $ mapMaybe collectfields interface - let ininterface f = case toField (fromProposedAccepted f) of - Just f' -> M.member f' m - Nothing -> False + let ininterface f = M.member (Field (fromProposedAccepted f)) m return $ RemoteConfigParser { remoteConfigFieldParsers = [ optionalStringParser programField (FieldDesc $ "compute program (must start with \"" ++ safetyPrefix ++ "\")") ] - , remoteConfigRestPassthrough = Just (ininterface, M.toList $ M.mapKeys fromField m) + , remoteConfigRestPassthrough = Just + ( ininterface + , M.toList $ M.mapKeys fromField m + ) } where collectfields (InterfaceInput f d) = Just (f, FieldDesc d) @@ -150,7 +165,7 @@ programField = Accepted "program" type Description = String -newtype Field = Field MetaField +newtype Field = Field { fromField :: String } deriving (Show, Eq, Ord) data InterfaceItem @@ -175,23 +190,33 @@ instance Proto.Receivable InterfaceItem where parseCommand "VALUE?" = Proto.parse2 InterfaceOptionalValue parseCommand "OUTPUT" = Proto.parse2 InterfaceOutput parseCommand "REPRODUCIBLE" = Proto.parse0 InterfaceReproducible + parseCommand _ = Proto.parseFail + +data ProcessOutput + = Computing Field FilePath + | Progress PercentFloat + deriving (Show, Eq) + +instance Proto.Receivable ProcessOutput where + parseCommand "COMPUTING" = Proto.parse2 Computing + parseCommand "PROGRESS" = Proto.parse1 Progress + parseCommand _ = Proto.parseFail instance Proto.Serializable Field where serialize = fromField - deserialize = toField + deserialize = Just . Field --- While MetaField is case insensitive, environment variable names are not, --- so make Field always lower cased. -toField :: String -> Maybe Field -toField f = Field <$> toMetaField (T.pack (map toLower f)) +newtype PercentFloat = PercentFloat Float + deriving (Show, Eq) -fromField :: Field -> String -fromField (Field f) = T.unpack (fromMetaField f) +instance Proto.Serializable PercentFloat where + serialize (PercentFloat p) = show p + deserialize s = PercentFloat <$> readMaybe s -getInterface :: ComputeProgram -> TMVar (Maybe Interface) -> IO (Either String Interface) -getInterface program iv = +getInterfaceCached :: ComputeProgram -> TMVar (Maybe Interface) -> IO (Either String Interface) +getInterfaceCached program iv = atomically (takeTMVar iv) >>= \case - Nothing -> getInterfaceUncached program >>= \case + Nothing -> getInterface program >>= \case Left err -> do atomically $ putTMVar iv Nothing return (Left err) @@ -202,8 +227,8 @@ getInterface program iv = atomically $ putTMVar iv (Just interface) return (Right interface) -getInterfaceUncached :: ComputeProgram -> IO (Either String Interface) -getInterfaceUncached (ComputeProgram program) = +getInterface :: ComputeProgram -> IO (Either String Interface) +getInterface (ComputeProgram program) = catchMaybeIO (readProcess program ["interface"]) >>= \case Nothing -> return $ Left $ "Failed to run " ++ program Just output -> return $ case parseInterface output of @@ -235,148 +260,294 @@ data ComputeState = ComputeState { computeInputs :: M.Map Field ComputeInput , computeValues :: M.Map Field ComputeValue , computeOutputs :: M.Map Field ComputeOutput - , computeTimeEstimate :: NominalDiffTime } deriving (Show, Eq) --- Generates a hash of a ComputeState. --- --- This is used as a short unique identifier in the metadata fields, --- since more than one ComputeState may be stored in the compute remote's --- metadata for a given Key. --- --- A md5 is fine for this. It does not need to protect against intentional --- collisions. And 2^64 is a sufficiently small chance of accidental --- collision. -hashComputeState :: ComputeState -> String -hashComputeState state = show $ md5s $ - mconcat (map (go goi) (M.toAscList (computeInputs state))) - <> - mconcat (map (go gov) (M.toAscList (computeValues state))) - <> - mconcat (map (go goo) (M.toAscList (computeOutputs state))) - <> - encodeBS (show (computeTimeEstimate state)) +{- Formats a ComputeState as an URL query string. + - + - Prefixes fields with "k" and "f" for computeInputs, with + - "v" for computeValues and "o" for computeOutputs. + - + - When the passed Key is an output, rather than duplicate it + - in the query string, that output has no value. + - + - Fields in the query string are sorted. This is in order to ensure + - that the same ComputeState is always formatted the same way. + - + - Example: "ffoo=somefile&kfoo=WORM--foo&oresult&vbar=11" + -} +formatComputeState :: Key -> ComputeState -> B.ByteString +formatComputeState k st = renderQuery False $ sortOn fst $ concat + [ concatMap formatinput $ M.toList (computeInputs st) + , map formatvalue $ M.toList (computeValues st) + , map formatoutput $ M.toList (computeOutputs st) + ] where - go c (Field f, v) = T.encodeUtf8 (fromMetaField f) <> c v - goi (ComputeInput k f) = serializeKey' k <> encodeBS f - gov (ComputeValue s) = encodeBS s - goo (ComputeOutput k) = serializeKey' k + formatinput (f, ComputeInput key file) = + [ ("k" <> fb, Just (serializeKey' key)) + , ("f" <> fb, Just (toRawFilePath file)) + ] + where + fb = encodeBS (fromField f) + formatvalue (f, ComputeValue v) = + ("v" <> encodeBS (fromField f), Just (encodeBS v)) + formatoutput (f, ComputeOutput key) = + ("o" <> encodeBS (fromField f), + if key == k + then Nothing + else Just (serializeKey' key) + ) -computeStateMetaData :: ComputeState -> MetaData -computeStateMetaData = undefined - --- FIXME: Need to unswizzle the mixed up metadata based on hash prefixes. -metaDataComputeStates :: MetaData -> [ComputeState] -metaDataComputeStates (MetaData m) = - go (ComputeState mempty mempty mempty 0) (M.toList m) +parseComputeState :: Key -> B.ByteString -> Maybe ComputeState +parseComputeState k b = + let q = parseQuery b + st = go emptycomputestate (M.fromList q) q + in if st == emptycomputestate then Nothing else Just st where - go c ((f,v):rest) = - let c' = case T.unpack (fromMetaField f) of - ('i':'n':'p':'u':'t':'-':f') -> case M.lookup m =<< toMetaField (T.pack ("key-" ++ f')) of - Nothing -> c - Just kv -> case deserializeKey' (fromMetaValue kv) of - Just k -> c - { computeInputs = - M.insert (toField f) - (ComputeInput k (decodeBS (fromMetaValue v))) - (computeOutputs c) - } - Nothing -> c - ('v':'a':'l':'u':'e':'-':f') -> c - { computeValues = - M.insert (toField f) - (ComputeValue (decodeBS (fromMetaValue v))) - (computeValues c) - } - ('o':'u':'t':'p':'u':'t':'-':f') -> - case deserializeKey' (fromMetaValue v) of - Just k -> c + emptycomputestate = ComputeState mempty mempty mempty + go c _ [] = c + go c m ((f, v):rest) = + let c' = fromMaybe c $ case decodeBS f of + ('f':f') -> do + file <- fromRawFilePath <$> v + kv <- M.lookup (encodeBS ('k':f')) m + key <- deserializeKey' =<< kv + Just $ c + { computeInputs = + M.insert (Field f') + (ComputeInput key file) + (computeInputs c) + } + ('v':f') -> do + val <- decodeBS <$> v + Just $ c + { computeValues = + M.insert (Field f') + (ComputeValue val) + (computeValues c) + } + ('o':f') -> case v of + Just kv -> do + key <- deserializeKey' kv + Just $ c { computeOutputs = - M.insert (toField f) - (ComputeOutput k) + M.insert (Field f') + (ComputeOutput key) (computeOutputs c) } - Nothing -> c - ('t':'i':'m':'e':'-':f') -> - case parsePOSIXTime (fromMetaValue v) of - Just t -> c { computeTimeEstimate = t } - Nothing -> c - _ -> c - in go c' rest + Nothing -> Just $ c + { computeOutputs = + M.insert (Field f') + (ComputeOutput k) + (computeOutputs c) + } + _ -> Nothing + in go c' m rest -getComputeStates :: RemoteStateHandle -> Key -> Annex [ComputeState] +{- The per remote metadata is used to store ComputeState. This allows + - recording multiple ComputeStates that generate the same key. + - + - The metadata fields are numbers (prefixed with "t" to make them legal + - field names), which are estimates of how long it might take to run + - the computation (in seconds). + -} +setComputeState :: RemoteStateHandle -> Key -> NominalDiffTime -> ComputeState -> Annex () +setComputeState rs k ts st = addRemoteMetaData k rs $ MetaData $ M.singleton + (mkMetaFieldUnchecked $ T.pack ('t':show (truncateResolution 1 ts))) + (S.singleton (MetaValue (CurrentlySet True) (formatComputeState k st))) + +getComputeStates :: RemoteStateHandle -> Key -> Annex [(NominalDiffTime, ComputeState)] getComputeStates rs k = do - RemoteMetaData _ m <- getCurrentRemoteMetaData rs k - return (metaDataComputeStates m) + RemoteMetaData _ (MetaData m) <- getCurrentRemoteMetaData rs k + return $ go [] (M.toList m) + where + go c [] = concat c + go c ((f, s) : rest) = + let sts = mapMaybe (parseComputeState k . fromMetaValue) + (S.toList s) + in case parsePOSIXTime (T.encodeUtf8 (T.drop 1 (fromMetaField f))) of + Just ts -> go (zip (repeat ts) sts : c) rest + Nothing -> go c rest -setComputeState :: RemoteStateHandle -> Key -> ComputeState -> Annex () -setComputeState rs k st = addRemoteMetaData k rs (computeStateMetaData st) +data InterfaceEnv = InterfaceEnv [(String, Either Key String)] + +data InterfaceOutputs = InterfaceOutputs (M.Map Field Key) {- Finds the first compute state that provides everything required by the - interface, and returns a list of what should be provided to the program - - in its environment. + - in its environment, and what outputs the program is expected to make. -} -interfaceEnv :: [ComputeState] -> Interface -> Either String [(String, Either Key String)] +interfaceEnv :: [ComputeState] -> Interface -> Either String (InterfaceEnv, InterfaceOutputs) interfaceEnv states interface = go Nothing states where go (Just firsterr) [] = Left firsterr - go Nothing [] = interfaceEnv' (ComputeState mempty mempty mempty 0) interface + go Nothing [] = interfaceEnv' (ComputeState mempty mempty mempty) interface go firsterr (state:rest) = case interfaceEnv' state interface of Right v -> Right v Left e | null rest -> Left (fromMaybe e firsterr) | otherwise -> go (firsterr <|> Just e) rest -interfaceEnv' :: ComputeState -> Interface -> Either String [(String, Either Key String)] -interfaceEnv' state (Interface interface) = - case partitionEithers (mapMaybe go interface) of - ([], env) -> Right $ - map (\(f, v) -> (fromField f, v)) env +interfaceEnv' :: ComputeState -> Interface -> Either String (InterfaceEnv, InterfaceOutputs) +interfaceEnv' state interface@(Interface i) = + case partitionEithers (mapMaybe go i) of + ([], r) -> Right + ( InterfaceEnv (map (\(f, v) -> (fromField f, v)) r) + , interfaceOutputs state interface + ) (problems, _) -> Left $ unlines problems where - go (InterfaceInput name desc) = - case M.lookup name (computeInputs state) of + go (InterfaceInput field desc) = + case M.lookup field (computeInputs state) of Just (ComputeInput key _file) -> Just $ - Right (name, Left key) + Right (field, Left key) Nothing -> Just $ - Left $ "Missing required input \"" ++ fromField name ++ "\" -- " ++ desc - go (InterfaceOptionalInput name desc) = - case M.lookup name (computeInputs state) of + Left $ "Missing required input \"" ++ fromField field ++ "\" -- " ++ desc + go (InterfaceOptionalInput field _desc) = + case M.lookup field (computeInputs state) of Just (ComputeInput key _file) -> Just $ - Right (name, Left key) + Right (field, Left key) Nothing -> Nothing - go (InterfaceValue name desc) = - case M.lookup name (computeValues state) of + go (InterfaceValue field desc) = + case M.lookup field (computeValues state) of Just (ComputeValue v) -> Just $ - Right (name, Right v) - nothing -> Just $ - Left $ "Missing required value \"" ++ fromField name ++ "\" -- " ++ desc - go (InterfaceOptionalValue name desc) = - case M.lookup name (computeValues state) of + Right (field, Right v) + Nothing -> Just $ + Left $ "Missing required value \"" ++ fromField field ++ "\" -- " ++ desc + go (InterfaceOptionalValue field _desc) = + case M.lookup field (computeValues state) of Just (ComputeValue v) -> Just $ - Right (name, Right v) + Right (field, Right v) Nothing -> Nothing go (InterfaceOutput _ _) = Nothing go InterfaceReproducible = Nothing +interfaceOutputs :: ComputeState -> Interface -> InterfaceOutputs +interfaceOutputs state (Interface interface) = + InterfaceOutputs $ M.fromList $ mapMaybe go interface + where + go (InterfaceOutput field _) = do + ComputeOutput key <- M.lookup field (computeOutputs state) + Just (field, key) + go _ = Nothing + +computeProgramEnvironment :: InterfaceEnv -> Annex [(String, String)] +computeProgramEnvironment (InterfaceEnv ienv) = do + environ <- filter (caninherit . fst) <$> liftIO getEnvironment + interfaceenv <- mapM go ienv + return $ environ ++ interfaceenv + where + envprefix = "ANNEX_COMPUTE_" + caninherit v = not (envprefix `isPrefixOf` v) + go (f, Right v) = return (envprefix ++ f, v) + go (f, Left k) = + ifM (inAnnex k) + ( do + objloc <- calcRepo (gitAnnexLocation k) + return (envprefix ++ f, fromOsPath objloc) + , giveup "missing an input to the computation" + ) + +runComputeProgram + :: ComputeProgram + -> Key + -> AssociatedFile + -> OsPath + -> MeterUpdate + -> VerifyConfig + -> (InterfaceEnv, InterfaceOutputs) + -> Annex Verification +runComputeProgram (ComputeProgram program) k _af dest p vc (ienv, InterfaceOutputs iout) = do + environ <- computeProgramEnvironment ienv + withOtherTmp $ \tmpdir -> + go environ tmpdir + `finally` liftIO (removeDirectoryRecursive tmpdir) + where + go environ tmpdir = do + let pr = (proc program []) + { cwd = Just $ fromOsPath tmpdir + , std_out = CreatePipe + , env = Just environ + } + computing <- liftIO $ withCreateProcess pr $ + processoutput mempty tmpdir + finish computing tmpdir + + processoutput computing tmpdir _ (Just h) _ pid = + hGetLineUntilExitOrEOF pid h >>= \case + Just l + | null l -> processoutput computing tmpdir Nothing (Just h) Nothing pid + | otherwise -> parseoutput computing l >>= \case + Just computing' -> + processoutput computing' tmpdir Nothing (Just h) Nothing pid + Nothing -> do + hClose h + ifM (checkSuccessProcess pid) + ( giveup $ program ++ " output included an unparseable line: \"" ++ l ++ "\"" + , giveup $ program ++ " exited unsuccessfully" + ) + Nothing -> do + hClose h + unlessM (checkSuccessProcess pid) $ + giveup $ program ++ " exited unsuccessfully" + return computing + processoutput _ _ _ _ _ _ = error "internal" + + parseoutput computing l = case Proto.parseMessage l of + Just (Computing field file) -> + case M.lookup field iout of + Just key -> do + when (key == k) $ + -- XXX can start watching the file and updating progess now + return () + return $ Just $ + M.insert key (toRawFilePath file) computing + Nothing -> return (Just computing) + Just (Progress percent) -> do + -- XXX + return Nothing + Nothing -> return Nothing + + finish computing tmpdir = do + case M.lookup k computing of + Nothing -> giveup $ program ++ " exited successfully, but failed to output a filename" + Just file -> do + let file' = tmpdir file + unlessM (liftIO $ doesFileExist file') $ + giveup $ program ++ " exited sucessfully, but failed to write the computed file" + catchNonAsync (liftIO $ moveFile file' dest) + (\err -> giveup $ "failed to move the computed file: " ++ show err) + + -- Try to move any other computed object files into the annex. + forM_ (M.toList computing) $ \(key, file) -> + when (k /= key) $ do + let file' = tmpdir file + whenM (liftIO $ doesFileExist file') $ + whenM (verifyKeyContentPostRetrieval RetrievalAllKeysSecure vc verification k file') $ + void $ tryNonAsync $ moveAnnex k file' + + return verification + + -- The program might not be reproducible, so require strong + -- verification. + verification = MustVerify + computeKey :: RemoteStateHandle -> ComputeProgram -> TMVar (Maybe Interface) -> Key -> AssociatedFile -> OsPath -> MeterUpdate -> VerifyConfig -> Annex Verification -computeKey rs program iv k _af dest p vc = - liftIO (getInterface program iv) >>= \case +computeKey rs program iv k af dest p vc = + liftIO (getInterfaceCached program iv) >>= \case Left err -> giveup err Right interface -> do - states <- sortBy (comparing computeTimeEstimate) + states <- map snd . sortOn fst <$> getComputeStates rs k - case interfaceEnv states interface of - Left err -> giveup err - Right ienv -> undefined -- TODO + either giveup (runComputeProgram program k af dest p vc) + (interfaceEnv states interface) -- Make sure that the compute state has everything needed by -- the program's current interface. checkKey :: RemoteStateHandle -> ComputeProgram -> TMVar (Maybe Interface) -> Key -> Annex Bool checkKey rs program iv k = do - states <- getComputeStates rs k - liftIO (getInterface program iv) >>= \case + states <- map snd <$> getComputeStates rs k + liftIO (getInterfaceCached program iv) >>= \case Left err -> giveup err Right interface -> case interfaceEnv states interface of From b804f8a3ccee9dc1f54fc5ae749f55f82d9a1c78 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Fri, 21 Feb 2025 15:09:46 -0400 Subject: [PATCH 05/53] update --- doc/design/compute_special_remote_interface.mdwn | 5 ----- 1 file changed, 5 deletions(-) diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 707c65d742..cc03b4861f 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -87,11 +87,6 @@ Use "INPUT" when a file is an input to the computation, and "VALUE" for all other input values. Use "INPUT?" and "VALUE?" for optional inputs and values. -Note that the Name and Id both have to be legal git-annex metadata field -names. And should be lower cased. The user is allowed to use any case -for the names when providing inputs and values to `git-annex addcomputed` -though. - The interface can also optionally include a "REPRODUCIBLE" line. That indicates that the results of its computations are expected to be bit-for-bit reproducible. From 490174b06821bbbe59dad9779932fcf4ee144671 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 24 Feb 2025 12:41:25 -0400 Subject: [PATCH 06/53] new compute program interface This is much more flexible, and also simpler to understand. --- .../compute_special_remote_interface.mdwn | 128 ++++++++---------- 1 file changed, 56 insertions(+), 72 deletions(-) diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index cc03b4861f..33c33ad2ad 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -8,62 +8,69 @@ When an compute special remote is initremoted, a program is specified: git-annex initremote myremote type=compute program=git-annex-compute-foo The user adds an annexed file that is computed by the program by running -a command like this: +a command like one of these: - git-annex addcomputed --to myremote \ - --input raw=file.raw --value passes=10 \ - --output photo=file.jpeg + git-annex addcomputed --to=myremote -- convert file.raw file.jpeg passes=10 + git-annex addcomputed --to=myremote -- compress in out --level=9 + git-annex addcomputed --to=myremote -- clip foo 2:01-3:00 combine with bar to baz -That command and later `git-annex get` of a computed file both -run the program the same way. +Whatever values the user passes to `git-annex addcomputed` are passed on to +the program, followed by any values that the user provided to +`git-annex initremote`. -The program is passed inputs to the computation via environment variables, -which are all prefixed with `"ANNEX_COMPUTE_"`. +To simplify the program's option parsing, any value that the user provides +that is in the form "foo=bar" will also result in an environment variable +being set, eg `ANNEX_COMPUTE_passes=10` or `ANNEX_COMPUTE_--level=9`. -In the example above, the program will be passed this environment: - - ANNEX_COMPUTE_INPUT_raw=/path/.git/annex/objects/.. - ANNEX_COMPUTE_VALUE_passes=10 - -Default values that are provided to `git-annex initremote` will also be set -in the environment. Eg `git-annex initremote myremote type=compute -program=foo passes=9` will set `ANNEX_COMPUTE_VALUE_passes=9` by default. - -For security, the program should avoid exposing values from `ANNEX_COMPUTE_*` -variables to the shell unprotected, or otherwise executing them. - -The program will also inherit other environment variables -that were set when git-annex was run, like PATH. (`ANNEX_COMPUTE_*` -environment variables are not inherited.) +For security, the program should avoid exposing user input to the shell +unprotected, or otherwise executing it. The program is run in a temporary directory, which will be cleaned up after -it exits. It writes the files that it computes to that directory. +it exits. -Before starting the main computation, the program must output a list of the -files that it will compute, in the form "COMPUTING Id filename". -Here "Id" is a short identifier for a particular file, which the -user specifies when running `git-annex addcomputed`. +The content of any annexed file in the repository can be an input +to the computation. The program requests an input by writing a line to +stdout: -In the example above, the program is expected to output something like: + INPUT file.raw - COMPUTING photo out.jpeg - COMPUTING sidecar otherfile +Then it can read a line from stdin, which will be the path to the content +(eg a `.git/annex/objects/` path). -If possible, the program should write the content of the file it is -computing directly to the file listed in COMPUTING, rather than writing to -somewhere else and renaming it at the end. Except, when the program writes -the file it computes out of order, it should write to a file somewhere else +If the program needs multiple input files, it should output multiple +`INPUT` lines at once, and then read multiple paths from stdin. This +allows retrival of the inputs to potentially run in parallel. + +If an input file is not available, the program's stdin will be closed +without a path being written to it. So when reading from stdin fails, +the program should exit. + +The program computes one or more output files. For each output file that it +will compute, the program should write a line to stdout: + + OUTPUT file.jpeg + +The filename of the output file is both the filename in the program's +temporary directory, and also the filename that will be added to the +git-annex repository by `git-annex compute`. + +If git-annex sees that an output file is growing, it will use its file size +when displaying progress to the user. So if possible, the program should +write the content to the file it is computing directly, rather than writing +to somewhere else and renaming it at the end. But, if the program seeks +around and writes out of order, it should write to a file somewhere else and rename it at the end. -If git-annex sees that the file corresponding to the key it requested be -computed is growing, it will use its file size when displaying progress to -the user. - The program can also output lines to stdout to indicate its current progress: PROGRESS 50% +The program can optionally also output a "REPRODUCIBLE" line. That +indicates that the results of its computations are expected to be +bit-for-bit reproducible. That makes `git-annex addcomputed` behave as if +the `--reproducible` option is set. + Anything that the program outputs to stderr will be displayed to the user. This stderr should be used for error messages, and possibly computation output, but not for progress displays. @@ -71,42 +78,19 @@ output, but not for progress displays. If the program exits nonzero, nothing it computed will be stored in the git-annex repository. -When run with the "interface" parameter, the program must describe its -interface. This is a list of the inputs and outputs that it -supports. This allows `git-annex addcomputed` and `git-annex initremote` to -list inputs and outputs, and also lets them reject invalid inputs and -outputs. - -The output is lines, in the form: - - INPUT[?] Name Description - VALUE[?] Name Description - OUTPUT Id Description - -Use "INPUT" when a file is an input to the computation, -and "VALUE" for all other input values. Use "INPUT?" and "VALUE?" -for optional inputs and values. - -The interface can also optionally include a "REPRODUCIBLE" line. -That indicates that the results of its computations are -expected to be bit-for-bit reproducible. -That makes `git-annex addcomputed` behave as if the `--reproducible` -option is set. - An example `git-annex-compute-foo` shell script follows: #!/bin/sh set -e - if [ "$1" = interface ]; then - echo "INPUT raw A photo in RAW format" - echo "VALUE? passes Number of passes" - echo "OUTPUT photo Computed JPEG" - echo "REPRODUCIBLE" - exit 0 + if [ "$1" != "convert" ]; then + echo "Usage: convert input output [passes=n]" >&2 + exit 1 fi - if [ -z "$ANNEX_COMPUTE_VALUE_passes" ]; then - ANNEX_COMPUTE_VALUE_passes=1 + if [ -z "$ANNEX_COMPUTE_passes" ]; + ANNEX_COMPUTE_passes=1 fi - echo "COMPUTING photo out.jpeg" - frobnicate --passes="$ANNEX_COMPUTE_VALUE_passes" \ - <"$ANNEX_COMPUTE_INPUT_raw" >out.jpeg + echo "INPUT "$2" + read input + echo "OUTPUT $3" + echo REPRODUCIBLE + frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$3" From 921850d05cc450fd2cc8879b8fcd18144c706ee9 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 24 Feb 2025 13:48:46 -0400 Subject: [PATCH 07/53] support addcomputed --fast This complicates the interface but it's still simpler to understand than the old interface. --- .../compute_special_remote_interface.mdwn | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 33c33ad2ad..5c771c17ad 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -14,8 +14,8 @@ a command like one of these: git-annex addcomputed --to=myremote -- compress in out --level=9 git-annex addcomputed --to=myremote -- clip foo 2:01-3:00 combine with bar to baz -Whatever values the user passes to `git-annex addcomputed` are passed on to -the program, followed by any values that the user provided to +Whatever values the user passes to `git-annex addcomputed` are passed to +the program in `ARGV`, followed by any values that the user provided to `git-annex initremote`. To simplify the program's option parsing, any value that the user provides @@ -45,8 +45,15 @@ If an input file is not available, the program's stdin will be closed without a path being written to it. So when reading from stdin fails, the program should exit. -The program computes one or more output files. For each output file that it -will compute, the program should write a line to stdout: +When `git-annex addcomputed --fast` is being used to add a computation +to the git-annex repository without actually performing it, the +response to each "INPUT" will be an empty line rather than the path to +an input file. In that case, the program should proceed with the rest of +its output to stdout (eg "OUTPUT" and "REPRODUCIBLE"), but should not +perform any computation. + +For each output file that it will compute, the program should write a +line to stdout: OUTPUT file.jpeg @@ -93,4 +100,6 @@ An example `git-annex-compute-foo` shell script follows: read input echo "OUTPUT $3" echo REPRODUCIBLE - frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$3" + if [ -n "$input" ]; then + frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$3" + fi From 40be51c98a3035f5c2af4c95c4ccf0e85010c558 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 24 Feb 2025 15:48:42 -0400 Subject: [PATCH 08/53] reimplement using new compute program interface --- Remote/Compute.hs | 509 +++++++++++++++++----------------------------- 1 file changed, 185 insertions(+), 324 deletions(-) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 3012337f3d..2142c96e48 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -9,12 +9,9 @@ module Remote.Compute ( remote, - Interface, ComputeState(..), setComputeState, getComputeStates, - InterfaceEnv, - interfaceEnv, getComputeProgram, runComputeProgram, ) where @@ -40,9 +37,7 @@ import qualified Git import qualified Utility.SimpleProtocol as Proto import Network.HTTP.Types.URI -import Control.Concurrent.STM import Data.Time.Clock -import Data.Either import Text.Read import qualified Data.Map as M import qualified Data.Set as S @@ -66,23 +61,22 @@ gen :: Git.Repo -> UUID -> RemoteConfig -> RemoteGitConfig -> RemoteStateHandle gen r u rc gc rs = case getComputeProgram rc of Left _err -> return Nothing Right program -> do - interface <- liftIO $ newTMVarIO Nothing c <- parsedRemoteConfig remote rc cst <- remoteCost gc c veryExpensiveRemoteCost - return $ Just $ mk program interface c cst + return $ Just $ mk program c cst where - mk program interface c cst = Remote + mk program c cst = Remote { uuid = u , cost = cst , name = Git.repoDescribe r , storeKey = storeKeyUnsupported - , retrieveKeyFile = computeKey rs program interface + , retrieveKeyFile = computeKey rs program , retrieveKeyFileInOrder = pure True , retrieveKeyFileCheap = Nothing , retrievalSecurityPolicy = RetrievalAllKeysSecure , removeKey = dropKey rs , lockContent = Nothing - , checkPresent = checkKey rs program interface + , checkPresent = checkKey rs , checkPresentCheap = False , exportActions = exportUnsupported , importActions = importUnsupported @@ -114,33 +108,19 @@ setupInstance _ mu _ c _ = do gitConfigSpecialRemote u c [("compute", "true")] return (c, u) --- The RemoteConfig is allowed to contain fields from the program's --- interface. That provides defaults for git-annex addcomputed. computeConfigParser :: RemoteConfig -> Annex RemoteConfigParser -computeConfigParser rc = do - Interface interface <- case getComputeProgram rc of - Left _ -> pure $ Interface [] - Right program -> liftIO (getInterface program) >>= return . \case - Left _ -> Interface [] - Right interface -> interface - let m = M.fromList $ mapMaybe collectfields interface - let ininterface f = M.member (Field (fromProposedAccepted f)) m - return $ RemoteConfigParser - { remoteConfigFieldParsers = - [ optionalStringParser programField - (FieldDesc $ "compute program (must start with \"" ++ safetyPrefix ++ "\")") - ] - , remoteConfigRestPassthrough = Just - ( ininterface - , M.toList $ M.mapKeys fromField m - ) - } - where - collectfields (InterfaceInput f d) = Just (f, FieldDesc d) - collectfields (InterfaceOptionalInput f d) = Just (f, FieldDesc d) - collectfields (InterfaceValue f d) = Just (f, FieldDesc d) - collectfields (InterfaceOptionalValue f d) = Just (f, FieldDesc d) - collectfields _ = Nothing +computeConfigParser _ = return $ RemoteConfigParser + { remoteConfigFieldParsers = + [ optionalStringParser programField + (FieldDesc $ "compute program (must start with \"" ++ safetyPrefix ++ "\")") + ] + -- Pass through all other params, which git-annex addcomputed adds + -- to the input params. + , remoteConfigRestPassthrough = Just + ( const True + , [] + ) + } newtype ComputeProgram = ComputeProgram String deriving (Show) @@ -163,49 +143,20 @@ safetyPrefix = "git-annex-compute-" programField :: RemoteConfigField programField = Accepted "program" -type Description = String - -newtype Field = Field { fromField :: String } - deriving (Show, Eq, Ord) - -data InterfaceItem - = InterfaceInput Field Description - | InterfaceOptionalInput Field Description - | InterfaceValue Field Description - | InterfaceOptionalValue Field Description - | InterfaceOutput Field Description - | InterfaceReproducible +data ProcessCommand + = ProcessInput FilePath + | ProcessOutput FilePath + | ProcessReproducible + | ProcessProgress PercentFloat deriving (Show, Eq) --- List order matters, because when displaying the interface to the --- user, need to display it in the same order as the program --- does. -data Interface = Interface [InterfaceItem] - deriving (Show, Eq) - -instance Proto.Receivable InterfaceItem where - parseCommand "INPUT" = Proto.parse2 InterfaceInput - parseCommand "INPUT?" = Proto.parse2 InterfaceOptionalInput - parseCommand "VALUE" = Proto.parse2 InterfaceValue - parseCommand "VALUE?" = Proto.parse2 InterfaceOptionalValue - parseCommand "OUTPUT" = Proto.parse2 InterfaceOutput - parseCommand "REPRODUCIBLE" = Proto.parse0 InterfaceReproducible +instance Proto.Receivable ProcessCommand where + parseCommand "INPUT" = Proto.parse1 ProcessInput + parseCommand "OUTPUT" = Proto.parse1 ProcessOutput + parseCommand "REPRODUCIBLE" = Proto.parse0 ProcessReproducible + parseCommand "PROGRESS" = Proto.parse1 ProcessProgress parseCommand _ = Proto.parseFail -data ProcessOutput - = Computing Field FilePath - | Progress PercentFloat - deriving (Show, Eq) - -instance Proto.Receivable ProcessOutput where - parseCommand "COMPUTING" = Proto.parse2 Computing - parseCommand "PROGRESS" = Proto.parse1 Progress - parseCommand _ = Proto.parseFail - -instance Proto.Serializable Field where - serialize = fromField - deserialize = Just . Field - newtype PercentFloat = PercentFloat Float deriving (Show, Eq) @@ -213,136 +164,80 @@ instance Proto.Serializable PercentFloat where serialize (PercentFloat p) = show p deserialize s = PercentFloat <$> readMaybe s -getInterfaceCached :: ComputeProgram -> TMVar (Maybe Interface) -> IO (Either String Interface) -getInterfaceCached program iv = - atomically (takeTMVar iv) >>= \case - Nothing -> getInterface program >>= \case - Left err -> do - atomically $ putTMVar iv Nothing - return (Left err) - Right interface -> ret interface - Just interface -> ret interface - where - ret interface = do - atomically $ putTMVar iv (Just interface) - return (Right interface) - -getInterface :: ComputeProgram -> IO (Either String Interface) -getInterface (ComputeProgram program) = - catchMaybeIO (readProcess program ["interface"]) >>= \case - Nothing -> return $ Left $ "Failed to run " ++ program - Just output -> return $ case parseInterface output of - Right i -> Right i - Left err -> Left $ program ++ " interface output problem: " ++ err - -parseInterface :: String -> Either String Interface -parseInterface = go [] . lines - where - go is [] - | null is = Left "empty interface output" - | otherwise = Right (Interface (reverse is)) - go is (l:ls) - | null l = go is ls - | otherwise = case Proto.parseMessage l of - Just i -> go (i:is) ls - Nothing -> Left $ "Unable to parse line: \"" ++ l ++ "\"" - -data ComputeInput = ComputeInput Key FilePath - deriving (Show, Eq) - -data ComputeValue = ComputeValue String - deriving (Show, Eq) - -data ComputeOutput = ComputeOutput Key - deriving (Show, Eq) - data ComputeState = ComputeState - { computeInputs :: M.Map Field ComputeInput - , computeValues :: M.Map Field ComputeValue - , computeOutputs :: M.Map Field ComputeOutput + { computeParams :: [String] + , computeInputs :: M.Map FilePath Key + , computeOutputs :: M.Map FilePath (Maybe Key) + , computeReproducible :: Bool } deriving (Show, Eq) {- Formats a ComputeState as an URL query string. - - - Prefixes fields with "k" and "f" for computeInputs, with - - "v" for computeValues and "o" for computeOutputs. + - Prefixes computeParams with 'p', computeInputs with 'i', + - and computeOutput with 'o'. - - When the passed Key is an output, rather than duplicate it - in the query string, that output has no value. - - - Fields in the query string are sorted. This is in order to ensure - - that the same ComputeState is always formatted the same way. + - Example: "psomefile&pdestfile&pbaz&isomefile=WORM--foo&odestfile=" - - - Example: "ffoo=somefile&kfoo=WORM--foo&oresult&vbar=11" + - The computeParams are in the order they were given. The computeInputs + - and computeOutputs are sorted in ascending order for stability. -} formatComputeState :: Key -> ComputeState -> B.ByteString -formatComputeState k st = renderQuery False $ sortOn fst $ concat - [ concatMap formatinput $ M.toList (computeInputs st) - , map formatvalue $ M.toList (computeValues st) - , map formatoutput $ M.toList (computeOutputs st) +formatComputeState k st = renderQuery False $ concat + [ map formatparam (computeParams st) + , map formatinput (M.toAscList (computeInputs st)) + , mapMaybe formatoutput (M.toAscList (computeOutputs st)) ] where - formatinput (f, ComputeInput key file) = - [ ("k" <> fb, Just (serializeKey' key)) - , ("f" <> fb, Just (toRawFilePath file)) - ] - where - fb = encodeBS (fromField f) - formatvalue (f, ComputeValue v) = - ("v" <> encodeBS (fromField f), Just (encodeBS v)) - formatoutput (f, ComputeOutput key) = - ("o" <> encodeBS (fromField f), + formatparam p = ("p" <> encodeBS p, Nothing) + formatinput (file, key) = + ("i" <> toRawFilePath file, Just (serializeKey' key)) + formatoutput (file, (Just key)) = Just $ + ("o" <> toRawFilePath file, if key == k then Nothing else Just (serializeKey' key) ) + formatoutput (_, Nothing) = Nothing parseComputeState :: Key -> B.ByteString -> Maybe ComputeState parseComputeState k b = - let q = parseQuery b - st = go emptycomputestate (M.fromList q) q + let st = go emptycomputestate (parseQuery b) in if st == emptycomputestate then Nothing else Just st where - emptycomputestate = ComputeState mempty mempty mempty - go c _ [] = c - go c m ((f, v):rest) = + emptycomputestate = ComputeState mempty mempty mempty False + go :: ComputeState -> [QueryItem] -> ComputeState + go c [] = c { computeParams = reverse (computeParams c) } + go c ((f, v):rest) = let c' = fromMaybe c $ case decodeBS f of - ('f':f') -> do - file <- fromRawFilePath <$> v - kv <- M.lookup (encodeBS ('k':f')) m - key <- deserializeKey' =<< kv + ('p':p) -> Just $ c + { computeParams = p : computeParams c + } + ('i':i) -> do + key <- deserializeKey' =<< v Just $ c - { computeInputs = - M.insert (Field f') - (ComputeInput key file) + { computeInputs = + M.insert i key (computeInputs c) } - ('v':f') -> do - val <- decodeBS <$> v - Just $ c - { computeValues = - M.insert (Field f') - (ComputeValue val) - (computeValues c) - } - ('o':f') -> case v of + ('o':o) -> case v of Just kv -> do key <- deserializeKey' kv Just $ c - { computeOutputs = - M.insert (Field f') - (ComputeOutput key) + { computeOutputs = + M.insert o (Just key) (computeOutputs c) } Nothing -> Just $ c - { computeOutputs = - M.insert (Field f') - (ComputeOutput k) + { computeOutputs = + M.insert o (Just k) (computeOutputs c) } _ -> Nothing - in go c' m rest + in go c' rest {- The per remote metadata is used to store ComputeState. This allows - recording multiple ComputeStates that generate the same key. @@ -369,162 +264,142 @@ getComputeStates rs k = do Just ts -> go (zip (repeat ts) sts : c) rest Nothing -> go c rest -data InterfaceEnv = InterfaceEnv [(String, Either Key String)] - -data InterfaceOutputs = InterfaceOutputs (M.Map Field Key) - -{- Finds the first compute state that provides everything required by the - - interface, and returns a list of what should be provided to the program - - in its environment, and what outputs the program is expected to make. - -} -interfaceEnv :: [ComputeState] -> Interface -> Either String (InterfaceEnv, InterfaceOutputs) -interfaceEnv states interface = go Nothing states - where - go (Just firsterr) [] = Left firsterr - go Nothing [] = interfaceEnv' (ComputeState mempty mempty mempty) interface - go firsterr (state:rest) = case interfaceEnv' state interface of - Right v -> Right v - Left e - | null rest -> Left (fromMaybe e firsterr) - | otherwise -> go (firsterr <|> Just e) rest - -interfaceEnv' :: ComputeState -> Interface -> Either String (InterfaceEnv, InterfaceOutputs) -interfaceEnv' state interface@(Interface i) = - case partitionEithers (mapMaybe go i) of - ([], r) -> Right - ( InterfaceEnv (map (\(f, v) -> (fromField f, v)) r) - , interfaceOutputs state interface - ) - (problems, _) -> Left $ unlines problems - where - go (InterfaceInput field desc) = - case M.lookup field (computeInputs state) of - Just (ComputeInput key _file) -> Just $ - Right (field, Left key) - Nothing -> Just $ - Left $ "Missing required input \"" ++ fromField field ++ "\" -- " ++ desc - go (InterfaceOptionalInput field _desc) = - case M.lookup field (computeInputs state) of - Just (ComputeInput key _file) -> Just $ - Right (field, Left key) - Nothing -> Nothing - go (InterfaceValue field desc) = - case M.lookup field (computeValues state) of - Just (ComputeValue v) -> Just $ - Right (field, Right v) - Nothing -> Just $ - Left $ "Missing required value \"" ++ fromField field ++ "\" -- " ++ desc - go (InterfaceOptionalValue field _desc) = - case M.lookup field (computeValues state) of - Just (ComputeValue v) -> Just $ - Right (field, Right v) - Nothing -> Nothing - go (InterfaceOutput _ _) = Nothing - go InterfaceReproducible = Nothing - -interfaceOutputs :: ComputeState -> Interface -> InterfaceOutputs -interfaceOutputs state (Interface interface) = - InterfaceOutputs $ M.fromList $ mapMaybe go interface - where - go (InterfaceOutput field _) = do - ComputeOutput key <- M.lookup field (computeOutputs state) - Just (field, key) - go _ = Nothing - -computeProgramEnvironment :: InterfaceEnv -> Annex [(String, String)] -computeProgramEnvironment (InterfaceEnv ienv) = do +computeProgramEnvironment :: ComputeState -> Annex [(String, String)] +computeProgramEnvironment st = do environ <- filter (caninherit . fst) <$> liftIO getEnvironment - interfaceenv <- mapM go ienv - return $ environ ++ interfaceenv + let addenv = mapMaybe go (computeParams st) + return $ environ ++ addenv where envprefix = "ANNEX_COMPUTE_" caninherit v = not (envprefix `isPrefixOf` v) - go (f, Right v) = return (envprefix ++ f, v) - go (f, Left k) = - ifM (inAnnex k) - ( do - objloc <- calcRepo (gitAnnexLocation k) - return (envprefix ++ f, fromOsPath objloc) - , giveup "missing an input to the computation" - ) + go p + | '=' `elem` p = + let (f, v) = separate (== '=') p + in Just (envprefix ++ f, v) + | otherwise = Nothing + +newtype ImmutableState = ImmutableState Bool runComputeProgram :: ComputeProgram - -> Key - -> AssociatedFile - -> OsPath - -> MeterUpdate - -> VerifyConfig - -> (InterfaceEnv, InterfaceOutputs) - -> Annex Verification -runComputeProgram (ComputeProgram program) k _af dest p vc (ienv, InterfaceOutputs iout) = do - environ <- computeProgramEnvironment ienv + -> ComputeState + -> ImmutableState + -> (OsPath -> Annex (Key, Maybe OsPath)) + -> (ComputeState -> OsPath -> Annex v) + -> Annex v +runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) getinputcontent cont = withOtherTmp $ \tmpdir -> - go environ tmpdir + go tmpdir `finally` liftIO (removeDirectoryRecursive tmpdir) where - go environ tmpdir = do - let pr = (proc program []) - { cwd = Just $ fromOsPath tmpdir + go tmpdir = do + environ <- computeProgramEnvironment state + let pr = (proc program (computeParams state)) + { cwd = Just (fromOsPath tmpdir) + , std_in = CreatePipe , std_out = CreatePipe , env = Just environ } - computing <- liftIO $ withCreateProcess pr $ - processoutput mempty tmpdir - finish computing tmpdir + state' <- bracket + (liftIO $ createProcess pr) + (liftIO . cleanupProcess) + (getinput state tmpdir) + cont state' tmpdir - processoutput computing tmpdir _ (Just h) _ pid = - hGetLineUntilExitOrEOF pid h >>= \case + getinput state' tmpdir p = + liftIO (hGetLineUntilExitOrEOF (processHandle p) (stdoutHandle p)) >>= \case Just l - | null l -> processoutput computing tmpdir Nothing (Just h) Nothing pid - | otherwise -> parseoutput computing l >>= \case - Just computing' -> - processoutput computing' tmpdir Nothing (Just h) Nothing pid - Nothing -> do - hClose h - ifM (checkSuccessProcess pid) - ( giveup $ program ++ " output included an unparseable line: \"" ++ l ++ "\"" - , giveup $ program ++ " exited unsuccessfully" - ) + | null l -> getinput state' tmpdir p + | otherwise -> do + state'' <- parseoutput p state' l + getinput state'' tmpdir p Nothing -> do - hClose h - unlessM (checkSuccessProcess pid) $ + liftIO $ hClose (stdoutHandle p) + liftIO $ hClose (stdinHandle p) + unlessM (liftIO $ checkSuccessProcess (processHandle p)) $ giveup $ program ++ " exited unsuccessfully" - return computing - processoutput _ _ _ _ _ _ = error "internal" + return state' - parseoutput computing l = case Proto.parseMessage l of - Just (Computing field file) -> - case M.lookup field iout of - Just key -> do - when (key == k) $ - -- XXX can start watching the file and updating progess now - return () - return $ Just $ - M.insert key (toRawFilePath file) computing - Nothing -> return (Just computing) - Just (Progress percent) -> do + parseoutput p state' l = case Proto.parseMessage l of + Just (ProcessInput f) -> + let knowninput = M.member f (computeInputs state') + in checkimmutable knowninput l $ do + (k, mp) <- getinputcontent (toOsPath f) + liftIO $ hPutStrLn (stdinHandle p) $ + maybe "" fromOsPath mp + return $ if knowninput + then state' + else state' + { computeInputs = + M.insert f k + (computeInputs state') + } + Just (ProcessOutput f) -> + let knownoutput = M.member f (computeOutputs state') + in checkimmutable knownoutput l $ + return $ if knownoutput + then state' + else state' + { computeOutputs = + M.insert f Nothing + (computeOutputs state') + } + Just (ProcessProgress percent) -> do -- XXX - return Nothing - Nothing -> return Nothing - - finish computing tmpdir = do - case M.lookup k computing of - Nothing -> giveup $ program ++ " exited successfully, but failed to output a filename" - Just file -> do - let file' = tmpdir file - unlessM (liftIO $ doesFileExist file') $ - giveup $ program ++ " exited sucessfully, but failed to write the computed file" - catchNonAsync (liftIO $ moveFile file' dest) - (\err -> giveup $ "failed to move the computed file: " ++ show err) + return state' + Just ProcessReproducible -> + return $ state' { computeReproducible = True } + Nothing -> giveup $ + program ++ " output included an unparseable line: \"" ++ l ++ "\"" + + checkimmutable True _ a = a + checkimmutable False l a + | not immutablestate = a + | otherwise = giveup $ + program ++ " is not behaving the same way it used to, now outputting: \"" ++ l ++ "\"" + +computeKey :: RemoteStateHandle -> ComputeProgram -> Key -> AssociatedFile -> OsPath -> MeterUpdate -> VerifyConfig -> Annex Verification +computeKey rs (ComputeProgram program) k af dest p vc = do + states <- map snd . sortOn fst -- least expensive probably + <$> getComputeStates rs k + case mapMaybe computeskey states of + ((keyfile, state):_) -> runComputeProgram + (ComputeProgram program) + state + (ImmutableState True) + (getinputcontent state) + (go keyfile) + [] -> giveup "Missing compute state" + where + getinputcontent state f = + case M.lookup (fromOsPath f) (computeInputs state) of + Just inputkey -> do + obj <- calcRepo (gitAnnexLocation inputkey) + -- XXX get input object when not present + return (inputkey, Just obj) + Nothing -> error "internal" + + computeskey state = + case M.keys $ M.filter (== Just k) (computeOutputs state) of + (keyfile : _) -> Just (keyfile, state) + [] -> Nothing + + go keyfile state tmpdir = do + let keyfile' = tmpdir toOsPath keyfile + unlessM (liftIO $ doesFileExist keyfile') $ + giveup $ program ++ " exited sucessfully, but failed to write the computed file" + catchNonAsync (liftIO $ moveFile keyfile' dest) + (\err -> giveup $ "failed to move the computed file: " ++ show err) -- Try to move any other computed object files into the annex. - forM_ (M.toList computing) $ \(key, file) -> - when (k /= key) $ do - let file' = tmpdir file - whenM (liftIO $ doesFileExist file') $ - whenM (verifyKeyContentPostRetrieval RetrievalAllKeysSecure vc verification k file') $ - void $ tryNonAsync $ moveAnnex k file' + forM_ (M.toList $ computeOutputs state) $ \case + (file, (Just key)) -> + when (k /= key) $ do + let file' = tmpdir toOsPath file + whenM (liftIO $ doesFileExist file') $ + whenM (verifyKeyContentPostRetrieval RetrievalAllKeysSecure vc verification k file') $ + void $ tryNonAsync $ moveAnnex k file' + _ -> noop return verification @@ -532,27 +407,13 @@ runComputeProgram (ComputeProgram program) k _af dest p vc (ienv, InterfaceOutpu -- verification. verification = MustVerify -computeKey :: RemoteStateHandle -> ComputeProgram -> TMVar (Maybe Interface) -> Key -> AssociatedFile -> OsPath -> MeterUpdate -> VerifyConfig -> Annex Verification -computeKey rs program iv k af dest p vc = - liftIO (getInterfaceCached program iv) >>= \case - Left err -> giveup err - Right interface -> do - states <- map snd . sortOn fst - <$> getComputeStates rs k - either giveup (runComputeProgram program k af dest p vc) - (interfaceEnv states interface) - --- Make sure that the compute state has everything needed by --- the program's current interface. -checkKey :: RemoteStateHandle -> ComputeProgram -> TMVar (Maybe Interface) -> Key -> Annex Bool -checkKey rs program iv k = do - states <- map snd <$> getComputeStates rs k - liftIO (getInterfaceCached program iv) >>= \case - Left err -> giveup err - Right interface -> - case interfaceEnv states interface of - Right _ -> return True - Left _ -> return False +-- Make sure that the compute state exists. +checkKey :: RemoteStateHandle -> Key -> Annex Bool +checkKey rs k = do + states <- getComputeStates rs k + if null states + then giveup "Missing compute state" + else return True -- Unsetting the compute state will prevent computing the key. dropKey :: RemoteStateHandle -> Maybe SafeDropProof -> Key -> Annex () From 556f44d404de3c51cbaf87ad6d4837025c8defa3 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 24 Feb 2025 16:15:04 -0400 Subject: [PATCH 09/53] update for new interface --- doc/git-annex-addcomputed.mdwn | 54 ++++++++++++---------------------- 1 file changed, 18 insertions(+), 36 deletions(-) diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index 395d6246e7..487bb70ff1 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -4,31 +4,34 @@ git-annex addcomputed - adds computed files to the repository # SYNOPSIS -git annex addcomputed `--to=remote [--input name=file ...] [--value name=value ...] [--output name=file ...]` - -git annex addcomputed `--describe=remote` +git annex addcomputed `--to=remote -- ...` # DESCRIPTION -Adds files to the annex that are computed from input files and values, +Adds files to the annex that are computed from input files, using a compute special remote. -For example, this adds a file `foo.jpeg` to the repository. It is computed -by the "photoconv" compute remote, based on an input file, `foo.raw`. A -configurable "passes" value is set to 10 when computing the file. +Once a file has been added to a compute remote, commands +like `git-annex get` will use it to compute the content of the file. - git-annex addcomputed --to photoconv \ - --input raw=foo.raw --output photo=foo.jpeg \ - --value passes=10 +The syntax of this command after the `--` is up to the program that +the compute special remote is set up to run to perform the comuptation. + +To see the program's usage, you can run: + + git-annex addcomputed --to=foo + +Generally you will provide an input file (or files), and often also an +output filename, and additional parameters to control the computation. There can be more than one input file that are combined to compute an output file. And multiple output files can be computed at the same time. -The output files are added to the repository as annexed files. -Once a file has been added to a compute remote, commands -like `git-annex get` will use it to compute the content of the file. -It is also possible to use commands like `git-annex drop` on the file, -with the compute remote being counted as one copy of it. +Some examples of how this might look: + + git-annex addcomputed --to=x -- convert file.raw file.jpeg passes=10 + git-annex addcomputed --to=y -- compress foo --level=9 + git-annex addcomputed --to=z -- clip foo 2:01-3:00 combine with bar to baz # OPTIONS @@ -46,27 +49,6 @@ with the compute remote being counted as one copy of it. of the programs that are available, see -* `--input name=file` - - Provide a file as input to the computation, with the specified input name. - - The input file can be an annexed file, or a file stored in git. - -* `--output name=file` - - Add the output of the computation to the repository as an annexed file, - with the specified filename. - -* `--value name=value` - - Provide a value to the computation, with the specified name. - -* `--describe=remote` - - Describe all inputs, outputs, and values supported by a compute remote. - - For a machine-readable list, use this with `--json`. - * `--fast` Adds computed files to the repository, without generating their content From ce05a92ee7db4b260003ffe40083d49dbfd0a635 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 24 Feb 2025 16:39:55 -0400 Subject: [PATCH 10/53] add field desc --- Remote/Compute.hs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 2142c96e48..08b7f385d4 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -118,7 +118,7 @@ computeConfigParser _ = return $ RemoteConfigParser -- to the input params. , remoteConfigRestPassthrough = Just ( const True - , [] + , [("*", FieldDesc "all other parameters are passed to compute program")] ) } From 2e1fe1620e14f0f8681b2872b696645e6e00156d Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 25 Feb 2025 15:08:38 -0400 Subject: [PATCH 11/53] handle comutations in subdirs of the git repository Eg, a computation might be run in "foo/" and refer to "../bar" as an input or output. So, the subdir is part of the computation state. Also, prevent input or output of files that are outside the git repository. Of course, the program can access any file on disk if it wants to; this is just a guard against mistakes. And it may also be useful if the program comunicates with something less trusted than it, eg a container image, so input/output files communicated by that are not the source of security problems. --- Remote/Compute.hs | 104 ++++++++++++------ .../compute_special_remote_interface.mdwn | 8 +- doc/git-annex-addcomputed.mdwn | 4 +- doc/git-annex-recompute.mdwn | 4 +- 4 files changed, 81 insertions(+), 39 deletions(-) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 08b7f385d4..06017e6365 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -12,8 +12,10 @@ module Remote.Compute ( ComputeState(..), setComputeState, getComputeStates, + ComputeProgram, getComputeProgram, runComputeProgram, + ImmutableState(..), ) where import Annex.Common @@ -33,6 +35,7 @@ import Logs.MetaData import Utility.Metered import Utility.TimeStamp import Utility.Env +import Utility.Tmp.Dir import qualified Git import qualified Utility.SimpleProtocol as Proto @@ -166,8 +169,9 @@ instance Proto.Serializable PercentFloat where data ComputeState = ComputeState { computeParams :: [String] - , computeInputs :: M.Map FilePath Key - , computeOutputs :: M.Map FilePath (Maybe Key) + , computeInputs :: M.Map OsPath Key + , computeOutputs :: M.Map OsPath (Maybe Key) + , computeSubdir :: OsPath , computeReproducible :: Bool } deriving (Show, Eq) @@ -175,12 +179,12 @@ data ComputeState = ComputeState {- Formats a ComputeState as an URL query string. - - Prefixes computeParams with 'p', computeInputs with 'i', - - and computeOutput with 'o'. + - and computeOutput with 'o'. Uses "d" for computeSubdir. - - When the passed Key is an output, rather than duplicate it - in the query string, that output has no value. - - - Example: "psomefile&pdestfile&pbaz&isomefile=WORM--foo&odestfile=" + - Example: "psomefile&pdestfile&pbaz&isomefile=WORM--foo&odestfile=&d=subdir" - - The computeParams are in the order they were given. The computeInputs - and computeOutputs are sorted in ascending order for stability. @@ -190,13 +194,14 @@ formatComputeState k st = renderQuery False $ concat [ map formatparam (computeParams st) , map formatinput (M.toAscList (computeInputs st)) , mapMaybe formatoutput (M.toAscList (computeOutputs st)) + , [("d", Just (fromOsPath (computeSubdir st)))] ] where formatparam p = ("p" <> encodeBS p, Nothing) formatinput (file, key) = - ("i" <> toRawFilePath file, Just (serializeKey' key)) + ("i" <> fromOsPath file, Just (serializeKey' key)) formatoutput (file, (Just key)) = Just $ - ("o" <> toRawFilePath file, + ("o" <> fromOsPath file, if key == k then Nothing else Just (serializeKey' key) @@ -208,7 +213,7 @@ parseComputeState k b = let st = go emptycomputestate (parseQuery b) in if st == emptycomputestate then Nothing else Just st where - emptycomputestate = ComputeState mempty mempty mempty False + emptycomputestate = ComputeState mempty mempty mempty "." False go :: ComputeState -> [QueryItem] -> ComputeState go c [] = c { computeParams = reverse (computeParams c) } go c ((f, v):rest) = @@ -220,7 +225,7 @@ parseComputeState k b = key <- deserializeKey' =<< v Just $ c { computeInputs = - M.insert i key + M.insert (toOsPath i) key (computeInputs c) } ('o':o) -> case v of @@ -228,14 +233,21 @@ parseComputeState k b = key <- deserializeKey' kv Just $ c { computeOutputs = - M.insert o (Just key) + M.insert (toOsPath o) + (Just key) (computeOutputs c) } Nothing -> Just $ c { computeOutputs = - M.insert o (Just k) + M.insert (toOsPath o) + (Just k) (computeOutputs c) } + ('d':[]) -> do + subdir <- v + Just $ c + { computeSubdir = toOsPath subdir + } _ -> Nothing in go c' rest @@ -288,14 +300,14 @@ runComputeProgram -> (ComputeState -> OsPath -> Annex v) -> Annex v runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) getinputcontent cont = - withOtherTmp $ \tmpdir -> - go tmpdir - `finally` liftIO (removeDirectoryRecursive tmpdir) + withOtherTmp $ \othertmpdir -> + withTmpDirIn othertmpdir "compute" go where go tmpdir = do environ <- computeProgramEnvironment state + subdir <- liftIO $ getsubdir tmpdir let pr = (proc program (computeParams state)) - { cwd = Just (fromOsPath tmpdir) + { cwd = Just (fromOsPath subdir) , std_in = CreatePipe , std_out = CreatePipe , env = Just environ @@ -303,16 +315,26 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) state' <- bracket (liftIO $ createProcess pr) (liftIO . cleanupProcess) - (getinput state tmpdir) - cont state' tmpdir + (getinput state tmpdir subdir) + cont state' subdir + + getsubdir tmpdir = do + let subdir = tmpdir computeSubdir state + ifM (dirContains <$> absPath tmpdir <*> absPath subdir) + ( do + createDirectoryIfMissing True subdir + return subdir + -- Ignore unsafe value in state. + , return tmpdir + ) - getinput state' tmpdir p = + getinput state' tmpdir subdir p = liftIO (hGetLineUntilExitOrEOF (processHandle p) (stdoutHandle p)) >>= \case Just l - | null l -> getinput state' tmpdir p + | null l -> getinput state' tmpdir subdir p | otherwise -> do - state'' <- parseoutput p state' l - getinput state'' tmpdir p + state'' <- parseoutput p tmpdir subdir state' l + getinput state'' tmpdir subdir p Nothing -> do liftIO $ hClose (stdoutHandle p) liftIO $ hClose (stdinHandle p) @@ -320,28 +342,36 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) giveup $ program ++ " exited unsuccessfully" return state' - parseoutput p state' l = case Proto.parseMessage l of - Just (ProcessInput f) -> - let knowninput = M.member f (computeInputs state') - in checkimmutable knowninput l $ do - (k, mp) <- getinputcontent (toOsPath f) + parseoutput p tmpdir subdir state' l = case Proto.parseMessage l of + Just (ProcessInput f) -> do + let f' = toOsPath f + let knowninput = M.member f' (computeInputs state') + checksafefile tmpdir subdir f' "input" + checkimmutable knowninput l $ do + (k, mp) <- getinputcontent f' + mp' <- liftIO $ maybe (pure Nothing) + (Just <$$> relPathDirToFile subdir) + mp liftIO $ hPutStrLn (stdinHandle p) $ - maybe "" fromOsPath mp + maybe "" fromOsPath mp' + liftIO $ hFlush (stdinHandle p) return $ if knowninput then state' else state' { computeInputs = - M.insert f k + M.insert f' k (computeInputs state') } - Just (ProcessOutput f) -> - let knownoutput = M.member f (computeOutputs state') - in checkimmutable knownoutput l $ + Just (ProcessOutput f) -> do + let f' = toOsPath f + checksafefile tmpdir subdir f' "output" + let knownoutput = M.member f' (computeOutputs state') + checkimmutable knownoutput l $ return $ if knownoutput then state' else state' { computeOutputs = - M.insert f Nothing + M.insert f' Nothing (computeOutputs state') } Just (ProcessProgress percent) -> do @@ -352,6 +382,14 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) Nothing -> giveup $ program ++ " output included an unparseable line: \"" ++ l ++ "\"" + checksafefile tmpdir subdir f fileaction = do + let err problem = giveup $ + program ++ " tried to " ++ fileaction ++ " a file that is " ++ problem ++ ": " ++ fromOsPath f + unlessM (liftIO $ dirContains <$> absPath tmpdir <*> absPath (subdir f)) $ + err "outside the git repository" + when (any (\p -> dropTrailingPathSeparator p == literalOsPath ".git") (splitPath f)) $ + err "inside the .git directory" + checkimmutable True _ a = a checkimmutable False l a | not immutablestate = a @@ -385,7 +423,7 @@ computeKey rs (ComputeProgram program) k af dest p vc = do [] -> Nothing go keyfile state tmpdir = do - let keyfile' = tmpdir toOsPath keyfile + let keyfile' = tmpdir keyfile unlessM (liftIO $ doesFileExist keyfile') $ giveup $ program ++ " exited sucessfully, but failed to write the computed file" catchNonAsync (liftIO $ moveFile keyfile' dest) @@ -395,7 +433,7 @@ computeKey rs (ComputeProgram program) k af dest p vc = do forM_ (M.toList $ computeOutputs state) $ \case (file, (Just key)) -> when (k /= key) $ do - let file' = tmpdir toOsPath file + let file' = tmpdir file whenM (liftIO $ doesFileExist file') $ whenM (verifyKeyContentPostRetrieval RetrievalAllKeysSecure vc verification k file') $ void $ tryNonAsync $ moveAnnex k file' diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 5c771c17ad..34b7da7e77 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -26,7 +26,9 @@ For security, the program should avoid exposing user input to the shell unprotected, or otherwise executing it. The program is run in a temporary directory, which will be cleaned up after -it exits. +it exits. Note that it may be run in a subdirectory of its temporary +directory. Eg, when `git-annex addcomputed` is run in a `foo/bar/` +subdirectory of the git repository. The content of any annexed file in the repository can be an input to the computation. The program requests an input by writing a line to @@ -93,10 +95,10 @@ An example `git-annex-compute-foo` shell script follows: echo "Usage: convert input output [passes=n]" >&2 exit 1 fi - if [ -z "$ANNEX_COMPUTE_passes" ]; + if [ -z "$ANNEX_COMPUTE_passes" ]; then ANNEX_COMPUTE_passes=1 fi - echo "INPUT "$2" + echo "INPUT $2" read input echo "OUTPUT $3" echo REPRODUCIBLE diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index 487bb70ff1..bca6e1144d 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -33,6 +33,8 @@ Some examples of how this might look: git-annex addcomputed --to=y -- compress foo --level=9 git-annex addcomputed --to=z -- clip foo 2:01-3:00 combine with bar to baz + + # OPTIONS * `--to=remote` @@ -54,7 +56,7 @@ Some examples of how this might look: Adds computed files to the repository, without generating their content yet. -* `--unreproducible` +* `--unreproducible`, `-u` Indicate that the computation is not expected to be fully reproducible. It can vary, in ways that produce files that equivilant enough to diff --git a/doc/git-annex-recompute.mdwn b/doc/git-annex-recompute.mdwn index c86173c2eb..2800a74106 100644 --- a/doc/git-annex-recompute.mdwn +++ b/doc/git-annex-recompute.mdwn @@ -22,12 +22,12 @@ in the repository. Recompute files even when their input files have not changed. -* `--unreproducible` +* `--unreproducible`, `-u` Convert files that were added with `git-annex addcomputed --reproducible` to be as if they were added without that option. -* `--reproducible` +* `--reproducible`, `-r` Convert files that were added with `git-annex addcomputed --unreproducible` to be as if they were added with `--reproducible`. From a154e91513506f5edfd36aea1783f32da13715f1 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 25 Feb 2025 15:45:14 -0400 Subject: [PATCH 12/53] add git-annex addcomputed Working pretty well. Mostly. But: * Does not yet support inputs that are non-annexed files checked into git * --fast is currently broken (will need something like VURL keys) * --unreproducible still uses a checksumming backend, so drop and get again will likely fail (needs probably to use an URL key or something like one) The compute special remote seems to work pretty well too. Eg, getting from it works, and dropping content that is present in it works. --- CmdLine/GitAnnex.hs | 2 + Command/AddComputed.hs | 160 +++++++++++++++++++++++++++++++++++++++++ Remote/Compute.hs | 2 +- git-annex.cabal | 1 + 4 files changed, 164 insertions(+), 1 deletion(-) create mode 100644 Command/AddComputed.hs diff --git a/CmdLine/GitAnnex.hs b/CmdLine/GitAnnex.hs index 6596e269e9..71d9f2e51f 100644 --- a/CmdLine/GitAnnex.hs +++ b/CmdLine/GitAnnex.hs @@ -133,6 +133,7 @@ import qualified Command.ExtendCluster import qualified Command.UpdateProxy import qualified Command.MaxSize import qualified Command.Sim +import qualified Command.AddComputed import qualified Command.Version import qualified Command.RemoteDaemon #ifdef WITH_ASSISTANT @@ -265,6 +266,7 @@ cmds testoptparser testrunner mkbenchmarkgenerator = map addGitAnnexCommonOption , Command.UpdateProxy.cmd , Command.MaxSize.cmd , Command.Sim.cmd + , Command.AddComputed.cmd , Command.Version.cmd , Command.RemoteDaemon.cmd #ifdef WITH_ASSISTANT diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs new file mode 100644 index 0000000000..d80fb168da --- /dev/null +++ b/Command/AddComputed.hs @@ -0,0 +1,160 @@ +{- git-annex command + - + - Copyright 2025 Joey Hess + - + - Licensed under the GNU AGPL version 3 or higher. + -} + +{-# LANGUAGE OverloadedStrings #-} + +module Command.AddComputed where + +import Command +import qualified Git +import qualified Annex +import qualified Remote.Compute +import qualified Types.Remote as Remote +import Annex.CatFile +import Annex.Content.Presence +import Annex.Ingest +import Types.RemoteConfig +import Types.KeySource +import Messages.Progress +import Utility.MonotonicClock +import Logs.Location + +import qualified Data.Map as M +import Data.Time.Clock + +cmd :: Command +cmd = notBareRepo $ + command "addcomputed" SectionCommon "add computed files to annex" + (paramRepeating paramExpression) + (seek <$$> optParser) + +data AddComputedOptions = AddComputedOptions + { computeParams :: CmdParams + , computeRemote :: DeferredParse Remote + , reproducible :: Reproducible + } + +optParser :: CmdParamsDesc -> Parser AddComputedOptions +optParser desc = AddComputedOptions + <$> cmdParams desc + <*> (mkParseRemoteOption <$> parseToOption) + <*> (fromMaybe Unreproducible <$> parseReproducible) + +data Reproducible = Reproducible | Unreproducible + +parseReproducible :: Parser (Maybe Reproducible) +parseReproducible = r <|> unr + where + r = flag Nothing (Just Reproducible) + ( long "reproducible" + <> short 'r' + <> help "computation is fully reproducible" + ) + unr = flag Nothing (Just Unreproducible) + ( long "unreproducible" + <> short 'u' + <> help "computation is not fully reproducible" + ) + +seek :: AddComputedOptions -> CommandSeek +seek o = startConcurrency commandStages (seek' o) + +seek' :: AddComputedOptions -> CommandSeek +seek' o = do + r <- getParsed (computeRemote o) + unless (Remote.typename (Remote.remotetype r) == Remote.typename Remote.Compute.remote) $ + giveup "That is not a compute remote." + + let rc = unparsedRemoteConfig (Remote.config r) + case Remote.Compute.getComputeProgram rc of + Left err -> giveup $ + "Problem with the configuration of the compute remote: " ++ err + Right program -> commandAction $ start o r program + +start :: AddComputedOptions -> Remote -> Remote.Compute.ComputeProgram -> CommandStart +start o r program = starting "addcomputed" ai si $ perform o r program + where + ai = ActionItemUUID (Remote.uuid r) (UnquotedString (Remote.name r)) + si = SeekInput (computeParams o) + +perform :: AddComputedOptions -> Remote -> Remote.Compute.ComputeProgram -> CommandPerform +perform o r program = do + repopath <- fromRepo Git.repoPath + subdir <- liftIO $ relPathDirToFile repopath (literalOsPath ".") + let state = Remote.Compute.ComputeState + { Remote.Compute.computeParams = computeParams o + , Remote.Compute.computeInputs = mempty + , Remote.Compute.computeOutputs = mempty + , Remote.Compute.computeSubdir = subdir + , Remote.Compute.computeReproducible = + case reproducible o of + Reproducible -> True + Unreproducible -> False + } + fast <- Annex.getRead Annex.fast + starttime <- liftIO currentMonotonicTimestamp + Remote.Compute.runComputeProgram program state + (Remote.Compute.ImmutableState False) + (getinputcontent fast) + (go starttime) + next $ return True + where + getinputcontent fast p = catKeyFile p >>= \case + Just inputkey -> do + obj <- calcRepo (gitAnnexLocation inputkey) + if fast + then return (inputkey, Nothing) + else ifM (inAnnex inputkey) + ( return (inputkey, Just obj) + , giveup $ "The computation needs the content of a file which is not present: " ++ fromOsPath p + ) + Nothing -> ifM (liftIO $ doesFileExist p) + ( giveup $ "The computation needs an input file that is not an annexed file: " ++ fromOsPath p + , giveup $ "The computation needs an input file which does not exist: " ++ fromOsPath p + ) + + go starttime state tmpdir = do + endtime <- liftIO currentMonotonicTimestamp + let ts = calcduration starttime endtime + let outputs = Remote.Compute.computeOutputs state + when (M.null outputs) $ + giveup "The computation succeeded, but it did not generate any files." + oks <- forM (M.keys outputs) $ \outputfile -> do + showAction $ "adding " <> QuotedPath outputfile + k <- catchNonAsync (addfile tmpdir outputfile) + (\err -> giveup $ "Failed to ingest output file " ++ fromOsPath outputfile ++ ": " ++ show err) + return (outputfile, Just k) + let state' = state + { Remote.Compute.computeOutputs = M.fromList oks + } + forM_ (mapMaybe snd oks) $ \k -> do + Remote.Compute.setComputeState + (Remote.remoteStateHandle r) + k ts state' + logChange NoLiveUpdate k (Remote.uuid r) InfoPresent + + addfile tmpdir outputfile = do + let outputfile' = tmpdir outputfile + let ld = LockedDown ldc $ KeySource + { keyFilename = outputfile + , contentLocation = outputfile' + , inodeCache = Nothing + } + sz <- liftIO $ getFileSize outputfile' + metered Nothing sz Nothing $ \_ p -> + ingestAdd p (Just ld) >>= \case + Nothing -> giveup "key generation failed" + Just k -> return k + + ldc = LockDownConfig + { lockingFile = True + , hardlinkFileTmpDir = Nothing + , checkWritePerms = True + } + + calcduration (MonotonicTimestamp starttime) (MonotonicTimestamp endtime) = + fromIntegral (endtime - starttime) :: NominalDiffTime diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 06017e6365..cb2bd1f479 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -179,7 +179,7 @@ data ComputeState = ComputeState {- Formats a ComputeState as an URL query string. - - Prefixes computeParams with 'p', computeInputs with 'i', - - and computeOutput with 'o'. Uses "d" for computeSubdir. + - and computeOutputs with 'o'. Uses "d" for computeSubdir. - - When the passed Key is an output, rather than duplicate it - in the query string, that output has no value. diff --git a/git-annex.cabal b/git-annex.cabal index 0e95331084..5ed414a8dd 100644 --- a/git-annex.cabal +++ b/git-annex.cabal @@ -654,6 +654,7 @@ Executable git-annex CmdLine.Usage Command Command.Add + Command.AddComputed Command.AddUnused Command.AddUrl Command.Adjust From 16f529c05fb048a5653695bd1d04f2f2f88a8060 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 25 Feb 2025 16:36:22 -0400 Subject: [PATCH 13/53] addcomputed --fast and --unreproducible working For these, use VURL and URL keys, with an "annex-compute:" URI prefix. These URL keys will look something like this: URL--annex-compute&cbar4,63pconvert,3-f4d3d72cf3f16ac9c3e9a8012bde4462 Generally it's too long so most of it gets md5summed. It's a little ugly, but it's what fell out of the existing URL key generation machinery. I did consider special casing to eg "URL--annex-compute&c4d3d72cf3f16ac9c3e9a8012bde4462". But it seems at least possibly useful that the name of the file that was computed is visible and perhaps one or two words of the git-annex compute command parameters. Note that two different output files from the same computation will get the same URL key. And these keys should remain stable. --- Command/AddComputed.hs | 56 +++++++++++++++++++--------------- Remote/Compute.hs | 20 ++++++++++-- doc/git-annex-addcomputed.mdwn | 2 +- 3 files changed, 50 insertions(+), 28 deletions(-) diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index d80fb168da..01a334bf9e 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -20,8 +20,9 @@ import Annex.Ingest import Types.RemoteConfig import Types.KeySource import Messages.Progress -import Utility.MonotonicClock import Logs.Location +import Utility.MonotonicClock +import Backend.URL (fromUrl) import qualified Data.Map as M import Data.Time.Clock @@ -42,19 +43,19 @@ optParser :: CmdParamsDesc -> Parser AddComputedOptions optParser desc = AddComputedOptions <$> cmdParams desc <*> (mkParseRemoteOption <$> parseToOption) - <*> (fromMaybe Unreproducible <$> parseReproducible) + <*> (fromMaybe (Reproducible False) <$> parseReproducible) -data Reproducible = Reproducible | Unreproducible +newtype Reproducible = Reproducible { isReproducible :: Bool } parseReproducible :: Parser (Maybe Reproducible) parseReproducible = r <|> unr where - r = flag Nothing (Just Reproducible) + r = flag Nothing (Just (Reproducible True)) ( long "reproducible" <> short 'r' <> help "computation is fully reproducible" ) - unr = flag Nothing (Just Unreproducible) + unr = flag Nothing (Just (Reproducible False)) ( long "unreproducible" <> short 'u' <> help "computation is not fully reproducible" @@ -90,17 +91,14 @@ perform o r program = do , Remote.Compute.computeInputs = mempty , Remote.Compute.computeOutputs = mempty , Remote.Compute.computeSubdir = subdir - , Remote.Compute.computeReproducible = - case reproducible o of - Reproducible -> True - Unreproducible -> False + , Remote.Compute.computeReproducible = isreproducible } fast <- Annex.getRead Annex.fast starttime <- liftIO currentMonotonicTimestamp Remote.Compute.runComputeProgram program state (Remote.Compute.ImmutableState False) (getinputcontent fast) - (go starttime) + (go starttime fast) next $ return True where getinputcontent fast p = catKeyFile p >>= \case @@ -117,7 +115,7 @@ perform o r program = do , giveup $ "The computation needs an input file which does not exist: " ++ fromOsPath p ) - go starttime state tmpdir = do + go starttime fast state tmpdir = do endtime <- liftIO currentMonotonicTimestamp let ts = calcduration starttime endtime let outputs = Remote.Compute.computeOutputs state @@ -125,7 +123,7 @@ perform o r program = do giveup "The computation succeeded, but it did not generate any files." oks <- forM (M.keys outputs) $ \outputfile -> do showAction $ "adding " <> QuotedPath outputfile - k <- catchNonAsync (addfile tmpdir outputfile) + k <- catchNonAsync (addfile fast state tmpdir outputfile) (\err -> giveup $ "Failed to ingest output file " ++ fromOsPath outputfile ++ ": " ++ show err) return (outputfile, Just k) let state' = state @@ -137,24 +135,32 @@ perform o r program = do k ts state' logChange NoLiveUpdate k (Remote.uuid r) InfoPresent - addfile tmpdir outputfile = do - let outputfile' = tmpdir outputfile - let ld = LockedDown ldc $ KeySource - { keyFilename = outputfile - , contentLocation = outputfile' - , inodeCache = Nothing - } - sz <- liftIO $ getFileSize outputfile' - metered Nothing sz Nothing $ \_ p -> - ingestAdd p (Just ld) >>= \case - Nothing -> giveup "key generation failed" - Just k -> return k + addfile fast state tmpdir outputfile + | fast || not isreproducible = do + let stateurl = Remote.Compute.computeStateUrl state outputfile + let k = fromUrl stateurl Nothing isreproducible + addSymlink outputfile k Nothing + return k + | otherwise = do + let outputfile' = tmpdir outputfile + let ld = LockedDown ldc $ KeySource + { keyFilename = outputfile + , contentLocation = outputfile' + , inodeCache = Nothing + } + sz <- liftIO $ getFileSize outputfile' + metered Nothing sz Nothing $ \_ p -> + ingestAdd p (Just ld) >>= \case + Nothing -> giveup "key generation failed" + Just k -> return k ldc = LockDownConfig { lockingFile = True , hardlinkFileTmpDir = Nothing , checkWritePerms = True } - + calcduration (MonotonicTimestamp starttime) (MonotonicTimestamp endtime) = fromIntegral (endtime - starttime) :: NominalDiffTime + + isreproducible = isReproducible (reproducible o) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index cb2bd1f479..1157ac581d 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -12,6 +12,7 @@ module Remote.Compute ( ComputeState(..), setComputeState, getComputeStates, + computeStateUrl, ComputeProgram, getComputeProgram, runComputeProgram, @@ -36,6 +37,7 @@ import Utility.Metered import Utility.TimeStamp import Utility.Env import Utility.Tmp.Dir +import Utility.Url import qualified Git import qualified Utility.SimpleProtocol as Proto @@ -190,7 +192,10 @@ data ComputeState = ComputeState - and computeOutputs are sorted in ascending order for stability. -} formatComputeState :: Key -> ComputeState -> B.ByteString -formatComputeState k st = renderQuery False $ concat +formatComputeState k = formatComputeState' (Just k) + +formatComputeState' :: Maybe Key -> ComputeState -> B.ByteString +formatComputeState' mk st = renderQuery False $ concat [ map formatparam (computeParams st) , map formatinput (M.toAscList (computeInputs st)) , mapMaybe formatoutput (M.toAscList (computeOutputs st)) @@ -202,7 +207,7 @@ formatComputeState k st = renderQuery False $ concat ("i" <> fromOsPath file, Just (serializeKey' key)) formatoutput (file, (Just key)) = Just $ ("o" <> fromOsPath file, - if key == k + if Just key == mk then Nothing else Just (serializeKey' key) ) @@ -251,6 +256,17 @@ parseComputeState k b = _ -> Nothing in go c' rest +{- A compute: url for a given output file of a computation. -} +computeStateUrl :: ComputeState -> OsPath -> URLString +computeStateUrl st p = + "annex-compute:" ++ fromOsPath p ++ "?" + ++ decodeBS (formatComputeState' Nothing st') + where + -- Omit computeOutputs, so this gives the same result whether + -- it's called on a ComputeState with the computeOutputs + -- Keys populated or not. + st' = st { computeOutputs = mempty } + {- The per remote metadata is used to store ComputeState. This allows - recording multiple ComputeStates that generate the same key. - diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index bca6e1144d..9f096770b7 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -54,7 +54,7 @@ Some examples of how this might look: * `--fast` Adds computed files to the repository, without generating their content - yet. + yet. * `--unreproducible`, `-u` From 233a6954b9e98d599867eb32746c58925a71fbe4 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 25 Feb 2025 17:00:00 -0400 Subject: [PATCH 14/53] ingest when --unreproducible is used without --fast --- Command/AddComputed.hs | 33 ++++++++++++++++++++------------- 1 file changed, 20 insertions(+), 13 deletions(-) diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 01a334bf9e..4edce492e2 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -21,6 +21,7 @@ import Types.RemoteConfig import Types.KeySource import Messages.Progress import Logs.Location +import Utility.Metered import Utility.MonotonicClock import Backend.URL (fromUrl) @@ -136,23 +137,29 @@ perform o r program = do logChange NoLiveUpdate k (Remote.uuid r) InfoPresent addfile fast state tmpdir outputfile - | fast || not isreproducible = do - let stateurl = Remote.Compute.computeStateUrl state outputfile - let k = fromUrl stateurl Nothing isreproducible - addSymlink outputfile k Nothing - return k - | otherwise = do - let outputfile' = tmpdir outputfile - let ld = LockedDown ldc $ KeySource + | fast = do + addSymlink outputfile stateurlk Nothing + return stateurlk + | isreproducible = do + sz <- liftIO $ getFileSize outputfile' + metered Nothing sz Nothing $ \_ p -> + ingestwith $ ingestAdd p (Just ld) + | otherwise = ingestwith $ + ingestAdd' nullMeterUpdate (Just ld) (Just stateurlk) + where + stateurl = Remote.Compute.computeStateUrl state outputfile + stateurlk = fromUrl stateurl Nothing True + outputfile' = tmpdir outputfile + ld = LockedDown ldc $ KeySource { keyFilename = outputfile , contentLocation = outputfile' , inodeCache = Nothing } - sz <- liftIO $ getFileSize outputfile' - metered Nothing sz Nothing $ \_ p -> - ingestAdd p (Just ld) >>= \case - Nothing -> giveup "key generation failed" - Just k -> return k + ingestwith a = a >>= \case + Nothing -> giveup "key generation failed" + Just k -> do + logStatus NoLiveUpdate k InfoPresent + return k ldc = LockDownConfig { lockingFile = True From 71e92a509a0411b419c44d7ad5fb0fe91542c326 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 25 Feb 2025 17:10:41 -0400 Subject: [PATCH 15/53] use compute program REPRODUCIBLE by default --- Command/AddComputed.hs | 12 +++++++----- doc/git-annex-addcomputed.mdwn | 9 ++++++--- 2 files changed, 13 insertions(+), 8 deletions(-) diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 4edce492e2..0d614a1fc0 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -37,14 +37,14 @@ cmd = notBareRepo $ data AddComputedOptions = AddComputedOptions { computeParams :: CmdParams , computeRemote :: DeferredParse Remote - , reproducible :: Reproducible + , reproducible :: Maybe Reproducible } optParser :: CmdParamsDesc -> Parser AddComputedOptions optParser desc = AddComputedOptions <$> cmdParams desc <*> (mkParseRemoteOption <$> parseToOption) - <*> (fromMaybe (Reproducible False) <$> parseReproducible) + <*> parseReproducible newtype Reproducible = Reproducible { isReproducible :: Bool } @@ -92,7 +92,7 @@ perform o r program = do , Remote.Compute.computeInputs = mempty , Remote.Compute.computeOutputs = mempty , Remote.Compute.computeSubdir = subdir - , Remote.Compute.computeReproducible = isreproducible + , Remote.Compute.computeReproducible = False } fast <- Annex.getRead Annex.fast starttime <- liftIO currentMonotonicTimestamp @@ -140,7 +140,7 @@ perform o r program = do | fast = do addSymlink outputfile stateurlk Nothing return stateurlk - | isreproducible = do + | isreproducible state = do sz <- liftIO $ getFileSize outputfile' metered Nothing sz Nothing $ \_ p -> ingestwith $ ingestAdd p (Just ld) @@ -170,4 +170,6 @@ perform o r program = do calcduration (MonotonicTimestamp starttime) (MonotonicTimestamp endtime) = fromIntegral (endtime - starttime) :: NominalDiffTime - isreproducible = isReproducible (reproducible o) + isreproducible state = case reproducible o of + Just v -> isReproducible v + Nothing -> Remote.Compute.computeReproducible state diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index 9f096770b7..1da287f0cf 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -53,8 +53,11 @@ Some examples of how this might look: * `--fast` - Adds computed files to the repository, without generating their content - yet. + Adds computed files to the repository, without doing any work yet to + compute their content. + + This implies `--unreproducible`, because even if the compute remote + produces reproducible output, it's not available. * `--unreproducible`, `-u` @@ -70,7 +73,7 @@ Some examples of how this might look: Indicate that the computation is expected to be fully reproducible. This is the default when the compute remote indicates that it produces - reproducible output. + reproducible output (except when using `--fast`). If a computation turns out not to be fully reproducible, then getting the file from the compute remote will later fail with a checksum From f8c7cea019ec3c3092cfb3eacd7c9b16a96ea034 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 25 Feb 2025 17:23:38 -0400 Subject: [PATCH 16/53] pdate demo program needed a mkdir --- doc/design/compute_special_remote_interface.mdwn | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 34b7da7e77..f8fa92ed46 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -92,16 +92,17 @@ An example `git-annex-compute-foo` shell script follows: #!/bin/sh set -e if [ "$1" != "convert" ]; then - echo "Usage: convert input output [passes=n]" >&2 - exit 1 + echo "Usage: convert input output [passes=n]" >&2 + exit 1 fi if [ -z "$ANNEX_COMPUTE_passes" ]; then - ANNEX_COMPUTE_passes=1 + ANNEX_COMPUTE_passes=1 fi echo "INPUT $2" read input echo "OUTPUT $3" echo REPRODUCIBLE if [ -n "$input" ]; then - frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$3" + mkdir -p "$(dirname "$3")" + frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$3" fi From 2b8428bb170b72ee7a9eb643b1cfee0ffdbbf89d Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 25 Feb 2025 17:26:28 -0400 Subject: [PATCH 17/53] wording --- doc/design/compute_special_remote_interface.mdwn | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index f8fa92ed46..cd53a04aa1 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -27,8 +27,8 @@ unprotected, or otherwise executing it. The program is run in a temporary directory, which will be cleaned up after it exits. Note that it may be run in a subdirectory of its temporary -directory. Eg, when `git-annex addcomputed` is run in a `foo/bar/` -subdirectory of the git repository. +directory. This is done when `git-annex addcomputed` was run in a subdirectory +of the git repository. The content of any annexed file in the repository can be an input to the computation. The program requests an input by writing a line to @@ -40,8 +40,8 @@ Then it can read a line from stdin, which will be the path to the content (eg a `.git/annex/objects/` path). If the program needs multiple input files, it should output multiple -`INPUT` lines at once, and then read multiple paths from stdin. This -allows retrival of the inputs to potentially run in parallel. +`INPUT` lines first, and then read multiple paths from stdin. This +allows retrieval of the inputs to potentially run in parallel. If an input file is not available, the program's stdin will be closed without a path being written to it. So when reading from stdin fails, From e702cb94fffbdcd023f95ce0255ebf2f37c89dc4 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 25 Feb 2025 18:44:40 -0400 Subject: [PATCH 18/53] add compute remote uuid to compute state url Otherwise, two different compute remotes that happen to take the same input would use the same compute state url. Which seems wrong. --- Command/AddComputed.hs | 2 +- Remote/Compute.hs | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 0d614a1fc0..7afd23a129 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -147,7 +147,7 @@ perform o r program = do | otherwise = ingestwith $ ingestAdd' nullMeterUpdate (Just ld) (Just stateurlk) where - stateurl = Remote.Compute.computeStateUrl state outputfile + stateurl = Remote.Compute.computeStateUrl r state outputfile stateurlk = fromUrl stateurl Nothing True outputfile' = tmpdir outputfile ld = LockedDown ldc $ KeySource diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 1157ac581d..b6ba1dbf2e 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -257,9 +257,9 @@ parseComputeState k b = in go c' rest {- A compute: url for a given output file of a computation. -} -computeStateUrl :: ComputeState -> OsPath -> URLString -computeStateUrl st p = - "annex-compute:" ++ fromOsPath p ++ "?" +computeStateUrl :: Remote -> ComputeState -> OsPath -> URLString +computeStateUrl r st p = + "annex-compute:" ++ fromUUID (uuid r) ++ "/" ++ fromOsPath p ++ "?" ++ decodeBS (formatComputeState' Nothing st') where -- Omit computeOutputs, so this gives the same result whether From a5b53fa98ac5cdae7434fee535cbdc5b9d1137a7 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 25 Feb 2025 18:45:55 -0400 Subject: [PATCH 19/53] todo --- Command/AddComputed.hs | 1 + 1 file changed, 1 insertion(+) diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 7afd23a129..8af8e7c900 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -88,6 +88,7 @@ perform o r program = do repopath <- fromRepo Git.repoPath subdir <- liftIO $ relPathDirToFile repopath (literalOsPath ".") let state = Remote.Compute.ComputeState + -- TODO add inherited initremote parameters { Remote.Compute.computeParams = computeParams o , Remote.Compute.computeInputs = mempty , Remote.Compute.computeOutputs = mempty From eed522a0f8f47ca3334412cca9ec23d3f71ef295 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 26 Feb 2025 09:45:35 -0400 Subject: [PATCH 20/53] addcomputed inherits extra initremote parameters This is limited because the remote config is a field/value map. So order is not preserved, and when 2 parameters have the same field name, only the last one will be passed. --- Command/AddComputed.hs | 4 ++-- Remote/Compute.hs | 6 ++++++ doc/git-annex-addcomputed.mdwn | 6 +++++- doc/special_remotes/compute.mdwn | 17 ++++++++--------- 4 files changed, 21 insertions(+), 12 deletions(-) diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 8af8e7c900..8b983a738c 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -88,8 +88,8 @@ perform o r program = do repopath <- fromRepo Git.repoPath subdir <- liftIO $ relPathDirToFile repopath (literalOsPath ".") let state = Remote.Compute.ComputeState - -- TODO add inherited initremote parameters - { Remote.Compute.computeParams = computeParams o + { Remote.Compute.computeParams = computeParams o ++ + Remote.Compute.defaultComputeParams r , Remote.Compute.computeInputs = mempty , Remote.Compute.computeOutputs = mempty , Remote.Compute.computeSubdir = subdir diff --git a/Remote/Compute.hs b/Remote/Compute.hs index b6ba1dbf2e..09ab45687a 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -17,6 +17,7 @@ module Remote.Compute ( getComputeProgram, runComputeProgram, ImmutableState(..), + defaultComputeParams, ) where import Annex.Common @@ -127,6 +128,11 @@ computeConfigParser _ = return $ RemoteConfigParser ) } +defaultComputeParams :: Remote -> [String] +defaultComputeParams = map mk . M.toList . getRemoteConfigPassedThrough . config + where + mk (f, v) = fromProposedAccepted f ++ '=' : v + newtype ComputeProgram = ComputeProgram String deriving (Show) diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index 1da287f0cf..245d4a04b0 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -33,7 +33,9 @@ Some examples of how this might look: git-annex addcomputed --to=y -- compress foo --level=9 git-annex addcomputed --to=z -- clip foo 2:01-3:00 combine with bar to baz - +Note that parameters that were passed to `git-annex initremote` +when setting up the compute special remote will be appended to the end of +the parameters provided to `git-annex addcomputed`. # OPTIONS @@ -88,6 +90,8 @@ Some examples of how this might look: [[git-annex-recompute]](1) +[[git-annex-initremote]](1) + # AUTHOR Joey Hess diff --git a/doc/special_remotes/compute.mdwn b/doc/special_remotes/compute.mdwn index b840f5fcbe..c3f4186008 100644 --- a/doc/special_remotes/compute.mdwn +++ b/doc/special_remotes/compute.mdwn @@ -2,6 +2,10 @@ While other remotes store the contents of annexed files somewhere, this special remote uses a program to compute the contents of annexed files. +To add a file to a compute special remote, use the [[git-annex-addcomputed]] +command. Once a file has been added to a compute special remote, commands +like `git-annex get` will use it to compute the content of the file. + To enable an instance of this special remote: # git-annex initremote myremote type=compute program=git-annex-compute-foo @@ -11,16 +15,11 @@ program to use to compute the contents of annexed files. It must start with "git-annex-compute-". The program needs to be installed somewhere in the `PATH`. -To add a file to a compute special remote, use the [[git-annex-addcomputed]] -command. Once a file has been added to a compute special remote, commands -like `git-annex get` will use it to compute the content of the file. +All other "field=value" parameters passed to `initremote` will be passed +to the program when running [[git-annex-addcomputed]]. Note that when the +program takes a dashed option, it can be provided after "--": -You can provide other parameters to `initremote`, in order to provide -default configuration values to use when adding files with -[[git-annex-addcomputed]]. To see a list of all the configuration values -supported by a given program, pass `--whatelse` to `initremote`: - - # git-annex initremote myremote type=compute program=git-annex-compute-foo --whatelse + # git-annex initremote myremote type=compute program=git-annex-compute-foo -- --level=9 ## compute programs From d49f371acc922cff3db6132511b22e4484254578 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 26 Feb 2025 09:47:56 -0400 Subject: [PATCH 21/53] showOutput when the compute program eg displays usage, it needs to start on its own line --- Command/AddComputed.hs | 1 + 1 file changed, 1 insertion(+) diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 8b983a738c..fad9c1dc30 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -97,6 +97,7 @@ perform o r program = do } fast <- Annex.getRead Annex.fast starttime <- liftIO currentMonotonicTimestamp + showOutput Remote.Compute.runComputeProgram program state (Remote.Compute.ImmutableState False) (getinputcontent fast) From 3bec89a3c35167536ee9df77f0a32fd89bd0df49 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 26 Feb 2025 11:25:32 -0400 Subject: [PATCH 22/53] started git-annex recompute The perform action of this still needs work to do the right thing. In particular, it currently behaves as if --others was always set. And, it duplicates a lot of code from addcomputed. --- CmdLine/GitAnnex.hs | 2 + Command/AddComputed.hs | 49 ++++---- Command/Recompute.hs | 202 +++++++++++++++++++++++++++++++++ Remote/Compute.hs | 77 ++++++++----- doc/git-annex-addcomputed.mdwn | 6 +- doc/git-annex-recompute.mdwn | 32 ++++-- git-annex.cabal | 1 + 7 files changed, 304 insertions(+), 65 deletions(-) create mode 100644 Command/Recompute.hs diff --git a/CmdLine/GitAnnex.hs b/CmdLine/GitAnnex.hs index 71d9f2e51f..8dc64f8b7b 100644 --- a/CmdLine/GitAnnex.hs +++ b/CmdLine/GitAnnex.hs @@ -134,6 +134,7 @@ import qualified Command.UpdateProxy import qualified Command.MaxSize import qualified Command.Sim import qualified Command.AddComputed +import qualified Command.Recompute import qualified Command.Version import qualified Command.RemoteDaemon #ifdef WITH_ASSISTANT @@ -267,6 +268,7 @@ cmds testoptparser testrunner mkbenchmarkgenerator = map addGitAnnexCommonOption , Command.MaxSize.cmd , Command.Sim.cmd , Command.AddComputed.cmd + , Command.Recompute.cmd , Command.Version.cmd , Command.RemoteDaemon.cmd #ifdef WITH_ASSISTANT diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index fad9c1dc30..9ff13f1f70 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -17,7 +17,6 @@ import qualified Types.Remote as Remote import Annex.CatFile import Annex.Content.Presence import Annex.Ingest -import Types.RemoteConfig import Types.KeySource import Messages.Progress import Logs.Location @@ -68,23 +67,20 @@ seek o = startConcurrency commandStages (seek' o) seek' :: AddComputedOptions -> CommandSeek seek' o = do r <- getParsed (computeRemote o) - unless (Remote.typename (Remote.remotetype r) == Remote.typename Remote.Compute.remote) $ + unless (Remote.Compute.isComputeRemote r) $ giveup "That is not a compute remote." - let rc = unparsedRemoteConfig (Remote.config r) - case Remote.Compute.getComputeProgram rc of - Left err -> giveup $ - "Problem with the configuration of the compute remote: " ++ err - Right program -> commandAction $ start o r program + commandAction $ start o r -start :: AddComputedOptions -> Remote -> Remote.Compute.ComputeProgram -> CommandStart -start o r program = starting "addcomputed" ai si $ perform o r program +start :: AddComputedOptions -> Remote -> CommandStart +start o r = starting "addcomputed" ai si $ perform o r where ai = ActionItemUUID (Remote.uuid r) (UnquotedString (Remote.name r)) si = SeekInput (computeParams o) -perform :: AddComputedOptions -> Remote -> Remote.Compute.ComputeProgram -> CommandPerform -perform o r program = do +perform :: AddComputedOptions -> Remote -> CommandPerform +perform o r = do + program <- Remote.Compute.getComputeProgram r repopath <- fromRepo Git.repoPath subdir <- liftIO $ relPathDirToFile repopath (literalOsPath ".") let state = Remote.Compute.ComputeState @@ -100,24 +96,10 @@ perform o r program = do showOutput Remote.Compute.runComputeProgram program state (Remote.Compute.ImmutableState False) - (getinputcontent fast) + (getInputContent fast) (go starttime fast) next $ return True where - getinputcontent fast p = catKeyFile p >>= \case - Just inputkey -> do - obj <- calcRepo (gitAnnexLocation inputkey) - if fast - then return (inputkey, Nothing) - else ifM (inAnnex inputkey) - ( return (inputkey, Just obj) - , giveup $ "The computation needs the content of a file which is not present: " ++ fromOsPath p - ) - Nothing -> ifM (liftIO $ doesFileExist p) - ( giveup $ "The computation needs an input file that is not an annexed file: " ++ fromOsPath p - , giveup $ "The computation needs an input file which does not exist: " ++ fromOsPath p - ) - go starttime fast state tmpdir = do endtime <- liftIO currentMonotonicTimestamp let ts = calcduration starttime endtime @@ -175,3 +157,18 @@ perform o r program = do isreproducible state = case reproducible o of Just v -> isReproducible v Nothing -> Remote.Compute.computeReproducible state + +getInputContent :: Bool -> OsPath -> Annex (Key, Maybe OsPath) +getInputContent fast p = catKeyFile p >>= \case + Just inputkey -> do + obj <- calcRepo (gitAnnexLocation inputkey) + if fast + then return (inputkey, Nothing) + else ifM (inAnnex inputkey) + ( return (inputkey, Just obj) + , giveup $ "The computation needs the content of a file which is not present: " ++ fromOsPath p + ) + Nothing -> ifM (liftIO $ doesFileExist p) + ( giveup $ "The computation needs an input file that is not an annexed file: " ++ fromOsPath p + , giveup $ "The computation needs an input file which does not exist: " ++ fromOsPath p + ) diff --git a/Command/Recompute.hs b/Command/Recompute.hs new file mode 100644 index 0000000000..95f8f3e16f --- /dev/null +++ b/Command/Recompute.hs @@ -0,0 +1,202 @@ +{- git-annex command + - + - Copyright 2025 Joey Hess + - + - Licensed under the GNU AGPL version 3 or higher. + -} + +{-# LANGUAGE OverloadedStrings #-} + +module Command.Recompute where + +import Command +import qualified Git +import qualified Annex +import qualified Remote.Compute +import qualified Remote +import qualified Types.Remote as Remote +import Annex.CatFile +import Annex.Content.Presence +import Annex.Ingest +import Git.FilePath +import Types.RemoteConfig +import Types.KeySource +import Messages.Progress +import Logs.Location +import Utility.Metered +import Utility.MonotonicClock +import Backend.URL (fromUrl) +import Command.AddComputed (Reproducible(..), parseReproducible, getInputContent) + +import qualified Data.Map as M +import Data.Time.Clock + +cmd :: Command +cmd = notBareRepo $ + command "recompute" SectionCommon "recompute computed files" + paramPaths (seek <$$> optParser) + +data RecomputeOptions = RecomputeOptions + { recomputeThese :: CmdParams + , originalOption :: Bool + , othersOption :: Bool + , reproducible :: Maybe Reproducible + , computeRemote :: Maybe (DeferredParse Remote) + } + +optParser :: CmdParamsDesc -> Parser RecomputeOptions +optParser desc = RecomputeOptions + <$> cmdParams desc + <*> switch + ( long "original" + <> help "recompute using original content of input files" + ) + <*> switch + ( long "others" + <> help "stage other files that are recomputed in passing" + ) + <*> parseReproducible + <*> optional (mkParseRemoteOption <$> parseRemoteOption) + +seek :: RecomputeOptions -> CommandSeek +seek o = startConcurrency commandStages (seek' o) + +seek' :: RecomputeOptions -> CommandSeek +seek' o = do + computeremote <- maybe (pure Nothing) (Just <$$> getParsed) + (computeRemote o) + let seeker = AnnexedFileSeeker + { startAction = const $ start o computeremote + , checkContentPresent = Nothing + , usesLocationLog = True + } + withFilesInGitAnnex ww seeker + =<< workTreeItems ww (recomputeThese o) + where + ww = WarnUnmatchLsFiles "recompute" + +start :: RecomputeOptions -> Maybe Remote -> SeekInput -> OsPath -> Key -> CommandStart +start o (Just computeremote) si file key = + stopUnless (notElem (Remote.uuid computeremote) <$> loggedLocations key) $ + start' o computeremote si file key +start o Nothing si file key = do + rs <- catMaybes <$> (mapM Remote.byUUID =<< loggedLocations key) + case sortOn Remote.cost $ filter Remote.Compute.isComputeRemote rs of + [] -> stop + (r:_) -> start' o r si file key + +start' :: RecomputeOptions -> Remote -> SeekInput -> OsPath -> Key -> CommandStart +start' o r si file key = + Remote.Compute.getComputeState + (Remote.remoteStateHandle r) key >>= \case + Nothing -> stop + Just state -> + stopUnless (shouldrecompute state) $ + starting "recompute" ai si $ + perform o r file key state + where + ai = mkActionItem (key, file) + + shouldrecompute state + | originalOption o = return True + | otherwise = + anyM (inputchanged state) $ + M.toList (Remote.Compute.computeInputs state) + + inputchanged state (inputfile, inputkey) = do + -- Note that the paths from the remote state are not to be + -- trusted to point to a file in the repository, but using + -- the path with catKeyFile will only succeed if it + -- is checked into the repository. + p <- fromRepo $ fromTopFilePath $ asTopFilePath $ + Remote.Compute.computeSubdir state inputfile + catKeyFile p >>= return . \case + Just k -> k /= inputkey + -- When an input file is missing, go ahead and + -- recompute. This way, the user will see the + -- computation fail, with an error message that + -- explains the problem. + -- XXX check that this works well + Nothing -> True + +perform :: RecomputeOptions -> Remote -> OsPath -> Key -> Remote.Compute.ComputeState -> CommandPerform +perform o r file key oldstate = do + program <- Remote.Compute.getComputeProgram r + let recomputestate = oldstate + { Remote.Compute.computeInputs = mempty + , Remote.Compute.computeOutputs = mempty + } + fast <- Annex.getRead Annex.fast + starttime <- liftIO currentMonotonicTimestamp + showOutput + Remote.Compute.runComputeProgram program recomputestate + (Remote.Compute.ImmutableState False) + (getinputcontent program fast) + (go starttime fast) + next $ return True + where + getinputcontent program fast p + | originalOption o = + case M.lookup p (Remote.Compute.computeInputs oldstate) of + Just inputkey -> return (inputkey, Nothing) + Nothing -> Remote.Compute.computationBehaviorChangeError program + "requesting a new input file" p + | otherwise = getInputContent fast p + + go starttime fast state tmpdir = do + endtime <- liftIO currentMonotonicTimestamp + let ts = calcduration starttime endtime + let outputs = Remote.Compute.computeOutputs state + when (M.null outputs) $ + giveup "The computation succeeded, but it did not generate any files." + oks <- forM (M.keys outputs) $ \outputfile -> do + showAction $ "adding " <> QuotedPath outputfile + k <- catchNonAsync (addfile fast state tmpdir outputfile) + (\err -> giveup $ "Failed to ingest output file " ++ fromOsPath outputfile ++ ": " ++ show err) + return (outputfile, Just k) + let state' = state + { Remote.Compute.computeOutputs = M.fromList oks + } + forM_ (mapMaybe snd oks) $ \k -> do + Remote.Compute.setComputeState + (Remote.remoteStateHandle r) + k ts state' + logChange NoLiveUpdate k (Remote.uuid r) InfoPresent + + addfile fast state tmpdir outputfile + | fast = do + addSymlink outputfile stateurlk Nothing + return stateurlk + | isreproducible state = do + sz <- liftIO $ getFileSize outputfile' + metered Nothing sz Nothing $ \_ p -> + ingestwith $ ingestAdd p (Just ld) + | otherwise = ingestwith $ + ingestAdd' nullMeterUpdate (Just ld) (Just stateurlk) + where + stateurl = Remote.Compute.computeStateUrl r state outputfile + stateurlk = fromUrl stateurl Nothing True + outputfile' = tmpdir outputfile + ld = LockedDown ldc $ KeySource + { keyFilename = outputfile + , contentLocation = outputfile' + , inodeCache = Nothing + } + ingestwith a = a >>= \case + Nothing -> giveup "key generation failed" + Just k -> do + logStatus NoLiveUpdate k InfoPresent + return k + + ldc = LockDownConfig + { lockingFile = True + , hardlinkFileTmpDir = Nothing + , checkWritePerms = True + } + + calcduration (MonotonicTimestamp starttime) (MonotonicTimestamp endtime) = + fromIntegral (endtime - starttime) :: NominalDiffTime + + isreproducible state = case reproducible o of + Just v -> isReproducible v + Nothing -> Remote.Compute.computeReproducible state diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 09ab45687a..b412fc4df6 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -9,14 +9,16 @@ module Remote.Compute ( remote, + isComputeRemote, ComputeState(..), setComputeState, - getComputeStates, + getComputeState, computeStateUrl, ComputeProgram, getComputeProgram, runComputeProgram, ImmutableState(..), + computationBehaviorChangeError, defaultComputeParams, ) where @@ -63,8 +65,11 @@ remote = RemoteType , thirdPartyPopulated = False } +isComputeRemote :: Remote -> Bool +isComputeRemote r = typename (remotetype r) == typename remote + gen :: Git.Repo -> UUID -> RemoteConfig -> RemoteGitConfig -> RemoteStateHandle -> Annex (Maybe Remote) -gen r u rc gc rs = case getComputeProgram rc of +gen r u rc gc rs = case getComputeProgram' rc of Left _err -> return Nothing Right program -> do c <- parsedRemoteConfig remote rc @@ -107,7 +112,7 @@ gen r u rc gc rs = case getComputeProgram rc of setupInstance :: SetupStage -> Maybe UUID -> Maybe CredPair -> RemoteConfig -> RemoteGitConfig -> Annex (RemoteConfig, UUID) setupInstance _ mu _ c _ = do - ComputeProgram program <- either giveup return (getComputeProgram c) + ComputeProgram program <- either giveup return $ getComputeProgram' c unlessM (liftIO $ inSearchPath program) $ giveup $ "Cannot find " ++ program ++ " in PATH" u <- maybe (liftIO genUUID) return mu @@ -136,8 +141,15 @@ defaultComputeParams = map mk . M.toList . getRemoteConfigPassedThrough . config newtype ComputeProgram = ComputeProgram String deriving (Show) -getComputeProgram :: RemoteConfig -> Either String ComputeProgram -getComputeProgram c = case fromProposedAccepted <$> M.lookup programField c of +getComputeProgram :: Remote -> Annex ComputeProgram +getComputeProgram r = + case getComputeProgram' (unparsedRemoteConfig (config r)) of + Right program -> return program + Left err -> giveup $ + "Problem with the configuration of compute remote " ++ name r ++ ": " ++ err + +getComputeProgram' :: RemoteConfig -> Either String ComputeProgram +getComputeProgram' c = case fromProposedAccepted <$> M.lookup programField c of Just program | safetyPrefix `isPrefixOf` program -> Right (ComputeProgram program) @@ -285,8 +297,15 @@ setComputeState rs k ts st = addRemoteMetaData k rs $ MetaData $ M.singleton (mkMetaFieldUnchecked $ T.pack ('t':show (truncateResolution 1 ts))) (S.singleton (MetaValue (CurrentlySet True) (formatComputeState k st))) -getComputeStates :: RemoteStateHandle -> Key -> Annex [(NominalDiffTime, ComputeState)] -getComputeStates rs k = do +{- When multiple ComputeStates have been recorded for the same key, + - this returns one that is probably less expensive to compute, + - based on the original time it took to compute it. -} +getComputeState:: RemoteStateHandle -> Key -> Annex (Maybe ComputeState) +getComputeState rs k = headMaybe . map snd . sortOn fst + <$> getComputeStatesUnsorted rs k + +getComputeStatesUnsorted :: RemoteStateHandle -> Key -> Annex [(NominalDiffTime, ComputeState)] +getComputeStatesUnsorted rs k = do RemoteMetaData _ (MetaData m) <- getCurrentRemoteMetaData rs k return $ go [] (M.toList m) where @@ -369,7 +388,7 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) let f' = toOsPath f let knowninput = M.member f' (computeInputs state') checksafefile tmpdir subdir f' "input" - checkimmutable knowninput l $ do + checkimmutable knowninput "inputting" f' $ do (k, mp) <- getinputcontent f' mp' <- liftIO $ maybe (pure Nothing) (Just <$$> relPathDirToFile subdir) @@ -388,7 +407,7 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) let f' = toOsPath f checksafefile tmpdir subdir f' "output" let knownoutput = M.member f' (computeOutputs state') - checkimmutable knownoutput l $ + checkimmutable knownoutput "outputting" f' $ return $ if knownoutput then state' else state' @@ -412,25 +431,31 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) when (any (\p -> dropTrailingPathSeparator p == literalOsPath ".git") (splitPath f)) $ err "inside the .git directory" - checkimmutable True _ a = a - checkimmutable False l a + checkimmutable True _ _ a = a + checkimmutable False requestdesc p a | not immutablestate = a - | otherwise = giveup $ - program ++ " is not behaving the same way it used to, now outputting: \"" ++ l ++ "\"" + | otherwise = computationBehaviorChangeError (ComputeProgram program) requestdesc p + +computationBehaviorChangeError :: ComputeProgram -> String -> OsPath -> Annex a +computationBehaviorChangeError (ComputeProgram program) requestdesc p = + giveup $ program ++ " is not behaving the same way it used to, now " ++ requestdesc ++ ": " ++ fromOsPath p computeKey :: RemoteStateHandle -> ComputeProgram -> Key -> AssociatedFile -> OsPath -> MeterUpdate -> VerifyConfig -> Annex Verification -computeKey rs (ComputeProgram program) k af dest p vc = do - states <- map snd . sortOn fst -- least expensive probably - <$> getComputeStates rs k - case mapMaybe computeskey states of - ((keyfile, state):_) -> runComputeProgram - (ComputeProgram program) - state - (ImmutableState True) - (getinputcontent state) - (go keyfile) - [] -> giveup "Missing compute state" +computeKey rs (ComputeProgram program) k af dest p vc = + getComputeState rs k >>= \case + Just state -> + case computeskey state of + Just keyfile -> runComputeProgram + (ComputeProgram program) + state + (ImmutableState True) + (getinputcontent state) + (go keyfile) + Nothing -> missingstate + Nothing -> missingstate where + missingstate = giveup "Missing compute state" + getinputcontent state f = case M.lookup (fromOsPath f) (computeInputs state) of Just inputkey -> do @@ -441,7 +466,7 @@ computeKey rs (ComputeProgram program) k af dest p vc = do computeskey state = case M.keys $ M.filter (== Just k) (computeOutputs state) of - (keyfile : _) -> Just (keyfile, state) + (keyfile : _) -> Just keyfile [] -> Nothing go keyfile state tmpdir = do @@ -470,7 +495,7 @@ computeKey rs (ComputeProgram program) k af dest p vc = do -- Make sure that the compute state exists. checkKey :: RemoteStateHandle -> Key -> Annex Bool checkKey rs k = do - states <- getComputeStates rs k + states <- getComputeStatesUnsorted rs k if null states then giveup "Missing compute state" else return True diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index 245d4a04b0..58261da181 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -78,9 +78,9 @@ the parameters provided to `git-annex addcomputed`. reproducible output (except when using `--fast`). If a computation turns out not to be fully reproducible, then getting - the file from the compute remote will later fail with a checksum - verification error. One thing that can be done then is to use - `git-annex recompute --unreproducible`. + a computed file from the compute remote will later fail with a + checksum verification error. One thing that can be done then is to use + `git-annex recompute --original --unreproducible`. * Also the [[git-annex-common-options]](1) can be used. diff --git a/doc/git-annex-recompute.mdwn b/doc/git-annex-recompute.mdwn index 2800a74106..6e1a32f0d9 100644 --- a/doc/git-annex-recompute.mdwn +++ b/doc/git-annex-recompute.mdwn @@ -1,6 +1,6 @@ # NAME -git-annex recompute - update computed files +git-annex recompute - recompute computed files # SYNOPSIS @@ -9,18 +9,24 @@ git-annex recompute [path ...]` # DESCRIPTION This updates computed files that were added with -[[git-annex-addcomputed]](1). +[[git-annex-addcomputed]](1). + +When the output of the computation is different, the updated computed +file is staged in the repository. By default, this only recomputes files whose input files have changed. -The new contents of the input files are used to re-run the computation, -and when the output is different, the updated computed file is staged -in the repository. +The new contents of the input files are used to re-run the computation. # OPTIONS -* `--unchanged` +* `--original` - Recompute files even when their input files have not changed. + Use the original content of input files. + +* `--others` + + When recomputing one file also generates new versions of other files, + stage those other files in the repository too. * `--unreproducible`, `-u` @@ -32,14 +38,20 @@ in the repository. Convert files that were added with `git-annex addcomputed --unreproducible` to be as if they were added with `--reproducible`. +* `--remote=name` + + Only recompute files that were computed by this compute remote. + + When this option is not used, all computed files are recomputed using + whatever compute remote was originally used to add them. In cases where + a file can be computed by multiple remotes, the one with the lowest + configured cost will be used. + * matching options The [[git-annex-matching-options]](1) can be used to control what files to recompute. - For example, to only recompute files that are computed by the "photoconv" - compute remote, use `--in=photoconv` - * Also the [[git-annex-common-options]](1) can be used. # SEE ALSO diff --git a/git-annex.cabal b/git-annex.cabal index 5ed414a8dd..88203be956 100644 --- a/git-annex.cabal +++ b/git-annex.cabal @@ -728,6 +728,7 @@ Executable git-annex Command.Proxy Command.Pull Command.Push + Command.Recompute Command.ReKey Command.ReadPresentKey Command.RecvKey From 53d107ca472d2bf239866339b232f253f8d86f0e Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 26 Feb 2025 14:05:37 -0400 Subject: [PATCH 23/53] refactor --- Command/AddComputed.hs | 12 ++---------- Command/Recompute.hs | 15 ++------------- Remote/Compute.hs | 12 +++++++++--- 3 files changed, 13 insertions(+), 26 deletions(-) diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 9ff13f1f70..f27932405e 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -21,11 +21,9 @@ import Types.KeySource import Messages.Progress import Logs.Location import Utility.Metered -import Utility.MonotonicClock import Backend.URL (fromUrl) import qualified Data.Map as M -import Data.Time.Clock cmd :: Command cmd = notBareRepo $ @@ -92,17 +90,14 @@ perform o r = do , Remote.Compute.computeReproducible = False } fast <- Annex.getRead Annex.fast - starttime <- liftIO currentMonotonicTimestamp showOutput Remote.Compute.runComputeProgram program state (Remote.Compute.ImmutableState False) (getInputContent fast) - (go starttime fast) + (go fast) next $ return True where - go starttime fast state tmpdir = do - endtime <- liftIO currentMonotonicTimestamp - let ts = calcduration starttime endtime + go fast state tmpdir ts = do let outputs = Remote.Compute.computeOutputs state when (M.null outputs) $ giveup "The computation succeeded, but it did not generate any files." @@ -151,9 +146,6 @@ perform o r = do , checkWritePerms = True } - calcduration (MonotonicTimestamp starttime) (MonotonicTimestamp endtime) = - fromIntegral (endtime - starttime) :: NominalDiffTime - isreproducible state = case reproducible o of Just v -> isReproducible v Nothing -> Remote.Compute.computeReproducible state diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 95f8f3e16f..42a313ee75 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -10,26 +10,21 @@ module Command.Recompute where import Command -import qualified Git import qualified Annex import qualified Remote.Compute import qualified Remote import qualified Types.Remote as Remote import Annex.CatFile -import Annex.Content.Presence import Annex.Ingest import Git.FilePath -import Types.RemoteConfig import Types.KeySource import Messages.Progress import Logs.Location import Utility.Metered -import Utility.MonotonicClock import Backend.URL (fromUrl) import Command.AddComputed (Reproducible(..), parseReproducible, getInputContent) import qualified Data.Map as M -import Data.Time.Clock cmd :: Command cmd = notBareRepo $ @@ -127,12 +122,11 @@ perform o r file key oldstate = do , Remote.Compute.computeOutputs = mempty } fast <- Annex.getRead Annex.fast - starttime <- liftIO currentMonotonicTimestamp showOutput Remote.Compute.runComputeProgram program recomputestate (Remote.Compute.ImmutableState False) (getinputcontent program fast) - (go starttime fast) + (go fast) next $ return True where getinputcontent program fast p @@ -143,9 +137,7 @@ perform o r file key oldstate = do "requesting a new input file" p | otherwise = getInputContent fast p - go starttime fast state tmpdir = do - endtime <- liftIO currentMonotonicTimestamp - let ts = calcduration starttime endtime + go fast state tmpdir ts = do let outputs = Remote.Compute.computeOutputs state when (M.null outputs) $ giveup "The computation succeeded, but it did not generate any files." @@ -194,9 +186,6 @@ perform o r file key oldstate = do , checkWritePerms = True } - calcduration (MonotonicTimestamp starttime) (MonotonicTimestamp endtime) = - fromIntegral (endtime - starttime) :: NominalDiffTime - isreproducible state = case reproducible o of Just v -> isReproducible v Nothing -> Remote.Compute.computeReproducible state diff --git a/Remote/Compute.hs b/Remote/Compute.hs index b412fc4df6..e3ec2a8fdd 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -41,6 +41,7 @@ import Utility.TimeStamp import Utility.Env import Utility.Tmp.Dir import Utility.Url +import Utility.MonotonicClock import qualified Git import qualified Utility.SimpleProtocol as Proto @@ -338,7 +339,7 @@ runComputeProgram -> ComputeState -> ImmutableState -> (OsPath -> Annex (Key, Maybe OsPath)) - -> (ComputeState -> OsPath -> Annex v) + -> (ComputeState -> OsPath -> NominalDiffTime -> Annex v) -> Annex v runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) getinputcontent cont = withOtherTmp $ \othertmpdir -> @@ -353,11 +354,13 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) , std_out = CreatePipe , env = Just environ } + starttime <- liftIO currentMonotonicTimestamp state' <- bracket (liftIO $ createProcess pr) (liftIO . cleanupProcess) (getinput state tmpdir subdir) - cont state' subdir + endtime <- liftIO currentMonotonicTimestamp + cont state' subdir (calcduration starttime endtime) getsubdir tmpdir = do let subdir = tmpdir computeSubdir state @@ -435,6 +438,9 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) checkimmutable False requestdesc p a | not immutablestate = a | otherwise = computationBehaviorChangeError (ComputeProgram program) requestdesc p + + calcduration (MonotonicTimestamp starttime) (MonotonicTimestamp endtime) = + fromIntegral (endtime - starttime) :: NominalDiffTime computationBehaviorChangeError :: ComputeProgram -> String -> OsPath -> Annex a computationBehaviorChangeError (ComputeProgram program) requestdesc p = @@ -469,7 +475,7 @@ computeKey rs (ComputeProgram program) k af dest p vc = (keyfile : _) -> Just keyfile [] -> Nothing - go keyfile state tmpdir = do + go keyfile state tmpdir ts = do let keyfile' = tmpdir keyfile unlessM (liftIO $ doesFileExist keyfile') $ giveup $ program ++ " exited sucessfully, but failed to write the computed file" From d6a010a6155e545142b59ca54042b91a98430c68 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 26 Feb 2025 15:51:31 -0400 Subject: [PATCH 24/53] recompute closer to working properly Proper behavior without --others implemented. And eliminated most of the code duplication through refactoring. Also, changed it to not stage recomputed files. This way, git diff will show files that have differences. --- Command/AddComputed.hs | 108 ++++++++++++++++++++++------------- Command/Recompute.hs | 78 ++++--------------------- doc/git-annex-recompute.mdwn | 8 +-- 3 files changed, 81 insertions(+), 113 deletions(-) diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index f27932405e..b2b55fb605 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -14,6 +14,7 @@ import qualified Git import qualified Annex import qualified Remote.Compute import qualified Types.Remote as Remote +import Backend import Annex.CatFile import Annex.Content.Presence import Annex.Ingest @@ -24,6 +25,7 @@ import Utility.Metered import Backend.URL (fromUrl) import qualified Data.Map as M +import Data.Time.Clock cmd :: Command cmd = notBareRepo $ @@ -94,73 +96,97 @@ perform o r = do Remote.Compute.runComputeProgram program state (Remote.Compute.ImmutableState False) (getInputContent fast) - (go fast) + (addComputed "adding" True r (reproducible o) (const True) fast) next $ return True + +addComputed :: StringContainingQuotedPath -> Bool -> Remote -> Maybe Reproducible -> (OsPath -> Bool) -> Bool -> Remote.Compute.ComputeState -> OsPath -> NominalDiffTime -> Annex () +addComputed addaction stagefiles r reproducibleconfig wantfile fast state tmpdir ts = do + let outputs = Remote.Compute.computeOutputs state + when (M.null outputs) $ + giveup "The computation succeeded, but it did not generate any files." + oks <- forM (M.keys outputs) $ \outputfile -> do + showAction $ addaction <> " " <> QuotedPath outputfile + k <- catchNonAsync (addfile outputfile) + (\err -> giveup $ "Failed to ingest output file " ++ fromOsPath outputfile ++ ": " ++ show err) + return (outputfile, Just k) + let state' = state + { Remote.Compute.computeOutputs = M.fromList oks + } + forM_ (mapMaybe snd oks) $ \k -> do + Remote.Compute.setComputeState + (Remote.remoteStateHandle r) + k ts state' + logChange NoLiveUpdate k (Remote.uuid r) InfoPresent where - go fast state tmpdir ts = do - let outputs = Remote.Compute.computeOutputs state - when (M.null outputs) $ - giveup "The computation succeeded, but it did not generate any files." - oks <- forM (M.keys outputs) $ \outputfile -> do - showAction $ "adding " <> QuotedPath outputfile - k <- catchNonAsync (addfile fast state tmpdir outputfile) - (\err -> giveup $ "Failed to ingest output file " ++ fromOsPath outputfile ++ ": " ++ show err) - return (outputfile, Just k) - let state' = state - { Remote.Compute.computeOutputs = M.fromList oks - } - forM_ (mapMaybe snd oks) $ \k -> do - Remote.Compute.setComputeState - (Remote.remoteStateHandle r) - k ts state' - logChange NoLiveUpdate k (Remote.uuid r) InfoPresent - - addfile fast state tmpdir outputfile + addfile outputfile | fast = do - addSymlink outputfile stateurlk Nothing + when (wantfile outputfile) $ + if stagefiles + then addSymlink outputfile stateurlk Nothing + else makelink stateurlk return stateurlk - | isreproducible state = do + | isreproducible = do sz <- liftIO $ getFileSize outputfile' metered Nothing sz Nothing $ \_ p -> - ingestwith $ ingestAdd p (Just ld) - | otherwise = ingestwith $ - ingestAdd' nullMeterUpdate (Just ld) (Just stateurlk) + if wantfile outputfile + then ingesthelper p Nothing + else genkey p + | otherwise = + if wantfile outputfile + then ingesthelper nullMeterUpdate + (Just stateurlk) + else return stateurlk where stateurl = Remote.Compute.computeStateUrl r state outputfile stateurlk = fromUrl stateurl Nothing True outputfile' = tmpdir outputfile - ld = LockedDown ldc $ KeySource - { keyFilename = outputfile - , contentLocation = outputfile' - , inodeCache = Nothing - } + ld = LockedDown ldc ks + ks = KeySource + { keyFilename = outputfile + , contentLocation = outputfile' + , inodeCache = Nothing + } ingestwith a = a >>= \case - Nothing -> giveup "key generation failed" + Nothing -> giveup "ingestion failed" Just k -> do logStatus NoLiveUpdate k InfoPresent return k - + genkey p = do + backend <- chooseBackend outputfile + fst <$> genKey ks p backend + makelink k = void $ makeLink outputfile k Nothing + ingesthelper p mk + | stagefiles = ingestwith $ + ingestAdd' p (Just ld) mk + | otherwise = ingestwith $ do + mk' <- fst <$> ingest p (Just ld) mk + maybe noop makelink mk' + return mk' + ldc = LockDownConfig { lockingFile = True , hardlinkFileTmpDir = Nothing , checkWritePerms = True } - isreproducible state = case reproducible o of + isreproducible = case reproducibleconfig of Just v -> isReproducible v Nothing -> Remote.Compute.computeReproducible state getInputContent :: Bool -> OsPath -> Annex (Key, Maybe OsPath) getInputContent fast p = catKeyFile p >>= \case - Just inputkey -> do - obj <- calcRepo (gitAnnexLocation inputkey) - if fast - then return (inputkey, Nothing) - else ifM (inAnnex inputkey) - ( return (inputkey, Just obj) - , giveup $ "The computation needs the content of a file which is not present: " ++ fromOsPath p - ) + Just inputkey -> getInputContent' fast inputkey (fromOsPath p) Nothing -> ifM (liftIO $ doesFileExist p) ( giveup $ "The computation needs an input file that is not an annexed file: " ++ fromOsPath p , giveup $ "The computation needs an input file which does not exist: " ++ fromOsPath p ) + +getInputContent' :: Bool -> Key -> String -> Annex (Key, Maybe OsPath) +getInputContent' fast inputkey filedesc = do + obj <- calcRepo (gitAnnexLocation inputkey) + if fast + then return (inputkey, Nothing) + else ifM (inAnnex inputkey) + ( return (inputkey, Just obj) + , giveup $ "The computation needs the content of a file which is not present: " ++ filedesc + ) diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 42a313ee75..a5a82b7028 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -15,14 +15,9 @@ import qualified Remote.Compute import qualified Remote import qualified Types.Remote as Remote import Annex.CatFile -import Annex.Ingest import Git.FilePath -import Types.KeySource -import Messages.Progress import Logs.Location -import Utility.Metered -import Backend.URL (fromUrl) -import Command.AddComputed (Reproducible(..), parseReproducible, getInputContent) +import Command.AddComputed (Reproducible(..), parseReproducible, getInputContent, getInputContent', addComputed) import qualified Data.Map as M @@ -111,81 +106,28 @@ start' o r si file key = -- recompute. This way, the user will see the -- computation fail, with an error message that -- explains the problem. - -- XXX check that this works well Nothing -> True perform :: RecomputeOptions -> Remote -> OsPath -> Key -> Remote.Compute.ComputeState -> CommandPerform perform o r file key oldstate = do program <- Remote.Compute.getComputeProgram r - let recomputestate = oldstate - { Remote.Compute.computeInputs = mempty - , Remote.Compute.computeOutputs = mempty - } fast <- Annex.getRead Annex.fast showOutput - Remote.Compute.runComputeProgram program recomputestate - (Remote.Compute.ImmutableState False) + Remote.Compute.runComputeProgram program oldstate + (Remote.Compute.ImmutableState True) (getinputcontent program fast) - (go fast) + (addComputed "processing" False r (reproducible o) wantfile fast) next $ return True where getinputcontent program fast p - | originalOption o = + | originalOption o = case M.lookup p (Remote.Compute.computeInputs oldstate) of - Just inputkey -> return (inputkey, Nothing) + Just inputkey -> getInputContent' fast inputkey + (fromOsPath p ++ "(key " ++ serializeKey inputkey ++ ")") Nothing -> Remote.Compute.computationBehaviorChangeError program "requesting a new input file" p | otherwise = getInputContent fast p - go fast state tmpdir ts = do - let outputs = Remote.Compute.computeOutputs state - when (M.null outputs) $ - giveup "The computation succeeded, but it did not generate any files." - oks <- forM (M.keys outputs) $ \outputfile -> do - showAction $ "adding " <> QuotedPath outputfile - k <- catchNonAsync (addfile fast state tmpdir outputfile) - (\err -> giveup $ "Failed to ingest output file " ++ fromOsPath outputfile ++ ": " ++ show err) - return (outputfile, Just k) - let state' = state - { Remote.Compute.computeOutputs = M.fromList oks - } - forM_ (mapMaybe snd oks) $ \k -> do - Remote.Compute.setComputeState - (Remote.remoteStateHandle r) - k ts state' - logChange NoLiveUpdate k (Remote.uuid r) InfoPresent - - addfile fast state tmpdir outputfile - | fast = do - addSymlink outputfile stateurlk Nothing - return stateurlk - | isreproducible state = do - sz <- liftIO $ getFileSize outputfile' - metered Nothing sz Nothing $ \_ p -> - ingestwith $ ingestAdd p (Just ld) - | otherwise = ingestwith $ - ingestAdd' nullMeterUpdate (Just ld) (Just stateurlk) - where - stateurl = Remote.Compute.computeStateUrl r state outputfile - stateurlk = fromUrl stateurl Nothing True - outputfile' = tmpdir outputfile - ld = LockedDown ldc $ KeySource - { keyFilename = outputfile - , contentLocation = outputfile' - , inodeCache = Nothing - } - ingestwith a = a >>= \case - Nothing -> giveup "key generation failed" - Just k -> do - logStatus NoLiveUpdate k InfoPresent - return k - - ldc = LockDownConfig - { lockingFile = True - , hardlinkFileTmpDir = Nothing - , checkWritePerms = True - } - - isreproducible state = case reproducible o of - Just v -> isReproducible v - Nothing -> Remote.Compute.computeReproducible state + wantfile outputfile + | othersOption o = True + | otherwise = outputfile == file diff --git a/doc/git-annex-recompute.mdwn b/doc/git-annex-recompute.mdwn index 6e1a32f0d9..b5176285e7 100644 --- a/doc/git-annex-recompute.mdwn +++ b/doc/git-annex-recompute.mdwn @@ -11,12 +11,12 @@ git-annex recompute [path ...]` This updates computed files that were added with [[git-annex-addcomputed]](1). -When the output of the computation is different, the updated computed -file is staged in the repository. - By default, this only recomputes files whose input files have changed. The new contents of the input files are used to re-run the computation. +When the output of the computation is different, the computed file is +updated with the new content. + # OPTIONS * `--original` @@ -26,7 +26,7 @@ The new contents of the input files are used to re-run the computation. * `--others` When recomputing one file also generates new versions of other files, - stage those other files in the repository too. + update those other files too. * `--unreproducible`, `-u` From 5d2a608a5630e732e3a49b95c415a348aa04e034 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 26 Feb 2025 15:59:47 -0400 Subject: [PATCH 25/53] todo --- Command/Recompute.hs | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Command/Recompute.hs b/Command/Recompute.hs index a5a82b7028..44d89f6a33 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -108,6 +108,8 @@ start' o r si file key = -- explains the problem. Nothing -> True +-- TODO When reproducible is not set, preserve the +-- reproducible/unreproducible of the input key. perform :: RecomputeOptions -> Remote -> OsPath -> Key -> Remote.Compute.ComputeState -> CommandPerform perform o r file key oldstate = do program <- Remote.Compute.getComputeProgram r From 9c2c3002a6233a191425854bb69623d9bdd43194 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 27 Feb 2025 11:10:44 -0400 Subject: [PATCH 26/53] fix recompute of renamed files When a computed file has been renamed, a recompute needs to write to the new filename. I decided to remove --others because it's not clear what it should do in the face of renames. Should it update only other files that have not been renamed? Or update files that use the old key to the new key anywhere in the tree? Or write the other files to the cwd, ignoring renames? Since --others is just a way to save on compute time, adding this complexity at this point seems like a bad idea. May revisit later. Added temporary TODO-compute file --- Command/AddComputed.hs | 50 ++++++++++++++++++------------------ Command/Recompute.hs | 22 ++++++++-------- TODO-compute | 36 ++++++++++++++++++++++++++ doc/git-annex-recompute.mdwn | 5 ---- 4 files changed, 71 insertions(+), 42 deletions(-) create mode 100644 TODO-compute diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index b2b55fb605..071015e014 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -96,11 +96,11 @@ perform o r = do Remote.Compute.runComputeProgram program state (Remote.Compute.ImmutableState False) (getInputContent fast) - (addComputed "adding" True r (reproducible o) (const True) fast) + (addComputed "adding" True r (reproducible o) Just fast) next $ return True -addComputed :: StringContainingQuotedPath -> Bool -> Remote -> Maybe Reproducible -> (OsPath -> Bool) -> Bool -> Remote.Compute.ComputeState -> OsPath -> NominalDiffTime -> Annex () -addComputed addaction stagefiles r reproducibleconfig wantfile fast state tmpdir ts = do +addComputed :: StringContainingQuotedPath -> Bool -> Remote -> Maybe Reproducible -> (OsPath -> Maybe OsPath) -> Bool -> Remote.Compute.ComputeState -> OsPath -> NominalDiffTime -> Annex () +addComputed addaction stagefiles r reproducibleconfig destfile fast state tmpdir ts = do let outputs = Remote.Compute.computeOutputs state when (M.null outputs) $ giveup "The computation succeeded, but it did not generate any files." @@ -120,29 +120,29 @@ addComputed addaction stagefiles r reproducibleconfig wantfile fast state tmpdir where addfile outputfile | fast = do - when (wantfile outputfile) $ - if stagefiles - then addSymlink outputfile stateurlk Nothing - else makelink stateurlk + case destfile outputfile of + Nothing -> noop + Just f + | stagefiles -> addSymlink f stateurlk Nothing + | otherwise -> makelink f stateurlk return stateurlk | isreproducible = do sz <- liftIO $ getFileSize outputfile' metered Nothing sz Nothing $ \_ p -> - if wantfile outputfile - then ingesthelper p Nothing - else genkey p - | otherwise = - if wantfile outputfile - then ingesthelper nullMeterUpdate - (Just stateurlk) - else return stateurlk + case destfile outputfile of + Just f -> ingesthelper f p Nothing + Nothing -> genkey outputfile p + | otherwise = case destfile outputfile of + Just f -> ingesthelper f nullMeterUpdate + (Just stateurlk) + Nothing -> return stateurlk where stateurl = Remote.Compute.computeStateUrl r state outputfile stateurlk = fromUrl stateurl Nothing True outputfile' = tmpdir outputfile - ld = LockedDown ldc ks - ks = KeySource - { keyFilename = outputfile + ld f = LockedDown ldc (ks f) + ks f = KeySource + { keyFilename = f , contentLocation = outputfile' , inodeCache = Nothing } @@ -151,16 +151,16 @@ addComputed addaction stagefiles r reproducibleconfig wantfile fast state tmpdir Just k -> do logStatus NoLiveUpdate k InfoPresent return k - genkey p = do + genkey f p = do backend <- chooseBackend outputfile - fst <$> genKey ks p backend - makelink k = void $ makeLink outputfile k Nothing - ingesthelper p mk + fst <$> genKey (ks f) p backend + makelink f k = void $ makeLink f k Nothing + ingesthelper f p mk | stagefiles = ingestwith $ - ingestAdd' p (Just ld) mk + ingestAdd' p (Just (ld f)) mk | otherwise = ingestwith $ do - mk' <- fst <$> ingest p (Just ld) mk - maybe noop makelink mk' + mk' <- fst <$> ingest p (Just (ld f)) mk + maybe noop (makelink f) mk' return mk' ldc = LockDownConfig diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 44d89f6a33..2193216d29 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -29,7 +29,6 @@ cmd = notBareRepo $ data RecomputeOptions = RecomputeOptions { recomputeThese :: CmdParams , originalOption :: Bool - , othersOption :: Bool , reproducible :: Maybe Reproducible , computeRemote :: Maybe (DeferredParse Remote) } @@ -41,10 +40,6 @@ optParser desc = RecomputeOptions ( long "original" <> help "recompute using original content of input files" ) - <*> switch - ( long "others" - <> help "stage other files that are recomputed in passing" - ) <*> parseReproducible <*> optional (mkParseRemoteOption <$> parseRemoteOption) @@ -111,25 +106,28 @@ start' o r si file key = -- TODO When reproducible is not set, preserve the -- reproducible/unreproducible of the input key. perform :: RecomputeOptions -> Remote -> OsPath -> Key -> Remote.Compute.ComputeState -> CommandPerform -perform o r file key oldstate = do +perform o r file key origstate = do program <- Remote.Compute.getComputeProgram r fast <- Annex.getRead Annex.fast showOutput - Remote.Compute.runComputeProgram program oldstate + Remote.Compute.runComputeProgram program origstate (Remote.Compute.ImmutableState True) (getinputcontent program fast) - (addComputed "processing" False r (reproducible o) wantfile fast) + (addComputed "processing" False r (reproducible o) destfile fast) next $ return True where getinputcontent program fast p | originalOption o = - case M.lookup p (Remote.Compute.computeInputs oldstate) of + case M.lookup p (Remote.Compute.computeInputs origstate) of Just inputkey -> getInputContent' fast inputkey (fromOsPath p ++ "(key " ++ serializeKey inputkey ++ ")") Nothing -> Remote.Compute.computationBehaviorChangeError program "requesting a new input file" p | otherwise = getInputContent fast p - wantfile outputfile - | othersOption o = True - | otherwise = outputfile == file + destfile outputfile + | Just outputfile == origfile = Just file + | otherwise = Nothing + + origfile = headMaybe $ M.keys $ M.filter (== Just key) + (Remote.Compute.computeOutputs origstate) diff --git a/TODO-compute b/TODO-compute new file mode 100644 index 0000000000..31d8aaa7b2 --- /dev/null +++ b/TODO-compute @@ -0,0 +1,36 @@ +* recompute could ingest keys for other files than the one being + recomputed, and remember them. Then recomputing those files could just + use those keys, without re-running a computation. (Better than --others + which got removed.) + +* `git-annex recompute foo bar baz`, when foo depends on bar which depends + on baz, and when baz has changed, will not recompute foo, because bar has + not changed. It then recomputes bar. So running the command again is + needed to recompute foo. + + What it could do is, after it recomputes bar, notice that it already + considered foo, and revisit foo, and recompute it then. It could either + use a bloom filter to remember the files it considered but did not + compute, or it could just notice that the command line includes foo + (or includes a directory that contains foo), and then foo is not + modified. + + Or it could build a DAG and traverse it, but building a DAG of a large + directory tree has its own problems. + +* recompute should use the same key backend for a file that it used before + (except when --reproducible/--unreproducible is passed). + +* Check recompute's handling of --reproducible and --unreproducible. + +* addcomputed should honor annex.addunlocked. + +* Perhaps recompute should write a new version of a file as an unlocked + file when the file is currently unlocked? + +* Support non-annexed files as inputs to computations. + +* Should addcomputed honor annex.smallfiles? That would seem to imply + that recompute should also support recomputing non-annexed files. + Otherwise, adding a file and then recomputing it would vary in + what the content of the file is, depending on annex.smallfiles setting. diff --git a/doc/git-annex-recompute.mdwn b/doc/git-annex-recompute.mdwn index b5176285e7..b65488bae8 100644 --- a/doc/git-annex-recompute.mdwn +++ b/doc/git-annex-recompute.mdwn @@ -23,11 +23,6 @@ updated with the new content. Use the original content of input files. -* `--others` - - When recomputing one file also generates new versions of other files, - update those other files too. - * `--unreproducible`, `-u` Convert files that were added with `git-annex addcomputed --reproducible` From 1704b5e327dcea691685e7c1e84f90575a3ddd52 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 27 Feb 2025 14:54:03 -0400 Subject: [PATCH 27/53] refactoring --- Backend.hs | 12 +++++++++++- Backend/VURL/Utilities.hs | 8 ++++++++ Remote/Web.hs | 19 ++++++------------- 3 files changed, 25 insertions(+), 14 deletions(-) diff --git a/Backend.hs b/Backend.hs index 4a7ace6524..de4c7bbee8 100644 --- a/Backend.hs +++ b/Backend.hs @@ -10,13 +10,14 @@ module Backend ( builtinList, defaultBackend, - defaultHashBackend, + hashBackend, genKey, getBackend, chooseBackend, lookupBackendVariety, lookupBuiltinBackendVariety, maybeLookupBackendVariety, + unknownBackendVarietyMessage, isStableKey, isCryptographicallySecureKey, isCryptographicallySecure, @@ -54,6 +55,15 @@ defaultBackend = maybe cache return =<< Annex.getState Annex.backend valid name = not (null name) lookupname = lookupBackendVariety . parseKeyVariety . encodeBS +{- A hashing backend. Takes git config into account, but + - guarantees the backend is cryptographically secure. -} +hashBackend :: Annex Backend +hashBackend = do + db <- defaultBackend + return $ if isCryptographicallySecure db + then db + else defaultHashBackend + {- Generates a key for a file. -} genKey :: KeySource -> MeterUpdate -> Backend -> Annex (Key, Backend) genKey source meterupdate b = case B.genKey b of diff --git a/Backend/VURL/Utilities.hs b/Backend/VURL/Utilities.hs index 82e5939e7c..0fdb038ccb 100644 --- a/Backend/VURL/Utilities.hs +++ b/Backend/VURL/Utilities.hs @@ -56,3 +56,11 @@ generateEquivilantKey b f = , contentLocation = f , inodeCache = Nothing } + +recordVurlKey :: Backend -> OsPath -> Key -> [Key] -> Annex Bool +recordVurlKey b f key eks = generateEquivilantKey b f >>= \case + Nothing -> return False + Just ek -> do + unless (ek `elem` eks) $ + setEquivilantKey key ek + return True diff --git a/Remote/Web.hs b/Remote/Web.hs index 4728a64c6a..0fdad0e92c 100644 --- a/Remote/Web.hs +++ b/Remote/Web.hs @@ -30,7 +30,7 @@ import Annex.SpecialRemote.Config import Logs.Remote import Logs.EquivilantKeys import Backend -import Backend.VURL.Utilities (generateEquivilantKey) +import Backend.VURL.Utilities (recordVurlKey) import qualified Data.Map as M @@ -169,18 +169,11 @@ downloadKey urlincludeexclude key _af dest p vc = | otherwise = return (Just v) recordvurlkey eks = do - -- Make sure to pick a backend that is cryptographically - -- secure. - db <- defaultBackend - let b = if isCryptographicallySecure db - then db - else defaultHashBackend - generateEquivilantKey b dest >>= \case - Nothing -> return Nothing - Just ek -> do - unless (ek `elem` eks) $ - setEquivilantKey key ek - return (Just Verified) + b <- hashBackend + ifM (recordVurlKey b dest key eks) + ( return (Just Verified) + , return Nothing + ) uploadKey :: Key -> AssociatedFile -> Maybe OsPath -> MeterUpdate -> Annex () uploadKey _ _ _ _ = giveup "upload to web not supported" From e6ae5e8d564d514675ce85251d0243e4d2ad7af6 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 27 Feb 2025 15:12:29 -0400 Subject: [PATCH 28/53] many recompute improvements I've lost track of them all, but it includes: * Using the same key backend as was used in the original computation. * Fixing bug that prevented updating the source file key in the compute state * Handling --reproducible and --unreproducible. * recompute --original of a file using VURL, when the result is different, but the key remains the same, makes the object file be updated with the new content * Detecting some other ways the program behavior can change, just for completeness. * Also adds --backend to addcomputed. --- Command/AddComputed.hs | 43 +++++++++++++-------- Command/Recompute.hs | 69 ++++++++++++++++++++++++++++------ Remote/Compute.hs | 8 ++-- TODO-compute | 29 ++++++++------ doc/git-annex-addcomputed.mdwn | 4 ++ doc/git-annex-recompute.mdwn | 28 ++++++++------ 6 files changed, 127 insertions(+), 54 deletions(-) diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 071015e014..20eacf954f 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -28,7 +28,7 @@ import qualified Data.Map as M import Data.Time.Clock cmd :: Command -cmd = notBareRepo $ +cmd = notBareRepo $ withAnnexOptions [backendOption] $ command "addcomputed" SectionCommon "add computed files to annex" (paramRepeating paramExpression) (seek <$$> optParser) @@ -96,11 +96,22 @@ perform o r = do Remote.Compute.runComputeProgram program state (Remote.Compute.ImmutableState False) (getInputContent fast) - (addComputed "adding" True r (reproducible o) Just fast) + (addComputed "adding" True r (reproducible o) chooseBackend Just fast) next $ return True -addComputed :: StringContainingQuotedPath -> Bool -> Remote -> Maybe Reproducible -> (OsPath -> Maybe OsPath) -> Bool -> Remote.Compute.ComputeState -> OsPath -> NominalDiffTime -> Annex () -addComputed addaction stagefiles r reproducibleconfig destfile fast state tmpdir ts = do +addComputed + :: StringContainingQuotedPath + -> Bool + -> Remote + -> Maybe Reproducible + -> (OsPath -> Annex Backend) + -> (OsPath -> Maybe OsPath) + -> Bool + -> Remote.Compute.ComputeState + -> OsPath + -> NominalDiffTime + -> Annex () +addComputed addaction stagefiles r reproducibleconfig choosebackend destfile fast state tmpdir ts = do let outputs = Remote.Compute.computeOutputs state when (M.null outputs) $ giveup "The computation succeeded, but it did not generate any files." @@ -146,22 +157,24 @@ addComputed addaction stagefiles r reproducibleconfig destfile fast state tmpdir , contentLocation = outputfile' , inodeCache = Nothing } + genkey f p = do + backend <- choosebackend outputfile + fst <$> genKey (ks f) p backend + makelink f k = void $ makeLink f k Nothing + ingesthelper f p mk + | stagefiles = ingestwith $ do + k <- maybe (genkey f p) return mk + ingestAdd' p (Just (ld f)) (Just k) + | otherwise = ingestwith $ do + k <- maybe (genkey f p) return mk + mk' <- fst <$> ingest p (Just (ld f)) (Just k) + maybe noop (makelink f) mk' + return mk' ingestwith a = a >>= \case Nothing -> giveup "ingestion failed" Just k -> do logStatus NoLiveUpdate k InfoPresent return k - genkey f p = do - backend <- chooseBackend outputfile - fst <$> genKey (ks f) p backend - makelink f k = void $ makeLink f k Nothing - ingesthelper f p mk - | stagefiles = ingestwith $ - ingestAdd' p (Just (ld f)) mk - | otherwise = ingestwith $ do - mk' <- fst <$> ingest p (Just (ld f)) mk - maybe noop (makelink f) mk' - return mk' ldc = LockDownConfig { lockingFile = True diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 2193216d29..4a3c8355ad 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -14,10 +14,13 @@ import qualified Annex import qualified Remote.Compute import qualified Remote import qualified Types.Remote as Remote +import Annex.Content import Annex.CatFile import Git.FilePath import Logs.Location import Command.AddComputed (Reproducible(..), parseReproducible, getInputContent, getInputContent', addComputed) +import Backend (maybeLookupBackendVariety, unknownBackendVarietyMessage) +import Types.Key import qualified Data.Map as M @@ -62,7 +65,7 @@ seek' o = do start :: RecomputeOptions -> Maybe Remote -> SeekInput -> OsPath -> Key -> CommandStart start o (Just computeremote) si file key = - stopUnless (notElem (Remote.uuid computeremote) <$> loggedLocations key) $ + stopUnless (elem (Remote.uuid computeremote) <$> loggedLocations key) $ start' o computeremote si file key start o Nothing si file key = do rs <- catMaybes <$> (mapM Remote.byUUID =<< loggedLocations key) @@ -103,31 +106,73 @@ start' o r si file key = -- explains the problem. Nothing -> True --- TODO When reproducible is not set, preserve the --- reproducible/unreproducible of the input key. perform :: RecomputeOptions -> Remote -> OsPath -> Key -> Remote.Compute.ComputeState -> CommandPerform -perform o r file key origstate = do +perform o r file origkey origstate = do program <- Remote.Compute.getComputeProgram r - fast <- Annex.getRead Annex.fast + reproducibleconfig <- getreproducibleconfig showOutput Remote.Compute.runComputeProgram program origstate - (Remote.Compute.ImmutableState True) - (getinputcontent program fast) - (addComputed "processing" False r (reproducible o) destfile fast) + (Remote.Compute.ImmutableState False) + (getinputcontent program) + (go program reproducibleconfig) next $ return True where - getinputcontent program fast p + go program reproducibleconfig state tmpdir ts = do + checkbehaviorchange program state + addComputed "processing" False r reproducibleconfig + choosebackend destfile state tmpdir ts + + checkbehaviorchange program state = do + let check s w a b = forM_ (M.keys (w a)) $ \f -> + unless (M.member f (w b)) $ + Remote.Compute.computationBehaviorChangeError program s f + + check "not using input file" + Remote.Compute.computeInputs origstate state + check "outputting" + Remote.Compute.computeOutputs state origstate + check "not outputting" + Remote.Compute.computeOutputs origstate state + + getinputcontent program p | originalOption o = case M.lookup p (Remote.Compute.computeInputs origstate) of - Just inputkey -> getInputContent' fast inputkey + Just inputkey -> getInputContent' False inputkey (fromOsPath p ++ "(key " ++ serializeKey inputkey ++ ")") Nothing -> Remote.Compute.computationBehaviorChangeError program "requesting a new input file" p - | otherwise = getInputContent fast p + | otherwise = getInputContent False p destfile outputfile | Just outputfile == origfile = Just file | otherwise = Nothing - origfile = headMaybe $ M.keys $ M.filter (== Just key) + origfile = headMaybe $ M.keys $ M.filter (== Just origkey) (Remote.Compute.computeOutputs origstate) + + origbackendvariety = fromKey keyVariety origkey + + getreproducibleconfig = case reproducible o of + Just (Reproducible True) -> return (Just (Reproducible True)) + -- A VURL key is used when the computation was + -- unreproducible. So recomputing should too, but that + -- will result in the same VURL key. Since moveAnnex + -- will prefer the current annex object to a new one, + -- delete the annex object first, so that if recomputing + -- generates a new version of the file, it replaces + -- the old version. + v -> case origbackendvariety of + VURLKey -> do + lockContentForRemoval origkey noop removeAnnex + -- in case computation fails or is interupted + logStatus NoLiveUpdate origkey InfoMissing + return (Just (Reproducible False)) + _ -> return v + + choosebackend _outputfile + -- Use the same backend as was used to compute it before, + -- so if the computed file is the same, there will be + -- no change. + | otherwise = maybeLookupBackendVariety origbackendvariety >>= \case + Just b -> return b + Nothing -> giveup $ unknownBackendVarietyMessage origbackendvariety diff --git a/Remote/Compute.hs b/Remote/Compute.hs index e3ec2a8fdd..a8a3cdd32e 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -399,8 +399,8 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) liftIO $ hPutStrLn (stdinHandle p) $ maybe "" fromOsPath mp' liftIO $ hFlush (stdinHandle p) - return $ if knowninput - then state' + return $ if immutablestate + then state else state' { computeInputs = M.insert f' k @@ -411,8 +411,8 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) checksafefile tmpdir subdir f' "output" let knownoutput = M.member f' (computeOutputs state') checkimmutable knownoutput "outputting" f' $ - return $ if knownoutput - then state' + return $ if immutablestate + then state else state' { computeOutputs = M.insert f' Nothing diff --git a/TODO-compute b/TODO-compute index 31d8aaa7b2..fe128b0e4d 100644 --- a/TODO-compute +++ b/TODO-compute @@ -1,3 +1,20 @@ +* VURL keys don't currently have the hash key recorded in the equivilant + key log by addcompute or when getting from a compute remote. + +* need progress bars for computations and implement PROGRESS message + +* get input files for a computation (so `git-annex get .` gets every file, + even when input files in a directory are processed after computed files) + +* autoinit security + +* Support non-annexed files as inputs to computations. + +* addcomputed should honor annex.addunlocked. + +* Perhaps recompute should write a new version of a file as an unlocked + file when the file is currently unlocked? + * recompute could ingest keys for other files than the one being recomputed, and remember them. Then recomputing those files could just use those keys, without re-running a computation. (Better than --others @@ -18,18 +35,6 @@ Or it could build a DAG and traverse it, but building a DAG of a large directory tree has its own problems. -* recompute should use the same key backend for a file that it used before - (except when --reproducible/--unreproducible is passed). - -* Check recompute's handling of --reproducible and --unreproducible. - -* addcomputed should honor annex.addunlocked. - -* Perhaps recompute should write a new version of a file as an unlocked - file when the file is currently unlocked? - -* Support non-annexed files as inputs to computations. - * Should addcomputed honor annex.smallfiles? That would seem to imply that recompute should also support recomputing non-annexed files. Otherwise, adding a file and then recomputing it would vary in diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index 58261da181..3301381c66 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -82,6 +82,10 @@ the parameters provided to `git-annex addcomputed`. checksum verification error. One thing that can be done then is to use `git-annex recompute --original --unreproducible`. +* `--backend` + + Specifies which key-value backend to use. + * Also the [[git-annex-common-options]](1) can be used. # SEE ALSO diff --git a/doc/git-annex-recompute.mdwn b/doc/git-annex-recompute.mdwn index b65488bae8..fb895aa75c 100644 --- a/doc/git-annex-recompute.mdwn +++ b/doc/git-annex-recompute.mdwn @@ -21,17 +21,7 @@ updated with the new content. * `--original` - Use the original content of input files. - -* `--unreproducible`, `-u` - - Convert files that were added with `git-annex addcomputed --reproducible` - to be as if they were added without that option. - -* `--reproducible`, `-r` - - Convert files that were added with `git-annex addcomputed --unreproducible` - to be as if they were added with `--reproducible`. + Re-run the computation with the original input files. * `--remote=name` @@ -42,6 +32,22 @@ updated with the new content. a file can be computed by multiple remotes, the one with the lowest configured cost will be used. +* `--unreproducible`, `-u` + + Indicate that the computation is not expected to be fully reproducible. + It can vary, in ways that produce files that equivilant enough to + be interchangeable, but are not necessarily identical. + + This is the default unless the compute remote indicates that it produces + reproducible output. + +* `--reproducible`, `-r` + + Indicate that the computation is expected to be fully reproducible. + + This is the default when the compute remote indicates that it produces + reproducible output. + * matching options The [[git-annex-matching-options]](1) can be used to control what From d2091730e931f5b77e57b33ce58ff7d6cf65fa7c Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 27 Feb 2025 16:17:42 -0400 Subject: [PATCH 29/53] refactor --- Backend/VURL/Utilities.hs | 25 ------------------------- Logs/EquivilantKeys.hs | 36 +++++++++++++++++++++++++++++++++++- Remote/Web.hs | 6 +----- 3 files changed, 36 insertions(+), 31 deletions(-) diff --git a/Backend/VURL/Utilities.hs b/Backend/VURL/Utilities.hs index 0fdb038ccb..46b06c41b8 100644 --- a/Backend/VURL/Utilities.hs +++ b/Backend/VURL/Utilities.hs @@ -10,10 +10,8 @@ module Backend.VURL.Utilities where import Annex.Common import Types.Key import Types.Backend -import Types.KeySource import Logs.EquivilantKeys import qualified Backend.Hash -import Utility.Metered migrateFromURLToVURL :: Key -> Backend -> AssociatedFile -> Bool -> Annex (Maybe Key) migrateFromURLToVURL oldkey newbackend _af inannex @@ -41,26 +39,3 @@ migrateFromVURLToURL oldkey newbackend _af _ (keyData oldkey) { keyVariety = URLKey } | otherwise = return Nothing - --- The Backend must use a cryptographically secure hash. -generateEquivilantKey :: Backend -> OsPath -> Annex (Maybe Key) -generateEquivilantKey b f = - case genKey b of - Just genkey -> do - showSideAction (UnquotedString Backend.Hash.descChecksum) - Just <$> genkey source nullMeterUpdate - Nothing -> return Nothing - where - source = KeySource - { keyFilename = mempty -- avoid adding any extension - , contentLocation = f - , inodeCache = Nothing - } - -recordVurlKey :: Backend -> OsPath -> Key -> [Key] -> Annex Bool -recordVurlKey b f key eks = generateEquivilantKey b f >>= \case - Nothing -> return False - Just ek -> do - unless (ek `elem` eks) $ - setEquivilantKey key ek - return True diff --git a/Logs/EquivilantKeys.hs b/Logs/EquivilantKeys.hs index 0a0117301e..32accda777 100644 --- a/Logs/EquivilantKeys.hs +++ b/Logs/EquivilantKeys.hs @@ -1,6 +1,6 @@ {- Logs listing keys that are equivalent to a key. - - - Copyright 2024 Joey Hess + - Copyright 2024-2025 Joey Hess - - Licensed under the GNU AGPL version 3 or higher. -} @@ -10,6 +10,8 @@ module Logs.EquivilantKeys ( getEquivilantKeys, setEquivilantKey, + updateEquivilantKeys, + generateEquivilantKey, ) where import Annex.Common @@ -17,6 +19,11 @@ import qualified Annex import Logs import Logs.Presence import qualified Annex.Branch +import qualified Backend.Hash +import Types.KeySource +import Types.Backend +import Types.Remote (Verification(..)) +import Utility.Metered getEquivilantKeys :: Key -> Annex [Key] getEquivilantKeys key = do @@ -29,3 +36,30 @@ setEquivilantKey key equivkey = do config <- Annex.getGitConfig addLog (Annex.Branch.RegardingUUID []) (equivilantKeysLogFile config key) InfoPresent (LogInfo (serializeKey' equivkey)) + +-- The Backend must use a cryptographically secure hash. +-- +-- This returns Verified when when an equivilant key has been added to the +-- log (or was already in the log). This is to avoid hashing the object +-- again later. +updateEquivilantKeys :: Backend -> OsPath -> Key -> [Key] -> Annex (Maybe Verification) +updateEquivilantKeys b obj key eks = generateEquivilantKey b obj >>= \case + Nothing -> return Nothing + Just ek -> do + unless (ek `elem` eks) $ + setEquivilantKey key ek + return (Just Verified) + +generateEquivilantKey :: Backend -> OsPath -> Annex (Maybe Key) +generateEquivilantKey b obj = + case genKey b of + Just genkey -> do + showSideAction (UnquotedString Backend.Hash.descChecksum) + Just <$> genkey source nullMeterUpdate + Nothing -> return Nothing + where + source = KeySource + { keyFilename = mempty -- avoid adding any extension + , contentLocation = obj + , inodeCache = Nothing + } diff --git a/Remote/Web.hs b/Remote/Web.hs index 0fdad0e92c..a097782efe 100644 --- a/Remote/Web.hs +++ b/Remote/Web.hs @@ -30,7 +30,6 @@ import Annex.SpecialRemote.Config import Logs.Remote import Logs.EquivilantKeys import Backend -import Backend.VURL.Utilities (recordVurlKey) import qualified Data.Map as M @@ -170,10 +169,7 @@ downloadKey urlincludeexclude key _af dest p vc = recordvurlkey eks = do b <- hashBackend - ifM (recordVurlKey b dest key eks) - ( return (Just Verified) - , return Nothing - ) + updateEquivilantKeys b dest key eks uploadKey :: Key -> AssociatedFile -> Maybe OsPath -> MeterUpdate -> Annex () uploadKey _ _ _ _ = giveup "upload to web not supported" From b813549b2d256cfd73b0b0491bbb4e2c5af7ee6c Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 27 Feb 2025 16:18:04 -0400 Subject: [PATCH 30/53] fix build --- Command/Recompute.hs | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 4a3c8355ad..ecbc0fde56 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -10,7 +10,6 @@ module Command.Recompute where import Command -import qualified Annex import qualified Remote.Compute import qualified Remote import qualified Types.Remote as Remote @@ -120,7 +119,7 @@ perform o r file origkey origstate = do go program reproducibleconfig state tmpdir ts = do checkbehaviorchange program state addComputed "processing" False r reproducibleconfig - choosebackend destfile state tmpdir ts + choosebackend destfile False state tmpdir ts checkbehaviorchange program state = do let check s w a b = forM_ (M.keys (w a)) $ \f -> From 2bd64059f18c1ca325036b6e86ec020b74caa8bf Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 27 Feb 2025 16:19:41 -0400 Subject: [PATCH 31/53] record VURL key hashes when getting from compute remote Like when getting from the web special remote, when the output of the computation has changed, record the new hash of the content as an equivilant key for the VURL key. Still needs to be done for addcomputed and recompute. --- Remote/Compute.hs | 35 ++++++++++++++++++++++++----------- TODO-compute | 2 +- 2 files changed, 25 insertions(+), 12 deletions(-) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index a8a3cdd32e..84170fc5dd 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -36,6 +36,7 @@ import Annex.UUID import Annex.Content import Annex.Tmp import Logs.MetaData +import Logs.EquivilantKeys import Utility.Metered import Utility.TimeStamp import Utility.Env @@ -44,6 +45,8 @@ import Utility.Url import Utility.MonotonicClock import qualified Git import qualified Utility.SimpleProtocol as Proto +import Types.Key +import Backend import Network.HTTP.Types.URI import Data.Time.Clock @@ -447,7 +450,7 @@ computationBehaviorChangeError (ComputeProgram program) requestdesc p = giveup $ program ++ " is not behaving the same way it used to, now " ++ requestdesc ++ ": " ++ fromOsPath p computeKey :: RemoteStateHandle -> ComputeProgram -> Key -> AssociatedFile -> OsPath -> MeterUpdate -> VerifyConfig -> Annex Verification -computeKey rs (ComputeProgram program) k af dest p vc = +computeKey rs (ComputeProgram program) k _af dest p vc = getComputeState rs k >>= \case Just state -> case computeskey state of @@ -475,28 +478,38 @@ computeKey rs (ComputeProgram program) k af dest p vc = (keyfile : _) -> Just keyfile [] -> Nothing - go keyfile state tmpdir ts = do + go keyfile state tmpdir _ts = do + hb <- hashBackend + let updatevurl key getobj = + if (fromKey keyVariety key == VURLKey) + then do + obj <- getobj + updateEquivilantKeys hb obj key + =<< getEquivilantKeys key + else return Nothing + let keyfile' = tmpdir keyfile unlessM (liftIO $ doesFileExist keyfile') $ giveup $ program ++ " exited sucessfully, but failed to write the computed file" catchNonAsync (liftIO $ moveFile keyfile' dest) (\err -> giveup $ "failed to move the computed file: " ++ show err) - + mverification <- updatevurl k (pure dest) + -- Try to move any other computed object files into the annex. forM_ (M.toList $ computeOutputs state) $ \case (file, (Just key)) -> when (k /= key) $ do let file' = tmpdir file - whenM (liftIO $ doesFileExist file') $ - whenM (verifyKeyContentPostRetrieval RetrievalAllKeysSecure vc verification k file') $ - void $ tryNonAsync $ moveAnnex k file' + whenM (liftIO $ doesFileExist file') $ do + whenM (verifyKeyContentPostRetrieval RetrievalAllKeysSecure vc MustVerify key file') $ do + moved <- moveAnnex key file' `catchNonAsync` const (pure False) + when moved $ + void $ updatevurl key (calcRepo (gitAnnexLocation key)) _ -> noop - return verification - - -- The program might not be reproducible, so require strong - -- verification. - verification = MustVerify + -- The program might not be reproducible, + -- so require strong verification. + return $ fromMaybe MustVerify mverification -- Make sure that the compute state exists. checkKey :: RemoteStateHandle -> Key -> Annex Bool diff --git a/TODO-compute b/TODO-compute index fe128b0e4d..1f3ac6c9d5 100644 --- a/TODO-compute +++ b/TODO-compute @@ -1,5 +1,5 @@ * VURL keys don't currently have the hash key recorded in the equivilant - key log by addcompute or when getting from a compute remote. + key log by addcompute * need progress bars for computations and implement PROGRESS message From 63d73d8d1b874c5a77d3d3329722b357dcaa1854 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 3 Mar 2025 10:57:56 -0400 Subject: [PATCH 32/53] record VURL key hashes in addcomputed and recompute --- Command/AddComputed.hs | 6 ++++++ Logs/EquivilantKeys.hs | 9 +++++++-- Remote/Compute.hs | 5 +---- TODO-compute | 3 --- 4 files changed, 14 insertions(+), 9 deletions(-) diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 20eacf954f..857e495ad0 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -19,8 +19,10 @@ import Annex.CatFile import Annex.Content.Presence import Annex.Ingest import Types.KeySource +import Types.Key import Messages.Progress import Logs.Location +import Logs.EquivilantKeys import Utility.Metered import Backend.URL (fromUrl) @@ -174,6 +176,10 @@ addComputed addaction stagefiles r reproducibleconfig choosebackend destfile fas Nothing -> giveup "ingestion failed" Just k -> do logStatus NoLiveUpdate k InfoPresent + when (fromKey keyVariety k == VURLKey) $ do + hb <- hashBackend + void $ addEquivilantKey hb k + =<< calcRepo (gitAnnexLocation k) return k ldc = LockDownConfig diff --git a/Logs/EquivilantKeys.hs b/Logs/EquivilantKeys.hs index 32accda777..b238675724 100644 --- a/Logs/EquivilantKeys.hs +++ b/Logs/EquivilantKeys.hs @@ -11,6 +11,7 @@ module Logs.EquivilantKeys ( getEquivilantKeys, setEquivilantKey, updateEquivilantKeys, + addEquivilantKey, generateEquivilantKey, ) where @@ -37,8 +38,6 @@ setEquivilantKey key equivkey = do addLog (Annex.Branch.RegardingUUID []) (equivilantKeysLogFile config key) InfoPresent (LogInfo (serializeKey' equivkey)) --- The Backend must use a cryptographically secure hash. --- -- This returns Verified when when an equivilant key has been added to the -- log (or was already in the log). This is to avoid hashing the object -- again later. @@ -50,6 +49,12 @@ updateEquivilantKeys b obj key eks = generateEquivilantKey b obj >>= \case setEquivilantKey key ek return (Just Verified) +addEquivilantKey :: Backend -> Key -> OsPath -> Annex (Maybe Verification) +addEquivilantKey b key obj = + updateEquivilantKeys b obj key + =<< getEquivilantKeys key + +-- The Backend must use a cryptographically secure hash. generateEquivilantKey :: Backend -> OsPath -> Annex (Maybe Key) generateEquivilantKey b obj = case genKey b of diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 84170fc5dd..eaef6d44fb 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -482,10 +482,7 @@ computeKey rs (ComputeProgram program) k _af dest p vc = hb <- hashBackend let updatevurl key getobj = if (fromKey keyVariety key == VURLKey) - then do - obj <- getobj - updateEquivilantKeys hb obj key - =<< getEquivilantKeys key + then addEquivilantKey hb key =<< getobj else return Nothing let keyfile' = tmpdir keyfile diff --git a/TODO-compute b/TODO-compute index 1f3ac6c9d5..dfa629ab8b 100644 --- a/TODO-compute +++ b/TODO-compute @@ -1,6 +1,3 @@ -* VURL keys don't currently have the hash key recorded in the equivilant - key log by addcompute - * need progress bars for computations and implement PROGRESS message * get input files for a computation (so `git-annex get .` gets every file, From 6ebab7fb00477ee473035bee9644b252f049e928 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 3 Mar 2025 11:08:36 -0400 Subject: [PATCH 33/53] factor out Annex.GitShaKey --- Annex/Export.hs | 30 +----------------------------- Annex/GitShaKey.hs | 41 +++++++++++++++++++++++++++++++++++++++++ Annex/Import.hs | 2 +- Command/Export.hs | 1 + git-annex.cabal | 1 + 5 files changed, 45 insertions(+), 30 deletions(-) create mode 100644 Annex/GitShaKey.hs diff --git a/Annex/Export.hs b/Annex/Export.hs index 60039ef3b9..4ce30e2fca 100644 --- a/Annex/Export.hs +++ b/Annex/Export.hs @@ -11,16 +11,13 @@ module Annex.Export where import Annex import Annex.CatFile +import Annex.GitShaKey import Types -import Types.Key import qualified Git import qualified Types.Remote as Remote import Git.Quote import Messages -import Data.Maybe -import qualified Data.ByteString.Short as S (fromShort, toShort) - -- From a sha pointing to the content of a file to the key -- to use to export it. When the file is annexed, it's the annexed key. -- When the file is stored in git, it's a special type of key to indicate @@ -31,31 +28,6 @@ exportKey sha = mk <$> catKey sha mk (Just k) = k mk Nothing = gitShaKey sha --- Encodes a git sha as a key. This is used to represent a non-annexed --- file that is stored on a special remote, which necessarily needs a --- key. --- --- This is not the same as a SHA1 key, because the mapping needs to be --- bijective, also because git may not always use SHA1, and because git --- takes a SHA1 of the file size + content, while git-annex SHA1 keys --- only checksum the content. -gitShaKey :: Git.Sha -> Key -gitShaKey (Git.Ref s) = mkKey $ \kd -> kd - { keyName = S.toShort s - , keyVariety = OtherKey "GIT" - } - --- Reverse of gitShaKey -keyGitSha :: Key -> Maybe Git.Sha -keyGitSha k - | fromKey keyVariety k == OtherKey "GIT" = - Just (Git.Ref (S.fromShort (fromKey keyName k))) - | otherwise = Nothing - --- Is a key storing a git sha, and not used for an annexed file? -isGitShaKey :: Key -> Bool -isGitShaKey = isJust . keyGitSha - warnExportImportConflict :: Remote -> Annex () warnExportImportConflict r = do isimport <- Remote.isImportSupported r diff --git a/Annex/GitShaKey.hs b/Annex/GitShaKey.hs new file mode 100644 index 0000000000..1413039c51 --- /dev/null +++ b/Annex/GitShaKey.hs @@ -0,0 +1,41 @@ +{- Encoding a git sha as a Key + - + - Copyright 2017-2025 Joey Hess + - + - Licensed under the GNU AGPL version 3 or higher. + -} + +{-# LANGUAGE OverloadedStrings #-} + +module Annex.GitShaKey where + +import Types +import Types.Key +import qualified Git + +import Data.Maybe +import qualified Data.ByteString.Short as S (fromShort, toShort) + +-- Encodes a git sha as a Key. This is used to represent a non-annexed +-- file. For example, when storing a git sha on a special remote. +-- +-- This is not the same as a SHA1 key, because the mapping needs to be +-- bijective, also because git may not always use SHA1, and because git +-- takes a SHA1 of the file size + content, while git-annex SHA1 keys +-- only checksum the content. +gitShaKey :: Git.Sha -> Key +gitShaKey (Git.Ref s) = mkKey $ \kd -> kd + { keyName = S.toShort s + , keyVariety = OtherKey "GIT" + } + +-- Reverse of gitShaKey +keyGitSha :: Key -> Maybe Git.Sha +keyGitSha k + | fromKey keyVariety k == OtherKey "GIT" = + Just (Git.Ref (S.fromShort (fromKey keyName k))) + | otherwise = Nothing + +-- Is a key storing a git sha, and not used for an annexed file? +isGitShaKey :: Key -> Bool +isGitShaKey = isJust . keyGitSha diff --git a/Annex/Import.hs b/Annex/Import.hs index 2e86df920d..2d2526a544 100644 --- a/Annex/Import.hs +++ b/Annex/Import.hs @@ -38,12 +38,12 @@ import qualified Annex import Annex.Link import Annex.LockFile import Annex.Content -import Annex.Export import Annex.RemoteTrackingBranch import Annex.HashObject import Annex.Transfer import Annex.CheckIgnore import Annex.CatFile +import Annex.GitShaKey import Annex.VectorClock import Annex.SpecialRemote.Config import Command diff --git a/Command/Export.hs b/Command/Export.hs index b4acaac401..3be1a67c93 100644 --- a/Command/Export.hs +++ b/Command/Export.hs @@ -26,6 +26,7 @@ import Types.Remote import Types.Export import Annex.Export import Annex.Content +import Annex.GitShaKey import Annex.Transfer import Annex.CatFile import Annex.FileMatcher diff --git a/git-annex.cabal b/git-annex.cabal index 88203be956..2123b73663 100644 --- a/git-annex.cabal +++ b/git-annex.cabal @@ -562,6 +562,7 @@ Executable git-annex Annex.FileMatcher Annex.Fixup Annex.GitOverlay + Annex.GitShaKey Annex.HashObject Annex.Hook Annex.Import From a0d6a6ea2a486081f2f3c561b0eb4757a17e9d96 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 3 Mar 2025 11:59:04 -0400 Subject: [PATCH 34/53] support git files as input to computations Using GIT keys, like are used when exporting git files to special remotes. Except here the GIT key refers to a file checked into the git repo. Note that, since the compute remote uses catObject to get the content, a symlink that is checked into git does not get followed. This is important for security, because following a symlink and adding the content to the repo as an annex object would allow exfiltrating content from outside the repository. Instead, the behavior with a symlink is to run the computation on the symlink target. This may turn out to be confusing, and it might be worth addcomputed checking if the file in git is a symlink and erroring out. Or it could follow symlinks as long as the destination is a file in the repisitory. --- Command/AddComputed.hs | 42 ++++++++++++++++++++++------------ Git/Types.hs | 2 +- Remote/Compute.hs | 35 ++++++++++++++++++---------- TODO-compute | 4 ++-- doc/git-annex-addcomputed.mdwn | 4 ++-- 5 files changed, 56 insertions(+), 31 deletions(-) diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 857e495ad0..b0127b10ba 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -11,6 +11,8 @@ module Command.AddComputed where import Command import qualified Git +import qualified Git.Types as Git +import qualified Git.Ref as Git import qualified Annex import qualified Remote.Compute import qualified Types.Remote as Remote @@ -18,6 +20,7 @@ import Backend import Annex.CatFile import Annex.Content.Presence import Annex.Ingest +import Annex.GitShaKey import Types.KeySource import Types.Key import Messages.Progress @@ -192,20 +195,31 @@ addComputed addaction stagefiles r reproducibleconfig choosebackend destfile fas Just v -> isReproducible v Nothing -> Remote.Compute.computeReproducible state -getInputContent :: Bool -> OsPath -> Annex (Key, Maybe OsPath) +getInputContent :: Bool -> OsPath -> Annex (Key, Maybe (Either Git.Sha OsPath)) getInputContent fast p = catKeyFile p >>= \case - Just inputkey -> getInputContent' fast inputkey (fromOsPath p) - Nothing -> ifM (liftIO $ doesFileExist p) - ( giveup $ "The computation needs an input file that is not an annexed file: " ++ fromOsPath p - , giveup $ "The computation needs an input file which does not exist: " ++ fromOsPath p - ) + Just inputkey -> getInputContent' fast inputkey filedesc + Nothing -> inRepo (Git.fileRef p) >>= \case + Just fileref -> catObjectMetaData fileref >>= \case + Just (sha, _, t) + | t == Git.BlobObject -> + getInputContent' fast (gitShaKey sha) filedesc + | otherwise -> + badinput $ ", not a git " ++ decodeBS (Git.fmtObjectType t) + Nothing -> notcheckedin + Nothing -> notcheckedin + where + filedesc = fromOsPath p + badinput s = giveup $ "The computation needs an input file " ++ s ++ ": " ++ fromOsPath p + notcheckedin = badinput "that is not checked into the git repository" -getInputContent' :: Bool -> Key -> String -> Annex (Key, Maybe OsPath) -getInputContent' fast inputkey filedesc = do - obj <- calcRepo (gitAnnexLocation inputkey) - if fast - then return (inputkey, Nothing) - else ifM (inAnnex inputkey) - ( return (inputkey, Just obj) - , giveup $ "The computation needs the content of a file which is not present: " ++ filedesc +getInputContent' :: Bool -> Key -> String -> Annex (Key, Maybe (Either Git.Sha OsPath)) +getInputContent' fast inputkey filedesc + | fast = return (inputkey, Nothing) + | otherwise = case keyGitSha inputkey of + Nothing -> ifM (inAnnex inputkey) + ( do + obj <- calcRepo (gitAnnexLocation inputkey) + return (inputkey, Just (Right obj)) + , giveup $ "The computation needs the content of an annexed file which is not present: " ++ filedesc ) + Just sha -> return (inputkey, Just (Left sha)) diff --git a/Git/Types.hs b/Git/Types.hs index a32d07d4f7..1ad145452b 100644 --- a/Git/Types.hs +++ b/Git/Types.hs @@ -145,7 +145,7 @@ newtype RefDate = RefDate String {- Types of objects that can be stored in git. -} data ObjectType = BlobObject | CommitObject | TreeObject - deriving (Show) + deriving (Show, Eq) readObjectType :: S.ByteString -> Maybe ObjectType readObjectType "blob" = Just BlobObject diff --git a/Remote/Compute.hs b/Remote/Compute.hs index eaef6d44fb..564ecbda70 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -35,6 +35,8 @@ import Annex.SpecialRemote.Config import Annex.UUID import Annex.Content import Annex.Tmp +import Annex.GitShaKey +import Annex.CatFile import Logs.MetaData import Logs.EquivilantKeys import Utility.Metered @@ -43,10 +45,11 @@ import Utility.Env import Utility.Tmp.Dir import Utility.Url import Utility.MonotonicClock -import qualified Git -import qualified Utility.SimpleProtocol as Proto import Types.Key import Backend +import qualified Git +import qualified Utility.FileIO as F +import qualified Utility.SimpleProtocol as Proto import Network.HTTP.Types.URI import Data.Time.Clock @@ -341,7 +344,7 @@ runComputeProgram :: ComputeProgram -> ComputeState -> ImmutableState - -> (OsPath -> Annex (Key, Maybe OsPath)) + -> (OsPath -> Annex (Key, Maybe (Either Git.Sha OsPath))) -> (ComputeState -> OsPath -> NominalDiffTime -> Annex v) -> Annex v runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) getinputcontent cont = @@ -395,12 +398,17 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) let knowninput = M.member f' (computeInputs state') checksafefile tmpdir subdir f' "input" checkimmutable knowninput "inputting" f' $ do - (k, mp) <- getinputcontent f' - mp' <- liftIO $ maybe (pure Nothing) - (Just <$$> relPathDirToFile subdir) - mp + (k, inputcontent) <- getinputcontent f' + mp <- case inputcontent of + Nothing -> pure Nothing + Just (Right f'') -> liftIO $ + Just <$> relPathDirToFile subdir f'' + Just (Left gitsha) -> do + liftIO . F.writeFile (subdir f') + =<< catObject gitsha + return (Just f') liftIO $ hPutStrLn (stdinHandle p) $ - maybe "" fromOsPath mp' + maybe "" fromOsPath mp liftIO $ hFlush (stdinHandle p) return $ if immutablestate then state @@ -467,10 +475,13 @@ computeKey rs (ComputeProgram program) k _af dest p vc = getinputcontent state f = case M.lookup (fromOsPath f) (computeInputs state) of - Just inputkey -> do - obj <- calcRepo (gitAnnexLocation inputkey) - -- XXX get input object when not present - return (inputkey, Just obj) + Just inputkey -> case keyGitSha inputkey of + Nothing -> do + obj <- calcRepo (gitAnnexLocation inputkey) + -- XXX get input object when not present + return (inputkey, Just (Right obj)) + Just gitsha -> + return (inputkey, Just (Left gitsha)) Nothing -> error "internal" computeskey state = diff --git a/TODO-compute b/TODO-compute index dfa629ab8b..b3f67016a7 100644 --- a/TODO-compute +++ b/TODO-compute @@ -5,13 +5,13 @@ * autoinit security -* Support non-annexed files as inputs to computations. - * addcomputed should honor annex.addunlocked. * Perhaps recompute should write a new version of a file as an unlocked file when the file is currently unlocked? +* compute on files in submodules + * recompute could ingest keys for other files than the one being recomputed, and remember them. Then recomputing those files could just use those keys, without re-running a computation. (Better than --others diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index 3301381c66..faff1d96b6 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -8,8 +8,8 @@ git annex addcomputed `--to=remote -- ...` # DESCRIPTION -Adds files to the annex that are computed from input files, -using a compute special remote. +Adds files to the annex that are computed from input files in the +repository, using a compute special remote. Once a file has been added to a compute remote, commands like `git-annex get` will use it to compute the content of the file. From b01a0d232384c56fd53e3d23718e83d3a50ea5dc Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 3 Mar 2025 14:56:49 -0400 Subject: [PATCH 35/53] avoid recomputing every time on git inputs --- Command/Recompute.hs | 26 ++++++++++++++++++-------- 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/Command/Recompute.hs b/Command/Recompute.hs index ecbc0fde56..81fe35cbff 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -13,8 +13,10 @@ import Command import qualified Remote.Compute import qualified Remote import qualified Types.Remote as Remote +import qualified Git.Ref as Git import Annex.Content import Annex.CatFile +import Annex.GitShaKey import Git.FilePath import Logs.Location import Command.AddComputed (Reproducible(..), parseReproducible, getInputContent, getInputContent', addComputed) @@ -93,17 +95,25 @@ start' o r si file key = inputchanged state (inputfile, inputkey) = do -- Note that the paths from the remote state are not to be -- trusted to point to a file in the repository, but using - -- the path with catKeyFile will only succeed if it + -- the path with git cat-file will only succeed if it -- is checked into the repository. p <- fromRepo $ fromTopFilePath $ asTopFilePath $ Remote.Compute.computeSubdir state inputfile - catKeyFile p >>= return . \case - Just k -> k /= inputkey - -- When an input file is missing, go ahead and - -- recompute. This way, the user will see the - -- computation fail, with an error message that - -- explains the problem. - Nothing -> True + case keyGitSha inputkey of + Nothing -> + catKeyFile p >>= return . \case + Just k -> k /= inputkey + Nothing -> inputfilemissing + Just inputgitsha -> inRepo (Git.fileRef p) >>= \case + Just fileref -> catObjectMetaData fileref >>= return . \case + Just (sha, _, _) -> sha /= inputgitsha + Nothing -> inputfilemissing + Nothing -> return inputfilemissing + + -- When an input file is missing, go ahead and recompute. This way, + -- the user will see the computation fail, with an error message that + -- explains the problem. + inputfilemissing = True perform :: RecomputeOptions -> Remote -> OsPath -> Key -> Remote.Compute.ComputeState -> CommandPerform perform o r file origkey origstate = do From 89bfeada8785c68491e015c7d7bc16876b16dacc Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 3 Mar 2025 15:12:19 -0400 Subject: [PATCH 36/53] recompute: display one of the changed files --- Command/Recompute.hs | 32 +++++++++++++++++++------------- doc/git-annex-recompute.mdwn | 3 ++- 2 files changed, 21 insertions(+), 14 deletions(-) diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 81fe35cbff..2eda098867 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -79,18 +79,22 @@ start' o r si file key = Remote.Compute.getComputeState (Remote.remoteStateHandle r) key >>= \case Nothing -> stop - Just state -> - stopUnless (shouldrecompute state) $ - starting "recompute" ai si $ - perform o r file key state + Just state -> shouldrecompute state >>= \case + Nothing -> stop + Just mreason -> starting "recompute" ai si $ do + maybe noop showNote mreason + perform o r file key state where ai = mkActionItem (key, file) shouldrecompute state - | originalOption o = return True - | otherwise = - anyM (inputchanged state) $ - M.toList (Remote.Compute.computeInputs state) + | originalOption o = return (Just Nothing) + | otherwise = firstM (inputchanged state) + (M.toList (Remote.Compute.computeInputs state)) + >>= return . \case + Nothing -> Nothing + Just (inputfile, _) -> Just $ Just $ + QuotedPath inputfile <> " changed" inputchanged state (inputfile, inputkey) = do -- Note that the paths from the remote state are not to be @@ -109,11 +113,13 @@ start' o r si file key = Just (sha, _, _) -> sha /= inputgitsha Nothing -> inputfilemissing Nothing -> return inputfilemissing - - -- When an input file is missing, go ahead and recompute. This way, - -- the user will see the computation fail, with an error message that - -- explains the problem. - inputfilemissing = True + where + -- When an input file is missing, go ahead and recompute. + -- This way, the user will see the computation fail, + -- with an error message that explains the problem. + -- Or, if the input file is only optionally used by the + -- computation, it might succeed. + inputfilemissing = True perform :: RecomputeOptions -> Remote -> OsPath -> Key -> Remote.Compute.ComputeState -> CommandPerform perform o r file origkey origstate = do diff --git a/doc/git-annex-recompute.mdwn b/doc/git-annex-recompute.mdwn index fb895aa75c..0f8dd56901 100644 --- a/doc/git-annex-recompute.mdwn +++ b/doc/git-annex-recompute.mdwn @@ -15,7 +15,8 @@ By default, this only recomputes files whose input files have changed. The new contents of the input files are used to re-run the computation. When the output of the computation is different, the computed file is -updated with the new content. +updated with the new content. The updated file is written to the worktree, +but is not staged, in order to avoid overwriting any staged changes. # OPTIONS From f32d2aecceafe1730cc5d6ad684e0d4230e690ff Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 3 Mar 2025 15:47:09 -0400 Subject: [PATCH 37/53] autoenable security for compute special remote Added annex.security.autoenable-compute-programs and only allow autoenabling special remotes that use compute programs on that list. The reason this is needed is a user might have some compute programs that are less safe to use than others. They might want to use an unsafe one only with one repository, where they are the only committer or other committers are trusted. They might be ok with others being used by any repository, and if so they can add them to the list. Another reason would be a user who has installed a compute program by accident. Eg, it might be included with git-annex at some point, or pulled in by some dependency. That user doesn't necessarily want that compute program to be used in an autoenabled special remote. --- Remote/Compute.hs | 18 +++++++++++++++++- TODO-compute | 2 -- Types/GitConfig.hs | 3 +++ .../compute_special_remote_interface.mdwn | 7 +++---- doc/git-annex.mdwn | 7 +++++++ doc/special_remotes/compute.mdwn | 6 ++++++ 6 files changed, 36 insertions(+), 7 deletions(-) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 564ecbda70..d43e745e95 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -23,6 +23,7 @@ module Remote.Compute ( ) where import Annex.Common +import qualified Annex import Types.Remote import Types.ProposedAccepted import Types.MetaData @@ -118,8 +119,23 @@ gen r u rc gc rs = case getComputeProgram' rc of } setupInstance :: SetupStage -> Maybe UUID -> Maybe CredPair -> RemoteConfig -> RemoteGitConfig -> Annex (RemoteConfig, UUID) -setupInstance _ mu _ c _ = do +setupInstance ss mu _ c _ = do ComputeProgram program <- either giveup return $ getComputeProgram' c + case ss of + AutoEnable _ -> do + l <- maybe [] words + . annexAutoEnableComputePrograms + <$> Annex.getGitConfig + unless (program `elem` l) $ do + let remotename = fromMaybe "(unknown)" (lookupName c) + giveup $ unwords + [ "Not auto-enabling compute special remote" + , remotename + , "because its compute program" + , program + , " is not listed in annex.security.autoenable-compute-programs" + ] + _ -> noop unlessM (liftIO $ inSearchPath program) $ giveup $ "Cannot find " ++ program ++ " in PATH" u <- maybe (liftIO genUUID) return mu diff --git a/TODO-compute b/TODO-compute index b3f67016a7..547730914e 100644 --- a/TODO-compute +++ b/TODO-compute @@ -3,8 +3,6 @@ * get input files for a computation (so `git-annex get .` gets every file, even when input files in a directory are processed after computed files) -* autoinit security - * addcomputed should honor annex.addunlocked. * Perhaps recompute should write a new version of a file as an unlocked diff --git a/Types/GitConfig.hs b/Types/GitConfig.hs index 255778387f..6ea4503d1a 100644 --- a/Types/GitConfig.hs +++ b/Types/GitConfig.hs @@ -146,6 +146,7 @@ data GitConfig = GitConfig , annexAllowedUrlSchemes :: S.Set Scheme , annexAllowedIPAddresses :: String , annexAllowUnverifiedDownloads :: Bool + , annexAutoEnableComputePrograms :: Maybe String , annexMaxExtensionLength :: Maybe Int , annexMaxExtensions :: Maybe Int , annexJobs :: Concurrency @@ -261,6 +262,8 @@ extractGitConfig configsource r = GitConfig getmaybe (annexConfig "security.allowed-http-addresses") -- old name , annexAllowUnverifiedDownloads = (== Just "ACKTHPPT") $ getmaybe (annexConfig "security.allow-unverified-downloads") + , annexAutoEnableComputePrograms = + getmaybe (annexConfig "security.autoenable-compute-programs") , annexMaxExtensionLength = getmayberead (annexConfig "maxextensionlength") , annexMaxExtensions = getmayberead (annexConfig "maxextensions") , annexJobs = fromMaybe NonConcurrent $ diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index cd53a04aa1..8b62a601fa 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -26,13 +26,12 @@ For security, the program should avoid exposing user input to the shell unprotected, or otherwise executing it. The program is run in a temporary directory, which will be cleaned up after -it exits. Note that it may be run in a subdirectory of its temporary +it exits. Note that it may be run in a subdirectory of a temporary directory. This is done when `git-annex addcomputed` was run in a subdirectory of the git repository. -The content of any annexed file in the repository can be an input -to the computation. The program requests an input by writing a line to -stdout: +The content of any file in the repository can be an input to the +computation. The program requests an input by writing a line to stdout: INPUT file.raw diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn index daed2be98a..2146104456 100644 --- a/doc/git-annex.mdwn +++ b/doc/git-annex.mdwn @@ -2201,6 +2201,13 @@ Remotes are configured using these settings in `.git/config`. Per-remote configuration of annex.security.allow-unverified-downloads. +* `annex.security.autoenable-compute-programs` + + This is a space separated list of compute programs eg + "git-annex-compute-foo git-annex-compute-bar". Listing a compute + program here allows compute special remotes that use that program to be + autoenabled. + # CONFIGURATION OF ASSISTANT * `annex.delayadd` diff --git a/doc/special_remotes/compute.mdwn b/doc/special_remotes/compute.mdwn index c3f4186008..811640e2f6 100644 --- a/doc/special_remotes/compute.mdwn +++ b/doc/special_remotes/compute.mdwn @@ -15,6 +15,12 @@ program to use to compute the contents of annexed files. It must start with "git-annex-compute-". The program needs to be installed somewhere in the `PATH`. +The `autoenable` parameter can be set to "true" like with other special +remotes to make git-annex automatically enable this special remote when +run in a new clone of the repository. However, for security, autoenabling +is only done when the git config `annex.security.autoenable-compute-programs` +includes the name of the compute program. + All other "field=value" parameters passed to `initremote` will be passed to the program when running [[git-annex-addcomputed]]. Note that when the program takes a dashed option, it can be provided after "--": From 52f51d065a09507e71e21fb814dfb6f22dd31ea8 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Mon, 3 Mar 2025 16:07:04 -0400 Subject: [PATCH 38/53] rename config to annex.security.allowed-compute-programs And require for enable as well as autoenable. It seemed asking for trouble for `git-annex enable foo` to use whatever compute program is stored in the git config, without verifying that the user wants that program to be used. Note that it would be good to allow `git-annex enable foo program=...` to be used without the program being in the git config. Not implemented yet though. --- Remote/Compute.hs | 26 ++++++++++++-------------- TODO-compute | 3 +++ Types/GitConfig.hs | 6 +++--- doc/git-annex.mdwn | 4 ++-- doc/special_remotes/compute.mdwn | 9 ++++----- 5 files changed, 24 insertions(+), 24 deletions(-) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index d43e745e95..2903f926b2 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -121,21 +121,19 @@ gen r u rc gc rs = case getComputeProgram' rc of setupInstance :: SetupStage -> Maybe UUID -> Maybe CredPair -> RemoteConfig -> RemoteGitConfig -> Annex (RemoteConfig, UUID) setupInstance ss mu _ c _ = do ComputeProgram program <- either giveup return $ getComputeProgram' c + allowedprograms <- maybe [] words . annexAllowedComputePrograms + <$> Annex.getGitConfig case ss of - AutoEnable _ -> do - l <- maybe [] words - . annexAutoEnableComputePrograms - <$> Annex.getGitConfig - unless (program `elem` l) $ do - let remotename = fromMaybe "(unknown)" (lookupName c) - giveup $ unwords - [ "Not auto-enabling compute special remote" - , remotename - , "because its compute program" - , program - , " is not listed in annex.security.autoenable-compute-programs" - ] - _ -> noop + Init -> noop + _ -> unless (program `elem` allowedprograms) $ do + let remotename = fromMaybe "(unknown)" (lookupName c) + giveup $ unwords + [ "Not enabling compute special remote" + , remotename + , "because its compute program" + , program + , "is not listed in annex.security-allowed-compute-programs" + ] unlessM (liftIO $ inSearchPath program) $ giveup $ "Cannot find " ++ program ++ " in PATH" u <- maybe (liftIO genUUID) return mu diff --git a/TODO-compute b/TODO-compute index 547730914e..3d02d9cc00 100644 --- a/TODO-compute +++ b/TODO-compute @@ -1,3 +1,6 @@ +* allow git-annex enableremote with program= explicitly specified, + without checking annex.security.allowed-compute-programs + * need progress bars for computations and implement PROGRESS message * get input files for a computation (so `git-annex get .` gets every file, diff --git a/Types/GitConfig.hs b/Types/GitConfig.hs index 6ea4503d1a..eeae1a0c7e 100644 --- a/Types/GitConfig.hs +++ b/Types/GitConfig.hs @@ -146,7 +146,7 @@ data GitConfig = GitConfig , annexAllowedUrlSchemes :: S.Set Scheme , annexAllowedIPAddresses :: String , annexAllowUnverifiedDownloads :: Bool - , annexAutoEnableComputePrograms :: Maybe String + , annexAllowedComputePrograms :: Maybe String , annexMaxExtensionLength :: Maybe Int , annexMaxExtensions :: Maybe Int , annexJobs :: Concurrency @@ -262,8 +262,8 @@ extractGitConfig configsource r = GitConfig getmaybe (annexConfig "security.allowed-http-addresses") -- old name , annexAllowUnverifiedDownloads = (== Just "ACKTHPPT") $ getmaybe (annexConfig "security.allow-unverified-downloads") - , annexAutoEnableComputePrograms = - getmaybe (annexConfig "security.autoenable-compute-programs") + , annexAllowedComputePrograms = + getmaybe (annexConfig "security.allowed-compute-programs") , annexMaxExtensionLength = getmayberead (annexConfig "maxextensionlength") , annexMaxExtensions = getmayberead (annexConfig "maxextensions") , annexJobs = fromMaybe NonConcurrent $ diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn index 2146104456..5a39aa3bfa 100644 --- a/doc/git-annex.mdwn +++ b/doc/git-annex.mdwn @@ -2201,12 +2201,12 @@ Remotes are configured using these settings in `.git/config`. Per-remote configuration of annex.security.allow-unverified-downloads. -* `annex.security.autoenable-compute-programs` +* `annex.security.allowed-compute-programs` This is a space separated list of compute programs eg "git-annex-compute-foo git-annex-compute-bar". Listing a compute program here allows compute special remotes that use that program to be - autoenabled. + enabled by `git-annex enableremote` or autoenabled. # CONFIGURATION OF ASSISTANT diff --git a/doc/special_remotes/compute.mdwn b/doc/special_remotes/compute.mdwn index 811640e2f6..264cec825a 100644 --- a/doc/special_remotes/compute.mdwn +++ b/doc/special_remotes/compute.mdwn @@ -15,11 +15,10 @@ program to use to compute the contents of annexed files. It must start with "git-annex-compute-". The program needs to be installed somewhere in the `PATH`. -The `autoenable` parameter can be set to "true" like with other special -remotes to make git-annex automatically enable this special remote when -run in a new clone of the repository. However, for security, autoenabling -is only done when the git config `annex.security.autoenable-compute-programs` -includes the name of the compute program. +Any program can be passed to `git-annex initremote`. However, when enabling +a compute special remote later with `git-annex enableremote` or due to +"autoenable=true", the program must be listed in the git config +`annex.security.allowed-compute-programs`. All other "field=value" parameters passed to `initremote` will be passed to the program when running [[git-annex-addcomputed]]. Note that when the From b395bd4f56a7a8e3120278dc5bbfcc44fa512a5a Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 4 Mar 2025 10:02:33 -0400 Subject: [PATCH 39/53] move showOutput into compute remote --- Command/AddComputed.hs | 1 - Remote/Compute.hs | 1 + 2 files changed, 1 insertion(+), 1 deletion(-) diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index b0127b10ba..f54f2de802 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -97,7 +97,6 @@ perform o r = do , Remote.Compute.computeReproducible = False } fast <- Annex.getRead Annex.fast - showOutput Remote.Compute.runComputeProgram program state (Remote.Compute.ImmutableState False) (getInputContent fast) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 2903f926b2..60b2e30185 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -374,6 +374,7 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) , std_out = CreatePipe , env = Just environ } + showOutput starttime <- liftIO currentMonotonicTimestamp state' <- bracket (liftIO $ createProcess pr) From 4e6324131deef2c48e1da62fc7e16082cac18b06 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 4 Mar 2025 11:06:58 -0400 Subject: [PATCH 40/53] compute remote: get input files from other remotes This needed some refactoring to avoid cycles, since Remote.Compute cannot import Remote.List. Instead, it uses Annex.remotes. Which must be populated by something else, but we know it has been, because something is using Remote.Compute, which it must have found in the remote list, which populates that. In Remote.Compute, keyPossibilities' is called with all loggedLocations, without the trustExclude DeadTrusted that keyLocations does. There is another cycle there. This may be a problem if a dead repository is still a remote. This is missing cycle prevention, and it's certianly possible to make 2 files in the compute remote co-depend on one-another. Hopefully not in a real world situation, but it an attacker could certainly do it. Cycle prevention will need to be added to this. --- Remote.hs | 36 ++++------------------------------ Remote/Compute.hs | 34 ++++++++++++++++++++++++++++---- Remote/List/Util.hs | 48 ++++++++++++++++++++++++++++++++++++++++++++- TODO-compute | 3 +++ 4 files changed, 84 insertions(+), 37 deletions(-) diff --git a/Remote.hs b/Remote.hs index cfe771bb12..ab75383cfa 100644 --- a/Remote.hs +++ b/Remote.hs @@ -319,22 +319,11 @@ remoteFromUUID u = ifM ((==) u <$> getUUID) remotesChanged findinmap -{- Filters a list of remotes to ones that have the listed uuids. -} -remotesWithUUID :: [Remote] -> [UUID] -> [Remote] -remotesWithUUID rs us = filter (\r -> uuid r `elem` us) rs - -{- Filters a list of remotes to ones that do not have the listed uuids. -} -remotesWithoutUUID :: [Remote] -> [UUID] -> [Remote] -remotesWithoutUUID rs us = filter (\r -> uuid r `notElem` us) rs - {- List of repository UUIDs that the location log indicates may have a key. - Dead repositories are excluded. -} keyLocations :: Key -> Annex [UUID] keyLocations key = trustExclude DeadTrusted =<< loggedLocations key -{- Whether to include remotes that have annex-ignore set. -} -newtype IncludeIgnored = IncludeIgnored Bool - {- Cost ordered lists of remotes that the location log indicates - may have a key. - @@ -342,33 +331,16 @@ newtype IncludeIgnored = IncludeIgnored Bool -} keyPossibilities :: IncludeIgnored -> Key -> Annex [Remote] keyPossibilities ii key = do - u <- getUUID - -- uuids of all remotes that are recorded to have the key - locations <- filter (/= u) <$> keyLocations key - speclocations <- map uuid - . filter (remoteAnnexSpeculatePresent . gitconfig) - <$> remoteList - -- there are unlikely to be many speclocations, so building a Set - -- is not worth the expense - let locations' = speclocations ++ filter (`notElem` speclocations) locations - fst <$> remoteLocations ii locations' [] + locations <- keyLocations key + keyPossibilities' ii key locations =<< remoteList {- Given a list of locations of a key, and a list of all - trusted repositories, generates a cost-ordered list of - remotes that contain the key, and a list of trusted locations of the key. -} remoteLocations :: IncludeIgnored -> [UUID] -> [UUID] -> Annex ([Remote], [UUID]) -remoteLocations (IncludeIgnored ii) locations trusted = do - let validtrustedlocations = nub locations `intersect` trusted - - -- remotes that match uuids that have the key - allremotes <- remoteList - >>= if not ii - then filterM (not <$$> liftIO . getDynamicConfig . remoteAnnexIgnore . gitconfig) - else return - let validremotes = remotesWithUUID allremotes locations - - return (sortBy (comparing cost) validremotes, validtrustedlocations) +remoteLocations ii locations trusted = + remoteLocations' ii locations trusted =<< remoteList {- Displays known locations of a key and helps the user take action - to make them accessible. -} diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 60b2e30185..8cc23a6f44 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -32,14 +32,17 @@ import Config import Config.Cost import Remote.Helper.Special import Remote.Helper.ExportImport +import Remote.List.Util import Annex.SpecialRemote.Config import Annex.UUID import Annex.Content import Annex.Tmp import Annex.GitShaKey import Annex.CatFile +import qualified Annex.Transfer import Logs.MetaData import Logs.EquivilantKeys +import Logs.Location import Utility.Metered import Utility.TimeStamp import Utility.Env @@ -359,6 +362,8 @@ runComputeProgram -> ComputeState -> ImmutableState -> (OsPath -> Annex (Key, Maybe (Either Git.Sha OsPath))) + -- ^ get input file's content, or Nothing when adding a computation + -- without actually performing it -> (ComputeState -> OsPath -> NominalDiffTime -> Annex v) -> Annex v runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) getinputcontent cont = @@ -491,13 +496,34 @@ computeKey rs (ComputeProgram program) k _af dest p vc = getinputcontent state f = case M.lookup (fromOsPath f) (computeInputs state) of Just inputkey -> case keyGitSha inputkey of - Nothing -> do - obj <- calcRepo (gitAnnexLocation inputkey) - -- XXX get input object when not present - return (inputkey, Just (Right obj)) + Nothing -> + let retkey = do + obj <- calcRepo (gitAnnexLocation inputkey) + return (inputkey, Just (Right obj)) + in ifM (inAnnex inputkey) + ( retkey + , do + getinputcontent' f inputkey + retkey + ) Just gitsha -> return (inputkey, Just (Left gitsha)) Nothing -> error "internal" + + getinputcontent' f inputkey = do + remotelist <- Annex.getState Annex.remotes + locs <- loggedLocations inputkey + rs <- keyPossibilities' (IncludeIgnored False) inputkey locs remotelist + if null rs + then return () + else void $ firstM (getinputcontentfrom f inputkey) rs + + -- TODO cycle prevention + getinputcontentfrom f inputkey r = do + showAction $ "getting input " <> QuotedPath f + <> " from " <> UnquotedString (name r) + Annex.Transfer.download r inputkey (AssociatedFile (Just f)) + Annex.Transfer.stdRetry Annex.Transfer.noNotification computeskey state = case M.keys $ M.filter (== Just k) (computeOutputs state) of diff --git a/Remote/List/Util.hs b/Remote/List/Util.hs index 382a98fa5d..866bd36867 100644 --- a/Remote/List/Util.hs +++ b/Remote/List/Util.hs @@ -1,6 +1,6 @@ {- git-annex remote list utils - - - Copyright 2011-2020 Joey Hess + - Copyright 2011-2025 Joey Hess - - Licensed under the GNU AGPL version 3 or higher. -} @@ -10,6 +10,11 @@ module Remote.List.Util where import Annex.Common import qualified Annex import qualified Git.Config +import Annex.UUID +import Types.Remote +import Config.DynamicConfig + +import Data.Ord {- Call when remotes have changed. Re-reads the git config, and - invalidates the cache so the remoteList will be re-generated next time @@ -22,3 +27,44 @@ remotesChanged = do , Annex.gitremotes = Nothing , Annex.repo = newg } + +{- Whether to include remotes that have annex-ignore set. -} +newtype IncludeIgnored = IncludeIgnored Bool + +keyPossibilities' + :: IncludeIgnored + -> Key + -> [UUID] + -- ^ uuids of remotes that are recorded to have the key + -> [Remote] + -- ^ all remotes + -> Annex [Remote] +keyPossibilities' ii key remotelocations rs = do + u <- getUUID + let locations = filter (/= u) remotelocations + let speclocations = map uuid + $ filter (remoteAnnexSpeculatePresent . gitconfig) rs + -- there are unlikely to be many speclocations, so building a Set + -- is not worth the expense + let locations' = speclocations ++ filter (`notElem` speclocations) locations + fst <$> remoteLocations' ii locations' [] rs + +remoteLocations' :: IncludeIgnored -> [UUID] -> [UUID] -> [Remote] -> Annex ([Remote], [UUID]) +remoteLocations' (IncludeIgnored ii) locations trusted rs = do + let validtrustedlocations = nub locations `intersect` trusted + + -- remotes that match uuids that have the key + allremotes <- if not ii + then filterM (not <$$> liftIO . getDynamicConfig . remoteAnnexIgnore . gitconfig) rs + else return rs + let validremotes = remotesWithUUID allremotes locations + + return (sortBy (comparing cost) validremotes, validtrustedlocations) + +{- Filters a list of remotes to ones that have the listed uuids. -} +remotesWithUUID :: [Remote] -> [UUID] -> [Remote] +remotesWithUUID rs us = filter (\r -> uuid r `elem` us) rs + +{- Filters a list of remotes to ones that do not have the listed uuids. -} +remotesWithoutUUID :: [Remote] -> [UUID] -> [Remote] +remotesWithoutUUID rs us = filter (\r -> uuid r `notElem` us) rs diff --git a/TODO-compute b/TODO-compute index 3d02d9cc00..c0a05ef8db 100644 --- a/TODO-compute +++ b/TODO-compute @@ -6,6 +6,8 @@ * get input files for a computation (so `git-annex get .` gets every file, even when input files in a directory are processed after computed files) + started implementation, but must avoid cycles! + * addcomputed should honor annex.addunlocked. * Perhaps recompute should write a new version of a file as an unlocked @@ -37,3 +39,4 @@ that recompute should also support recomputing non-annexed files. Otherwise, adding a file and then recomputing it would vary in what the content of the file is, depending on annex.smallfiles setting. + From 4b6fabae65f48a198bb6fa32b846ef9c3df7363d Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 4 Mar 2025 12:43:50 -0400 Subject: [PATCH 41/53] better wording Avoids this contradiction: (Auto enabling special remote foo...) Not enabling compute special remote c2 because [..] --- Remote/Compute.hs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 8cc23a6f44..e4051bc877 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -131,7 +131,7 @@ setupInstance ss mu _ c _ = do _ -> unless (program `elem` allowedprograms) $ do let remotename = fromMaybe "(unknown)" (lookupName c) giveup $ unwords - [ "Not enabling compute special remote" + [ "Unable to enable compute special remote" , remotename , "because its compute program" , program From f4e0d6a04372872608951632710e97ec762879ea Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 4 Mar 2025 12:51:38 -0400 Subject: [PATCH 42/53] update location log after getting input file from remote --- Remote/Compute.hs | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index e4051bc877..9e821ff9eb 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -39,6 +39,7 @@ import Annex.Content import Annex.Tmp import Annex.GitShaKey import Annex.CatFile +import Annex.RepoSize.LiveUpdate import qualified Annex.Transfer import Logs.MetaData import Logs.EquivilantKeys @@ -513,17 +514,19 @@ computeKey rs (ComputeProgram program) k _af dest p vc = getinputcontent' f inputkey = do remotelist <- Annex.getState Annex.remotes locs <- loggedLocations inputkey - rs <- keyPossibilities' (IncludeIgnored False) inputkey locs remotelist - if null rs + remotes <- keyPossibilities' (IncludeIgnored False) inputkey locs remotelist + if null remotes then return () - else void $ firstM (getinputcontentfrom f inputkey) rs + else void $ firstM (getinputcontentfrom f inputkey) remotes -- TODO cycle prevention getinputcontentfrom f inputkey r = do showAction $ "getting input " <> QuotedPath f <> " from " <> UnquotedString (name r) - Annex.Transfer.download r inputkey (AssociatedFile (Just f)) - Annex.Transfer.stdRetry Annex.Transfer.noNotification + lu <- prepareLiveUpdate Nothing inputkey AddingKey + logStatusAfter lu inputkey $ + Annex.Transfer.download r inputkey (AssociatedFile (Just f)) + Annex.Transfer.stdRetry Annex.Transfer.noNotification computeskey state = case M.keys $ M.filter (== Just k) (computeOutputs state) of From 51538fa0a8eb224b64cd28a57ea3ee59fb749d95 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 4 Mar 2025 13:13:18 -0400 Subject: [PATCH 43/53] improve error message when unable to get an input file In this case, the compute program is run the same as if addcomputed --fast were used, so it should succeed, without outputting a computed file. computeInputsUnavailable is in ComputeState for simplicity, but it is not serialized with the rest of the ComputeState. --- Command/AddComputed.hs | 1 + Remote/Compute.hs | 37 +++++++++++++++++++++++-------------- 2 files changed, 24 insertions(+), 14 deletions(-) diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index f54f2de802..226f2c0c08 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -95,6 +95,7 @@ perform o r = do , Remote.Compute.computeOutputs = mempty , Remote.Compute.computeSubdir = subdir , Remote.Compute.computeReproducible = False + , Remote.Compute.computeInputsUnavailable = False } fast <- Annex.getRead Annex.fast Remote.Compute.runComputeProgram program state diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 9e821ff9eb..b54a196e6f 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -218,6 +218,7 @@ data ComputeState = ComputeState , computeOutputs :: M.Map OsPath (Maybe Key) , computeSubdir :: OsPath , computeReproducible :: Bool + , computeInputsUnavailable :: Bool } deriving (Show, Eq) @@ -261,7 +262,7 @@ parseComputeState k b = let st = go emptycomputestate (parseQuery b) in if st == emptycomputestate then Nothing else Just st where - emptycomputestate = ComputeState mempty mempty mempty "." False + emptycomputestate = ComputeState mempty mempty mempty "." False False go :: ComputeState -> [QueryItem] -> ComputeState go c [] = c { computeParams = reverse (computeParams c) } go c ((f, v):rest) = @@ -363,8 +364,8 @@ runComputeProgram -> ComputeState -> ImmutableState -> (OsPath -> Annex (Key, Maybe (Either Git.Sha OsPath))) - -- ^ get input file's content, or Nothing when adding a computation - -- without actually performing it + -- ^ get input file's content, or Nothing the input file's + -- content is not available -> (ComputeState -> OsPath -> NominalDiffTime -> Annex v) -> Annex v runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) getinputcontent cont = @@ -431,9 +432,13 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) liftIO $ hPutStrLn (stdinHandle p) $ maybe "" fromOsPath mp liftIO $ hFlush (stdinHandle p) + let state'' = state' + { computeInputsUnavailable = + isNothing mp || computeInputsUnavailable state' + } return $ if immutablestate - then state - else state' + then state'' + else state'' { computeInputs = M.insert f' k (computeInputs state') @@ -444,7 +449,7 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) let knownoutput = M.member f' (computeOutputs state') checkimmutable knownoutput "outputting" f' $ return $ if immutablestate - then state + then state' else state' { computeOutputs = M.insert f' Nothing @@ -488,7 +493,7 @@ computeKey rs (ComputeProgram program) k _af dest p vc = state (ImmutableState True) (getinputcontent state) - (go keyfile) + (postcompute keyfile) Nothing -> missingstate Nothing -> missingstate where @@ -503,9 +508,10 @@ computeKey rs (ComputeProgram program) k _af dest p vc = return (inputkey, Just (Right obj)) in ifM (inAnnex inputkey) ( retkey - , do - getinputcontent' f inputkey - retkey + , ifM (getinputcontent' f inputkey) + ( retkey + , return (inputkey, Nothing) + ) ) Just gitsha -> return (inputkey, Just (Left gitsha)) @@ -515,9 +521,7 @@ computeKey rs (ComputeProgram program) k _af dest p vc = remotelist <- Annex.getState Annex.remotes locs <- loggedLocations inputkey remotes <- keyPossibilities' (IncludeIgnored False) inputkey locs remotelist - if null remotes - then return () - else void $ firstM (getinputcontentfrom f inputkey) remotes + anyM (getinputcontentfrom f inputkey) remotes -- TODO cycle prevention getinputcontentfrom f inputkey r = do @@ -533,7 +537,12 @@ computeKey rs (ComputeProgram program) k _af dest p vc = (keyfile : _) -> Just keyfile [] -> Nothing - go keyfile state tmpdir _ts = do + postcompute keyfile state tmpdir _ts + | computeInputsUnavailable state = + giveup "Input file(s) unavailable." + | otherwise = postcompute' keyfile state tmpdir + + postcompute' keyfile state tmpdir = do hb <- hashBackend let updatevurl key getobj = if (fromKey keyVariety key == VURLKey) From 1ee4d018f3e03de1e06f99a6dd2138c6141c864c Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 4 Mar 2025 14:06:55 -0400 Subject: [PATCH 44/53] cycle detection --- Remote/Compute.hs | 50 +++++++++++++++++++++++++++++++++++++++++++---- TODO-compute | 5 ----- 2 files changed, 46 insertions(+), 9 deletions(-) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index b54a196e6f..53f08c6cf9 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -518,12 +518,10 @@ computeKey rs (ComputeProgram program) k _af dest p vc = Nothing -> error "internal" getinputcontent' f inputkey = do - remotelist <- Annex.getState Annex.remotes - locs <- loggedLocations inputkey - remotes <- keyPossibilities' (IncludeIgnored False) inputkey locs remotelist + remotes <- avoidCycles [k] inputkey + =<< keyPossibilities inputkey anyM (getinputcontentfrom f inputkey) remotes - -- TODO cycle prevention getinputcontentfrom f inputkey r = do showAction $ "getting input " <> QuotedPath f <> " from " <> UnquotedString (name r) @@ -571,6 +569,50 @@ computeKey rs (ComputeProgram program) k _af dest p vc = -- The program might not be reproducible, -- so require strong verification. return $ fromMaybe MustVerify mverification + +keyPossibilities :: Key -> Annex [Remote] +keyPossibilities key = do + remotelist <- Annex.getState Annex.remotes + locs <- loggedLocations key + keyPossibilities' (IncludeIgnored False) key locs remotelist + +{- Filter out any remotes that, in order to compute the inputkey, would + - need to get the outputkey from some remote. + - + - This only finds cycles of compute special remotes, not any other + - similar type of special remote that might have its own input keys. + - There are no other such special remotes in git-annex itself, so this + - is the best that can be done. + - + - Note that, in a case where a compute special remote needs the outputkey + - to compute the inputkey, but could get the outputkey from either this + - remote, or some other, non-compute remote, that is filtered out as a + - cycle because it's not possible to prevent that remote getting from this + - remote. + -} +avoidCycles :: [Key] -> Key -> [Remote] -> Annex [Remote] +avoidCycles outputkeys inputkey = filterM go + where + go r + | iscomputeremote r = + getComputeState (remoteStateHandle r) inputkey >>= \case + Nothing -> return True + Just state + | inputsoutput state -> return False + | otherwise -> checkdeeper state + | otherwise = return True + + iscomputeremote r = remotetype r == remote + + inputsoutput state = not $ M.null $ + M.filter (`elem` outputkeys) + (computeInputs state) + + checkdeeper state = + flip allM (M.elems (computeInputs state)) $ \inputkey' -> do + rs <- keyPossibilities inputkey' + rs' <- avoidCycles (inputkey:outputkeys) inputkey' rs + return (rs' == rs) -- Make sure that the compute state exists. checkKey :: RemoteStateHandle -> Key -> Annex Bool diff --git a/TODO-compute b/TODO-compute index c0a05ef8db..3eff82e710 100644 --- a/TODO-compute +++ b/TODO-compute @@ -3,11 +3,6 @@ * need progress bars for computations and implement PROGRESS message -* get input files for a computation (so `git-annex get .` gets every file, - even when input files in a directory are processed after computed files) - - started implementation, but must avoid cycles! - * addcomputed should honor annex.addunlocked. * Perhaps recompute should write a new version of a file as an unlocked From a2fc471e14a9fcccb15e5265f6964b8a71b0399a Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 4 Mar 2025 14:54:13 -0400 Subject: [PATCH 45/53] safer git sha object filename Rather than use the filename provided by INPUT, which could come from user input, and so could be something that looks like a dashed parameter, use a .git/object/ filename. This avoids user input passing through INPUT and back out, with the file path then passed to a command, which could do something unexpected with a dashed parameter, or other special parameter. Added a note in the design about being careful of passing user input to commands. They still have to be careful of that in general, just not in this case. --- Remote/Compute.hs | 18 ++++++++++++++---- .../compute_special_remote_interface.mdwn | 4 +++- 2 files changed, 17 insertions(+), 5 deletions(-) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 53f08c6cf9..58e0ef6e8b 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -425,10 +425,9 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) Nothing -> pure Nothing Just (Right f'') -> liftIO $ Just <$> relPathDirToFile subdir f'' - Just (Left gitsha) -> do - liftIO . F.writeFile (subdir f') - =<< catObject gitsha - return (Just f') + Just (Left gitsha) -> + Just <$> (liftIO . relPathDirToFile subdir + =<< populategitsha gitsha tmpdir) liftIO $ hPutStrLn (stdinHandle p) $ maybe "" fromOsPath mp liftIO $ hFlush (stdinHandle p) @@ -479,6 +478,17 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) calcduration (MonotonicTimestamp starttime) (MonotonicTimestamp endtime) = fromIntegral (endtime - starttime) :: NominalDiffTime + -- Writes to a .git/objects/ file in the tmpdir, rather than + -- using the input filename, to avoid exposing the input filename + -- to the program as a parameter, which could parse it as a dashed + -- option or other special parameter. + populategitsha gitsha tmpdir = do + let f = tmpdir ".git" "objects" + toOsPath (Git.fromRef' gitsha) + liftIO $ createDirectoryIfMissing True $ takeDirectory f + liftIO . F.writeFile f =<< catObject gitsha + return f + computationBehaviorChangeError :: ComputeProgram -> String -> OsPath -> Annex a computationBehaviorChangeError (ComputeProgram program) requestdesc p = giveup $ program ++ " is not behaving the same way it used to, now " ++ requestdesc ++ ": " ++ fromOsPath p diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 8b62a601fa..0dfd93e314 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -23,7 +23,9 @@ that is in the form "foo=bar" will also result in an environment variable being set, eg `ANNEX_COMPUTE_passes=10` or `ANNEX_COMPUTE_--level=9`. For security, the program should avoid exposing user input to the shell -unprotected, or otherwise executing it. +unprotected, or otherwise executing it. And when running a command, make +sure that whatever user input is passed to it can result in only safe and +expected behavior. The program is run in a temporary directory, which will be cleaned up after it exits. Note that it may be run in a subdirectory of a temporary From 2e77c2b76250878ea850fc10b478196ecaa2c601 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 4 Mar 2025 15:02:02 -0400 Subject: [PATCH 46/53] update todo --- TODO-compute | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/TODO-compute b/TODO-compute index 3eff82e710..09cc853898 100644 --- a/TODO-compute +++ b/TODO-compute @@ -1,3 +1,13 @@ +* would be nice to have a way to see what computations are used by a + compute remote for a file. Put it in `whereis` output? But it's not an + url. Maybe a separate command? That would also allow querying for eg, + what files are inputs for another file. + +* "getting input from " message uses the original filename, + but that file might have been renamed. Would be more clear to use + whatever file in the tree currently points to the key it's getting + (what if there is not one?) + * allow git-annex enableremote with program= explicitly specified, without checking annex.security.allowed-compute-programs From 17ce1b4e7ba2dfd3344eb175bdaabc869371d53d Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 4 Mar 2025 15:46:30 -0400 Subject: [PATCH 47/53] mark unused parameter While unused, it seems to make sense to keep it, since it explains what the function is doing. --- Remote/List/Util.hs | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Remote/List/Util.hs b/Remote/List/Util.hs index 866bd36867..e022d23190 100644 --- a/Remote/List/Util.hs +++ b/Remote/List/Util.hs @@ -39,7 +39,7 @@ keyPossibilities' -> [Remote] -- ^ all remotes -> Annex [Remote] -keyPossibilities' ii key remotelocations rs = do +keyPossibilities' ii _key remotelocations rs = do u <- getUUID let locations = filter (/= u) remotelocations let speclocations = map uuid From 4a4a614b0d1a413368b6cd173c00e953a21b10b7 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Tue, 4 Mar 2025 15:50:15 -0400 Subject: [PATCH 48/53] OsPath build fixes --- Remote/Compute.hs | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 58e0ef6e8b..8c06dd9061 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -262,7 +262,15 @@ parseComputeState k b = let st = go emptycomputestate (parseQuery b) in if st == emptycomputestate then Nothing else Just st where - emptycomputestate = ComputeState mempty mempty mempty "." False False + emptycomputestate = ComputeState + { computeParams = mempty + , computeInputs = mempty + , computeOutputs = mempty + , computeSubdir = literalOsPath "." + , computeReproducible = False + , computeInputsUnavailable = False + } + go :: ComputeState -> [QueryItem] -> ComputeState go c [] = c { computeParams = reverse (computeParams c) } go c ((f, v):rest) = @@ -370,7 +378,7 @@ runComputeProgram -> Annex v runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) getinputcontent cont = withOtherTmp $ \othertmpdir -> - withTmpDirIn othertmpdir "compute" go + withTmpDirIn othertmpdir (literalOsPath "compute") go where go tmpdir = do environ <- computeProgramEnvironment state @@ -483,7 +491,7 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) -- to the program as a parameter, which could parse it as a dashed -- option or other special parameter. populategitsha gitsha tmpdir = do - let f = tmpdir ".git" "objects" + let f = tmpdir literalOsPath ".git" literalOsPath "objects" toOsPath (Git.fromRef' gitsha) liftIO $ createDirectoryIfMissing True $ takeDirectory f liftIO . F.writeFile f =<< catObject gitsha @@ -510,7 +518,7 @@ computeKey rs (ComputeProgram program) k _af dest p vc = missingstate = giveup "Missing compute state" getinputcontent state f = - case M.lookup (fromOsPath f) (computeInputs state) of + case M.lookup f (computeInputs state) of Just inputkey -> case keyGitSha inputkey of Nothing -> let retkey = do From ccc454a791c303f04c2e38b6afdc03b02049d6c7 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Wed, 5 Mar 2025 13:46:06 -0400 Subject: [PATCH 49/53] computation progress display --- Command/AddComputed.hs | 1 + Command/Recompute.hs | 1 + Remote/Compute.hs | 85 +++++++++++++++++++++++++++++++----------- TODO-compute | 2 - 4 files changed, 66 insertions(+), 23 deletions(-) diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 226f2c0c08..f05f3bdfcd 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -101,6 +101,7 @@ perform o r = do Remote.Compute.runComputeProgram program state (Remote.Compute.ImmutableState False) (getInputContent fast) + Nothing (addComputed "adding" True r (reproducible o) chooseBackend Just fast) next $ return True diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 2eda098867..8233bc87e7 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -129,6 +129,7 @@ perform o r file origkey origstate = do Remote.Compute.runComputeProgram program origstate (Remote.Compute.ImmutableState False) (getinputcontent program) + Nothing (go program reproducibleconfig) next $ return True where diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 8c06dd9061..8b64dee56e 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -44,6 +44,7 @@ import qualified Annex.Transfer import Logs.MetaData import Logs.EquivilantKeys import Logs.Location +import Messages.Progress import Utility.Metered import Utility.TimeStamp import Utility.Env @@ -59,6 +60,8 @@ import qualified Utility.SimpleProtocol as Proto import Network.HTTP.Types.URI import Data.Time.Clock import Text.Read +import Control.Concurrent.STM +import Control.Concurrent.Async import qualified Data.Map as M import qualified Data.Set as S import qualified Data.ByteString as B @@ -209,8 +212,10 @@ newtype PercentFloat = PercentFloat Float deriving (Show, Eq) instance Proto.Serializable PercentFloat where - serialize (PercentFloat p) = show p - deserialize s = PercentFloat <$> readMaybe s + serialize (PercentFloat p) = show p ++ "%" + deserialize s = do + s' <- reverse <$> stripPrefix "%" (reverse s) + PercentFloat <$> readMaybe s' data ComputeState = ComputeState { computeParams :: [String] @@ -374,9 +379,11 @@ runComputeProgram -> (OsPath -> Annex (Key, Maybe (Either Git.Sha OsPath))) -- ^ get input file's content, or Nothing the input file's -- content is not available + -> Maybe (Key, MeterUpdate) + -- ^ update meter for this key -> (ComputeState -> OsPath -> NominalDiffTime -> Annex v) -> Annex v -runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) getinputcontent cont = +runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) getinputcontent meterkey cont = withOtherTmp $ \othertmpdir -> withTmpDirIn othertmpdir (literalOsPath "compute") go where @@ -391,10 +398,10 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) } showOutput starttime <- liftIO currentMonotonicTimestamp - state' <- bracket + state' <- withmeterfile $ \meterfile -> bracket (liftIO $ createProcess pr) (liftIO . cleanupProcess) - (getinput state tmpdir subdir) + (getinput tmpdir subdir state meterfile) endtime <- liftIO currentMonotonicTimestamp cont state' subdir (calcduration starttime endtime) @@ -408,13 +415,13 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) , return tmpdir ) - getinput state' tmpdir subdir p = + getinput tmpdir subdir state' meterfile p = liftIO (hGetLineUntilExitOrEOF (processHandle p) (stdoutHandle p)) >>= \case Just l - | null l -> getinput state' tmpdir subdir p + | null l -> getinput tmpdir subdir state' meterfile p | otherwise -> do - state'' <- parseoutput p tmpdir subdir state' l - getinput state'' tmpdir subdir p + state'' <- parseoutput p tmpdir subdir state' meterfile l + getinput tmpdir subdir state'' meterfile p Nothing -> do liftIO $ hClose (stdoutHandle p) liftIO $ hClose (stdinHandle p) @@ -422,7 +429,7 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) giveup $ program ++ " exited unsuccessfully" return state' - parseoutput p tmpdir subdir state' l = case Proto.parseMessage l of + parseoutput p tmpdir subdir state' meterfile l = case Proto.parseMessage l of Just (ProcessInput f) -> do let f' = toOsPath f let knowninput = M.member f' (computeInputs state') @@ -453,7 +460,12 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) Just (ProcessOutput f) -> do let f' = toOsPath f checksafefile tmpdir subdir f' "output" - let knownoutput = M.member f' (computeOutputs state') + knownoutput <- case M.lookup f' (computeOutputs state') of + Nothing -> return False + Just mk -> do + when (mk == fmap fst meterkey) $ + meterfile (subdir f') + return True checkimmutable knownoutput "outputting" f' $ return $ if immutablestate then state' @@ -463,12 +475,12 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) (computeOutputs state') } Just (ProcessProgress percent) -> do - -- XXX + liftIO $ updatepercent percent return state' Just ProcessReproducible -> return $ state' { computeReproducible = True } Nothing -> giveup $ - program ++ " output included an unparseable line: \"" ++ l ++ "\"" + program ++ " output an unparseable line: \"" ++ l ++ "\"" checksafefile tmpdir subdir f fileaction = do let err problem = giveup $ @@ -497,26 +509,57 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) liftIO . F.writeFile f =<< catObject gitsha return f + withmeterfile a = case meterkey of + Nothing -> a (const noop) + Just (_, progress) -> do + filev <- liftIO newEmptyTMVarIO + endv <- liftIO $ newEmptyTMVarIO + let meterfile = void . liftIO + . atomically . tryPutTMVar filev + let endmeterfile = atomically $ putTMVar endv () + tid <- liftIO $ async $ do + v <- liftIO $ atomically $ + (Right <$> takeTMVar filev) + `orElse` + (Left <$> takeTMVar endv) + case v of + Right f -> watchFileSize f progress $ \_ -> + void $ liftIO $ atomically $ + takeTMVar endv + Left () -> return () + a meterfile + `finally` liftIO (endmeterfile >> wait tid) + + updatepercent (PercentFloat percent) = case meterkey of + Nothing -> noop + Just (k, progress) -> case fromKey keySize k of + Nothing -> noop + Just sz -> + progress $ BytesProcessed $ floor $ + fromIntegral sz * percent / 100 + computationBehaviorChangeError :: ComputeProgram -> String -> OsPath -> Annex a computationBehaviorChangeError (ComputeProgram program) requestdesc p = giveup $ program ++ " is not behaving the same way it used to, now " ++ requestdesc ++ ": " ++ fromOsPath p computeKey :: RemoteStateHandle -> ComputeProgram -> Key -> AssociatedFile -> OsPath -> MeterUpdate -> VerifyConfig -> Annex Verification -computeKey rs (ComputeProgram program) k _af dest p vc = +computeKey rs (ComputeProgram program) k _af dest meterupdate vc = getComputeState rs k >>= \case - Just state -> + Just state -> case computeskey state of - Just keyfile -> runComputeProgram - (ComputeProgram program) - state - (ImmutableState True) - (getinputcontent state) - (postcompute keyfile) + Just keyfile -> go state keyfile Nothing -> missingstate Nothing -> missingstate where missingstate = giveup "Missing compute state" + go state keyfile = metered (Just meterupdate) k Nothing $ \_ p -> + runComputeProgram (ComputeProgram program) state + (ImmutableState True) + (getinputcontent state) + (Just (k, p)) + (postcompute keyfile) + getinputcontent state f = case M.lookup f (computeInputs state) of Just inputkey -> case keyGitSha inputkey of diff --git a/TODO-compute b/TODO-compute index 09cc853898..5b212695c6 100644 --- a/TODO-compute +++ b/TODO-compute @@ -11,8 +11,6 @@ * allow git-annex enableremote with program= explicitly specified, without checking annex.security.allowed-compute-programs -* need progress bars for computations and implement PROGRESS message - * addcomputed should honor annex.addunlocked. * Perhaps recompute should write a new version of a file as an unlocked From c6c6e2632d6ed70cee22344f252ac74a26e66057 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 6 Mar 2025 12:41:30 -0400 Subject: [PATCH 50/53] avoid unncessary git-annex branch changes for recompute and addcomputed --- Command/AddComputed.hs | 32 ++++++++++------ Command/Recompute.hs | 35 +++++++++++++----- Remote/Compute.hs | 84 +++++++++++++++++++++++++----------------- 3 files changed, 96 insertions(+), 55 deletions(-) diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index f05f3bdfcd..4774caae9b 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -20,6 +20,7 @@ import Backend import Annex.CatFile import Annex.Content.Presence import Annex.Ingest +import Annex.UUID import Annex.GitShaKey import Types.KeySource import Types.Key @@ -94,35 +95,35 @@ perform o r = do , Remote.Compute.computeInputs = mempty , Remote.Compute.computeOutputs = mempty , Remote.Compute.computeSubdir = subdir - , Remote.Compute.computeReproducible = False - , Remote.Compute.computeInputsUnavailable = False } fast <- Annex.getRead Annex.fast Remote.Compute.runComputeProgram program state (Remote.Compute.ImmutableState False) (getInputContent fast) Nothing - (addComputed "adding" True r (reproducible o) chooseBackend Just fast) + (addComputed (Just "adding") True r (reproducible o) chooseBackend Just fast) next $ return True addComputed - :: StringContainingQuotedPath + :: Maybe StringContainingQuotedPath -> Bool -> Remote -> Maybe Reproducible -> (OsPath -> Annex Backend) -> (OsPath -> Maybe OsPath) -> Bool - -> Remote.Compute.ComputeState + -> Remote.Compute.ComputeProgramResult -> OsPath -> NominalDiffTime -> Annex () -addComputed addaction stagefiles r reproducibleconfig choosebackend destfile fast state tmpdir ts = do - let outputs = Remote.Compute.computeOutputs state +addComputed maddaction stagefiles r reproducibleconfig choosebackend destfile fast result tmpdir ts = do when (M.null outputs) $ giveup "The computation succeeded, but it did not generate any files." oks <- forM (M.keys outputs) $ \outputfile -> do - showAction $ addaction <> " " <> QuotedPath outputfile + case maddaction of + Just addaction -> showAction $ + addaction <> " " <> QuotedPath outputfile + Nothing -> noop k <- catchNonAsync (addfile outputfile) (\err -> giveup $ "Failed to ingest output file " ++ fromOsPath outputfile ++ ": " ++ show err) return (outputfile, Just k) @@ -133,8 +134,15 @@ addComputed addaction stagefiles r reproducibleconfig choosebackend destfile fas Remote.Compute.setComputeState (Remote.remoteStateHandle r) k ts state' - logChange NoLiveUpdate k (Remote.uuid r) InfoPresent + + let u = Remote.uuid r + unlessM (elem u <$> loggedLocations k) $ + logChange NoLiveUpdate k u InfoPresent where + state = Remote.Compute.computeState result + + outputs = Remote.Compute.computeOutputs state + addfile outputfile | fast = do case destfile outputfile of @@ -179,7 +187,9 @@ addComputed addaction stagefiles r reproducibleconfig choosebackend destfile fas ingestwith a = a >>= \case Nothing -> giveup "ingestion failed" Just k -> do - logStatus NoLiveUpdate k InfoPresent + u <- getUUID + unlessM (elem u <$> loggedLocations k) $ + logStatus NoLiveUpdate k InfoPresent when (fromKey keyVariety k == VURLKey) $ do hb <- hashBackend void $ addEquivilantKey hb k @@ -194,7 +204,7 @@ addComputed addaction stagefiles r reproducibleconfig choosebackend destfile fas isreproducible = case reproducibleconfig of Just v -> isReproducible v - Nothing -> Remote.Compute.computeReproducible state + Nothing -> Remote.Compute.computeReproducible result getInputContent :: Bool -> OsPath -> Annex (Key, Maybe (Either Git.Sha OsPath)) getInputContent fast p = catKeyFile p >>= \case diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 8233bc87e7..6b21ce8ee7 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -15,6 +15,7 @@ import qualified Remote import qualified Types.Remote as Remote import qualified Git.Ref as Git import Annex.Content +import Annex.UUID import Annex.CatFile import Annex.GitShaKey import Git.FilePath @@ -131,12 +132,13 @@ perform o r file origkey origstate = do (getinputcontent program) Nothing (go program reproducibleconfig) - next $ return True + next cleanup where - go program reproducibleconfig state tmpdir ts = do - checkbehaviorchange program state - addComputed "processing" False r reproducibleconfig - choosebackend destfile False state tmpdir ts + go program reproducibleconfig result tmpdir ts = do + checkbehaviorchange program + (Remote.Compute.computeState result) + addComputed Nothing False r reproducibleconfig + choosebackend destfile False result tmpdir ts checkbehaviorchange program state = do let check s w a b = forM_ (M.keys (w a)) $ \f -> @@ -168,6 +170,10 @@ perform o r file origkey origstate = do origbackendvariety = fromKey keyVariety origkey + recomputingvurl = case origbackendvariety of + VURLKey -> True + _ -> False + getreproducibleconfig = case reproducible o of Just (Reproducible True) -> return (Just (Reproducible True)) -- A VURL key is used when the computation was @@ -177,13 +183,22 @@ perform o r file origkey origstate = do -- delete the annex object first, so that if recomputing -- generates a new version of the file, it replaces -- the old version. - v -> case origbackendvariety of - VURLKey -> do + v -> if recomputingvurl + then do lockContentForRemoval origkey noop removeAnnex - -- in case computation fails or is interupted - logStatus NoLiveUpdate origkey InfoMissing return (Just (Reproducible False)) - _ -> return v + else return v + + cleanup = do + case reproducible o of + Just (Reproducible True) -> noop + -- in case computation failed, update + -- location log for removal done earlier + _ -> when recomputingvurl $ do + u <- getUUID + unlessM (elem u <$> loggedLocations origkey) $ + logStatus NoLiveUpdate origkey InfoMissing + return True choosebackend _outputfile -- Use the same backend as was used to compute it before, diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 8b64dee56e..b6ec907bda 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -18,6 +18,7 @@ module Remote.Compute ( getComputeProgram, runComputeProgram, ImmutableState(..), + ComputeProgramResult(..), computationBehaviorChangeError, defaultComputeParams, ) where @@ -222,8 +223,6 @@ data ComputeState = ComputeState , computeInputs :: M.Map OsPath Key , computeOutputs :: M.Map OsPath (Maybe Key) , computeSubdir :: OsPath - , computeReproducible :: Bool - , computeInputsUnavailable :: Bool } deriving (Show, Eq) @@ -272,8 +271,6 @@ parseComputeState k b = , computeInputs = mempty , computeOutputs = mempty , computeSubdir = literalOsPath "." - , computeReproducible = False - , computeInputsUnavailable = False } go :: ComputeState -> [QueryItem] -> ComputeState @@ -330,16 +327,23 @@ computeStateUrl r st p = - The metadata fields are numbers (prefixed with "t" to make them legal - field names), which are estimates of how long it might take to run - the computation (in seconds). + - + - Avoids redundantly recording a ComputeState when the per remote metadata + - already contains it. -} setComputeState :: RemoteStateHandle -> Key -> NominalDiffTime -> ComputeState -> Annex () -setComputeState rs k ts st = addRemoteMetaData k rs $ MetaData $ M.singleton - (mkMetaFieldUnchecked $ T.pack ('t':show (truncateResolution 1 ts))) - (S.singleton (MetaValue (CurrentlySet True) (formatComputeState k st))) +setComputeState rs k ts st = do + l <- map snd <$> getComputeStatesUnsorted rs k + unless (st `elem` l) go + where + go = addRemoteMetaData k rs $ MetaData $ M.singleton + (mkMetaFieldUnchecked $ T.pack ('t':show (truncateResolution 1 ts))) + (S.singleton (MetaValue (CurrentlySet True) (formatComputeState k st))) {- When multiple ComputeStates have been recorded for the same key, - this returns one that is probably less expensive to compute, - based on the original time it took to compute it. -} -getComputeState:: RemoteStateHandle -> Key -> Annex (Maybe ComputeState) +getComputeState :: RemoteStateHandle -> Key -> Annex (Maybe ComputeState) getComputeState rs k = headMaybe . map snd . sortOn fst <$> getComputeStatesUnsorted rs k @@ -372,6 +376,12 @@ computeProgramEnvironment st = do newtype ImmutableState = ImmutableState Bool +data ComputeProgramResult = ComputeProgramResult + { computeState :: ComputeState + , computeInputsUnavailable :: Bool + , computeReproducible :: Bool + } + runComputeProgram :: ComputeProgram -> ComputeState @@ -381,7 +391,7 @@ runComputeProgram -- content is not available -> Maybe (Key, MeterUpdate) -- ^ update meter for this key - -> (ComputeState -> OsPath -> NominalDiffTime -> Annex v) + -> (ComputeProgramResult -> OsPath -> NominalDiffTime -> Annex v) -> Annex v runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) getinputcontent meterkey cont = withOtherTmp $ \othertmpdir -> @@ -398,12 +408,13 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) } showOutput starttime <- liftIO currentMonotonicTimestamp - state' <- withmeterfile $ \meterfile -> bracket + let startresult = ComputeProgramResult state False False + result <- withmeterfile $ \meterfile -> bracket (liftIO $ createProcess pr) (liftIO . cleanupProcess) - (getinput tmpdir subdir state meterfile) + (getinput tmpdir subdir startresult meterfile) endtime <- liftIO currentMonotonicTimestamp - cont state' subdir (calcduration starttime endtime) + cont result subdir (calcduration starttime endtime) getsubdir tmpdir = do let subdir = tmpdir computeSubdir state @@ -415,24 +426,25 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) , return tmpdir ) - getinput tmpdir subdir state' meterfile p = + getinput tmpdir subdir result meterfile p = liftIO (hGetLineUntilExitOrEOF (processHandle p) (stdoutHandle p)) >>= \case Just l - | null l -> getinput tmpdir subdir state' meterfile p + | null l -> getinput tmpdir subdir result meterfile p | otherwise -> do - state'' <- parseoutput p tmpdir subdir state' meterfile l - getinput tmpdir subdir state'' meterfile p + result' <- parseoutput p tmpdir subdir result meterfile l + getinput tmpdir subdir result' meterfile p Nothing -> do liftIO $ hClose (stdoutHandle p) liftIO $ hClose (stdinHandle p) unlessM (liftIO $ checkSuccessProcess (processHandle p)) $ giveup $ program ++ " exited unsuccessfully" - return state' + return result - parseoutput p tmpdir subdir state' meterfile l = case Proto.parseMessage l of + parseoutput p tmpdir subdir result meterfile l = case Proto.parseMessage l of Just (ProcessInput f) -> do let f' = toOsPath f - let knowninput = M.member f' (computeInputs state') + let knowninput = M.member f' + (computeInputs (computeState result)) checksafefile tmpdir subdir f' "input" checkimmutable knowninput "inputting" f' $ do (k, inputcontent) <- getinputcontent f' @@ -446,21 +458,21 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) liftIO $ hPutStrLn (stdinHandle p) $ maybe "" fromOsPath mp liftIO $ hFlush (stdinHandle p) - let state'' = state' + let result' = result { computeInputsUnavailable = - isNothing mp || computeInputsUnavailable state' + isNothing mp || computeInputsUnavailable result } return $ if immutablestate - then state'' - else state'' + then result' + else modresultstate result' $ \s -> s { computeInputs = M.insert f' k - (computeInputs state') + (computeInputs s) } Just (ProcessOutput f) -> do let f' = toOsPath f checksafefile tmpdir subdir f' "output" - knownoutput <- case M.lookup f' (computeOutputs state') of + knownoutput <- case M.lookup f' (computeOutputs $ computeState result) of Nothing -> return False Just mk -> do when (mk == fmap fst meterkey) $ @@ -468,20 +480,23 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) return True checkimmutable knownoutput "outputting" f' $ return $ if immutablestate - then state' - else state' + then result + else modresultstate result $ \s -> s { computeOutputs = M.insert f' Nothing - (computeOutputs state') - } + (computeOutputs s) + } Just (ProcessProgress percent) -> do liftIO $ updatepercent percent - return state' + return result Just ProcessReproducible -> - return $ state' { computeReproducible = True } + return $ result { computeReproducible = True } Nothing -> giveup $ program ++ " output an unparseable line: \"" ++ l ++ "\"" + modresultstate result f = + result { computeState = f (computeState result) } + checksafefile tmpdir subdir f fileaction = do let err problem = giveup $ program ++ " tried to " ++ fileaction ++ " a file that is " ++ problem ++ ": " ++ fromOsPath f @@ -596,10 +611,11 @@ computeKey rs (ComputeProgram program) k _af dest meterupdate vc = (keyfile : _) -> Just keyfile [] -> Nothing - postcompute keyfile state tmpdir _ts - | computeInputsUnavailable state = + postcompute keyfile result tmpdir _ts + | computeInputsUnavailable result = giveup "Input file(s) unavailable." - | otherwise = postcompute' keyfile state tmpdir + | otherwise = + postcompute' keyfile (computeState result) tmpdir postcompute' keyfile state tmpdir = do hb <- hashBackend From 1e9bb30c4e53dd20b841d18969d7ebc0b1234fc7 Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 6 Mar 2025 12:52:12 -0400 Subject: [PATCH 51/53] update --- TODO-compute | 13 +++++++++++-- doc/git-annex-recompute.mdwn | 4 ++-- 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/TODO-compute b/TODO-compute index 5b212695c6..8d26a0777d 100644 --- a/TODO-compute +++ b/TODO-compute @@ -1,7 +1,8 @@ * would be nice to have a way to see what computations are used by a compute remote for a file. Put it in `whereis` output? But it's not an url. Maybe a separate command? That would also allow querying for eg, - what files are inputs for another file. + what files are inputs for another file. Or it could be exposed in the + Remote interface, and made into a file matching option. * "getting input from " message uses the original filename, but that file might have been renamed. Would be more clear to use @@ -16,6 +17,15 @@ * Perhaps recompute should write a new version of a file as an unlocked file when the file is currently unlocked? + Problem: Since recompute does not stage the file, it would have to write + the content to the working tree. And then the user would need to + git-annex add. But then, if the key was a VURL key, it would add it with + the default backend instead, and the file would no longer use a computed + key. + + So it, seems that, for this to be done, recompute would need to stage the + pointer file. + * compute on files in submodules * recompute could ingest keys for other files than the one being @@ -42,4 +52,3 @@ that recompute should also support recomputing non-annexed files. Otherwise, adding a file and then recomputing it would vary in what the content of the file is, depending on annex.smallfiles setting. - diff --git a/doc/git-annex-recompute.mdwn b/doc/git-annex-recompute.mdwn index 0f8dd56901..498c85e26c 100644 --- a/doc/git-annex-recompute.mdwn +++ b/doc/git-annex-recompute.mdwn @@ -29,9 +29,9 @@ but is not staged, in order to avoid overwriting any staged changes. Only recompute files that were computed by this compute remote. When this option is not used, all computed files are recomputed using - whatever compute remote was originally used to add them. In cases where + whatever compute remote was originally used to add them. (In cases where a file can be computed by multiple remotes, the one with the lowest - configured cost will be used. + configured cost is used.) * `--unreproducible`, `-u` From 4979df54d58986bc9cfa1530993aa0b6f841138c Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 6 Mar 2025 13:34:51 -0400 Subject: [PATCH 52/53] update --- TODO-compute | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/TODO-compute b/TODO-compute index 8d26a0777d..7749ad1be3 100644 --- a/TODO-compute +++ b/TODO-compute @@ -14,9 +14,10 @@ * addcomputed should honor annex.addunlocked. -* Perhaps recompute should write a new version of a file as an unlocked - file when the file is currently unlocked? - + What about recompute? It seems it should either write the new version of + the file as an unlocked file when the old version was unlocked, or also + honor annex.addunlocked. + Problem: Since recompute does not stage the file, it would have to write the content to the working tree. And then the user would need to git-annex add. But then, if the key was a VURL key, it would add it with From e952753846d9157f59f4c0e00c4aff437a21f85f Mon Sep 17 00:00:00 2001 From: Joey Hess Date: Thu, 6 Mar 2025 14:22:45 -0400 Subject: [PATCH 53/53] preparing to merge compute --- CHANGELOG | 4 ++ ..._2546562f7a00e082cd0500debc904cf3._comment | 22 ++++++ ..._d1561153a3916411ed8caa92fa53893c._comment | 69 +++++++++++++++++++ ...ompute_special_remote_remaining_todos.mdwn | 18 ++++- 4 files changed, 111 insertions(+), 2 deletions(-) create mode 100644 doc/todo/compute_special_remote/comment_21_2546562f7a00e082cd0500debc904cf3._comment create mode 100644 doc/todo/compute_special_remote/comment_22_d1561153a3916411ed8caa92fa53893c._comment rename TODO-compute => doc/todo/compute_special_remote_remaining_todos.mdwn (82%) diff --git a/CHANGELOG b/CHANGELOG index 475277f8f4..8c944a4bfb 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,5 +1,9 @@ git-annex (10.20250116) UNRELEASED; urgency=medium + * Added the compute special remote. + * addcomputed: New command, adds a file that is generated by a compute + special remote. + * recompute: New command, recomputes computed files. * Support help.autocorrect settings "prompt", "never", and "immediate". * Allow setting remote.foo.annex-tracking-branch to a branch name that contains "/", as long as it's not a remote tracking branch. diff --git a/doc/todo/compute_special_remote/comment_21_2546562f7a00e082cd0500debc904cf3._comment b/doc/todo/compute_special_remote/comment_21_2546562f7a00e082cd0500debc904cf3._comment new file mode 100644 index 0000000000..1416d77bde --- /dev/null +++ b/doc/todo/compute_special_remote/comment_21_2546562f7a00e082cd0500debc904cf3._comment @@ -0,0 +1,22 @@ +[[!comment format=mdwn + username="joey" + subject="""Re: DataLad exploration of the compute on demand space""" + date="2025-03-06T17:39:04Z" + content=""" +Thanks for explaining the design points of datalad-remake. Some +different design choices than I have made, but mostly they strike me as +implementing what is easier/possible from outside git-annex. + +Eg, storing the compute inputs under `.datalad` in the branch is fine -- +and might even be useful if you want to make a branch that changes +something in there -- but of course in the git-annex implementation it +stores the equvilant thing in the git-annex branch. + +I do hope I'm not closing off the design space from such differences +by dropping a compute special remote right into git-annex. But I also +expect that having a standard and easy way for at least simple +computations will lead to a lot of contributions as others use it. + +Your fMRI case seems like one that my compute remote could handle well +and easily. +"""]] diff --git a/doc/todo/compute_special_remote/comment_22_d1561153a3916411ed8caa92fa53893c._comment b/doc/todo/compute_special_remote/comment_22_d1561153a3916411ed8caa92fa53893c._comment new file mode 100644 index 0000000000..bfacbdf57d --- /dev/null +++ b/doc/todo/compute_special_remote/comment_22_d1561153a3916411ed8caa92fa53893c._comment @@ -0,0 +1,69 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 22""" + date="2025-03-06T17:54:50Z" + content=""" +I've merged the compute special remote now. +See [[special_remotes/compute]], [[git-annex-addcomputed]] +and [[git-annex-recompute]]. + +I have opened [[todo/compute_special_remote_remaining_todos]] with +some various ways that I want to improve it further. Including, notably, +computing on inputs from submodules, which is not currently supported at +all. + +---- + +Here I'll go down mih's original and quite useful design criteria and see +how the compute special remote applies to them: + +### Generate annex keys (that have never existed) + +`git-annex addcomputed --fast` + +### Re-generate annex keys + +`git-annex addcomputed` optionally with the --reproducible option, +followed by a later `git-annex get` + +Another thing that fits under this heading is when one of the original +input files has gotten modified, and you want to compute a new version of +the output file from it, using the same method as was used to compute it +before. That's `git-annex recompute $output_file` + +### Worktree provisioning? + +This is the main thing I didn't implement. Given that git-annex is working +with large files and needs to support various filesystems and OS's that +lack hardlinks and softlinks, it's hard to do this inexpensively. + +Also, it turned out to make sense for the compute program to request +the input files it needs, since this lets git-annex learn what the input +files are, so it can make them available when regenerating a computed file +later. And so the protocol just has git-annex respond with the path to +the content of the file. + +### Request one key, receive many + +This is supported. (So is using multiple inputs to produce one (or more) +outputs.) + +### Instruction deposition + +`git-annex addcomputed` + +### Storage redundancy tests + +It did make sense to have it automatically `git-annex get` the inputs. +Well, I think it makes sense in most cases, this may become a tunable +setting of the compute special remote. + +### Trust + +Handled by requiring the user install a `git-annex-compute-foo` command +in PATH, and provide the name of the command to `initremote`. + +And for later `enableremote` or `autoenable=true`, it will only +allow programs that are listed in the annex.security.allowed-compute-programs +git config. +"""]] diff --git a/TODO-compute b/doc/todo/compute_special_remote_remaining_todos.mdwn similarity index 82% rename from TODO-compute rename to doc/todo/compute_special_remote_remaining_todos.mdwn index 7749ad1be3..bb522398a4 100644 --- a/TODO-compute +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -1,3 +1,19 @@ +This is the remainder of my todo list while I was building the +compute special remote. --[[Joey]] + +* write a tip showing how to use this + +* Write some simple compute programs so we have something to start with. + + - convert between images eg jpeg to png + - run a command in a singularity container (that is one of the inputs) + - run a wasm binary (that is one of the inputs) + +* compute on input files in submodules + +* annex.diskreserve can be violated if getting a file computes it but also + some other output files, which get added to the annex. + * would be nice to have a way to see what computations are used by a compute remote for a file. Put it in `whereis` output? But it's not an url. Maybe a separate command? That would also allow querying for eg, @@ -27,8 +43,6 @@ So it, seems that, for this to be done, recompute would need to stage the pointer file. -* compute on files in submodules - * recompute could ingest keys for other files than the one being recomputed, and remember them. Then recomputing those files could just use those keys, without re-running a computation. (Better than --others