add design document for import tree
This commit is contained in:
		
					parent
					
						
							
								2f67c4ac87
							
						
					
				
			
			
				commit
				
					
						d128c8c3ec
					
				
			
		
					 3 changed files with 203 additions and 141 deletions
				
			
		|  | @ -4,6 +4,12 @@ and content from the tree. | |||
| 
 | ||||
| (See also [[todo/export]] and [[todo/dumb, unsafe, human-readable_backend]]) | ||||
| 
 | ||||
| Note that this document was written with the assumption that only git-annex | ||||
| is writing to the special remote. But | ||||
| [[importing_trees_from_special_remotes]] invalidates that assumption, | ||||
| and needed to add some additional things to deal with it. See that link for | ||||
| details. | ||||
| 
 | ||||
| [[!toc ]] | ||||
| 
 | ||||
| ## configuring a special remote for tree export | ||||
|  |  | |||
							
								
								
									
										192
									
								
								doc/design/importing_trees_from_special_remotes.mdwn
									
										
									
									
									
										Normal file
									
								
							
							
						
						
									
										192
									
								
								doc/design/importing_trees_from_special_remotes.mdwn
									
										
									
									
									
										Normal file
									
								
							|  | @ -0,0 +1,192 @@ | |||
| Importing trees from special remotes allows data published by others to be | ||||
| gathered. It also combines with [[exporting_trees_to_special_remotes]] | ||||
| to let a special remote act as a kind of git working tree without `.git`, | ||||
| that the user can alter data in as they like and use git-annex to pull | ||||
| their changes into the local repository's version control. | ||||
| 
 | ||||
| (See also [[todo/import_tree]].) | ||||
| 
 | ||||
| The basic idea is to have a `git annex import --from remote` command. | ||||
| 
 | ||||
| It would find changed/new/deleted files on the remote. | ||||
| Download the changed/new files and inject into the annex.  | ||||
| Generate a new treeish, with parent the treeish that was exported earlier, | ||||
| that has the modifications in it. | ||||
| 
 | ||||
| Updating the local working copy is then done by merging the import treeish. | ||||
| This way, conflicts will be detected and handled as normal by git. | ||||
| 
 | ||||
| ## content identifiers | ||||
| 
 | ||||
| The remote is responsible for collecting a list of | ||||
| files currently in it, along with some content identifier. That data is | ||||
| sent to git-annex. git-annex keeps track of which content identifier(s) map | ||||
| to which keys, and uses the information to determine when a file on the | ||||
| remote has changed or is new. | ||||
| 
 | ||||
| git-annex can simply build git tree objects as the file list | ||||
| comes in, looking up the key corresponding to each content identifier | ||||
| (or downloading the content from the remote and adding it to the annex | ||||
| when there's no corresponding key yet). It might be possible to avoid | ||||
| git-annex buffering much tree data in memory. | ||||
| 
 | ||||
| ---- | ||||
| 
 | ||||
| A good content identifier needs to: | ||||
| 
 | ||||
| * Be stable, so when a file has not changed, the content identifier | ||||
|   remains the same. | ||||
| * Change when a file is modified. | ||||
| * Be as unique as possible, but not necessarily fully unique.   | ||||
|   A hash of the content would be ideal. | ||||
|   A (size, mtime, inode) tuple is as good a content identifier as git uses in | ||||
|   its index. | ||||
| 
 | ||||
| git-annex will need a way to get the content identifiers of files | ||||
| that it stores on the remote when exporting a tree to it, so it can later | ||||
| know if those files have changed. | ||||
| 
 | ||||
| ---- | ||||
| 
 | ||||
| The content identifier needs to be stored somehow for later use. | ||||
| 
 | ||||
| It would be good to store the content identifiers only locally, if | ||||
| possible. | ||||
| 
 | ||||
| Would local storage pose a problem when multiple repositories import from | ||||
| the same remote? In that case, perhaps different trees would be imported, | ||||
| and merged into master. So the two repositories then have differing | ||||
| masters, which can be reconciled in merge as usual. | ||||
| 
 | ||||
| Since exporttree remotes don't have content identifier information yet, it | ||||
| needs to be collected the first time import tree is used. (Or import | ||||
| everything, but that is probably too expensive). Any modifications made to | ||||
| exported files before the first import tree would not be noticed. Seems | ||||
| acceptible as long as this only affects exporttree remotes created before | ||||
| this feature was added. | ||||
| 
 | ||||
| What if repo A is being used to import tree from R for a while, and the | ||||
| user gets used to editing files on R and importing them. Then they stop | ||||
| using A and switch to clone B. It would not have the content identifier | ||||
| information that A did. It seems that in this case, B needs to re-download | ||||
| everything, to build up the map of content identifiers. | ||||
| (Anything could have changed since the last time A imported). | ||||
| That seems too expensive! | ||||
| 
 | ||||
| Would storing content identifiers in the git-annex branch be too | ||||
| expensive? Probably not.. For S3 with versioning a content identifier is | ||||
| already stored. When the content identifier is (mtime, size, inode), | ||||
| that's a small amount of data. The maximum size of a content identifier | ||||
| could be limited to the size of a typical hash, and if a remote for some | ||||
| reason gets something larger, it could simply hash it to generate | ||||
| the content identifier. | ||||
| 
 | ||||
| ## safety | ||||
| 
 | ||||
| Since the special remote can be written to at any time by something other | ||||
| than git-annex, git-annex needs to take care when exporting to it, to avoid | ||||
| overwriting such changes. | ||||
| 
 | ||||
| This is similar to how git merge avoids overwriting modified files in the | ||||
| working tree. | ||||
| 
 | ||||
| Surprisingly, git merge doesn't avoid overwrites in all conditions! I | ||||
| modified git's merge.c to sleep for 10 seconds after `refresh_index()`, and | ||||
| verified that changes made to the work tree in that window were silently | ||||
| overwritten by git merge. In git's case, the race window is normally quite | ||||
| narrow and this is very unlikely to happen. | ||||
| 
 | ||||
| Also, git merge can overwrite a file that a process has open for write; | ||||
| the processes's changes then get lost. Verified with | ||||
| this perl oneliner, run in a worktree and a second later  | ||||
| followed by a git pull. The lines that it appended to the  | ||||
| file got lost: | ||||
| 
 | ||||
| 	perl -e 'open (OUT, ">>foo") || die "$!"; sleep(10); while (<>) { print OUT $_ }' | ||||
| 
 | ||||
| git-annex should take care to be at least as safe as git merge when | ||||
| exporting to a special remote that supports imports. | ||||
| 
 | ||||
| The situations to keep in mind are these: | ||||
| 
 | ||||
| 1. File is changed on the remote after an import tree, and an export wants | ||||
|    to also change it. Need to avoid the export overwriting the | ||||
|    file. Or, need a way to detect such an overwrite and recover the version | ||||
|    of the file that got overwritten, after the fact. | ||||
| 
 | ||||
| 2. File is changed on the remote while it's being imported, and part of one | ||||
|    version + part of the other version is downloaded. Need to detect this | ||||
|    and fail the import. | ||||
| 
 | ||||
| 3. File is changed on the remote after its content identifier is checked | ||||
|    and before it's downloaded, so the wrong version gets downloaded. | ||||
|    Need to detect this and fail the import. | ||||
| 
 | ||||
| ## api design | ||||
| 
 | ||||
| This is an extension to the ExportActions api. | ||||
| 
 | ||||
| 	listContents :: Annex (Tree [(ExportLocation, ContentIdentifier)]) | ||||
| 
 | ||||
| 	getContentIdentifier :: ExportLocation -> Annex (Maybe ContentIdentifier) | ||||
| 	 | ||||
| 	retrieveExportWithContentIdentifier :: ExportLocation -> ContentIdentifier -> (FilePath -> Annex Key) -> MeterUpdate -> Annex (Maybe Key) | ||||
| 
 | ||||
| 	storeExportWithContentIdentifier :: FilePath -> Key -> ExportLocation -> MeterUpdate -> Annex (Maybe ContentIdentifier) | ||||
| 
 | ||||
| listContents finds the current set of files that are stored in the remote, | ||||
| some of which may have been written by other programs than git-annex, | ||||
| along with their content identifiers. It returns a list of those, often in | ||||
| a single node tree. | ||||
| 
 | ||||
| listContents may also find past versions of files that are stored in the | ||||
| remote, when it supports storing multiple versions of files. Since it | ||||
| returns a tree of lists of files, it can represent anything from a linear | ||||
| history to a full branching version control history. | ||||
| 
 | ||||
| retrieveExportWithContentIdentifier is used when downloading a new file from  | ||||
| the remote that listContents found. retrieveExport can't be used because | ||||
| it has a Key parameter and the key is not yet known in this case. | ||||
| (The callback generating a key will let eg S3 record the S3 version id for | ||||
| the key.) | ||||
| 
 | ||||
| retrieveExportWithContentIdentifier should detect when the file it's | ||||
| downloaded may not match the requested content identifier (eg when | ||||
| something else wrote to it while it was being retrieved), and fail | ||||
| in that case. | ||||
| 
 | ||||
| storeExportWithContentIdentifier stores content and returns the  | ||||
| content identifier corresponding to what it stored. It can either get | ||||
| the content identifier in reply to the store (as S3 does with versioning), | ||||
| or it can store to a temp location, get the content identifier of that, | ||||
| and then rename the content into place. | ||||
| 
 | ||||
| storeExportWithContentIdentifier must avoid overwriting any file that may | ||||
| have been written to the remote by something else (unless that version of | ||||
| the file can later be recovered by listContents), so it will typically | ||||
| need to query for the content identifier before moving the new content | ||||
| into place. FIXME: How does it know when it's safe to overwrite a file? | ||||
| Should it be passed the content identifier that it's allowed to overwrite? | ||||
| 
 | ||||
| storeExportWithContentIdentifier needs to handle the case when there's a | ||||
| race with a concurrent writer. It needs to avoid getting the wrong | ||||
| ContentIdentifier for data written by the other writer. It may detect such | ||||
| races and fail, or it could succeed and overwrite the other file, so long | ||||
| as it can later be recovered by listContents. | ||||
| 
 | ||||
| ## multiple git-annex repos accessing a special remote | ||||
| 
 | ||||
| If multiple repos can access the remote at the same time, then there's a | ||||
| potential problem when one is exporting a new tree, and the other one is | ||||
| importing from the remote. | ||||
| 
 | ||||
| This can be reduced to the same problem as exports of two | ||||
| different trees to the same remote, which is already handled with the | ||||
| export log. | ||||
| 
 | ||||
| Once a tree has been imported from the remote, it's | ||||
| in the same state as exporting that same tree to the remote, so | ||||
| update the export log to say that the remote has that treeish exported | ||||
| to it. A conflict between two export log entries will be handled as | ||||
| usual, with the user being prompted to re-export the tree they want | ||||
| to be on the remote. (May need to reword that prompt.) | ||||
|  | @ -3,80 +3,13 @@ and the remote allows files to somehow be edited on it, then there ought | |||
| to be a way to import the changes back from the remote into the git repository. | ||||
| The command could be `git annex import --from remote` | ||||
| 
 | ||||
| It would find changed/new/deleted files on the remote. | ||||
| Download the changed/new files and inject into the annex.  | ||||
| Generate a new treeish, with parent the treeish that was exported, | ||||
| that has the modifications in it. | ||||
| See [[design/importing_trees_from_special_remotes]] for current design for | ||||
| this. | ||||
| 
 | ||||
| Updating the working copy is then done by merging the import treeish. | ||||
| This way, conflicts will be detected and handled as normal by git. | ||||
| ## race conditions | ||||
| 
 | ||||
| ## content identifiers | ||||
| 
 | ||||
| The remote is responsible for collecting a list of | ||||
| files currently in it, along with some content identifier. That data is | ||||
| sent to git-annex. git-annex keeps track of which content identifier(s) map | ||||
| to which keys, and uses the information to determine when a file on the | ||||
| remote has changed or is new. | ||||
| 
 | ||||
| git-annex can simply build git tree objects as the file list | ||||
| comes in, looking up the key corresponding to each content identifier | ||||
| (or downloading the content from the remote and adding it to the annex | ||||
| when there's no corresponding key yet). It might be possible to avoid | ||||
| git-annex buffering much tree data in memory. | ||||
| 
 | ||||
| ---- | ||||
| 
 | ||||
| A good content identifier needs to: | ||||
| 
 | ||||
| * Be stable, so when a file has not changed, the content identifier | ||||
|   remains the same. | ||||
| * Change when a file is modified. | ||||
| * Be as unique as possible, but not necessarily fully unique.   | ||||
|   A hash of the content would be ideal. | ||||
|   A (size, mtime, inode) tuple is as good a content identifier as git uses in | ||||
|   its index. | ||||
| 
 | ||||
| git-annex will need a way to get the content identifiers of files | ||||
| that it stores on the remote when exporting a tree to it, so it can later | ||||
| know if those files have changed. | ||||
| 
 | ||||
| ---- | ||||
| 
 | ||||
| The content identifier needs to be stored somehow for later use. | ||||
| 
 | ||||
| It would be good to store the content identifiers only locally, if | ||||
| possible. | ||||
| 
 | ||||
| Would local storage pose a problem when multiple repositories import from | ||||
| the same remote? In that case, perhaps different trees would be imported, | ||||
| and merged into master. So the two repositories then have differing | ||||
| masters, which can be reconciled in merge as usual. | ||||
| 
 | ||||
| Since exporttree remotes don't have content identifier information yet, it | ||||
| needs to be collected the first time import tree is used. (Or import | ||||
| everything, but that is probably too expensive). Any modifications made to | ||||
| exported files before the first import tree would not be noticed. Seems | ||||
| acceptible as long as this only affects exporttree remotes created before | ||||
| this feature was added. | ||||
| 
 | ||||
| What if repo A is being used to import tree from R for a while, and the | ||||
| user gets used to editing files on R and importing them. Then they stop | ||||
| using A and switch to clone B. It would not have the content identifier | ||||
| information that A did. It seems that in this case, B needs to re-download | ||||
| everything, to build up the map of content identifiers. | ||||
| (Anything could have changed since the last time A imported). | ||||
| That seems too expensive! | ||||
| 
 | ||||
| Would storing content identifiers in the git-annex branch be too | ||||
| expensive? Probably not.. For S3 with versioning a content identifier is | ||||
| already stored. When the content identifier is (mtime, size, inode), | ||||
| that's a small amount of data. The maximum size of a content identifier | ||||
| could be limited to the size of a typical hash, and if a remote for some | ||||
| reason gets something larger, it could simply hash it to generate | ||||
| the content identifier. | ||||
| 
 | ||||
| ## race conditions TODO | ||||
| (Some thoughts about races that the design should cover now, but kept here | ||||
| for reference.) | ||||
| 
 | ||||
| A file could be modified on the remote while | ||||
| it's being exported, and if the remote then uses the mtime of the modified | ||||
|  | @ -179,73 +112,4 @@ Since this is acceptable in git, I suppose we can accept it here too.. | |||
| 
 | ||||
| ---- | ||||
| 
 | ||||
| If multiple repos can access the remote at the same time, then there's a | ||||
| potential problem when one is exporting a new tree, and the other one is | ||||
| importing from the remote. | ||||
| 
 | ||||
| > This can be reduced to the same problem as exports of two | ||||
| > different trees to the same remote, which is already handled with the | ||||
| > export log. | ||||
| >  | ||||
| > Once a tree has been imported from the remote, it's | ||||
| > in the same state as exporting that same tree to the remote, so | ||||
| > update the export log to say that the remote has that treeish exported | ||||
| > to it. A conflict between two export log entries will be handled as | ||||
| > usual, with the user being prompted to re-export the tree they want | ||||
| > to be on the remote. (May need to reword that prompt.) | ||||
| > --[[Joey]] | ||||
| 
 | ||||
| ## api design | ||||
| 
 | ||||
| Pulling all of the above together, this is an extension to the | ||||
| ExportActions api. | ||||
| 
 | ||||
| 	listContents :: Annex (Tree [(ExportLocation, ContentIdentifier)]) | ||||
| 
 | ||||
| 	getContentIdentifier :: ExportLocation -> Annex (Maybe ContentIdentifier) | ||||
| 	 | ||||
| 	retrieveExportWithContentIdentifier :: ExportLocation -> ContentIdentifier -> (FilePath -> Annex Key) -> MeterUpdate -> Annex (Maybe Key) | ||||
| 
 | ||||
| 	storeExportWithContentIdentifier :: FilePath -> Key -> ExportLocation -> MeterUpdate -> Annex (Maybe ContentIdentifier) | ||||
| 
 | ||||
| listContents finds the current set of files that are stored in the remote, | ||||
| some of which may have been written by other programs than git-annex, | ||||
| along with their content identifiers. It returns a list of those, often in | ||||
| a single node tree. | ||||
| 
 | ||||
| listContents may also find past versions of files that are stored in the | ||||
| remote, when it supports storing multiple versions of files. Since it | ||||
| returns a tree of lists of files, it can represent anything from a linear | ||||
| history to a full branching version control history. | ||||
| 
 | ||||
| retrieveExportWithContentIdentifier is used when downloading a new file from  | ||||
| the remote that listContents found. retrieveExport can't be used because | ||||
| it has a Key parameter and the key is not yet known in this case. | ||||
| (The callback generating a key will let eg S3 record the S3 version id for | ||||
| the key.) | ||||
| 
 | ||||
| retrieveExportWithContentIdentifier should detect when the file it's | ||||
| downloaded may not match the requested content identifier (eg when | ||||
| something else wrote to it), and fail in that case. | ||||
| 
 | ||||
| storeExportWithContentIdentifier is used to get the content identifier | ||||
| corresponding to what it stores. It can either get the content | ||||
| identifier in reply to the store (as S3 does with versioning), or it can | ||||
| store to a temp location, get the content identifier of that, and then | ||||
| rename the content into place. | ||||
| 
 | ||||
| storeExportWithContentIdentifier must avoid overwriting any file that may | ||||
| have been written to the remote by something else (unless that version of | ||||
| the file can later be recovered by listContents), so it will typically | ||||
| need to query for the content identifier before moving the new content | ||||
| into place. | ||||
| 
 | ||||
| storeExportWithContentIdentifier needs to handle the case when there's a | ||||
| race with a concurrent writer. It needs to avoid getting the wrong | ||||
| ContentIdentifier for data written by the other writer. It may detect such | ||||
| races and fail, or it could succeed and overwrite the other file, so long | ||||
| as it can later be recovered by listContents. | ||||
| 
 | ||||
| ---- | ||||
| 
 | ||||
| See also, [[adb_special_remote]] | ||||
|  |  | |||
		Loading…
	
	Add table
		Add a link
		
	
		Reference in a new issue
	
	 Joey Hess
				Joey Hess