2020-12-30 17:40:49 +00:00
|
|
|
This todo is about `git-annex import branch --from remote`, which is
|
|
|
|
implemented now.
|
2018-03-21 06:25:53 +00:00
|
|
|
|
2020-12-30 17:40:49 +00:00
|
|
|
> [[done]] --[[Joey]]
|
2019-03-01 20:44:34 +00:00
|
|
|
|
2019-02-20 16:12:32 +00:00
|
|
|
## race conditions
|
2018-03-21 06:25:53 +00:00
|
|
|
|
2019-02-20 16:12:32 +00:00
|
|
|
(Some thoughts about races that the design should cover now, but kept here
|
|
|
|
for reference.)
|
2018-07-09 18:05:34 +00:00
|
|
|
|
2019-02-11 18:14:44 +00:00
|
|
|
A file could be modified on the remote while
|
2018-07-09 18:05:34 +00:00
|
|
|
it's being exported, and if the remote then uses the mtime of the modified
|
|
|
|
file in the content identifier, the modification would never be noticed by
|
|
|
|
imports.
|
2018-06-14 17:42:25 +00:00
|
|
|
|
2018-07-09 18:05:34 +00:00
|
|
|
To fix this race, we need an atomic move operation on the remote. Upload
|
|
|
|
the file to a temp file, then get its content identifier, and then move it
|
2019-02-11 18:14:44 +00:00
|
|
|
from the temp file to its final location. Alternatively, upload a file and
|
|
|
|
get the content identifier atomically, which eg S3 with versioning enabled
|
|
|
|
provides. It would make sense to have the storeExport operation always return
|
|
|
|
a content identifier and document that it needs to get it atomically by
|
|
|
|
either using a temp file or something specific to the remote.
|
|
|
|
|
|
|
|
----
|
2018-06-14 18:00:49 +00:00
|
|
|
|
2018-06-14 21:14:13 +00:00
|
|
|
There's also a race where a file gets changed on the remote after an
|
2018-07-09 18:05:34 +00:00
|
|
|
import tree, and an export then overwrites it with something else.
|
|
|
|
|
|
|
|
One solution would be to only allow one of importtree or exporttree
|
|
|
|
to a given remote. This reduces the use cases a lot though, and perhaps
|
|
|
|
so far that the import tree feature is not worth building. The adb
|
2019-02-13 20:28:02 +00:00
|
|
|
special remote needs both. Also, such a limitation seems like one that
|
|
|
|
users might try to work around by initializing two remotes using the same
|
|
|
|
data and trying to use one for import and the other for export.
|
2018-07-09 18:05:34 +00:00
|
|
|
|
|
|
|
Really fixing this race needs locking or an atomic operation. Locking seems
|
|
|
|
unlikely to be a portable enough solution.
|
|
|
|
|
2019-02-11 18:14:44 +00:00
|
|
|
An atomic rename operation could at least narrow the race significantly, eg:
|
2018-07-09 18:05:34 +00:00
|
|
|
|
2019-02-11 18:14:44 +00:00
|
|
|
1. get content identifier of $file, check if it's what was expected else
|
|
|
|
abort (optional but would catch most problems)
|
|
|
|
2. upload new version of $file to $tmp1
|
|
|
|
3. rename current $file to $tmp2
|
|
|
|
4. Get content identifier of $tmp2, check if it's what was expected to
|
2018-07-09 18:05:34 +00:00
|
|
|
be. If not, $file was modified after the last import tree, and that
|
|
|
|
conflict has to be resolved. Otherwise, delete $tmp2
|
2019-02-11 18:14:44 +00:00
|
|
|
5. rename $tmp1 to $file
|
|
|
|
|
|
|
|
That leaves a race if the file gets overwritten after it's moved out
|
|
|
|
of the way. If the rename refuses to overwrite existing files, that race
|
|
|
|
would be detected by it failing. renameat(2) with `RENAME_NOREPLACE` can do that,
|
|
|
|
but probably many special remote interfaces don't provide a way to do that.
|
|
|
|
|
|
|
|
S3 lacks a rename operation, can only copy and then delete. Which is not
|
|
|
|
good enough; it risks the file being replaced with new content before
|
|
|
|
the delete and the new content being deleted.
|
|
|
|
|
|
|
|
Is this race really a significant problem? One way to look at it is
|
|
|
|
analagous to a git merge overwriting a locally modified file.
|
|
|
|
Git can certianly use similar techniques to entirely detect and recover
|
|
|
|
from such races (but not the similar race described in the next section).
|
|
|
|
But, git does not actually do that! I modified git's
|
|
|
|
merge.c to sleep for 10 seconds after `refresh_index()`, and verified
|
|
|
|
that changes made to the work tree in that window were silently overwritten
|
|
|
|
by git merge. In git's case, the race window is normally quite narrow
|
|
|
|
and this is very unlikely to happen (the similar race described in the next
|
|
|
|
section is more likely).
|
|
|
|
|
|
|
|
If git-annex could get the race window similarly small out would perhaps be
|
|
|
|
ok. Eg:
|
|
|
|
|
|
|
|
1. upload new version of $file to $tmp
|
|
|
|
2. get content identifier of $file, check if it's what was expected else
|
|
|
|
abort
|
|
|
|
3. rename (or copy and delete) $tmp to $file
|
|
|
|
|
|
|
|
The race window between #2 and #3 could be quite narrow for some remotes.
|
|
|
|
But S3, lacking a rename, does a copy that can be very slow for large files.
|
|
|
|
|
|
|
|
S3, with versioning, could detect the race after the fact, by listing
|
|
|
|
the versions of the file, and checking if any of the versions is one
|
|
|
|
that git-annex did not know the file already had.
|
|
|
|
[Using this api](https://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGETVersion.html),
|
|
|
|
with version-id-marker set to the previous version of the file,
|
|
|
|
should list only the previous and current versions; if there's an
|
|
|
|
intermediate version then the race occurred and it could roll the change
|
2019-02-13 20:28:02 +00:00
|
|
|
back, or otherwise recover the overwritten version. This could be done at
|
|
|
|
import time, to detect a previous race, and recover from it; importing
|
|
|
|
a tree with the file(s) that were overwritten due to the race, leading to a
|
|
|
|
tree import conflict that the user can resolve. This likely generalizes
|
|
|
|
to importing a sequence of trees, so each version written to S3 gets
|
|
|
|
imported.
|
2019-02-11 18:14:44 +00:00
|
|
|
|
|
|
|
----
|
2018-07-09 18:05:34 +00:00
|
|
|
|
2018-07-09 18:38:34 +00:00
|
|
|
A remaining race is that, if the file is open for write at the same
|
2018-07-09 18:05:34 +00:00
|
|
|
time it's renamed, the write might happen after the content identifer
|
|
|
|
is checked, and then whatever is written to it will be lost.
|
|
|
|
|
|
|
|
But: Git worktree update has the same race condition. Verified with
|
2019-02-11 18:14:44 +00:00
|
|
|
this perl oneliner, run in a worktree and a second later
|
|
|
|
followed by a git pull. The lines that it appended to the
|
|
|
|
file got lost:
|
2018-07-09 18:05:34 +00:00
|
|
|
|
2019-02-11 18:14:44 +00:00
|
|
|
perl -e 'open (OUT, ">>foo") || die "$!"; sleep(10); while (<>) { print OUT $_ }'
|
2018-07-09 18:05:34 +00:00
|
|
|
|
|
|
|
Since this is acceptable in git, I suppose we can accept it here too..
|
2018-06-14 21:14:13 +00:00
|
|
|
|
2019-04-18 20:25:04 +00:00
|
|
|
## S3 versioning and import
|
|
|
|
|
|
|
|
Listing a versioned S3 bucket with past versions results in S3 sending
|
|
|
|
a list that's effectively:
|
|
|
|
|
|
|
|
foo current-version
|
|
|
|
foo past-version
|
|
|
|
bar deleted
|
|
|
|
bar past-version
|
|
|
|
bar even-older-version
|
|
|
|
|
|
|
|
Each item on the list also has a LastModified date, and IsLatest
|
|
|
|
is set for the current version of each file.
|
|
|
|
|
|
|
|
This needs to be converted into a ImportableContents tree of file trees.
|
|
|
|
|
|
|
|
Getting the current file tree is easy, just filter on IsLatest.
|
|
|
|
|
|
|
|
Getting the past file trees seems hard. Two things are in tension:
|
|
|
|
|
|
|
|
* Want to generate the same file tree in this import that was used in past
|
|
|
|
imports. Since the file tree is converted to a git tree, this avoids
|
|
|
|
a proliferation of git trees.
|
|
|
|
|
|
|
|
* Want the past file trees to reflect what was actually in the
|
|
|
|
S3 bucket at different past points in time.
|
|
|
|
|
|
|
|
So while it would work fine to just make one past file tree for each
|
|
|
|
file, that contains only that single file, the user would not like
|
|
|
|
the resulting history when they explored it with git.
|
|
|
|
|
|
|
|
With the example above, the user expects something like this:
|
|
|
|
|
|
|
|
ImportableContents [(foo, current-version)]
|
|
|
|
[ ImportableContents [(foo, past-version), (bar, past-version)]
|
|
|
|
[ ImportableContents [(bar, even-older-version)]
|
|
|
|
[]
|
|
|
|
]
|
|
|
|
]
|
|
|
|
|
|
|
|
And the user would like for the inner-most list to also include
|
|
|
|
(foo, past-version) if it were in the S3 bucket at the same time
|
|
|
|
(bar, even-older-version) was added. So depending on the past
|
|
|
|
modificatio times of foo vs bar, they may really expect:
|
|
|
|
|
|
|
|
let l = ImportableContents [(foo, current-version)]
|
|
|
|
[ ImportableContents [(foo, past-version), (bar, past-version)]
|
|
|
|
[ ImportableContents [(foo, past-version), (bar, even-older-version)]
|
|
|
|
[ ImportableContents [(foo, past-version)]
|
|
|
|
[]
|
|
|
|
]
|
|
|
|
]
|
|
|
|
]
|
|
|
|
|
|
|
|
Now, suppose that foo is deleted and subsequently bar is added back,
|
|
|
|
so S3 now sends this list:
|
|
|
|
|
|
|
|
bar new-version
|
|
|
|
bar deleted
|
|
|
|
bar past-version
|
|
|
|
bar even-older-version
|
|
|
|
foo deleted
|
|
|
|
foo current-version
|
|
|
|
foo past-version
|
|
|
|
|
|
|
|
The user would expect this to result in:
|
|
|
|
|
|
|
|
ImportableContents [(bar, new-version)]
|
|
|
|
[ ImportableContents []
|
|
|
|
l
|
|
|
|
]
|
|
|
|
|
|
|
|
But l needs to be the same as the l above to avoid git trees proliferation.
|
|
|
|
|
|
|
|
What is the algorythm here?
|
|
|
|
|
2019-04-19 17:39:33 +00:00
|
|
|
1. Build a list of files with historical versions ([[a]]).
|
|
|
|
2. Extract a snapshot from the list
|
|
|
|
3. Remove too new versions from the list
|
|
|
|
4. Recurse with the new list.
|
|
|
|
|
|
|
|
Extracting a snapshot:
|
|
|
|
|
|
|
|
Map over the list, taking the head version of each item and tracking
|
|
|
|
the most recent modification time. Add the filenames to a snapshot list
|
|
|
|
(unless the item is a deletion).
|
|
|
|
|
|
|
|
Removing too new versions:
|
|
|
|
|
|
|
|
Map over the list, and when the head version of a file matches the most
|
|
|
|
recent modification time, pop it off.
|
|
|
|
|
|
|
|
This results in a list that is only versions before the snapshot.
|
|
|
|
|
|
|
|
Overall this is perhaps a bit better than O(n^2) because the size of the list
|
|
|
|
decreases as it goes?
|
2019-04-18 20:25:04 +00:00
|
|
|
|
|
|
|
---
|
2018-06-14 21:14:13 +00:00
|
|
|
|
2018-03-21 06:25:53 +00:00
|
|
|
See also, [[adb_special_remote]]
|
2020-01-30 19:22:05 +00:00
|
|
|
|
|
|
|
[[!tag confirmed]]
|