Say we have 2 backup drives and want to fill them both evenly with files, different files in each drive. Currently, preferred content cannot express that entirely: * One way is to use a-m* and n-z*, but that's unlikely to split filenames evenly. * Or, can let both repos take whatever files, perhaps at random, that the other repo is not know to contain, but then repos will race and both get the same file, or similarly if they are not communicating frequently. So, let's add a new expression: `balanced(group)` This would work by taking the list of uuids of all repositories in the group, and sorting them, which yields a list from 0..M-1 repositories. To decide which repository wants key K, convert K to a number N in some stable way and then `N mod M` yields the number of the repository that wants it, while all the rest don't. (Since git-annex keys can be pretty long and not all of them are random hashes, let's md5sum the key and then use the md5 as a number.) This expression is stable as long as the members of the group don't change. I think that's stable enough to work as a preferred content expression. Now, you may want to be able to add a third repo and have the data be rebalanced, with some moving to it. And that would happen. However, as this scheme stands, it's equally likely that adding repo3 will make repo1 and repo2 want to swap files between them. So, we'll want to add some precautions to avoid a lot of data moving around in this case: ((balanced(backup) and not (copies=backup:1)) or present So once file lands on a backup drive, it stays there, even if more backup drives change the balancing. ----- Some limitations: * The item size is not taken into account. One repo could end up with a much larger item or items and so fill up faster. And the other repo wouldn't then notice it was full and take up some slack. * With the complicated expression above, adding a new repo when one is full would not necessarily result in new files going to one of the 2 repos that still have space. Some items would end up going to the full repo. These can be dealt with by noticing when a repo is full and moving some of it's files (any will do) to other repos in its group. I don't see a way to make preferred content express that movement though; it would need to be a manual/scripted process. > Could the size of each repo be recorded (either actual disk size or > desired max size) and when a repo is too full to hold an object, be left > out of the set of repos used to calculate where to store that object? > > With the preferred content expression above with "present" in it, > a repo being full would not cause any content to be moved off of it, > only new content that had not yet reached any of the repos in the > group would be affected. That seems good. > > This would need only a single one-time write to the git-annex branch, > to record the repo size. Then update a local counter for each repository > from the git-annex branch location log changes. > There is a todo about doing this, > [[todo/track_free_space_in_repos_via_git-annex_branch]]. > > Of course, in the time after the git-annex branch was updated and before > it reaches the local repo, a repo can be full without us knowing about > it. Stores to it would fail, and perhaps be retried, until the updated > git-annex branch was synced. ----- What if we have 5 backup repos and want each file to land in 3 of them? There's a simple change that can support that: `balanced(group:3)` This works the same as before, but rather than just `N mod M`, take `N+I mod M` where I is [0..2] to get the list of 3 repositories that want a key. This does not really avoid the limitations above, but having more repos that want each file will reduce the chances that no repo will be able to take a given file. In the [[iabackup]] scenario, new clients will just be assigned until all the files reach the desired level or replication. However.. Imagine there are 9 repos, all full, and some files have not reached desired level of replication. Seems like adding 1 more repo will make only 3 in 10 files be wanted by that new repo. Even if the repo has space for all the files, it won't be sufficient, and more repos would need to be added. One way to avoid this problem would be if the preferred content was only used for the initial distribution of files to a repo. If the repo has gotten all the files it wants, it could make a second pass and opportunistically get files it doesn't want but that it has space for and that don't have enough copies yet. Although this gets back to the original problem of multiple repos racing downloads and files getting more than the desired number of copies. > With the above idea of tracking when repos are full, the new repo > would want all files when the other 9 repos are full. ---- Of course this is not limited to backup drives. A more complicated example: There are 4 geographically distributed datacenters, each of which has some number of drives. Each file should have 1 copy stored in each datacenter, on some drive there. This can be implemented by making a group for each datacenter, which all of its drives are in, and using `balanced()` to pick the drive that holds the copy of the file. The preferred content expression would be eg: ((balanced(datacenterA) and not (copies=datacenterA:1)) or present In such a situation, to avoid a `N^2` remote interconnect, there might be a transfer repository in each datacenter, that is in front of its drives. The transfer repository should want files that have not yet reached the destination drive. How to write a preferred content expression for that? It might be sufficient to use `copies=datacenterA:1`, so long as the file reaching any drive in the datacenter is enough. But may want to add something analagous to `inallgroup=` that checks if a file is in the place that `balanced()` picks for a group. Eg, `balancedgroup=datacenterA` for 1 copy and `balancedgroup=group:datacenterA:2` for N copies. ---- Another possibility to think about is to have one repo calculate which files to store on which repos, to best distribute and pack them. The first repo that writes a solution would win and other nodes would work to move files around as needed. In a split brain situation, there would be sets of repos doing work toward different solutions. On merge it would make sense to calculate a new solution that takes that work into account as well as possible. (Some work would surely have been in vain.)