remove old closed bugs and todo items to speed up wiki updates and reduce size

Remove closed bugs and todos that were last edited or commented before 2022.

Except for ones tagged projects/* since projects like datalad want to keep
around records of old deleted bugs longer.

Command line used:

	for f in $(grep -l '|done\]\]' -- ./*.mdwn); do if ! grep -q "projects/" "$f"; then d="$(echo "$f" | sed 's/.mdwn$//')"; if [ -z "$(git log --since=01-01-2022 --pretty=oneline -- "$f")" -a -z "$(git log --since=01-01-2022 --pretty=oneline -- "$d")" ]; then git rm -- "./$f" ; git rm -rf "./$d"; fi; fi; done
	for f in $(grep -l '\[\[done\]\]' -- ./*.mdwn); do if ! grep -q "projects/" "$f"; then d="$(echo "$f" | sed 's/.mdwn$//')"; if [ -z "$(git log --since=01-01-2022 --pretty=oneline -- "$f")" -a -z "$(git log --since=01-01-2022 --pretty=oneline -- "$d")" ]; then git rm -- "./$f" ; git rm -rf "./$d"; fi; fi; done
This commit is contained in:
Joey Hess 2023-01-05 15:09:30 -04:00
parent acdd5fbab6
commit 4d90053e17
No known key found for this signature in database
GPG key ID: DB12DB0FF05F8F38
427 changed files with 0 additions and 15690 deletions

View file

@ -1,8 +0,0 @@
# As is
At the FAT disks annex uses ajusted unlocked branch. Files use double space: in the file tree and in the .git folder.
# As I wonder
At such disks, with option annex.thin, annex uses only file tree for keeping content. Content of the files in the .git folder is wiped.
> [[done]], dup of other todo, and I don't know how to avoid the problem
> with git deleting the file. --[[Joey]]

View file

@ -1,15 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2021-02-08T18:17:00Z"
content="""
This has the following problem: You run git pull. A file got deleted. git
deletes the file in the repository directory. That was the only copy of the
content, so it's now impossible to revert the deletion and get the file
back, which you're supposed to be able to do.
This is why git-annex has to either make a copy or hard link the file
away for safekeeping.
As already discussed in [[annex.thin without hardlinks]].
"""]]

View file

@ -1,23 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2016-09-21T17:11:06Z"
content="""
Standard encfs warning: It's buggy and insecure. Don't use it.
You can find many other problems caused by encfs on this site, and
<https://defuse.ca/audits/encfs.htm> has described security problems with
encfs for years.
It would not help for `git-annex add` to check some kind of filename limit,
because it would not prevent you doing this:
git annex add smallenough
git mv smallenough oh-oops-my-name-is-too-long-for-encfs
git commit -m haha
A git pre-commit hook can of course be written that blocks such commits.
I am not inclined to complicate git-annex just to handle encfs given how
broken encfs is.
"""]]

View file

@ -1,14 +0,0 @@
[[!comment format=mdwn
username="interfect@b151490178830f44348aa57b77ad58c7d18e8fe7"
nickname="interfect"
subject="Pre Commit Hook"
date="2016-09-21T19:20:04Z"
content="""
I'm basically stuck with whatever home directory encryption Canonical deigns to give me in their setup wizard, given my time and attention budget. I've looked a bit at the security problems with it and they mostly seem to be that it's a bit leaky due to not hiding structures and sizes. Hiding contents is better than not hiding contents, so that's what I've got.
Anyway, a pre-commit hook, or maybe an update hook, would be a great solution. I'd like one to be on the wiki somewhere as a useful tip for actually using git annex effectively across a bunch of non-ideal environments. It would be great if a \"git annex init\" could set it up for me, too.
Any ideas for writing a pre-commit script that works on Linux, Mac, Windows, Android, and whatever weird embedded NAS things people might want to use it on? If I went with an update script over a pre-commit, that would make platform support less of a problem, but then you'd get Git Annex into weird situations when syncing.
How would Git Annex react if I made a commit on one system, but then my central syncing repo's update script rejected the commit for breaking the rules on file names? If I have a commit that isn't allowed to be pushed to a particular remote, how would I use git annex to get it out of the history of any repos it might have already gotten to?
"""]]

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="interfect@b151490178830f44348aa57b77ad58c7d18e8fe7"
nickname="interfect"
subject="comment 3"
date="2016-09-21T19:32:06Z"
content="""
Also, I think Ubuntu is \"ecryptfs\" and not \"encfs\" anyway.
"""]]

View file

@ -1,23 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 4"""
date="2016-09-21T20:16:44Z"
content="""
If an update hook rejects a push, then `git annex sync` will just note that
it was unable to push. It will sync in only one direction until the problem
that prevents pushing gets resolved.
It might try pushing to a different branch name than usual to get around
some other problems that cause pushes to fail so be sure to have the update
hook check pushes to all branches (except for the git-annex branch)).
I don't know why you'd want to filter such a commit out of the git history.
You could just fix it by renaming the problem file and make a commit on top
of the problem commit. Just make the update hook only look at the diff
between the old version of the branch and the new version, so it won't be
tripped up by intermediate commits that violate its rules.
(I know that Ubuntu uses encfs or something like that by default, but
surely they have not removed the Debian installer's support for full
disk encryption?)
"""]]

View file

@ -1,10 +0,0 @@
[[!comment format=mdwn
username="interfect@b151490178830f44348aa57b77ad58c7d18e8fe7"
nickname="interfect"
subject="comment 5"
date="2016-09-21T22:49:55Z"
content="""
OK, I'll try something like that.
(Full disk encryption is still there; I think on one system I just have ecryptfs, because I want to be able to get in over ssh sometimes, and on one I have *both* FDE and ecryptfs on, because I enjoy performance penalties.)
"""]]

View file

@ -1,9 +0,0 @@
Would it be hard to support MD5E keys that omit the -sSIZE part, the way this is allowed for URL keys? I have a use case where I have the MD5 hashes and filenames of files stored in the cloud, but not their sizes, and want to construct keys for these files to use with setpresentkey and registerurl. I could construct URL keys, but then I lose the error-checking and have to set annex.security.allow-unverified-downloads . Or maybe, extend URL keys to permit an -hMD5 hash to be part of the key?
Another (and more generally useful) solution would be [[todo/alternate_keys_for_same_content/]]. Then can start with a URL-based key but then attach an MD5 to it as metadata, and have the key treated as a checksum-containing key, without needing to migrate the contents to a new key.
> Closing, because [[external_backends]] is implemented, so you should be
> able to roll your own backend for your use case here. Assuming you can't
> just use regular MD5E and omit the file size field, which will work too.
> [[done]]
> --[[Joey]]

View file

@ -1,9 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2019-01-22T15:54:03Z"
content="""
Have you tried just constricting MD5E keys without the size value?
git-annex still supports keys from before v2 repo version that did not
include size, so I'd guess it would work ok.
"""]]

View file

@ -1,27 +0,0 @@
What steps will reproduce the problem?
Sync a lot of small files.
What is the expected output? What do you see instead?
The expected output is hopefully a fast transfer.
But currently it seems like git-annex is only using one thread to transfer(per host or total?)
An option to select number of transfer threads to use(possibly per host) would be very nice.
> Opening a lot of connections to a single host is probably not desirable.
>
> I do want to do something to allow slow hosts to not hold up transfers to
> other hosts, which might involve running multiple queued transfers at
> once. The webapp already allows the user to force a given transfer to
> happen immediately. --[[Joey]]
And maybe also an option to limit how long a queue the browser should show, it can become quite resource intensive with a long queue.
> The queue is limited to 20 items for this reason. --[[Joey]]
---
> There has been a lot of improvement in both parallization support
> and per-file overhead on speed since this todo was filed. This todo does
> not look relevent enough to leave open, so [[done]] --[[Joey]]

View file

@ -1,12 +0,0 @@
[[!comment format=mdwn
username="https://www.google.com/accounts/o8/id?id=AItOawnKT33H9qVVGJOybP00Zq2NZmNAyB65mic"
nickname="Lucas"
subject="comment 1"
date="2014-11-12T07:58:07Z"
content="""
Opening multiple connections to a host can be preferable sometimes and it's unlikely to be an issue at all for the larger remotes like Google, Microsoft or S3.
For example, the OneDrive provider spends a lot of time sitting around waiting for initialisation between uploads. Using, say 5 threads instead of 1 would allow it to continue doing things while it waits.
Multiple connections can also vastly improve upload speeds for users with congested home internet connections.
"""]]

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="https://launchpad.net/~krastanov-stefan"
nickname="krastanov-stefan"
subject="Status of this issue"
date="2014-12-27T15:18:42Z"
content="""
I was unable to find a way to tell git-annex that certain remotes should receive multiple transfers in parallel. Is this implemented yet or on the roadmap? If neither would modifying the webapp to bear this logic without touching git-annex itself be a solution (asking mainly because it can be done with a greasemonkey script)?
"""]]

View file

@ -1,12 +0,0 @@
[[!comment format=mdwn
username="lhunath@3b4ff15f4600f3276d1776a490b734fca0f5c245"
nickname="lhunath"
subject="Simultaneous transfers"
date="2018-02-02T17:37:27Z"
content="""
I highly recommend ensuring that:
1. Each remote can configure a number of maximum simultaneous transfers, where each type of remote comes with a sensible default number.
2. Transfers to multiple individual remotes happen in parallel regardless of their simultaneous transfers setting.
Judging from the fact that simultaneous transfers happen just fine when you hit the > icon in the webapp, I would assume that most of the underbelly for this is already present.
"""]]

View file

@ -1,41 +0,0 @@
```
From 92dfde25409ae2268ab2251920ed11646c122870 Mon Sep 17 00:00:00 2001
From: Reiko Asakura <asakurareiko@protonmail.ch>
Date: Tue, 26 Oct 2021 15:46:38 -0400
Subject: [PATCH] Call freezeContent after move into annex
This change better supports Windows ACL management using
annex.freezecontent-command and annex.thawcontent-command and matches
the behaviour of adding an unlocked file.
By calling freezeContent after the file has moved into the annex,
the file's delete permission can be denied. If the file's delete
permission is denied before moving into the annex, the file cannot
be moved or deleted. If the file's delete permission is not denied after
moving into the annex, it will likely inherit a grant for the delete
permission which allows it to be deleted irrespective of the permissions
of the parent directory.
---
Annex/Content.hs | 3 +++
1 file changed, 3 insertions(+)
diff --git a/Annex/Content.hs b/Annex/Content.hs
index da65143ab..89c36e612 100644
--- a/Annex/Content.hs
+++ b/Annex/Content.hs
@@ -346,6 +346,9 @@ moveAnnex key af src = ifM (checkSecureHashes' key)
liftIO $ moveFile
(fromRawFilePath src)
(fromRawFilePath dest)
+ -- On Windows the delete permission must be denied only
+ -- after the content has been moved in the annex.
+ freezeContent dest
g <- Annex.gitRepo
fs <- map (`fromTopFilePath` g)
<$> Database.Keys.getAssociatedFiles key
--
2.30.2
```
> [[applied|done]] --[[Joey]]

View file

@ -1,49 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2021-10-26T17:49:00Z"
content="""
Thank you for putting this patch together. It is especially helpful to get
patches from a windows user, since it's far from my comfort zone.
---
My first concern was what happens if git-annex is interrupted after moving
the object into place but before freezeContent. Leaving an object file
with possibly unsafe permissions. Looks like `git-annex fsck` will
corrrect that, if it's run.
As you mentioned, when an unlocked file is added, and linkToAnnex
is called, it does move the object into the annex before freezeContent.
Although that may have been an oversight really. It could just as well
freeze before moving and so avoid leaving the file with the wrong
permissions when interrupted.
And there are other situations where being interrupted can have the same
result. Eg, in lockContentForRemoval, it calls thawContent, then an action
that may take long enough to be interrupted, and then freezeContent.
And it's hard to see any other way that could work; it can't
move the object out of the object directory before thawing it.
So, this seems ok, I suppose.
---
In Annex.Ingest, `lockDown'` calls freezeContent on the file
when it's still in the work tree. So I think that would have the same
problem you're trying to prevent with this patch?
Command.Import also has a call to freezeContent that is not on the final
object file location.
A windows-specific feature like this risks getting broken, so maybe
it would be good to change freezeContent to avoid such problems. Eg,
it could be changed to take a Key, and freeze the object file
for that Key. But at least the call in Annex.Ingest needs to happen
before there is a Key.
So perhaps there should be a freezeContent
and a separate freezeObject, which takes a Key. There could
then be a separate annex.freezeobject-command that gets run only
for freezeObject, not freezeContent.
"""]]

View file

@ -1,9 +0,0 @@
[[!comment format=mdwn
username="asakurareiko@f3d908c71c009580228b264f63f21c7274df7476"
nickname="asakurareiko"
avatar="http://cdn.libravatar.org/avatar/a865743e357add9d15081840179ce082"
subject="comment 2"
date="2021-10-26T19:54:53Z"
content="""
Sorry I missed explaining a few things and made a mistake in the patch. I made my freeze script detect whether the input is inside or outside of .git/annex/objects, so there are no problems with calling freezeContent on something in the working tree. The problem is not calling freezeContent on the final object, because the delete permission can only be denied at that point. The easiest way without compromising the safety of the previous behaviour is to add another freezeContent call after moveFile.
"""]]

View file

@ -1,18 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 3"""
date="2021-10-27T17:56:15Z"
content="""
Ah, making your script smart is reasonable enough.
I hope you might consider sharing the script in a tip?
Looking at your updated patch, you now leave the freezeContent call before it
moves to the object file, and add another call afterwards. I think that would
be objectionable if the user has a freeze hook that is expensive
the run, because it would unncessarily run twice. I fairly well satisfied
myself in comment #1 that it's ok to defer freezeContent to after it's
moved the object file into place.
So, I've applied it, but modified to remove that earlier freezeContent.
"""]]

View file

@ -1,3 +0,0 @@
Sqlite docs [say](https://www.sqlite.org/pragma.html#pragma_synchronous) "commits can be orders of magnitude faster with synchronous OFF". The downside is a chance of db corruption if power fails at a bad moment, but since git-annex's dbs can be re-generated from git data, maybe that's a tradeoff some users would be ok with? One usually knows when power has failed.
> [[closing|done]] per comments --[[Joey]]

View file

@ -1,47 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2021-06-23T17:05:38Z"
content="""
I think it could at least use synchronous=NORMAL, entirely safely, since it
uses WAL mode.
"WAL mode is always consistent with synchronous=NORMAL, but WAL mode does
lose durability. A transaction committed in WAL mode with
synchronous=NORMAL might roll back following a power loss or system crash."
It's certianly already possible for a power loss or ctrl-c while git-annex
is running to cause database changes to be lost, since git-annex buffers
several changes together into a transaction and until it sends that
transaction, can lose the data.
Exactly how well git-annex recovers from that probably varies, eg
Database.Keys.reconcileStaged flushes the transactions before updating its
own state files, so on power loss it will just run again and recover. The
fsck database gets recovered likewise. But there are probably other write points
where getting the data recovered is harder.
For example, moveAnnex updates the inode cache at the end when it populated
a pointer file. If that database write is lost, git-annex won't know that
the pointer file is populated with annexed content. So it will treat it as
a possibly modified unlocked file, and when it eventually has a reason to,
will re-hash it, and then should recover the lost information.
Quite possible there are situations where it fails to recover the lost
information and does something annoying. But like I said, such situations
can already happen and setting synchronous=NORMAL does not make them more
likely.
It would still make sense to benchmark it before changing to it. It may
well be that git-annex's buffering of changes into larger transactions
already has a similar performance gain as the pragma and that the pragma
does not speed it up.
As far as OFF goes, I'd need to see some serious performance improvements
in benchmarking, and also be sure that git-annex always recovered well,
which would have to somehow include detecting corrupted sqlite databases
and rebuilding them. I don't know if it's really possible to detect.
Might some form of corrupted sqlite database cause sqlite, and thus
git-annex, to crash? And rebuilding might entail re-hashing the entire
repository, so very expensive.
"""]]

View file

@ -1,19 +0,0 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="recovering from sqlite db corruption"
date="2021-06-23T18:45:47Z"
content="""
>detecting corrupted sqlite databases and rebuilding them. I don't know if it's really possible to detect.
Could you detect whether a git-annex command finished normally,by creating a marker file when it starts, and deleting the marker file as the last thing before exiting?
The next command then checks if the previous one crashed, and rebuilds the dbs if yes (or just warns the user and offers to rebuild.)
>Rebuilding might entail re-hashing the entire repository
Aren't all file hashes recorded in git, which would not be affected by a sqlite crash?
"""]]

View file

@ -1,102 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 2"""
date="2021-06-23T19:15:55Z"
content="""
Benchmarked with NORMAL:
joey@darkstar:~/tmp/t>/usr/bin/time ~/git-annex.synchrousNORMAL add 1??? --quiet
6.99user 5.09system 0:11.63elapsed 103%CPU (0avgtext+0avgdata 68356maxresident)k
143760inputs+40352outputs (819major+404774minor)pagefaults 0swaps
joey@darkstar:~/tmp/t>/usr/bin/time ~/git-annex.synchrousNORMAL add 2??? --quiet
7.71user 5.15system 0:11.93elapsed 107%CPU (0avgtext+0avgdata 69876maxresident)k
11336inputs+42648outputs (9major+414417minor)pagefaults 0swaps
joey@darkstar:~/tmp/t>/usr/bin/time ~/git-annex.synchrousNORMAL add 3??? --quiet
7.99user 5.16system 0:12.20elapsed 107%CPU (0avgtext+0avgdata 70452maxresident)k
11952inputs+44200outputs (8major+415267minor)pagefaults 0swaps
joey@darkstar:~/tmp/t>/usr/bin/time ~/git-annex.synchrousNORMAL add 4??? --quiet
8.30user 5.25system 0:12.62elapsed 107%CPU (0avgtext+0avgdata 69496maxresident)k
17784inputs+45776outputs (9major+416640minor)pagefaults 0swaps
Which is no improvement over git-annex with no pragmas. Actually slower.
joey@darkstar:~/tmp/t>/usr/bin/time ~/git-annex.orig add 1??? --quiet
6.89user 5.36system 0:11.39elapsed 107%CPU (0avgtext+0avgdata 50576maxresident)k
47064inputs+40352outputs (5616major+404472minor)pagefaults 0swaps
joey@darkstar:~/tmp/u>/usr/bin/time git-annex add 2??? --quiet
7.76user 5.09system 0:11.88elapsed 108%CPU (0avgtext+0avgdata 70848maxresident)k
12776inputs+42648outputs (9major+414346minor)pagefaults 0swaps
joey@darkstar:~/tmp/u>/usr/bin/time git-annex add 3??? --quiet
7.90user 5.26system 0:12.14elapsed 108%CPU (0avgtext+0avgdata 71676maxresident)k
13824inputs+44200outputs (8major+415258minor)pagefaults 0swaps
joey@darkstar:~/tmp/u>/usr/bin/time git-annex add 4??? --quiet
8.22user 5.38system 0:12.49elapsed 108%CPU (0avgtext+0avgdata 71652maxresident)k
14216inputs+45776outputs (8major+416784minor)pagefaults 0swaps
OFF also benchmarks very close to the same.
joey@darkstar:~/tmp/v>/usr/bin/time ~/git-annex.synchrousOFF add 1??? --quiet
6.85user 5.58system 0:12.01elapsed 103%CPU (0avgtext+0avgdata 71100maxresident)k
50080inputs+40352outputs (16major+405312minor)pagefaults 0swaps
joey@darkstar:~/tmp/v>/usr/bin/time ~/git-annex.synchrousOFF add 2??? --quiet
7.64user 5.31system 0:11.96elapsed 108%CPU (0avgtext+0avgdata 71392maxresident)k
12672inputs+42640outputs (8major+414373minor)pagefaults 0swaps
joey@darkstar:~/tmp/v>/usr/bin/time ~/git-annex.synchrousOFF add 3??? --quiet
8.02user 5.15system 0:12.19elapsed 108%CPU (0avgtext+0avgdata 71556maxresident)k
11648inputs+43928outputs (8major+415140minor)pagefaults 0swaps
joey@darkstar:~/tmp/v>/usr/bin/time ~/git-annex.synchrousOFF add 4??? --quiet
8.24user 5.24system 0:12.41elapsed 108%CPU (0avgtext+0avgdata 71224maxresident)k
10952inputs+45304outputs (8major+416560minor)pagefaults 0swaps
One pass did run 0.08s faster, could be due to not syncing but it does
not seem a significant optimisation, at least not on this SSD.
Should be noted that transactions build up 1000 changes, and that benchmark
was operating on 1000 files per run, so it probably only wrote one or two
transactions.
Here's the patch that adds a pragma:
diff --git a/Database/Handle.hs b/Database/Handle.hs
index d7f1822dc..2d66af5e6 100644
--- a/Database/Handle.hs
+++ b/Database/Handle.hs
@@ -1,11 +1,11 @@
{- Persistent sqlite database handles.
-
- - Copyright 2015-2019 Joey Hess <id@joeyh.name>
+ - Copyright 2015-2021 Joey Hess <id@joeyh.name>
-
- Licensed under the GNU AGPL version 3 or higher.
-}
-{-# LANGUAGE TypeFamilies, FlexibleContexts #-}
+{-# LANGUAGE TypeFamilies, FlexibleContexts, OverloadedStrings #-}
module Database.Handle (
DbHandle,
@@ -34,6 +34,7 @@ import qualified Data.Text as T
import Control.Monad.Trans.Resource (runResourceT)
import Control.Monad.Logger (runNoLoggingT)
import System.IO
+import Lens.Micro
{- A DbHandle is a reference to a worker thread that communicates with
- the database. It has a MVar which Jobs are submitted to. -}
@@ -194,10 +195,13 @@ runSqliteRobustly tablename db a = do
maxretries = 100 :: Int
rethrow msg e = throwIO $ userError $ show e ++ "(" ++ msg ++ ")"
-
+
+ conninfo = over extraPragmas (const ["PRAGMA synchronous=OFF"]) $
+ mkSqliteConnectionInfo db
+
go conn retries = do
r <- try $ runResourceT $ runNoLoggingT $
- withSqlConnRobustly (wrapConnection conn) $
+ withSqlConnRobustly (wrapConnectionInfo conninfo conn) $
runSqlConn a
case r of
Right v -> return v
"""]]

View file

@ -1,10 +0,0 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="thanks "
date="2021-06-24T16:40:48Z"
content="""
Thanks for doing the benchmarking; seems like git-annex's batching of operations already captures whatever speedup de-synchronizing could give.
"""]]

View file

@ -1,14 +0,0 @@
Like git annex runs git-annex, git-annex foo could run git-annex-foo when
it's not built-in.
One user of this would be annex-review-unused, which
its author would rather name git-annex-reviewunused if that
made "git annex reviewunused" work.
In CmdLine, where autocorrect is handled, it would need to
search the path for all "git-annex-" commands and then
either dispatch the one matching the inputcmdname,
or do autocorrect with the list of those commands
included along with the builtins. --[[Joey]]
> [[done]] --[[Joey]]

View file

@ -1,43 +0,0 @@
Currently annex.thin needs hard link support to be efficient;
it hard links the content from .git/annex/objects into the work tree.
When hard links are not supported, two copies of checked out files exist on
disk.
Would it be possible to make it work w/o hard links? Note that direct mode
does avoid two copies of files.
IIRC the main reason for the hard link is so, when git checkout deletes a
work tree file, the only copy of the file is not lost. Seems this would
need a git hook run before checkout to rescue such files.
Also some parts of git-annex's code, including `withObjectLoc`, assume
that the .annex/objects is present, and so it would need to be changed
to look at the work tree file. --[[Joey]]
> Git hook is not sufficient. Consider the case of "rm file; git checkout file"
> Without hard links, if the only copy of the annex object was in that
> deleted file, it can't be restored. Now, direct mode did have the same
> problem, but it didn't support `git checkout`, so the user didn't have
> reason to expect such a workflow to work.
>
> So, I think this is not possible to implement in a way that won't
> lead to users experiencing data loss when using it and doing
> perfectly normal git things like this.
>
> (Although to be fair, annex.thin has its own data loss scenarios,
> involving modifying a file potentially losing the only copy of
> the old version. The difference, I think, is that with it,
> you modify the file yourself and so lose the old version; the data
> loss does not happen when you run git checkout or git pull!)
>
> In the meantime,
> git-annex has gotten support for directory special remotes with
> import/export tree. This can be used instead, for use cases such as a
> device with a FAT filesystem. The git-annex repo can live on another
> filesystem that does support hard links or symlinks, or where using
> double disk space is not as much of a problem, or can even be a bare
> git repo. That syncs up with the FAT device through tree import and
> export. Once content has been imported to the git-annex repo,
> the user can delete files from the FAT device without losing data.
>
> So this seems about as good as it can get. [[done]] --[[Joey]]

View file

@ -1,38 +0,0 @@
Add a git config to limit the bandwidth of transfers to/from remotes.
rsync has --bwlimit, so used to work, but is not used with modern
git-annex for p2p transfers. (bup also has a --bwlimit)
This should be possible to implement in a way that works for any remote
that streams to/from a bytestring, by just pausing for a fraction of a
second when it's running too fast. The way the progress reporting interface
works, it will probably work to put the delay in there. --[[Joey]]
[[confirmed]]
> Implemented and works well. [[done]] --[[Joey]]
> Note: A local git remote, when resuming an interrupted
> transfer, has to hash the file (with default annex.verify settings),
> and that hashing updates the progress bar, and so the bwlimit can kick
> in and slow down that initial hashing, before any data copying begins.
> This seems perhaps ok; if you've bwlimited a local git remote,
> remote you're wanting to limit disk IO. Only reason it might not be ok
> is if the intent is to limit IO to the disk containing the remote
> but not the one containing the annex repo. (This also probably
> holds for the directory special remote.)
> Other remotes, including git over ssh, when resuming don't have that
> problem. Looks like chunked special remotes narrowly avoid it, just
> because their implementation choose to not do incremental verification
> when resuming. It might be worthwhile to differentiate between progress
> updates for incremental verification setup and for actual transfers, and
> only rate limit the latter, just to avoid fragility in the code.
> I have not done so yet though, and am closing this..
> --[[Joey]]
> (One other small caveat is that it pauses after each chunk, which means
> it pauses unncessarily after the last chunk of the file. It doesn't know
> it's the last chunk, and it would be hard to teach it. And the chunks
> tend to be 32kb or so, and the pauses a small fraction of a second. So
> mentioning this only for completeness.) --[[Joey]]

View file

@ -1,7 +0,0 @@
AFAICT, the `annex/` subdir in a bare annex repo is the exact same layout as a directory special remote.
It'd be very useful if its parameters could be customised just like an actual directory special remote to allow for e.g. encrypted and/or chunked storage. I have a use-case where this could significantly simplify things.
An interesting side-effect of this would be a tweakable location for a bare repo's storage which could be used to separate metadata and data (i.e. git repo on SSD for fast syncs and actual data on an HDD).
> [[rejected|done]] --[[Joey]]

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="Lukey"
avatar="http://cdn.libravatar.org/avatar/c7c08e2efd29c692cc017c4a4ca3406b"
subject="comment 1"
date="2021-05-25T16:48:26Z"
content="""
You can already do this by setting the `remote.<name>.annex-ignore` config option for the bare repo and initializing an independent directory special-remote.
"""]]

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="Atemu"
avatar="http://cdn.libravatar.org/avatar/d1f0f4275931c552403f4c6707bead7a"
subject="comment 2"
date="2021-05-26T07:11:20Z"
content="""
The problem is that I need this repo to stay a remote from the eyes of all other repos; I need to be able to get files from and add new ones to it. I just want its storage back-end to work a little differently so that it fits my use-case.
"""]]

View file

@ -1,22 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 3"""
date="2021-05-27T16:15:31Z"
content="""
It would not make sense for a non-bare git repository to have annexed
contents in it encrypted or chunked, because that would prevent actually
accessing the annexed files at all; git-annex symlinks have to point to a
complete, non-encrypted file.
Bare git repositories are a very minor special case of non-bare git
repositories; they do not have a work tree or index. In other
respected, they are the same, and it's entirely possible to manually
convert a git repo to or from bare, or even temporarily use a bare repo
with a work tree.
It would be extremely inelegant if git-annex did something that broke
that. Which this would.
I think you should use a rsync special remote possibly. Which also has the
same layout as a directory special remote.
"""]]

View file

@ -1,5 +0,0 @@
When you want to dead a file in your checkout, you can only do so via the key of the file. You can find the corresponding key with a bit of bash like this: `git annex dead --key $(basename $(readlink file))` but that shouldn't be necessary IMO.
It'd be a lot better if you could just dead files like this: `git annex dead --file file` or even like this: `git annex dead --file file1 file2 file3 otherfiles.*` (or maybe even like this: `git annex dead --file file1 file2 --key $key1 $key2`).
> [[done]] in another way --[[Joey]]

View file

@ -1,25 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2021-05-31T19:08:32Z"
content="""
I suppose this could be useful, but note that `git annex fsck` without
--all will still warn if it finds a file in the working tree with no
existing content, even if its key has been marked dead. Because having a
file in the working tree that you can't get is certainly a bad situation.
So, if this feature got implemented, you would want to follow `git annex
dead` of a file with `git rm` of the file. Probably.
The other reason dead only operates on keys is that the expected
workflow was that the user will lose data, will delete the lost file out of
their working tree, or overwrite it or whatever, and then at some later
point get annoyed that fsck --all complains about it, and so then mark it
dead. But if you want to be proactive, marking a file dead is certainly
useful to be able to do.
I'd also be concerned that `git annex dead` or `git annex dead .` run
accidentally could be an annoying mistake to recover from. Certianly
it should not default to marking all files dead when there are no
parameters!
"""]]

View file

@ -1,26 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 2"""
date="2021-06-25T18:22:04Z"
content="""
In <https://git-annex.branchable.com/forum/Forget_about_accidentally_added_file__63__/>
there is an idea of `git annex unannex --forget file`
And using unannex for this makes some sense; it's intended to be used to undo an
accidental `git-annex add`. When it's used that way, and later `git-annex
unused` finds the object file is not used by anything and the object gets
deleted, fsck --all will start complaining about it.
But there are still many ways it could go wrong. Being run recursively by
accident. Or another file, perhaps in another branch, using the same key,
which gets marked as dead.
Hmm, `git annex dropunused` (or `drop --unused`)
could mark the key as dead. At that point it's known to be unused.
This way, the existing workflow of git-annex unannex followed by git-annex
unused followed by dropping can be followed, and fsck --all does
not later complain about the key.
Done!
"""]]

View file

@ -1,38 +0,0 @@
Consider this, where branch foo has ten to a hundred thousand files
not in the master branch:
git checkout foo
touch newfile
git annex add newfile
After recent changes to reconcileStaged, the result can be:
add newfile 0b 100% # cursor sits here for several seconds
This is because it has to look in the keys db to see if there's an
associated file that's unlocked and needs populating with the content of
this newly available key, so it does reconcileStaged, which can take some
time.
One fix would be, if reconcileStaged is taking a long time, make it display
a note about what it's doing:
add newfile 0b 100% (scanning annexed files...)
It would also be possible to do the scan before starting to add files,
which would look more consitent and would avoid it getting stuck
with the progress display in view:
(scanning annexed files...)
add newfile ok
> [[done]] --[[Joey]]
It might also be possible to make reconcileStaged run a less expensive
scan in this case, eg the scan it did before
[[!commit 428c91606b434512d1986622e751c795edf4df44]]. In this case, it
only really cares about associated files that are unlocked, and so
diffing from HEAD to the index is sufficient, because the git checkout
will have run the smudge filter on all the unlocked ones in HEAD and so it
will already know about those associated files. However, I can't say I like
this idea much because it complicates using the keys db significantly.

View file

@ -1,12 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2021-06-08T15:21:02Z"
content="""
Made `git-annex smudge --update` run the scan, and so the post-checkout or
post-merge hook will call it.
That avoids the scenario shown above. But adding a lot of files to the
index can still cause a later pause for reconcileStaged without indication
what it's doing.
"""]]

View file

@ -1,22 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 2"""
date="2021-06-08T16:03:13Z"
content="""
I tried making reconcileStaged display the message itself, this is the
result:
add foo
100% 30 B 73 KiB/s 0s(scanning for annexed files...)
ok
So for that to be done, showSideAction would need to clear the progress
bar display first. Note that the display is ok when concurrent output is
enabled:
add c (scanning for annexed files...)
ok
Ok.. Fixed that display glitch, and made reconcileStaged display
the message itself when it's taking a while to run.
"""]]

View file

@ -1,32 +0,0 @@
The protocol has `GETCONFIG`, which gives access to the configuration
stored in remote.log, but it does not provide a good way to access git
configs set on the remote.
Datalad uses `GETCONFIG name` to get the remote name, and
then using git config to get its configs. That is suboptimal
because sameas remotes use sameas-name instead, and also because
the two names are not necessarily the same, eg `git remote rename` can
rename the git remote while the git-annex config still uses the other name.
<https://github.com/datalad/datalad/issues/4259>
One way to do that is `GETUUID` and then look for the git remote with
annex-uuid set to that, in order to learn its name and then find its other git
configs. But, it's also possible for there to be multiple git remotes with the
same annex-uuid. (This does not happen with sameas remotes, but like a git repo
can have multiple remotes pointing to it by different paths, the same can be
set up for a special remote, at least in theory.)
So, the protocol should be extended. Either with a way to get/set a single git
config (like `GETCONFIG`/`SETCONFIG` do with the remote.log config), or with a
way to get the git remote name.
The latter has the problem that this business of there being multiple
names for different related things that might be different but are probably
the same is a perhaps not something people want to learn about.
The former seems conceptually simpler, but there might be things that
`git config` could do, that providing an interface on top of it would not
allow. The --type option is one thing that comes to mind. --[[Joey]]
> [[done]] as the GETGITREMOTENAME protocol extension and message.
> --[[Joey]]

View file

@ -1,7 +0,0 @@
`git annex fsck` currently spams the terminal with all keys in a repo and prints `git-annex: fsck: n failed` at the end if errors occur. Finding these errors in a sea of `ok`s is not trivial however.
A simple solution to this could be an fsck option which skips printing ok'd (and perhaps also dead) keys, i.e. `--no-ok` and `--no-dead`.
[[!meta title="mention common options on per-command man pages"]]
> common option man page and references [[done]] --[[Joey]]

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="Lukey"
avatar="http://cdn.libravatar.org/avatar/c7c08e2efd29c692cc017c4a4ca3406b"
subject="comment 1"
date="2021-05-10T12:21:37Z"
content="""
Just use the `--quiet` option, then it will only show the errors (failed files/keys).
"""]]

View file

@ -1,10 +0,0 @@
[[!comment format=mdwn
username="Atemu"
avatar="http://cdn.libravatar.org/avatar/d1f0f4275931c552403f4c6707bead7a"
subject="comment 2"
date="2021-05-10T14:13:55Z"
content="""
Thanks, that's exactly what I'm looking for!
It's not in the git-annex-fsck manpage though for some reason.
"""]]

View file

@ -1,15 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 3"""
date="2021-05-10T15:07:06Z"
content="""
Normally the common options are not included in every command's man page
because there are over 100 lines of them. However, I do think it's worth
including --quiet on fsck's man page in this specific case and am doing
that.
Maybe individual command man pages should mention that there are
also a bunch of common options. Perhaps those should be split out of the
git-annex man page, like the git-annex-matching-options man page is
handled.
"""]]

View file

@ -1,55 +0,0 @@
ATM `annex get` (in particular '--json --json-error-messages --json-progress') would channel to the user the error from an attempt to get a key from a remote with a message which lacks information about remote and/or specifics of that particular attempt (e.g. which URL was attempted from web remote), e.g.
```
$> git clone https://github.com/dandisets/000029 && cd 000029
Cloning into '000029'...
remote: Enumerating objects: 326, done.
remote: Counting objects: 100% (326/326), done.
remote: Compressing objects: 100% (160/160), done.
remote: Total 326 (delta 137), reused 295 (delta 106), pack-reused 0
Receiving objects: 100% (326/326), 45.53 KiB | 1.30 MiB/s, done.
Resolving deltas: 100% (137/137), done.
dandiset.yaml sub-RAT123/ sub-anm369962/ sub-anm369963/ sub-anm369964/
$> git update-ref refs/remotes/origin/git-annex b822a8d40ff348a60602f13d0add989bd24e727a # URLs fixed since then
$> git annex get sub-RAT123
get sub-RAT123/sub-RAT123.nwb (from web...)
download failed: Not Found
ok
(recording state in git...)
$> git annex version | head -n 1
git-annex version: 8.20210803+git165-g249d424b8-1~ndall+1
```
NB. That "download failed: Not Found" is also channeled in that form (without any extra information) among "errors" of `--json-error-messages` (and each progress message within `--json-progress`)
As such the message is not informative really, and might even be a bit confusing to the user since `get` does `ok` eventually here.
I think it is useful to channel such information but it should be extended, e.g. in this case could be
```
failed to retrieve content from 'web' remote: https://api.dandiarchive.org/api/dandisets/000029/versions/draft/assets/b3675aad-db07-4fd4-9cce-c95f1184e7a3/download/ - Not Found
```
or alike. Even though considerably longer, it immediately provides feedback from which remote it failed to retrieve, and what was that particular URL.
refs in DataLad issues:
- from web remote: ["download failed: Not Found"](https://github.com/datalad/datalad/pull/5936)
- from ["failed to retrieve content from remote"](https://github.com/datalad/datalad/issues/5750)
> I think this is specific to downloading urls, although it can happen
> for a few remotes (web, external). There's really no reason to display
> a download failed message if it successfully downloads a later url.
> (After all, if it had tried the working url first, it would never display
> anything about the broken url.)
>
> When all urls fail, it makes sense to display each url and why it failed
> when using the web (or external) remote, so the user can decide what to
> do about each of the problems.
>
> [[done]] --[[Joey]]

View file

@ -1,3 +0,0 @@
Can git-annex-get be extended so that "git-annex-get --batch --key" fetches the keys (rather than filenames) given in the input?
> [[done]] --[[Joey]]

View file

@ -1,18 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2019-09-18T17:07:56Z"
content="""
--key can't be reused for another meaning like this, it would make "--key
foo" be ambiguous.
It would need to be some other option, --batch-key or whatever.
Adding this would seem to open the door to adding it to every command that
supports --batch now. I'm unsure if the added complexity justifies it.
I'd be more sanguine if there were a way to reuse the existing batch
machinery and apply it to keys. But many commands' --batch honor file
matching options (eg --copies or --include), and that cannot be done when
using keys.
"""]]

View file

@ -1,9 +0,0 @@
[[!comment format=mdwn
username="https://christian.amsuess.com/chrysn"
nickname="chrysn"
avatar="http://christian.amsuess.com/avatar/c6c0d57d63ac88f3541522c4b21198c3c7169a665a2f2d733b4f78670322ffdc"
subject="Usefulness of batch key processing"
date="2020-05-15T09:21:15Z"
content="""
This would be quite helpful to tools using git-annex (eg. [annex-to-web](https://gitlab.com/chrysn/annex-to-web), issue [2](https://gitlab.com/chrysn/annex-to-web/-/issues/2)), especially for short-running things like `whereis` where the launching time dominates over the processing time.
"""]]

View file

@ -1,9 +0,0 @@
[[!comment format=mdwn
username="https://christian.amsuess.com/chrysn"
nickname="chrysn"
avatar="http://christian.amsuess.com/avatar/c6c0d57d63ac88f3541522c4b21198c3c7169a665a2f2d733b4f78670322ffdc"
subject="Re: Usefulness of batch key processing"
date="2020-05-15T09:33:22Z"
content="""
Concerning the filtering, I'd find a note that \"--batch-keys is mutually exclusive with filtering\" perfectly acceptable if that makes implementation easier. (Or \"only with the filtering options that apply to keys\" -- as I found that `git annex whereis --in web --key=...` does work well with the key input).
"""]]

View file

@ -1,13 +0,0 @@
[[!comment format=mdwn
username="https://christian.amsuess.com/chrysn"
nickname="chrysn"
avatar="http://christian.amsuess.com/avatar/c6c0d57d63ac88f3541522c4b21198c3c7169a665a2f2d733b4f78670322ffdc"
subject="Another example"
date="2021-08-15T17:42:54Z"
content="""
The program at [[forum/Migrate_mark_files_dead]] shows again how batch-key would be useful, here for `git annex drop --from remote` and `git annex dead`.
I don't have numbers as I can't run it in batch, but comparing to other multi-file batch drop operations, I guesstimate this makes the difference of a script running for an hour invoking git-annex-drop a thousand times (with interruptions if the SSH agent decides to ask confirmation for a key again) or five minutes with --batch-key.
Like with the original use case of annex-to-web, filtering is not an issue for this application.
"""]]

View file

@ -1,20 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 5"""
date="2021-08-25T18:06:29Z"
content="""
I've implemented --batch-keys for the commands: get, drop, move, copy, whereis
That covers everything mentioned here except for dead, but that does not
support --batch yet, so if batch mode is needed for it, it can just use
--batch, not --batch-keys. However, after a recent change that makes
dropping unused keys automatically mark them dead, I suspect there
will not be a use case for that.
Most of the other commands that use --batch don't make sense to support
--batch-keys. Eg, add and find can't operate on keys, while
fromkey already operates on keys. About the only one that might is
rmurl, but it uses a custom batch format so would not be able to use the
current --batch-keys implementation. If someone needs that or some other
one, they can open a new todo.
"""]]

View file

@ -1,86 +0,0 @@
Files in the git-annex branch use timestamps to ensure that the most
recently recorded state wins. This is unsatisfying, because it requires
accurate clocks amoung all users. It would be better to use vector clocks,
where possible, but it is not possible to use vector clocks for all
information in the branch.
To see why vector clocks can't be used for some information in the branch,
consider location log files. They are meant to reflect the actual state of
an external resource. Vector clocks can ensure that a consistent state is
agreed on by distributed users, but there's no way to guarantee that state
matches the actual state.
For example, let's assume there's a vector clock consisting of an an
integer, and an object is being added and removed from a remote by multiple
parties. First Alice logs (present, 1), and then some time later, Alice
logs (missing, 2). Meanwhile, Bob merges (present, 1) from Alice
and then logs (missing, 2), followed by (present, 3). At some later point,
they merge back up, and the winning state is (present, 3) as it has the
highest vector clock. Is the content really present on the remote?
Well, we don't know, Alice could have removed it before Bob stored it,
or afterwards.
But, other information in the branch could use vector clocks. Consider
numcopies setting. It's fine if the winner of a conflict over that is not
the one who set it most recently, as long as a value can be consistently
determined. So, the numcopies setting, and similar other configuration, is not
trying to track an external state, and so it could use vector clocks.
How would these vector clocks work, and how to transition to using them
without confusing old versions of git-annex that expect timestamps? A
change to a log could simply increment the clock from the previous
version of the log. This would make the new git-annex normally lose
when a conflicting change was written by an old git-annex, but the result
would be consistent, so that's acceptable.
Files that are related to external state need to continue to use
timestamps. But this could still be improved. Currently, if the clock is
wronly set far in the future, logs using those timestamps will win over
other logs for a long time. This could break git-annex badly as there
becomes no way to correct wrong information.
Experimenting with `GIT_ANNEX_VECTOR_CLOCK`, it looks like `git annex fsck`
is able to recover from wrong location information being recorded with a
far future timestamp. It replaces that timestamp with the current one.
However, if that then gets union merged with a change to the same location
log made in another repository, fsck's correction can be lost in the merge.
Re-running the fsck will eventually get the information corrected, once a
non-union merge happens. However, `git annex fsck` can't correct other
logs, like remote state logs, if they end up with bad information with
a far future timestamp.
There's a mirror problem of information being recorded with a timestamp
in the past and being ignored. But, at least in that case, re-recording
good information with the right timestamp will fix the problem.
Consider making git-annex ignore future timestamps
(with some amount of allowance for minor lock skew). There are two
problems, one is that currently valid information gets ignored, until it's
able to be re-recorded. The second is that when the timestamp slips
into the past, the old, invalid information suddenly starts being taken
into account.
---
A better idea: When writing new information, check if the old
value for the log has a timestamp `>=` current timestamp. If so, don't use the
current timestamp for the new information, instead increment the old
timestamp. So when there's clock skew (forwards or backwards), this makes
it fall back, effectively to vector clocks.
This would work for both kinds of logs. For configuration changes,
it's kind of better than using only vector clocks, because in the absence
of clock skew, the most recent change to a configuration wins. For state
changes, it keeps the benefits of timestamps except when there's clock
skew, in which case there are not any benefits of timestamps anymore
so vector clocks is the best that can be done. --[[Joey]]
(How would `GIT_ANNEX_VECTOR_CLOCK` interact with this? Maybe, when that's
set to a low number, it would be treated as the current time. So this would
let it be used and not, without issues, and also would let it be set to a
low number once, and not need to be changed, since git-annex would
increment as necessary.)
> The `vectorclock` branch has this mostly implemented. --[[Joey]]
> > [[done]] --[[Joey]]

View file

@ -1,44 +0,0 @@
`git annex whereused` would report where in the git repository a
key is used, as a complement to `git-annex unused`.
Use cases include users not getting confused about why git-annex unused
says a key is used.
Also, it could scan through history to find where a key *was* used.
git-annex unused outputs a suggestion to use a rather hairy `git log -S`
command to do that currently.
If it does both these things, it could explain why git-annex unused
considers a key used despite a previous git rev referring to it. Eg:
# git annex whereused SHA1--foo
checking index... unused
checking branches... unused
checking tags... unused
checking history... last used in master^40:somefile
checking reflog... last used in HEAD@{30}:somefile
--[[Joey]]
> First pass is a keys db lookup to filenames.
>
> The historical pass can be done fairly efficiently by using
> `git log -Skey --exclude=*/git-annex --glob=* --exclude=git-annex --remotes=* --tags=* --pretty='%H' --raw`
> and fall back to `git log -Skey --walk-reflogs --pretty='%gd' --raw` if nothing was found.
>
> That makes git log check all commits reachable from those refs,
> probably as efficiently as possible, and stop after one match.
> It won't allow quite as nice a display as above.
>
> Parse the log output for commit sha and filename. Double-check
> by catting the file's object and making sure it parses as an annex
> link or pointer.
>
> Then use `git describe --contains --all` to get a description of the commit
> sha, which will be something like "master~2" or "origin/master~2",
> and add ":filename" to get the ref to output.
>
> Or, if it was found in the ref log, take the "HEAD@{n}" from log
> output, and add ":filename"
[[done]] --[[Joey]]

View file

@ -1,8 +0,0 @@
From [another thread](https://git-annex.branchable.com/todo/add_option_to_use_sqlite__39__s_synchronous__61__OFF/#comment-dbc9fdf5fd6d73f3e628bfe94b2a43a2):
>Quite possible there are situations where it fails to recover the lost information and does something annoying. But like I said, such situations can already happen
Maybe, there are some simple ways to harden git-annex against possible weirdness following abrupt interruptions? E.g. using flag files to detect when a prior operation got interrupted,
and rebuilding the sqlite dbs from git data. Or tagging sqlite records with the timestamp of their creation, and not using the data if the relevant worktree files got modified since then.
> [[closing|done]] per comment --[[Joey]]

View file

@ -1,10 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2021-06-25T16:06:59Z"
content="""
I could make that statement about basically any program. I think git-annex
deals with interruptions well. It is written with idempotency in mind. I
interrupt it all the time. It always behaves well. That is not a proof that
there is not some unforseen situation where I have made a mistake.
"""]]

View file

@ -1,15 +0,0 @@
If a tree containing a non-annexed file (checked directly into git) is exported,
and then an import is done from the remote, the new tree will have that
file annexed, and so merging it converts to annexed (there is no merge
conflict).
If the user is using annex.largefiles to configure or list
the non-annexed files, they'll be ok, but otherwise they'll be in for some
pain.
The importer could check for each file, if there's a corresponding file in
the branch it's generating the import for, if that file is annexed.
This corresponds to how git-annex add (and the smudge filter) handles these
files. But this might be slow when importing a large tree of files.
> [[fixed|done]] --[[Joey]]

View file

@ -1,10 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2021-03-05T16:38:25Z"
content="""
This leads to worse behavior than just converting to annexed from
non-annexed. The converted file's contents don't verify due to some
confusion between git and git-annex's use of SHA1. See
<https://git-annex.branchable.com/forum/__96__git_annex_import__96___from_directory_loses_contents__63__/>
"""]]

View file

@ -1,30 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 2"""
date="2021-03-05T16:42:03Z"
content="""
> The importer could check for each file, if there's a corresponding file in the branch it's generating the import for, if that file is annexed.
Should it check the branch it's generating the import for though?
If the non-annexed file is "foo" and master is exported, then in master
that file is renamed to "bar", the import should not look at the new master
to see if the "foo" from the remote should be annexed. The correct tree
to consult would be the tree that was exported to the remote last.
It seems reasonable to look at the file in that exported tree to see it was
non-annexed before, and if the ContentIdentifier is the same as what
was exported before, keep it non-annexed on import. If the ContentIdentifier
has changed, apply annex.largefiles to decide whether or not to annex it.
The export database stores information about that tree already,
but it does not keep track of whether a file was exported annexed or not.
So changing the database to include an indication of that, and using it
when importing, seems like a way to solve this problem, and without slowing
things down much.
*Alternatively* the GitKey that git-annex uses for these files when
exporting is represented as a SHA1 key with no size field. That's unusual;
nothing else creates such a key usually. (Although some advanced users may
for some reason.) Just treating such keys as non-annexed files when
importing would be at least a bandaid if not a real fix.
"""]]

View file

@ -1,14 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 3"""
date="2021-03-05T17:31:32Z"
content="""
Wait... The import code has a separate "GIT" key type that it uses
internally once it's decided a file should be non-annexed. Currently
that never hits disk. Using that rather than a SHA1 key for the export
database could be a solution.
(Using that rather than "SHA1" for the keys would also avoid
the problem that the current GitKey hardcods an assumption
that git uses sha1..)
"""]]

View file

@ -1,32 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 4"""
date="2021-03-05T17:44:54Z"
content="""
In fact, a very simple patch that just makes a GitKey generate a
"GIT" key seems to have solved this problem! Files that were non-annexed
on export remain so on import, until they're changed, and then
annex.largefiles controls what happens.
Once non-annexed files have been exported using the new version, they'll
stay non-annexed on import. Even when an old version of git-annex is doing
the importing!
When an old annex had exported, and a new one imports, what happens is
the file gets imported as an annexed file. Exporting first with the new
version avoids that unwanted conversion.
Interestingly though, the annexed file when that conversion happens does
not use the SHA1 key from git, so its content can be retrieved. I'm not
quite sure how that problem was avoided in this case but something avoided
the worst behavior.
It would be possible to special case the handling of SHA1 keys without a
size to make importing from an old export not do the conversion. But that
risks breakage for some user who is generating their own SHA1 keys and not
including a size in them. Or for some external special remote that supports
IMPORTKEY and generates SHA1 keys without a size. It seems better to avoid
that potential breakage of unrelated things, and keep the upgrade process
somewhat complicated when non-annexed files were exported before, than it
does to streamline the upgrade.
"""]]

View file

@ -1,30 +0,0 @@
When a FAT filesystem is unmounted and remounted, the inode numbers all
change. This makes import tree from a directory special remote on FAT
think the files have changed, and so it re-imports them. Since the content
is the unchanged, the unnecessary work that is done is limited to hashing
the file on the FAT filesystem. But that can be a lot of work when the tree
being imported has a lot of large files in it.
This makes import tree potentially much slower than the legacy import
interface (although that interface also re-hashes when used with
--duplicate/--skip-duplicates).
Also, the content identifier log gets another entry, with a content
identifier with the new inode number. So over time this can bloat the log.
May be better to omit the inode number from the content
identifier for such a filesystem, instead relying on size and mtime?
Although that would risk missing swaps of files with the same size and
mtime, that seems like an unlikely thing, and in any case git-annex would
import the data, and only miss the renaming of the files. It would also
miss modifications that don't change size and preserve the mtime; such
modifications are theoretically possible, but unlikely.
But how to detect when it's a FAT filesystem with this problem?
The method git-annex uses when running on a FAT filesystem, of maintaining
an inode sentinal file and checking it to tell when inodes have changed
would need importing to write to the drive. That seems strange, and the
drive could even be read-only. May be the directory special remote should
just not use inode numbers at all?
> [[done]] --[[Joey]]

View file

@ -1,12 +0,0 @@
[glacier-cli](https://github.com/basak/glacier-cli) calls its own command `glacier` rather than `glacier-cli` or something else. This conflicts with [boto](https://github.com/boto/boto/)'s own `glacier` executable, as noted here:
* <https://github.com/basak/glacier-cli/issues/30>
* <https://github.com/basak/glacier-cli/issues/47>
Whilst the `glacier-cli` project should resolve this conflict, it would be good if git-annex could be made to use a configurable path for this executable, rather than just assuming that it has been installed as `glacier`. After all, its installation procedure is simply telling the user to run `ln -s`, so there's no reason why the user couldn't make the target of this command `~/bin/glacier-cli` rather than `~/bin/glacier` - it's really irrelevant what the source file inside the git repo is called.
Of course, [`checkSaneGlacierCommand`](https://github.com/joeyh/git-annex/blob/master/Remote/Glacier.hs#L307) is still very much worth having, for safety.
> Well, it never got renamed, and checkSaneGlacierCommand does check for
> the conflict, so I don't see any point in making the name configurable.
> [[done]] --[[Joey]]

View file

@ -1,7 +0,0 @@
[[!comment format=mdwn
username="basak"
subject="comment 1"
date="2015-04-24T15:48:48Z"
content="""
Well, it's supposed to be a command line command, and I don't type `cd-cli` and `ls-cli`. So while `glacier-cli` might be fine as a project name and is fine for a name for integration, I don't think it makes sense to call it that in `/usr/bin/`, which is why I didn't. I'd prefer to have seen that boto integrate an improved `glacier` command, or for packaging to provide this one as an alternative (like `mawk` vs. `gawk` as `/usr/bin/awk`). But upstream boto considers themselves deprecated, so that's not going to happen. One of these days I'll package glacier-cli up for Debian, at which point I'll see if the boto maintainer is interested in doing something, since I don't actually believe anybody uses boto's glacier command (since it's mostly useless).
"""]]

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="https://adamspiers.wordpress.com/"
nickname="adamspiers"
subject="Good point"
date="2015-04-24T15:55:29Z"
content="""
glacier-cli would be a rather silly name to put in `/usr/bin`. How about `glcr`, as suggested [here](https://github.com/basak/glacier-cli/issues/30#issuecomment-95972840)?
"""]]

View file

@ -1,10 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 3"""
date="2015-04-24T17:23:10Z"
content="""
I don't want to complicate git-annex more with configurable names for
programs, and glacier is not at all special in this regard, any program
could be installed under any namee. We pick non-conflicting names to
avoid integration nightmares. Pick a name and I'll use it.
"""]]

View file

@ -1,26 +0,0 @@
Like was recently done for preferred content, when checking numcopies for a
drop, it could check if other files are using the same key, and if so check
that their numcopies (and mincopies) is satisfied as well.
There would be an efficiency tradeoff of course, since it would have to
query the keys db. The question I suppose is, if someone sets different
numcopies for different files via .gitattributes, and they use the same
key, will the user think it's a problem that numcopies can be violated in
some circumstances. And I think that users would maybe consider that to be
a problem, if they happened to encounter the behavior.
It may also be worth considering making --all (etc) also check numcopies of
associated files. Although then, in a bare repo, it would behave
differently than in a non-bare repo. (Also if this is done, the preferred
content checking should also behave the same way.) The docs for --all
do say that it bypasses checking .gitattributes numcopies.
--[[Joey]]
> Note that the assistant and git-annex sync already check numcopies
> for all known associated files, so already handled this for unlocked
> files. With the recent change to also track
> associated files for locked files, they also handle it for those.
>
> But, git-annex drop/move/mirror don't yet.
>
> > [[fixed|done]] (did not change --all behavior) --[[Joey]]

View file

@ -1,3 +0,0 @@
Right now, non-annexed files get passed through the `annex` clean/smudge filter (see [[forum/Adding_files_to_git__58___Very_long___34__recording_state_in_git__34___phase]]). It would be better if `git-annex` configure the filter only for the annexed unlocked files, in the `.gitattributes` file at the root of the repository.
> not a viable solution, [[done]] --[[Joey]]

View file

@ -1,19 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2019-11-22T16:01:26Z"
content="""
It immediately occurs to me that the proposal would break this:
git annex add foo
git annex add bar
git annex unlock bar
git mv bar foo
git commit -m add
Since foo was a locked file, gitattributes would prevent from being
smudged, so the large content that was in bar gets committed directly to git.
The right solution is to improve the smudge/clean filter interface to it's
not so slow, which there is copious discussion of elsewhere.
"""]]

View file

@ -1,46 +0,0 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="moving unlocked file onto locked file isn't possible"
date="2019-11-24T16:36:24Z"
content="""
`git mv` won't move an unlocked file onto a locked file (trace below).
\"The right solution is to improve the smudge/clean filter interface\" -- of course, but realistically, do you think git devs can be persuaded to do [[this|todo/git_smudge_clean_interface_suboptiomal]] sometime soon? Even if yes, it still seems better to avoid adding a step to common git workflows, than to make the step fast.
[[!format sh \"\"\"
(master_env_v164_py36) 11:14 [t1] $ ls
bar foo
(master_env_v164_py36) 11:14 [t1] $ git init
Initialized empty Git repository in /tmp/t1/.git/
(master_env_v164_py36) 11:14 [t1] $ git annex init
init (scanning for unlocked files...)
ok
(recording state in git...)
(master_env_v164_py36) 11:14 [t1] $ git annex add foo
add foo ok
(recording state in git...)
(master_env_v164_py36) 11:14 [t1] $ git annex add bar
add bar ok
(recording state in git...)
(master_env_v164_py36) 11:14 [t1] $ ls -alt
total 0
drwxrwxr-x 8 ilya ilya 141 Nov 24 11:14 .git
drwxrwxr-x 3 ilya ilya 40 Nov 24 11:14 .
lrwxrwxrwx 1 ilya ilya 108 Nov 24 11:14 bar -> .git/annex/objects/jx/MV/MD5E-s4--c157a79031e1c40f85931829bc5fc552/MD5E-s4--c157a79031\
e1c40f85931829bc5fc552
lrwxrwxrwx 1 ilya ilya 108 Nov 24 11:14 foo -> .git/annex/objects/00/zZ/MD5E-s4--d3b07384d113edec49eaa6238ad5ff00/MD5E-s4--d3b07384d1\
13edec49eaa6238ad5ff00
drwxrwxrwt 12 root root 282 Nov 24 11:14 ..
(master_env_v164_py36) 11:14 [t1] $ git annex unlock bar
unlock bar ok
(recording state in git...)
(master_env_v164_py36) 11:16 [t1] $ git mv bar foo
fatal: destination exists, source=bar, destination=foo
(master_env_v164_py36) 11:17 [t1] $
\"\"\"]]
"""]]

View file

@ -1,126 +0,0 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="even git mv -f seems to work correctly"
date="2019-11-24T17:25:32Z"
content="""
Also, `git mv` seems to reuse the already-smudged object contents of the source file for the target file, so even with `git mv -f` only the checksum gets checked into git:
[[!format sh \"\"\"
+ cat ./test-git-mv
#!/bin/bash
set -eu -o pipefail -x
cat $0
TEST_DIR=/tmp/test_dir
mkdir -p $TEST_DIR
chmod -R u+w $TEST_DIR
rm -rf $TEST_DIR
mkdir -p $TEST_DIR
pushd $TEST_DIR
git init
git annex init
git --version
git annex version
rm .git/info/attributes
echo foo > foo
echo bar > bar
git annex add foo bar
git check-attr -a foo
git check-attr -a bar
echo 'bar filter=annex' > .gitattributes
git add .gitattributes
git check-attr -a foo
git check-attr -a bar
git annex unlock bar
git mv bar foo || true
git mv -f bar foo
git commit -m add
git log -p
+ TEST_DIR=/tmp/test_dir
+ mkdir -p /tmp/test_dir
+ chmod -R u+w /tmp/test_dir
+ rm -rf /tmp/test_dir
+ mkdir -p /tmp/test_dir
+ pushd /tmp/test_dir
/tmp/test_dir /tmp
+ git init
Initialized empty Git repository in /tmp/test_dir/.git/
+ git annex init
init (scanning for unlocked files...)
ok
(recording state in git...)
+ git --version
git version 2.20.1
+ git annex version
git-annex version: 7.20191024-g6dc2272
build flags: Assistant Webapp Pairing S3 WebDAV Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite
dependency versions: aws-0.21.1 bloomfilter-2.0.1.0 cryptonite-0.25 DAV-1.3.3 feed-1.0.1.0 ghc-8.6.5 http-client-0.5.14 persistent-sqlite-2.9.3 torrent-10000.1.1 uuid-1.3.13 yesod-1.6.0
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs hook external
operating system: linux x86_64
supported repository versions: 7
upgrade supported from repository versions: 0 1 2 3 4 5 6
local repository version: 7
+ rm .git/info/attributes
+ echo foo
+ echo bar
+ git annex add foo bar
add foo ok
add bar ok
(recording state in git...)
+ git check-attr -a foo
+ git check-attr -a bar
+ echo 'bar filter=annex'
+ git add .gitattributes
+ git check-attr -a foo
+ git check-attr -a bar
bar: filter: annex
+ git annex unlock bar
unlock bar ok
(recording state in git...)
+ git mv bar foo
fatal: destination exists, source=bar, destination=foo
+ true
+ git mv -f bar foo
+ git commit -m add
[master (root-commit) 8610c0d] add
2 files changed, 2 insertions(+)
create mode 100644 .gitattributes
create mode 100644 foo
+ git log -p
commit 8610c0d8f327140608e71dc229f167731552d284
Author: Ilya Shlyakhter <ilya_shl@alum.mit.edu>
Date: Sun Nov 24 12:24:28 2019 -0500
add
diff --git a/.gitattributes b/.gitattributes
new file mode 100644
index 0000000..649f07e
--- /dev/null
+++ b/.gitattributes
@@ -0,0 +1 @@
+bar filter=annex
diff --git a/foo b/foo
new file mode 100644
index 0000000..266ae50
--- /dev/null
+++ b/foo
@@ -0,0 +1 @@
+/annex/objects/MD5E-s4--c157a79031e1c40f85931829bc5fc552
\"\"\"]]
"""]]

View file

@ -1,10 +0,0 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="installing clean/smudge filter lazily"
date="2021-03-19T02:30:13Z"
content="""
\"the proposal would break this\" -- suppose [[`git-annex-unlock`|git-annex-unlock]] was changed to install the clean/smudge filter for `*` if not installed yet?
Related: [Avoid lengthy \"Scanning for unlocked files ...\"](https://git-annex.branchable.com/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/)
"""]]

View file

@ -1,10 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 5"""
date="2021-03-23T16:02:47Z"
content="""
> "the proposal would break this" -- suppose git-annex-unlock was changed to install the clean/smudge filter for * if not installed yet?
git-annex unlock is not the only way unlocked files can appear in your
tree. Consider git pull.
"""]]

View file

@ -1,28 +0,0 @@
Some things to do with the [[design/P2P_protocol]]
are works in progress, needing a future flag day to complete.
## VERSION over tor
Old versions of git-annex, before 6.20180312, which speak the P2P protocol
over tor, don't support VERSION, and attempting to negotiate a version
will cause the server to hang up the connection. To deal with this
historical bug, the version is not currently negotiated when using the
protocol over tor. At some point in the future, when all peers can be
assumed to be upgraded, this should be changed.
> [[done]] --[[Joey]]
## git-annex-shell fallbacks
When using git-annex-shell p2pio, git-annex assumes that if it exits 1,
it does not support that, and falls back to the old sendkey/rerecvkey,
etc.
At some point in the future, once all git-annex and git-annex-shell
can be assumed to be upgraded to 6.20180312, this fallback can be removed.
It will allows removing a lot of code from git-annex-shell and a lot of
fallback code from Remote.Git.
> [[done]] --[[Joey]]
[[!tag confirmed]]

View file

@ -1,12 +0,0 @@
git-annex has good support for running commands in parallel, but there
are still some things that could be improved, tracked here:
* Maybe support -Jn in more commands. Just needs changing a few lines of code
and testing each.
* Maybe extend --jobs/annex.jobs for more control. `--jobs=cpus` is already
supported; it might be good to have `--jobs=cpus-1` to leave a spare
cpu to avoid contention, or `--jobs=remotes*2` to run 2 jobs per remote.
> Ok, those are maybe good ideas, but this needs to be closed at some
> point, so [[done]] --[[Joey]]

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="parallelization"
date="2019-11-27T17:23:14Z"
content="""
When operating on many files, maybe run N parallel commands where i'th command ignores paths for which `(hash(filename) module N) != i`. Or, if git index has size I, i'th command ignores paths that are not legixographically between `index[(I/N)*i]` and `index[(I/N)*(i+1)]` (for index state at command start). Extending [[git-annex-matching-options]] with `--block=i` would let this be done using `xargs`.
"""]]

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 2"""
date="2020-01-30T19:24:47Z"
content="""
How would running parallel commands with xargs be better than the current
-J?
"""]]

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="running parallel commands with xargs"
date="2020-02-20T20:48:33Z"
content="""
\"How would running parallel commands with xargs be better than the current -J\" -- it would allow writing wrappers that kill/retry stuck git-annex process trees, as suggested [[here|https://git-annex.branchable.com/todo/more_extensive_retries_to_mask_transient_failures/#comment-209f8a8c38e63fb3a704e1282cb269c7]].
"""]]

View file

@ -1,28 +0,0 @@
Hello,
By means of bisection I have determined that commit 4bf7940d6b912fbf692b268f621ebd41ed871125, recently uploaded to Debian after the bullseye freeze, is responsible for breaking the annex-to-annex-reinject script which ships with Git::Annex. Here is a minimal reproducer of the problem:
spwhitton@melete:~/tmp>echo foo >bar
spwhitton@melete:~/tmp>mkdir annex
spwhitton@melete:~/tmp>cp bar annex
spwhitton@melete:~/tmp>cd annex
spwhitton@melete:~/tmp/annex>git init
spwhitton@melete:~/tmp/annex>git annex add bar
spwhitton@melete:~/tmp/annex>git annex drop --force bar
spwhitton@melete:~/tmp/annex>git annex reinject --known /home/spwhitton/tmp/bar
fatal: './../bar' is outside repository at '/home/spwhitton/tmp/annex'
fatal: './../bar' is outside repository at '/home/spwhitton/tmp/annex'
fatal: './../bar' is outside repository at '/home/spwhitton/tmp/annex'
fatal: './../bar' is outside repository at '/home/spwhitton/tmp/annex'
fatal: './../bar' is outside repository at '/home/spwhitton/tmp/annex'
fatal: './../bar' is outside repository at '/home/spwhitton/tmp/annex'
fatal: './../bar' is outside repository at '/home/spwhitton/tmp/annex'
fatal: './../bar' is outside repository at '/home/spwhitton/tmp/annex'
fatal: './../bar' is outside repository at '/home/spwhitton/tmp/annex'
fatal: './../bar' is outside repository at '/home/spwhitton/tmp/annex'
fatal: './../bar' is outside repository at '/home/spwhitton/tmp/annex'
git-annex: fd:15: Data.ByteString.hGetLine: end of file
--spwhitton
> [[fixed|done]] --[[Joey]]

View file

@ -1,11 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2021-10-01T16:42:36Z"
content="""
Also happens with a relative path to the file. And also
`git annex reinject ../bar bar` fails the same way.
Fixed. In case you want to cherry-pick the fix, it's the commit adding
this comment, as well as the 2 prior commits fixing bugs in dirContains.
"""]]

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="spwhitton"
avatar="http://cdn.libravatar.org/avatar/9c3f08f80e67733fd506c353239569eb"
subject="comment 2"
date="2021-10-02T17:04:02Z"
content="""
Thanks so much for the fix! It looks like cherry-picking breaks the test suite, so I'll probably just wait for the next release.
"""]]

View file

@ -1,3 +0,0 @@
Small files might also be used for performance reasons, so there should be an option to also automatically fix merge conflicts for small files in git-annex-sync.
> [[wontfix|done]] --[[Joey]]

View file

@ -1,27 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2021-06-23T16:43:34Z"
content="""
There's a reason people don't want git to automatically resolve merge
conflicts of code, and for all git-annex knows small files are code.
Or looking at it from the other perspective, non-technical git-annex
assistant users need an automatic merge conflict resolution of annexed
files, since the assistant commits changes to those files and otherwise
they could end up with a conflict they don't understand how to resolve.
And, git-annex sync inherited that from the assistant. Which may or may not
have been the best decision. One thing in favor of it being a reasonable
decision is that a conflict in an annexed file will mostly be resolved by
picking one version of the file or the other, unlike conflicts in source
code which are often resolved by using brain sweat. Large and often binary
files not being very possible for human brains to deal with directly. Or
perhaps by a tool that combines the two versions in some way, in which case
the conflict resolution leaves both versions easily accessible for such a
tool.
So git-annex does know, or can make some reasonable assumptions, about
annexed files, but generalizing those assumptions to small files would not
make sense.
"""]]

View file

@ -1,10 +0,0 @@
[[!comment format=mdwn
username="Lukey"
avatar="http://cdn.libravatar.org/avatar/c7c08e2efd29c692cc017c4a4ca3406b"
subject="comment 2"
date="2021-06-24T17:43:36Z"
content="""
The idea is to solve the conflicts in a similar way to conflicts in annexed files. I.e. by creating two files file.version-a and file.version-b.
"""]]

View file

@ -1,16 +0,0 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="resolving merge conflicts"
date="2021-06-24T18:03:40Z"
content="""
\"Small files\" here means \"non-annexed files\", right?
Whether a file is annexed, and whether its merge conflicts should be auto-resolved by creating two files `file.version-a` and `file.version-b`, seem like orthogonal things.
One might check small binary files directly into git, and one might annex source code files e.g. just for the simplicity of annexing everything (as [[DataLad|projects/datalad]] does or at least used to).
So, maybe, `.gitattributes` should control which files' merge conflicts get auto-resolved?
"""]]

View file

@ -1,9 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 4"""
date="2021-06-25T16:12:02Z"
content="""
I have no interest in continung a feature added for the assistant down a
road to making source code files that are checked into git be handled in
some other way when merging.
"""]]

View file

@ -1,9 +0,0 @@
It'd be very useful if you could specify a size limit for drop/move/copy/get-type operations. `git annex move --to other --limit 1G` would move at most 1G of data to the other repo for example.
This way you could quickly "garbage collect" a few dozen GiB from your annex repo when you're running out of space without dropping everything for example.
Another issue this could be used to mitigates is that, for some reason, git-annex doesn't auto-stop the transfer when the repos on my external drives are full properly.
I imagine there are many more use-cases where quickly being able to set a limit for the amount of data a command should act on could come in handy.
> [[done]] --[[Joey]]

View file

@ -1,34 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2021-06-04T18:07:44Z"
content="""
I agree this could be useful.
Implementation is complicated by it needing to only count the size when a
file is acted on. Eg `git annex get` shouldn't stop when it's seen enough
files that already have content present.
So it seems it would need to be implemented next to where showStartMessage
is used in commandAction, looking at the size of the key in the
StartMessage (or possibly file when there's no key?) and when it would go
over the limit, rather than proceeding to perform the action it could skip
doing anything and go on to the next file.
I don't think there is a good way to make it immediately exit
when it reaches the limit, so if there were subsequent smaller files
after a skipped file that could be processed still, it still would.
It would probably also make sense to make it later exit with 101 like
--time-limit does, or another special exit code, to indicate it didn't
process everything.
Hmm, if an action fails, should the size of the file be counted or not?
If failures are not counted, incomplete transfers could result in a
lot more work/disk space than desired. But if failures are counted
after failing to drop a bunch of files, or failing early on to get a bunch
of files, it could stop seemingly prematurely. Also there's a problem with
concurrency, if it needs to know the result of running jobs before deciding
whether to start a new job. Seems no entirely good answer here, but the
concurrency problem seems only solvable by updating the count at start time.
"""]]

View file

@ -1,14 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 2"""
date="2021-06-04T20:35:26Z"
content="""
--size-limit is implemented, for most git-annex commands.
Ones like `git-annex add` that don't operate on annexed files don't support
it, at least yet.
Ones like git-annex export/import/sync I'm not sure it makes sense to
support it, since they kind of operate at a higher level than individual
files.
"""]]

View file

@ -1,18 +0,0 @@
When adding a lot of small files to git with `git annex add`,
it is slow because git runs the smudge filter on all files
and [[that_is_slow|todo/git_smudge_clean_interface_suboptiomal]].
But `git-annex add --force-small` is much much faster, because that
bypasses git add entirely, hashing the content and staging it in the index
from git-annex. So could that same method be used to speed up the slow case?
My concern with doing this is that there may be things that `git add`
does that are not done when bypassing it. The only one I can think of is,
if the user has other smudge/clean filters than the git-annex one
installed, they would not be run either. It could be argued that's a bug
with the existing `--force-small` too, but at least that's not the default.
Possible alternate approach: Unsetting filter.annex.smudge and
filter.annex.clean when running `git add`?
> This approach is a winner! [[done]] --[[Joey]]

View file

@ -1,19 +0,0 @@
reconcileStaged should be able to be sped up by improving streaming through
git, similar to [[!commit 0f54e5e0ae73b89bb6743bf298915619da00c3f4]].
Normally it's plenty fast enough, but users who often switch between
branches that have tens to hundreds of thousands of diverged files will
find it slow, and this should speed it up by somewhere around 3x (excluding
sqlite writes). --[[Joey]]
> Implemented this. Benchmarked it in a situation where 100,000 annexed
> files were added to the index (by checking out a branch with more annexed
> files). old: 50 seconds; new: 41 seconds
> Also benchmarked when 100,000 annexed files were removed from the index.
> old: 26 seconds; new: 17 seconds.
>
> Adding associated files to the sqlite db is clearly more expensive than
> removing from it.
>
> [[done]] --[[Joey]]

View file

@ -1,14 +0,0 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="keys db optimization"
date="2021-06-02T16:53:02Z"
content="""
\"users who often switch between branches that have tens to hundreds of thousands of diverged files will find it slow\" -- that's my use case ;) Could one keys-to-files db be kept per branch?
Maybe, the keys db could be split, based e.g. on prefix of md5 of the key, into separate sqlite files, and the writing to them parallelized?
It's common to be working on a many-core machine.
Is the keys-to-locked-files db used for anything besides detecting keys used by more than one files? For that one purpose there might be faster solutions.
But, if it's implemented, maybe it also be used to remove the [[limitation|git-annex-preferred-content]] that \"when a command is run with the --all option, or in a bare repository, there is no filename associated with an annexed object, and so \"include=\" and \"exclude=\" will not match\"?
"""]]

View file

@ -1,9 +0,0 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="matching include/exclude based on file extension in the key"
date="2021-06-02T17:02:58Z"
content="""
Actually, the include/exclude limitation above could be removed by just looking at the keys themselves, if the include/exclude expression is of the form `*.ext` and the keys include file extensions.
"""]]

View file

@ -1,12 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 3"""
date="2021-06-04T17:45:21Z"
content="""
It is not very useful to detect if a key is used by more than one file if
you don't know the files. In any case, yes, the keys db is used for a large
number of things, when it comes to unlocked files.
[[todo/numcopies_check_other_files_using_same_key]] has some thoughts on
--all, but I doubt it will make sense to change --all.
"""]]

View file

@ -1,14 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 4"""
date="2021-06-04T17:49:43Z"
content="""
Keys with extensions do not necessarily have the same extension as used in
the worktree files that include/exclude match on.
I'm not sure why all these wild ideas are being thrown out there when this
todo is about a specific, simple improvement that will speed up the git
part of the scanning by about 3x? It's like you somehow consider this an
emergency where increasingly wild measures have to be taken to prevent me
from making a terrible mistake?
"""]]

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="Ilya_Shlyakhter"
avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
subject="&quot;why all these wild ideas are being thrown out there&quot;"
date="2021-06-04T22:15:32Z"
content="""
It just seemed like all the speedup possibilities from `annex.supportunlocked=false` are getting undone to optimize a not-too-common scenario?
"""]]

View file

@ -1,9 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 6"""
date="2021-06-07T15:48:07Z"
content="""
annex.supportunlocked=false still prevents the smudge/clean filter from
being used, which can significantly speed up git if the repository has a
lot of files stored in git.
"""]]

View file

@ -1,3 +0,0 @@
For commands like [[`git-annex-whereis`|git-annex-whereis]] that take a `path` argument, it would help if this could be generalized to taking a [tree-ish](https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddeftree-ishatree-ishalsotreeish). E.g. for `git-annex-whereis` this could be used to look up where previous file versions are stored.
> [[done]] before this was filed

View file

@ -1,8 +0,0 @@
[[!comment format=mdwn
username="joey"
subject="""comment 1"""
date="2021-05-12T16:23:02Z"
content="""
This is already supported by whereis (and quite a lot of other commands)
with the --branch option, which is documented to support a treeish.
"""]]

View file

@ -1,9 +0,0 @@
Based on an irc conversation earlier today:
19:50 < warp> joeyh: what is the best way to figure out the (remote) filename for a file stored in an rsync remote?
20:43 < joeyh> warp: re your other question, probably the best thing would be to make the whereis command print out locations for each remote, as it always does for the web special remotes
> Several remotes do now populate whereis with urls, but an rsync remote
> does not in general have http urls to content in it. So I don't think
> it makes sense to do anything for rsync remotes. [[closeing|done]] --[[Joey]]