Commit graph

263 commits

Author SHA1 Message Date
Joey Hess
5cc8d9d03b
replace removeLink with removeFile
removeFile calls unlink so removes anything not a directory. So these
are replaceable in order to convert to OsPath.
2025-02-02 14:16:58 -04:00
Joey Hess
71195cce13
more OsPath conversion
Sponsored-by: k0ld
2025-02-01 14:06:38 -04:00
Joey Hess
474cf3bc8b
more OsPath conversion
Sponsored-by: Brock Spratlen
2025-02-01 11:54:19 -04:00
Joey Hess
9b79f0f43d
use file-io for readFile/writeFile/appendFile on ByteStrings
These are all straightforward, and easy small performance wins.

Sponsored-by: Nicholas Golder-Manning
2025-01-22 14:30:25 -04:00
Joey Hess
793ddecd4b
use openTempFile from file-io
And follow-on changes.

Note that relatedTemplate was changed to operate on a RawFilePath, and
so when it counts the length, it is now the number of bytes, not the
number of code points. This will just make it truncate shorter strings
in some cases, the truncation is still unicode aware.

When not building with the OsPath flag, toOsPath . fromRawFilePath and
fromRawFilePath . fromOsPath do extra conversions back and forth between
String and ByteString. That overhead could be avoided, but that's the
non-optimised build mode, so didn't bother.

Sponsored-by: unqueued
2025-01-22 11:41:43 -04:00
Joey Hess
c7cca43ab0
RawFilePath conversion for Utility.Directory.Stream 2025-01-20 19:25:52 -04:00
Joey Hess
1ceece3108
RawFilePath conversion of System.Directory
By using System.Directory.OsPath, which takes and returns OsString,
which is a ShortByteString. So, things like dirContents currently have the
overhead of copying that to a ByteString, but that should be less than
the overhead of using Strings which often in turn were converted to
RawFilePaths.

Added Utility.OsString and the OsString build flag. That flag is turned
on in the stack.yaml, and will be turned on automatically by cabal when
built with new enough libraries. The stack.yaml change is a bit ugly,
and that could be reverted for now if it causes any problems.

Note that Utility.OsString.toOsString on windows is avoiding only a
check of encoding that is documented as being unlikely to fail. I don't
think it can fail in git-annex; if it could, git-annex didn't contain
such an encoding check before, so at worst that should be a wash.
2025-01-20 19:17:33 -04:00
Joey Hess
a73fa77417
added hooks corresponding to annex.*-command
* Added freezecontent-annex and thawcontent-annex hooks that
  correspond to the git configs annex.freezecontent and
  annex.thawcontent.
* Added secure-erase-annex hook that corresponds to the git config
  annex.secure-erase-command.
* Added commitmessage-annex hook that corresponds to the git config
  annex.commitmessage-command.
* Added http-headers-annex hook that corresponds to the git config
  annex.http-headers-command.
  that correspond to the post-update-annex and pre-commit-annex hooks.

The use case for these is eg, setting up a git repository that is run in a
container, where the easiest way to provide a script is by putting it in
.git/hooks/, rather than copying it into the container in a way that puts
it in PATH.

This is all the ones that make sense to add for annex.*-config git configs.
annex.youtube-dl-command is not a hook, it's telling git-annex what command
to run. So is annex.shared-sop-command. So omitted those.

May later also want to add hooks corresponding to
`remote.<name>.annex-cost-command` etc.

Sponsored-by: the NIH-funded NICEMAN (ReproNim TR&D3) project
2025-01-10 14:54:42 -04:00
Joey Hess
5df1b2b36e
configs annex.post-update-command and annex.pre-commit-command
Added git configs annex.post-update-command and annex.pre-commit-command
that correspond to the git-annex hook scripts post-update-annex and
pre-commit-annex.

Note that the hook files take precience over the git config, since the git
config can includ global config which should be overridden by local config.

These new git configs are probably not super useful. Especially the
pre-commit-annex hook is there to install scripts to instead of the
pre-commit hook, since git-annex installs that hook itself. So why would
someone want to use a git config for that? Only reason I can think of would
be in a global git config. Or possibly because it's easier to set a git
config than write a hook script, on an OS like Windows.

The real reason I'm adding these is as groundwork for making other
annex.*-command git configs also be available as hook scripts. I want
to avoid having some things available as only git hooks and others as
both gitconfigs and git hooks. (It seems that some annex.*-command configs
don't translate to git hooks though.)

In the man page, moved documentation of the hooks to be next to the
documentation of the git configs. This is to avoid repitition.
2025-01-10 13:27:51 -04:00
Joey Hess
db89e39df6
partially fix concurrency issue in updating the rollingtotal
It's possible for two processes or threads to both be doing the same
operation at the same time. Eg, both dropping the same key. If one
finishes and updates the rollingtotal, then the other one needs to be
prevented from later updating the rollingtotal as well. And they could
finish at the same time, or with some time in between.

Addressed this by making updateRepoSize be called with the journal
locked, and only once it's been determined that there is an actual
location change to record in the log. updateRepoSize waits for the
database to be updated.

When there is a redundant operation, updateRepoSize won't be called,
and the redundant LiveUpdate will be removed from the database on
garbage collection.

But: There will be a window where the redundant LiveUpdate is still
visible in the db, and processes can see it, combine it with the
rollingtotal, and arrive at the wrong size. This is a small window, but
it still ought to be addressed. Unsure if it would always be safe to
remove the redundant LiveUpdate? Consider the case where two drops and a
get are all running concurrently somehow, and the order they finish is
[drop, get, drop]. The second drop seems redundant to the first, but
it would not be safe to remove it. While this seems unlikely, it's hard
to rule out that a get and drop at different stages can both be running
at the same time.
2024-08-26 09:43:32 -04:00
Joey Hess
06064f897c
update Annex.reposizes when changing location logs
The live update is only needed when Annex.reposizes has already been
populated.
2024-08-15 13:27:14 -04:00
Joey Hess
63a3cedc45
slightly improve hairy types 2024-08-14 16:04:18 -04:00
Joey Hess
3e6eb2a58d
implement journalledRepoSizes
Plan is to run this when populating Annex.reposizes on demand.
So Annex.reposizes will be up-to-date with the journal, including
crucially journal entries for private repositories. But also
anything that has been written to the journal by another process,
especially if the process was ran with annex.alwayscommit=false.

From there, Annex.reposizes can be kept up to date with changes made
by the running process.
2024-08-14 13:53:24 -04:00
Joey Hess
8ac2685b33
calcBranchRepoSizes without journal files
This will be used to prime the RepoSizes database, which will always
contain values that correpond to information in the git-annex branch, so
without anything from journal files.

Factored out overJournalFileContents which will later be used to
update Annex.reposizes to include information from journal files.
This will be partitcularly important to support private UUIDs which only
ever get to journal files and not to the branch.
2024-08-14 03:19:30 -04:00
Joey Hess
5afbea25e7
avoid counting size of keys that are in the journal twice
In calcRepoSizes and also git-annex info, when a key was in the journal,
it was passed to the callback twice, so the calculated size was wrong.
2024-08-13 13:23:39 -04:00
Joey Hess
467d80101a
improve handling of unmerged git-annex branches in readonly repo
git-annex info was displaying a message that didn't make sense in
context.

In calcRepoSizes, it seems better to return the information from the
git-annex branch, rather than giving up. Especially since balanced
preferred content uses it, and we can't just give up evaluating a
preferred content expression if git-annex is to be usable in such a
readonly repo.

Commit 6d7ecd9e5d nobly wanted git-annex
to behave the same with such unmerged branches as it does when it can
merge them. But for the purposes of preferred content, it seems to me
there's a sense that such an unmerged branch is the same as a remote we
have not pulled from. The balanced preferred content will either way
operate under outdated information, and so make not the best choices.
2024-08-13 13:13:12 -04:00
Joey Hess
bd3d327d8a
smarter BranchState cache invalidation
Only invalidate a just-written file in the cache, not the whole cache.

This will avoid the possibly performance impact of cache invalidation
mentioned in commit 770aac97a7
2024-07-28 12:33:32 -04:00
Joey Hess
25a6ab6f11
Avoid grafting in export tree objects that are missing
They could be missing due to an interrupted git-annex at just the wrong
time during a prior graft, after which the tree objects got garbage
collected.

Or they could be missing because of manual messing with the git-annex
branch, eg resetting it to back before the graft commit.

Sponsored-by: Dartmouth College's OpenNeuro project
2024-06-07 16:51:50 -04:00
Joey Hess
b32c4c2e98
atomic git-annex branch update when regrafting in transition
Fix a bug where interrupting git-annex while it is updating the git-annex
branch could lead to git fsck complaining about missing tree objects.

Interrupting git-annex while regraftexports is running in a transition
that is forgetting git-annex branch history would leave the
repository with a git-annex branch that did not contain the tree shas
listed in export.log. That lets those trees be garbage collected.

A subsequent run of the same transition then regrafts the trees listed
in export.log into the git-annex branch. But those trees have been lost.

Note that both sides of `if neednewlocalbranch` are atomic now. I had
thought only the True side needed to be, but I do think there may be
cases where the False side needs to be as well.

Sponsored-by: Dartmouth College's OpenNeuro project
2024-06-07 16:34:10 -04:00
Joey Hess
adcebbae47
clean up git-remote-annex git-annex branch handling
Implemented alternateJournal, which git-remote-annex
uses to avoid any writes to the git-annex branch while setting up
a special remote from an annex:: url.

That prevents the remote.log from being overwritten with the special
remote configuration from the url, which might not be 100% the same as
the existing special remote configuration.

And it prevents an overwrite deleting of other stuff that was
already in the remote.log.

Also, when the branch was created by git-remote-annex, only delete it
at the end if nothing else has been written to it by another command.
This fixes the race condition described in
797f27ab05, where git-remote-annex
set up the branch and git-annex init and other commands were
run at the same time and their writes to the branch were lost.
2024-05-15 17:33:38 -04:00
Joey Hess
2c73845d90
multiple -m second try
Test suite passes this time. When committing the adjusted branch, use
the old method to make a message that old git-annex can consume. Also
made the code accept the new message, so that eventually
commitTreeExactMessage can be removed.

Sponsored-by: Kevin Mueller on Patreon
2024-04-09 12:56:47 -04:00
Joey Hess
a8dd85ea5a
Revert "multiple -m"
This reverts commit cee12f6a2f.

This commit broke git-annex init run in a repo that was cloned from a
repo with an adjusted branch checked out.

The problem is that findAdjustingCommit was not able to identify the
commit that created the adjusted branch. It seems that there is an extra
"\n" at the end of the commit message that it does not expect.

Since backwards compatability needs to be maintained, cannot just make
findAdjustingCommit accept it with the "\n". Will have to instead
have one commitTree variant that uses the old method, and use it for
adjusted branch committing.
2024-04-02 17:29:07 -04:00
Joey Hess
cee12f6a2f
multiple -m
sync, assist, import: Allow -m option to be specified multiple times, to
provide additional paragraphs for the commit message.

The option parser didn't allow multiple -m before, so there is no risk of
behavior change breaking something that was for some reason using multiple
-m already.

Pass through to git commands, so that the method used to assemble the
paragrahs is whatever git does. Which might conceivably change in the
future.

Note that git commit-tree has supported -m since git 1.7.7. commitTree
was probably not using it since it predates that version. Since the
configure script prevents building git-annex with git older than 2.1,
there is no risk that it's not supported now.

Sponsored-by: Nicholas Golder-Manning on Patreon
2024-03-27 15:58:27 -04:00
Joey Hess
a69871491f
avoid build warning on windows
Since append was only exported by Annex.Common on unix, excluding it
from import caused a build warning on windows.
2024-03-26 13:16:33 -04:00
Joey Hess
68e99513f0
added annex.commitmessage-command config
Sponsored-by: the NIH-funded NICEMAN (ReproNim TR&D3) project
2024-02-12 14:35:22 -04:00
Joey Hess
2114253eaf
update comment
The segfault seems to be fixed with git 2.43, I'm not sure what the
affected range was.
2024-01-20 11:25:22 -04:00
Joey Hess
f1ce15036f
started migrate --update
This is most of the way there, but not quite working.

The layout of migrate.tree/ needs to be changed to follow this approach.
git log will list all the files in tree order, so the new layout needs
to alternate old and new keys. Can that be done? git may not document
tree order, or may not preserve it here.

Alternatively, change to using git log --format=raw and extract
the tree header from that, then use
git diff --raw $tree:migrate.tree/old $tree:migrate.tree/new
That will be a little more expensive, but only when there are lots of
migrations.

Sponsored-by: Joshua Antonishen on Patreon
2023-12-07 15:50:52 -04:00
Joey Hess
be6b56df4c
remove unused import 2023-11-01 13:14:39 -04:00
Joey Hess
eb42935e58
Windows: Fix CRLF handling in some log files
In particular, the mergedrefs file was written with CR added to each line,
but read without CRLF handling. This resulted in each update of the file
adding CR to each line in it, growing the number of lines, while also
preventing the optimisation from working, so it remerged unncessarily.

writeFile and readFile do NewlineMode translation on Windows. But the
ByteString conversion prevented that from happening any longer.

I've audited for other cases of this, and found three more
(.git/annex/index.lck, .git/annex/ignoredrefs, and .git/annex/import/). All
of those also only prevent optimisations from working. Some other files are
currently both read and written with ByteString, but old git-annex may have
written them with NewlineMode translation. Other files are at risk for
breakage later if the reader gets converted to ByteString.

This is a minimal fix, but should be enough, as long as I remember to use
fileLines when splitting a ByteString into lines. This leaves files written
using ByteString without CR added, but that's ok because old git-annex has
no difficulty reading such files.

When the mergedrefs file has gotten lines that end with "\r\r\r\n", this
will eventually clean it up. Each update will remove a single trailing CR.

Note that S8.lines is still used in eg Command.Unused, where it is parsing
git show-ref, and similar in Git/*. git commands don't include CR in their
output so that's ok.

Sponsored-by: Joshua Antonishen on Patreon
2023-10-30 14:23:23 -04:00
Joey Hess
0da1d40cd4
Improve memory use of --all when using annex.private
This does not improve Annex.Branch.files at all, since it still uses ++ to
combine the lists, so forcing all but the last one.

But when there are a lot of files in the private journal, it does avoid
--all (or a bare repo) from buffering the filenames in memory.

See commit 653b719472 for prior discussion of
this buffering.

Sponsored-by: Graham Spencer on Patreon
2023-10-24 13:20:55 -04:00
Joey Hess
8bde6101e3
sqlite datbase for importfeed
importfeed: Use caching database to avoid needing to list urls on every
run, and avoid using too much memory.

Benchmarking in my podcasts repo, importfeed got 1.42 seconds faster,
and memory use dropped from 203000k to 59408k.

Database.ImportFeed is Database.ContentIdentifier with the serial number
filed off. There is a bit of code duplication I would like to avoid,
particularly recordAnnexBranchTree, and getAnnexBranchTree. But these use
the persistent sqlite tables, so despite the code being the same, they
cannot be factored out.

Since this database includes the contentidentifier metadata, it will be
slightly redundant if a sqlite database is ever added for metadata. I
did consider making such a generic database and using it for this. But,
that would then need importfeed to update both the url database and the
metadata database, which is twice as much work diffing the git-annex
branch trees. Or would entagle updating two databases in a complex way.
So instead it seems better to optimise the database that
importfeed needs, and if the metadata database is used by another command,
use a little more disk space and do a little bit of redundant work to
update it.

Sponsored-by: unqueued on Patreon
2023-10-23 16:46:22 -04:00
Joey Hess
c268dc5878
only stage regular files from the journal
git-annex only writes regular files there, but other things may drop junk
like empty .DAV directories around the tree. And trying to hash such things
can have weird and hard to understand effects. So it seems best to do a
small amount of work in statting the journal file to make sure it's a
regular file.

Sponsored-by: Jack Hill on Patreon
2023-10-10 13:22:02 -04:00
Joey Hess
adda6c1088
Add git-annex remote refs that are not newer to the merged refs list
Significant startup speed increase by avoiding repeatedly checking if some
remote git-annex branch refs need to be merged when it is not newer.

One way this could happen is when there are 2 remotes that are themselves
connected. The git-annex branch on the first remote gets updated. Then the
second remote pulls from the first, and merges in its git-annex branch.
Then the local repo pulls from the second remote, and merges its git-annex
branch. At this point, a pull from the first remote will get a git-annex
branch that is not newer, but is not on the merged refs list.

In my big repo, git-annex startup time dropped from 4 seconds to 0.1 seconds.
There were 5 to 10 such remote refs out of 18 remotes.

Sponsored-by: Graham Spencer on Patreon
2023-08-09 13:31:36 -04:00
Joey Hess
8b6c7bdbcc
filter out control characters in all other Messages
This does, as a side effect, make long notes in json output not
be indented. The indentation is only needed to offset them
underneath the display of the file they apply to, so that's ok.

Sponsored-by: Brock Spratlen on Patreon
2023-04-11 12:58:01 -04:00
Joey Hess
cd544e548b
filter out control characters in error messages
giveup changed to filter out control characters. (It is too low level to
make it use StringContainingQuotedPath.)

error still does not, but it should only be used for internal errors,
where the message is not attacker-controlled.

Changed a lot of existing error to giveup when it is not strictly an
internal error.

Of course, other exceptions can still be thrown, either by code in
git-annex, or a library, that include some attacker-controlled value.
This does not guard against those.

Sponsored-by: Noam Kremen on Patreon
2023-04-10 13:50:51 -04:00
Yaroslav Halchenko
84b0a3707a
Apply codespell -w throughout 2023-03-17 15:14:58 -04:00
Joey Hess
f09e299156
rawfilepath conversion 2023-02-27 15:06:32 -04:00
Joey Hess
a23fd7349f
work around git segfault
Work around bug in git 2.37 that causes a segfault when when
core.untrackedCache is set, and broke git-annex init.

Depending on when git gets fixed and how widely the buggy versions are
used, this could be reverted quite soon, or need to linger for a long time.
It only makes git-annex init a tiny bit slower in a new repo.

Sponsored-by: Max Thoursie on Patreon
2022-08-04 14:20:57 -04:00
Joey Hess
d905232842
use ResourcePool for hash-object handles
Avoid starting an unncessary number of git hash-object processes when
concurrency is enabled.

Sponsored-by: Dartmouth College's DANDI project
2022-07-25 17:32:39 -04:00
Joey Hess
4e88137a28
prevent appends except when annex.alwayscompact=false
I would like for a new repo version to enable appends, but to do so
safely would need a v11 followed by a 1 year delay followed by a v12
that does it. Since a similar v9 and v10 transition is currently
happening, and is less than 6 months along in most repos, it does not
feel wise to stack up another year-long transition behind that. What if
I need to hurry up a new repo version for some other change?

Added todo so I remember to make this change at some time when a v11
and probably v12 repo version do make sense.

Sponsored-by: Dartmouth College's DANDI project
2022-07-20 13:23:55 -04:00
Joey Hess
6f1fd3abdd
no locking of journal on read after all
Finally have a final design, and it turns out not to need locking on read.
2022-07-20 10:57:28 -04:00
Joey Hess
d0860b7f0e
fix build
After 28b0aaea54
2022-07-18 16:44:32 -04:00
Joey Hess
28b0aaea54
re-add lock journal before reading journal files
This reverts commit 2e6e9876e3.

This is gonna be needed after all.. The append will only be atomic if
the journal is locked, because the file being appended will have to be
moved out of the way to avoid an old version of git-annex seeing an
incomplete write to it. When git-annex finds that the file is not in the
journal, and checks the append location, locking will be needed to avoid
a race causing it to miss it in the append location too due to it being
moved back to the journal.
2022-07-18 16:40:25 -04:00
Joey Hess
36f0bdcd57
add annex.alwayscompact
Added annex.alwayscompact setting which can be unset to speed up writes to
the git-annex branch in some cases.

Sponsored-by: Dartmouth College's DANDI project
2022-07-18 16:39:19 -04:00
Joey Hess
de18d92de6
efficient but unsafe journal file append
This is only for checking performance, it's not safe.

Sponsored-by: Dartmouth College's DANDI project
2022-07-18 14:17:12 -04:00
Joey Hess
2e6e9876e3
Revert "lock journal before reading journal files"
This reverts commit 47358a6f95.

This added overhead, and will not be needed, because appends are going
to have to be made atomic for other reasons than avoiding incomplete
reads of data being appended.

In particular, when git-annex is interrupted in the middle of an append,
it must not leave the file with a partially written line. So appending
has to somehow be made fully atomic.
2022-07-18 13:38:12 -04:00
Joey Hess
ce455223df
split out appending to journal from writing, high level only
Currently this is not an improvement, but it allows for optimising
appendJournalFile later. With an optimised appendJournalFile, this will
greatly speed up access patterns like git-annex addurl of a lot of urls
to the same key, where the log file can grow rather large. Appending
rather than re-writing the journal file for each line can save a lot of
disk writes.

It still has to read the current journal or branch file, to check
if it can append to it, and so when the journal file does not exist yet,
it can write the old content from the branch to it. Probably the re-reads
are better cached by the filesystem than repeated writes. (If the
re-reads turn out to keep performance bad, they could be eliminated, at
the cost of not being able to compact the log when replacing old
information in it. That could be enabled by a switch.)

While the immediate need is to affect addurl writes, it was implemented
at the level of presence logs, so will also perhaps speed up location logs.
The only added overhead is the call to isNewInfo, which only needs to
compare ByteStrings. Helping to balance that out, it avoids compactLog
when it's able to append.

Sponsored-by: Dartmouth College's DANDI project
2022-07-18 13:22:50 -04:00
Joey Hess
47358a6f95
lock journal before reading journal files
This is not currently necessary; journal files are updated atomically.

However, for faster appends to large journal files, locking on read will
be needed, because appends are not atomic.

Sponsored-by: Dartmouth College's DANDI project
2022-07-15 14:43:29 -04:00
Joey Hess
1b680d330b
revert accidental change 2022-07-13 15:17:08 -04:00
Joey Hess
68e9b7f987
comment 2022-07-13 13:44:43 -04:00