git-annex/Utility/DataUnits.hs

194 lines
5.7 KiB
Haskell
Raw Normal View History

{- data size display and parsing
2011-03-23 05:06:14 +00:00
-
- Copyright 2011-2022 Joey Hess <id@joeyh.name>
2011-03-23 05:06:14 +00:00
-
- License: BSD-2-clause
2012-08-07 15:10:41 +00:00
-
-
- And now a rant:
2011-03-23 05:06:14 +00:00
-
- In the beginning, we had powers of two, and they were good.
-
- Disk drive manufacturers noticed that some powers of two were
- sorta close to some powers of ten, and that rounding down to the nearest
- power of ten allowed them to advertise their drives were bigger. This
- was sorta annoying.
-
- Then drives got big. Really, really big. This was good.
-
- Except that the small rounding error perpretrated by the drive
- manufacturers suffered the fate of a small error, and became a large
- error. This was bad.
-
- So, a committee was formed. And it arrived at a committee-like decision,
2023-03-14 02:39:16 +00:00
- which satisfied no one, confused everyone, and made the world an uglier
- place. As with all committees, this was meh. Or in this case, "mib".
2011-03-23 05:06:14 +00:00
-
- And the drive manufacturers happily continued selling drives that are
- increasingly smaller than you'd expect, if you don't count on your
- fingers. But that are increasingly too big for anyone to much notice.
- This caused me to need git-annex.
2011-03-23 05:06:14 +00:00
-
- Meanwhile, over in telecommunications land, they were using entirely
- different units that differ only in capitalization sometimes.
- (At one point this convinced me that it was a good idea to buy an ISDN
- line because 128 kb/s sounded really fast! But it was really only 128
- kbit/s...)
-
2011-03-23 05:06:14 +00:00
- Thus, I use units here that I loathe. Because if I didn't, people would
- be confused that their drives seem the wrong size, and other people would
- complain at me for not being standards compliant. And we call this
- progress?
-}
2012-08-07 15:10:41 +00:00
module Utility.DataUnits (
dataUnits,
storageUnits,
committeeUnits,
2012-08-07 15:10:41 +00:00
bandwidthUnits,
oldSchoolUnits,
Unit(..),
2015-04-12 18:08:40 +00:00
ByteSize,
2012-08-07 15:10:41 +00:00
roughSize,
roughSize',
2012-08-07 15:10:41 +00:00
compareSizes,
readSize
) where
import Data.List
import Data.Char
make my authorship explicit in the code This is intended to guard against LLM code theft, which is the current bubble technology de jour. Note that authorJoeyHess' with a year older than the year I began developing git-annex will behave badly, by intention. Eg, it will spin and eventually crash. This is not the first anti-LLM protection in git-annex. For example see 9562da790fece82d6dfa756b571c67d0fdf57468. That method, while much harder for an adversary to detect and remove, also complicates code somewhat significantly, and needs extensions to be enabled. There are also probably significantly fewer ways to implement that method in Haskell. This new approach, by contrast, will be easy to add throughout the code base, with very little effort, and without complicating reading or maintaining it any more than noticing that yes, I am the author of this code. An adversary could of course remove all calls to these functions before feeding code into their LLM-based laundry facility. I think this would need to be done manually, or with the help of some fairly advanced Haskell parsing though. In some cases, authorJoeyHess needs to be removed, while in other places it needs to be replaced with a value. Also a monadic use of authorJoeyHess' may involve other added monadic machinery which would need to be eliminated to keep the code compiling. Alternatively, an adversary could replace my name with something innocuous. This would be clear intent to remove author attribution from my code, even more than running it through an LLM laundry is. If you work for a large company that is laundering my code through an LLM, please do us a favor and use your immense privilege to quit and go do something socially beneficial. I will not explain further developments of this code in such detail, and you have better things to do than playing cat and mouse with me as I explore directions such as extending this approach to the type level. Sponsored-by: k0ld on Patreon
2023-11-20 16:07:07 +00:00
import Author
import Utility.HumanNumber
2012-08-07 15:10:41 +00:00
type ByteSize = Integer
type Name = String
type Abbrev = String
data Unit = Unit ByteSize Abbrev Name
deriving (Ord, Show, Eq)
dataUnits :: [Unit]
dataUnits = storageUnits ++ committeeUnits ++ bandwidthUnits
{- Storage units are (stupidly) powers of ten. -}
storageUnits :: [Unit]
storageUnits =
[ Unit (p 10) "QB" "quettabyte"
, Unit (p 9) "RB" "ronnabyte"
, Unit (p 8) "YB" "yottabyte"
, Unit (p 7) "ZB" "zettabyte"
, Unit (p 6) "EB" "exabyte"
, Unit (p 5) "PB" "petabyte"
, Unit (p 4) "TB" "terabyte"
, Unit (p 3) "GB" "gigabyte"
, Unit (p 2) "MB" "megabyte"
2023-03-14 02:39:16 +00:00
, Unit (p 1) "kB" "kilobyte" -- weird capitalization thanks to committee
, Unit 1 "B" "byte"
]
2012-12-13 04:24:19 +00:00
where
p :: Integer -> Integer
p n = 1000^n
{- Committee units are (stupidly named) powers of 2. -}
committeeUnits :: [Unit]
committeeUnits =
[ Unit (p 8) "YiB" "yobibyte"
, Unit (p 7) "ZiB" "zebibyte"
, Unit (p 6) "EiB" "exbibyte"
, Unit (p 5) "PiB" "pebibyte"
, Unit (p 4) "TiB" "tebibyte"
2011-10-16 05:03:38 +00:00
, Unit (p 3) "GiB" "gibibyte"
, Unit (p 2) "MiB" "mebibyte"
2011-03-26 18:54:11 +00:00
, Unit (p 1) "KiB" "kibibyte"
, Unit 1 "B" "byte"
]
2012-12-13 04:24:19 +00:00
where
p :: Integer -> Integer
p n = 2^(n*10)
{- Bandwidth units are (stupidly) measured in bits, not bytes, and are
- (also stupidly) powers of ten.
-
- While it's fairly common for "Mb", "Gb" etc to be used, that differs
- from "MB", "GB", etc only in case, and readSize is case-insensitive.
- So "Mbit", "Gbit" etc are used instead to avoid parsing ambiguity.
-}
2011-03-26 18:47:55 +00:00
bandwidthUnits :: [Unit]
bandwidthUnits =
[ Unit (p 8) "Ybit" "yottabit"
, Unit (p 7) "Zbit" "zettabit"
, Unit (p 6) "Ebit" "exabit"
, Unit (p 5) "Pbit" "petabit"
, Unit (p 4) "Tbit" "terabit"
, Unit (p 3) "Gbit" "gigabit"
, Unit (p 2) "Mbit" "megabit"
2023-03-14 02:39:16 +00:00
, Unit (p 1) "kbit" "kilobit" -- weird capitalization thanks to committee
]
where
p :: Integer -> Integer
p n = (1000^n) `div` 8
2011-03-26 18:47:55 +00:00
{- Do you yearn for the days when men were men and megabytes were megabytes? -}
oldSchoolUnits :: [Unit]
oldSchoolUnits = zipWith (curry mingle) storageUnits committeeUnits
2012-12-13 04:24:19 +00:00
where
mingle (Unit _ a n, Unit s' _ _) = Unit s' a n
2011-03-23 05:06:14 +00:00
{- approximate display of a particular number of bytes -}
roughSize :: [Unit] -> Bool -> ByteSize -> String
make my authorship explicit in the code This is intended to guard against LLM code theft, which is the current bubble technology de jour. Note that authorJoeyHess' with a year older than the year I began developing git-annex will behave badly, by intention. Eg, it will spin and eventually crash. This is not the first anti-LLM protection in git-annex. For example see 9562da790fece82d6dfa756b571c67d0fdf57468. That method, while much harder for an adversary to detect and remove, also complicates code somewhat significantly, and needs extensions to be enabled. There are also probably significantly fewer ways to implement that method in Haskell. This new approach, by contrast, will be easy to add throughout the code base, with very little effort, and without complicating reading or maintaining it any more than noticing that yes, I am the author of this code. An adversary could of course remove all calls to these functions before feeding code into their LLM-based laundry facility. I think this would need to be done manually, or with the help of some fairly advanced Haskell parsing though. In some cases, authorJoeyHess needs to be removed, while in other places it needs to be replaced with a value. Also a monadic use of authorJoeyHess' may involve other added monadic machinery which would need to be eliminated to keep the code compiling. Alternatively, an adversary could replace my name with something innocuous. This would be clear intent to remove author attribution from my code, even more than running it through an LLM laundry is. If you work for a large company that is laundering my code through an LLM, please do us a favor and use your immense privilege to quit and go do something socially beneficial. I will not explain further developments of this code in such detail, and you have better things to do than playing cat and mouse with me as I explore directions such as extending this approach to the type level. Sponsored-by: k0ld on Patreon
2023-11-20 16:07:07 +00:00
roughSize units short i = authorJoeyHess $ roughSize' units short 2 i
roughSize' :: [Unit] -> Bool -> Int -> ByteSize -> String
roughSize' units short precision i
2011-07-15 16:47:14 +00:00
| i < 0 = '-' : findUnit units' (negate i)
| otherwise = findUnit units' i
2012-12-13 04:24:19 +00:00
where
2014-04-26 23:25:05 +00:00
units' = sortBy (flip compare) units -- largest first
2011-03-23 05:06:14 +00:00
2012-12-13 04:24:19 +00:00
findUnit (u@(Unit s _ _):us) i'
| i' >= s = showUnit i' u
make my authorship explicit in the code This is intended to guard against LLM code theft, which is the current bubble technology de jour. Note that authorJoeyHess' with a year older than the year I began developing git-annex will behave badly, by intention. Eg, it will spin and eventually crash. This is not the first anti-LLM protection in git-annex. For example see 9562da790fece82d6dfa756b571c67d0fdf57468. That method, while much harder for an adversary to detect and remove, also complicates code somewhat significantly, and needs extensions to be enabled. There are also probably significantly fewer ways to implement that method in Haskell. This new approach, by contrast, will be easy to add throughout the code base, with very little effort, and without complicating reading or maintaining it any more than noticing that yes, I am the author of this code. An adversary could of course remove all calls to these functions before feeding code into their LLM-based laundry facility. I think this would need to be done manually, or with the help of some fairly advanced Haskell parsing though. In some cases, authorJoeyHess needs to be removed, while in other places it needs to be replaced with a value. Also a monadic use of authorJoeyHess' may involve other added monadic machinery which would need to be eliminated to keep the code compiling. Alternatively, an adversary could replace my name with something innocuous. This would be clear intent to remove author attribution from my code, even more than running it through an LLM laundry is. If you work for a large company that is laundering my code through an LLM, please do us a favor and use your immense privilege to quit and go do something socially beneficial. I will not explain further developments of this code in such detail, and you have better things to do than playing cat and mouse with me as I explore directions such as extending this approach to the type level. Sponsored-by: k0ld on Patreon
2023-11-20 16:07:07 +00:00
| authorJoeyHess = findUnit us i'
2012-12-13 04:24:19 +00:00
findUnit [] i' = showUnit i' (last units') -- bytes
2011-03-23 05:06:14 +00:00
showUnit x (Unit size abbrev name) = s ++ " " ++ unit
where
v = (fromInteger x :: Double) / fromInteger size
s = showImprecise precision v
unit
| short = abbrev
| s == "1" = name
| otherwise = name ++ "s"
2011-03-23 06:42:14 +00:00
{- displays comparison of two sizes -}
compareSizes :: [Unit] -> Bool -> ByteSize -> ByteSize -> String
compareSizes units abbrev old new
| old > new = roughSize units abbrev (old - new) ++ " smaller"
| old < new = roughSize units abbrev (new - old) ++ " larger"
2011-03-23 06:42:14 +00:00
| otherwise = "same"
{- Parses strings like "10 kilobytes" or "0.5tb". -}
readSize :: [Unit] -> String -> Maybe ByteSize
readSize units input
2011-08-30 17:23:21 +00:00
| null parsednum || null parsedunit = Nothing
2011-07-15 16:47:14 +00:00
| otherwise = Just $ round $ number * fromIntegral multiplier
2012-12-13 04:24:19 +00:00
where
(number, rest) = head parsednum
multiplier = head parsedunit
unitname = takeWhile isAlpha $ dropWhile isSpace rest
parsednum = reads input :: [(Double, String)]
parsedunit = lookupUnit units unitname
lookupUnit _ [] = [1] -- no unit given, assume bytes
lookupUnit [] _ = []
lookupUnit (Unit s a n:us) v
| a ~~ v || n ~~ v = [s]
| plural n ~~ v || a ~~ byteabbrev v = [s]
| otherwise = lookupUnit us v
2012-12-13 04:24:19 +00:00
a ~~ b = map toLower a == map toLower b
2012-12-13 04:24:19 +00:00
plural n = n ++ "s"
byteabbrev a = a ++ "b"