Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Will Floating Point 8 Solve AI/ML Overhead? (semiengineering.com)
90 points by rbanffy on Jan 16, 2023 | hide | past | favorite | 80 comments


For LLM, INT8 is old news but still exciting. FP8 would definitely be an improvement. However the new coolness is INT4.

> Excitingly, we manage to reach the INT4 weight quantization for GLM-130B while existing successes have thus far only come to the INT8 level. Memory-wise, by comparing to INT8, the INT4 version helps additionally save half of the required GPU memory to 70GB, thus allowing GLM130B inference on 4 × RTX 3090 Ti (24G) or 8 × RTX 2080 Ti (11G). Performance-wise, Table 2 left indicates that without post-training at all, the INT4-version GLM-130B experiences almost no performance degradation, thus maintaining the advantages over GPT-3 on common benchmarks.

Page 7 https://arxiv.org/pdf/2210.02414.pdf


Hopper seems to drop int4 support so maybe it's old news now?

https://en.m.wikipedia.org/wiki/Hopper_(microarchitecture)


At this rate, we're going to end up with FP1 (1-bit floating point) numbers...

I guess that's nonsensical. 1-bit in all FP is the sign bit. So I guess the minimum size is 2-bit FP (1-bit sign + 1-bit exponent + 0-bit implicit 1 mantissa)


Binary neural networks, where weights and/or activations are just 0/1s, are an active research area. In theory they could be implemented very efficiently in hardware. But in contrast to FP16 (or to some extent, int8), just quantizing FP32 to 1 bit doesn't work very well. There have been successful methods in practice. There was a company called Xnor.ai that was built partially around this technology, but it was sold to Apple a couple years ago. I don't know what's the current SOTA in this area, though.


We have created a binary embedding to help us with the embedding sizes. I think in the future we will see a lot more research in reducing the model and embedding sizes.

https://medium.com/ozonetel-ai/compressing-bert-sentence-emb...


*binarized


At FP2, you are probably better off with {-1, 0, 1, NaN} (sign+mantissa) rather than sign/exponent. You basically bit pack.

FP3 gives you sign, 1x "exponent", 1x mantissa, so still kinda bit packing.

I could see FP4 with sign, 1x exponent, 2x mantissa. Exponent would really just be a 4x multiplier, giving +/-0,1,2,3,4,8,12

Or invert all those, so you are expressing common fractions on 0..1


Real life has the E3 series: 1, 2.2, 4.7, and then 10, 22, 47, 100, 220, 470, 1000, etc. Etc.

EEs would recognize these values to be the Preferred Resistors values for projects (though more commonly the E6 series is used in projects, the E3 and E1 values are preferred)

That's 3 values per decade, which is slightly more dispersed than a FP4 that consists of 1 sign + 3 exponent + 0 (implicit mantissa 1 bit).

Or the values -128, -64, -32, -16, -8, -4, -2, -1, 1, 2, ... 128.

Maybe we can take -128 and call that zero instead, cause zero is useful.

--------

Given how even E3 is still useful in real world electrical engineering problems, I'm more inclined to allocate more bits to the exponent than the mantissa.


> Real life has the E3 series: 1, 2.2, 4.7, and then 10, 22, 47, 100, 220, 470, 1000, etc. Etc.

Took me until today to realise that sequence is a rounded version of 10^(n/3) for integer n.


Almost so, there are some manual adjustments to help with overlap from tolerances at various places, but that pure math layout of it would be where you'd probably want to do it for ml


E3 is only relevant to decimal numbers. I don't see how computing can benefit from it, unless base conversion for human interaction is a particularly large part of the problem. And there is a reason why stuff like BCD isn't used anywhere near anything resembling high performance computing: it's practically never worth it.


If you're going to bother doing floats you should probably make them balanced around 1.

And exponent seems to be much more important for these small sizes. The first paper that shows up for FP4 almost has negative mantissa bits. Their encoding has 0, 1/64, 1/16, 1/4, 1, 4, 16, 64.


1 bit would work fine - make the values represent +-1 or so


That once we get into asymmetrical number coding so that you could use numbers that take fraction of bits.


I think I read somewhere it only goes as low as int4. Can't find the reference.


1-bit unsigned float would probably be useful to some


So +0, +infinity, -0, -infinity, no NaNs?


The GLM work seems exciting but has there been any other research groups/LMs that have achieved similar performance even after such drastic quantisation?


For inference, sure. Not yet for training.

Weight quantization is much easier than weight + activation.


Kind of a weird article, as 8-bit quantization is widely used in production for neural networks for a number of years now. The title of the article is a bit misleading since it's widely known that 8-bit quantization does work and is extremely effective at improving inference throughput and latency. I'm not 100% sure if I'm reading the article correctly since it's a bit oblique, but it seems like the news here is that work is being done to formally specify a cross-vendor FP8 standard, as what exists right now is de facto standards from different vendors.


INT8 quantisation has been used in production for years. FP8 has not.


FP8 provides some nice accuracy benefits over INT8 but if you swap it out that doesn't affect your overhead.


The article mentions the 8 bit quantization, I believe this is about training in fp8 as native format. The latest GPUs provide huge flops for those, Tim Dettmers updated his gpu article and he talks about this, the claim is 0.66 PFLOPS for an RTX 4090.


This is very different, FP8 is a new and active area of research that has new hardware that is just supporting it. You are thinking of (U)INT8 quantization


The title is nonsensical. The faster the compute is, or the faster inference is (through eg precision), the larger the models people will train, because accuracy / output quality increases indefinitely with model size, and everyone knows this. So a different precision it will not "Solve the AI/ML Overhead", that's nonsense. People will just use as large a model as they can for their latency budget at inference & for their $ budget at training, whatever it is.


Chinchilla suggests that most models now are undertrained. It's been the lore that model size was the bottleneck but we've since passed that and now training data is the limiting factor.


chinchilla only says that it would be more computationally efficient to have more data than to make the model larger, not that making the model larger wouldn't still benefit in performance gain


Sure, but why would you spend on compute that you don't need?


if there is not enough data then all we can do is increase model size, which supports my original point


A modern day analog to Jevon's Paradox.


The article, comparing single and double precision:

>the mantissa jumps from 32 bits to 52 bits

Rather from 23 (+1 for implicit msb) to 52 (+), I suppose.


Related:

https://ai.facebook.com/blog/making-floating-point-math-high...

Which is meta's 8 bit data type originally called (8,1, alpha, beta, gamma). I think they realized that's a terrible name so I think they are calling it Deepfloat or something now.


To be honest, the second name has connotations that are even more terrible than the original name.

They need to seek help from a better naming team, please. It would help them, us, and the world, so much.


Will Doubling Disk Size Solve Storage?


The practical question of interest is: will this make it possible to run GPT-3 size models on normal desktops with GPU? Like Stable Diffusion.


At some point the practical question would be how do you get all the data onto the desktop.


People are already downloading 100GB games. And data rates are growing much faster than RAM capacities. The logistics of downloading a model smaller than GPU RAM are unlikely to ever get complicated.


Using bittorrent, of course :)

Once AI models become usable on desktops without esoteric hardware, their datasets will go the way of all large media which people have been distributing for many years.


Storage isn’t much of an issue these days. The bottleneck is computing values from RAM, fp8 can improve this drastically


I feel like the fastest tier of ram can only get so big before speed of light delays become relevant.


Really for me just the mantissa would be fine; no need for exponent bc so much of what I worked on is between 0..1

There was an interesting paper from the Allen Institute a few years ago describing a system with 1 bit weights that worked pretty well! Since I read it I've been musing on trying that, though it seems unlikely I will be able to any time soon.


If you just have a mantissa, aren't you doing fixed point math?


Yes, just looking for a weight in the range 0 <= x < 1. But I want to do large numbers of calculations using the GPU, else I'd use the SIMD int instructions (AVX)


Just do fixed point bruh.


it is, but doesn't give me the hardware affordance I want: https://news.ycombinator.com/item?id=34405604


“ High on the ML punch list is how to run models more efficiently using less power, especially in critical applications like self-driving vehicles where latency becomes a matter of life or death.”

Never ever heard of inference latency being a bottleneck here…


This is a great article.

I have had to deal with this lately, and FP8 / INT8 is very important especially for mobile/on-device inference. AI will start to get embedded in all kind of devices (even low power one), and often backend inference is not desired or feasible. It is actually pretty awesome if the industry moves towards INT8 / FP8 inference as a standard.

"But if FP8 becomes real, and if the popular training tools begin to develop ML models with FP8 as the native format, it could be a huge boon to embedded inference deployments. Eight-bit weights take the same storage space, whether they are INT8 or FP8."


Could be 8-bit posits may be enough. Has that been done? At scale. I do not know.


Posits aren't the answer to any question worth asking.


Why not?


Original posits are variable width, making them nearly useless for high performance parallel computations. Later versions don't add anything of use for low precision neural networks, and lack of hardware support anywhere make them too slow for anything other than toying around.

See also http://people.eecs.berkeley.edu/~wkahan/UnumSORN.pdf and https://www.youtube.com/watch?v=LZAeZBVAzVw


So for reference to everyone, the way a fixed-width posit works is by unary-encoding the high bits of the exponent, then encoding the low bits if space is available, then encoding mantissa into whatever bits are left.

Near 1.0 they look like a normal float. Maybe they have an extra bit or two of precision. Then as you get closer to 0 or infinity, every time the exponent field would run out you instead reset it, at the cost of one bit of precision.

The main benefit is that they have a lot more dynamic range than an equivalent float. The downside is the further you go into that range, the less accurate your numbers are.

They can also trade off accuracy near 1 for accuracy far from 1. Different exponent widths represent different points on this scale.

At smaller bit sizes, they have a good mix of preserving precision while being quite hard to overflow.

Overall they're not very different from standard floating point. Most of the bold claims from Gustafson are outside the scope of the posit itself.

-

Naively I would expect them to be kind of useful, if we ignore the issue of hardware support. Do neural networks need to represent extreme values with just as much precision as non-extreme values? And is the risk of overflow mitigated sufficiently at the same time? If so, yeah, the whole idea is useless.


As far as I'm aware, the original unum proposals that Kahan was arguing against have been discarded by Gustafson and all that remains now for current advocacy is the posit type, which is essentially a floating-point type that's fixed-width with a variable-width exponent.

I don't know what the hardware costs of posits look like, since I'm not a hardware engineer, so I can't comment on that. For larger sizes, posits seem to be inferior to IEEE 754 floating-point. For smaller sizes (say 16-bit and smaller), posits may work better, as the limited size means that IEEE 754's scale invariant nature [1] isn't as relevant, and packing more distinct numbers into the same bitwidth is more valuable [2].

[1] Put simply, in a IEEE 754 number, it doesn't matter if you measure your distance in nanometers, meters, or light-years--you'll get the same relative error either way. This is emphatically not the case in posits, where your relative error depends on the scale of the numbers.

[2] Posits combine ±infinity and NaN into a single value, and also does away with -0.0. From a numerical perspective, this is actually pretty cringe--there's a useful distinction there (and Kahan's talk gives some examples here)--but by the time you're at small bitwidths, you're likely limiting yourself to situations where the utility of these special values are questionable.


As to [2] I am very skeptical of the value of all the NaNs, -0 and Infs floating point has. NaN breaks x==x which is a pretty fundamental relationship for numbers to have. +-Inf sound useful in theory, but in practice they rarely give you a more useful result than NaN or the maximum/minimum value of your type (returning Inf on overflow has infinitely more error than returning the largest positive value, and if that isn't meaningful than an Inf probably wasn't either). Once you've gotten rid of -Inf, it becomes clear that -0.0 is a mistake. It breaks the identity 0+x==x and 0-x==-x. Furthermore, IEEE specifies sqrt(-0.0)==-0.0 and log(-0.0)==-Inf which are both nonsensical if you consider -0.0 as a limit from the negatives. Floats also have the unfortunate property that inv(x) can be infinite for finite x.


The value of -0 as distinguished from +0 has a few uses. The most obvious one is preserving sign in the case of overflow. A less obvious use case is handling branch cuts. There are uses in a few more cases: I've heard it's occasionally useful in things like coordinate systems, since something like "0°5'3" W" can be stored as (-0.0, 5, 3) after explosion and still display correctly. It's definitely niche, but it does have its uses.

Returning a distinct value that retains the fact that it overflowed is quite useful--if you get that result out of the computation, you know you overflowed the computation rather than silently getting a meaningless result. Note in particular that infinities end up being sticky values: once a value goes infinite, it tends to stay infinity, which isn't true for largest finite values. Distinguishing between various kinds of "invalid" values turns out to be moderately useful in practice--I've used infinities a couple of times in my own code.

NaNs are useful in representing a different kind of error than overflowed computation. Now there is a lot of room to criticize IEEE 754 here: "x != x" was quite frankly a mistake (basically the primary reason for it was the creators wanted to make testing for NaN easier than calling isnan(x)...). sNaNs are of course an abomination that just makes things worse. Multiple NaN payloads were originally intended (in part) to let developers debug the sources of NaNs, but this requires support that never really materialized. However, NaN payloads did find new use in making NaN-boxing a useful technique, and dedicating an entire exponent to special values simplifies several numerical analysis lemmas.


> handling branch cuts

I agree this sounds great in theory, but I don't think it works very well in practice. i.e. what about 1/(x+1)? Also branch cuts matter most for complex arithmetic, and there +-0 doesn't help since you don't know the phase of the zero. Also, realistically, floating point has finite precision so there are very few non-toy examples where you can do an actual computation and reliably end up on the correct branch. I'd rather have all the real numbers represented before we start adding hyper-reals to the number system.

> Returning a distinct value that retains the fact that it overflowed is quite useful

Agreed, and I think that NaR in Posits does a good job of that while not taking a ridiculous number of values.


> I agree this sounds great in theory, but I don't think it works very well in practice.

I've actually done it once in practice myself. I forget the exact details, though. As I said, it is a niche use case, but it's a useful to have when you are in that niche.


>NaN breaks x==x which is a pretty fundamental relationship for numbers to have

NaN is not a number, so it should NOT satisfy "fundamental relationships for numbers to have".

>+-Inf sound useful in theory, but in practice they rarely give you a more useful resu

There are algorithms that are more performant using infs, and without having a way to denote overflow, you'd have to pre-check evedry operation to do serious numerical work, which basically cuts your performance in half.

>Once you've gotten rid of -Inf, it becomes clear that -0.0 is a mistake

>It breaks the identity 0+x==x and 0-x==-x.

No, you have some fundamental misunderstanding. IEEE explicitly guarantees these hold, even for -0.

> Furthermore, IEEE specifies sqrt(-0.0)==-0.0 and log(-0.0)==-Inf which are both nonsensical if you consider -0.0 as a limit from the negatives.

You're making up strawmen. -0 is not a "limit from the negatives" any more than +0 is a limit form the positives, which would break other made up requirements. That is why making up stuff that has zero bearing on what IEEE 754 specifies is arguing strawmen.

>Floats also have the unfortunate property that inv(x) can be infinite for finite x.

Integers have the same property: -(X) can not be the negative of X. So this is not a problem except in made up goofiness.

Every objection you post is a lack of understanding numerical analysis and the needs of actual scientific software.

So you're skeptical- do you write numerical software professionally? I do, and have, and will do it in the future. There are very, very good reasons for all of those pieces you don't see the need for.

There's a reason unums have not caught on with the field of numerical software or numerical analysis - they simply don't allow writing robust, performant software, they solve no real issues, and add significant problems.


>so it should NOT satisfy "fundamental relationships for numbers to have".

If you have a list with a NaN in it, how should you make sort terminate (and where should the NaN end up)? I understand that in theory it is kind of arguable that NaN should be different, but breaking the total order is a really dumb decision.

>you'd have to pre-check evedry operation to do serious numerical work, which basically cuts your performance in half.

Can you give an example? Saturating overflow tends to do the same thing.

>IEEE explicitly guarantees these hold

This is kind of true. -0.0+0.0==0.0 and 0.0-0.0==0.0. IEEE does define -0.0==0.0 so IEEE does technically make this hold, but only by redefining == so that two different numbers are ==

> -0 is not a "limit from the negatives" Then what is it? it's not a real number, and Kahan's justification of them comes from branch cuts of analytic functions which is only makes sense in the context of limits https://homes.cs.washington.edu/~ztatlock/599z-17sp/papers/b...

> Integers have the same property Yeah and it sucks there too. In the fp case it makes it really annoying to do things like calculate divide an array by a float quickly and accurately. You would want to take the inverse of the divisor and multiply by that, but doing so isn't safe if the divisor is subnormal.

Yes. My day job is in solving Differential Algebraic equations, but I also have written a bunch of Julia's Libm.


>If you have a list with a NaN in it, how should you make sort terminate

Do whatever you want. If you're sorting floats, sort them to the front. Every language I've ever used for developing numerical software has a trivial IsNaN equivalent. So that's not a complaint worthy of claiming NaNs are not useful. I've written lots of numerical software and not once has this been an issue for me.

What value do you assign sqrt of a negative without some NaN type item? Or any of tons of other "not a number" results?

>so IEEE does technically make this hold, but only by redefining ==

There's no "redefining ==" here. You are upset that bit patterns are different, but == is not for bit patterns. You are confusing == for floats with == for bit patterns, which are not and need not be the same thing. I've never seen a language that gets these confused. If you want float ==, simply use language ==. If you want bitwise ==, then you usually have to do (often not portable) fiddling to convert to a bit pattern. It's like claiming reference == and structure field == should be the same, but both have uses. So languages have all sorts of ways to use the concept of equality, and they are all useful. Confusing them does not make the ones you don't like invalid or not extremely useful for people that do understand and use them.

>Yes. My day job is in solving Differential Algebraic equations, but I also have written a bunch of Julia's Libm.

Good. Then you should understand why, as an example, C++ std lib has a massive amount of functions like fma, expm1, log1p, hypot, and many more. Sure you can simply write log(1+x) instead of using log1p, but log1p is vastly better in this case because properties of IEEE 754 allow more precision. instead of hypot(x,y) you could write sqrt(xx+yy), but hypot is much better. These functions exist since IEEE provides tools to analyze these and make much better versions than the naive way to write them. Unums, with varying precision, make this vastly harder (and losing precision over the domain, making it hard to analyze anything).

So unums, with varying precision, violate fundamental properties for scientific computing, namely, they lose precision in really messy ways. You cannot start with P digits of precision and do even simple math and get an answer with P digits of precision. IEEE does allow this.

For example, sqrt(x^2)=|x| in IEEE (for no under/overflow). This does not work in unums, since they lose precision. Square something and lose digits. Fundamental to lots of scientific computation is the requirement to maintain precision throughout a calculation. Unums fail this spectacularly, making it incredibly messy to do correct scientific work.


the posit standard has a NaR value that does everything I wish NaN and Inf were in ieee it is the result of 1/0, and sqrt(-1) etc. there is only one of them and it compares equal with itself and is defined as less than all other posits. Real numbers have a total order so it's silly that floating point doesn't. Furthermore the Posit ordering operations (bitwise) are the same as the signed integer ones which makes your processor simpler and makes it easier to do things like write radix sorts for floats.

> You are confusing == for floats with == for bit patterns

The problem is that == for floats doesn't behave like an equality operation. x==x doesn't hold (reflexivity) and x==y => f(x)==f(y) doesn't hold. These are The two most important parts of what equality means.

To take your example of sqrt(xx), for Float16, of the 65k values, 34k give exact answers (counting NaNs as exact otherwise subtract 2k), 16k overflow and 5k underflow. There are also 9k inexact answers of which 6k are within 2 ULPs, and the others are further off (since xx loses precision due to subnormals). so in other words you get exact answers 1/2 of the time and close answers 60% of the time. With Posit16, you get 47k exact answers, and 18k inexact answers. How inexact are these inexact answers? 15k are within 2 ULP and only 2.9k aren't. (Of the 2.9k that aren't, Float16 would have overflowed or underflowed in all but 278 of the cases and these 278 cases are all accurate to less than 4 ULPs).

Posits do lose the ability to do error free transforms, but IMO for 32 bit and smaller math, this isn't a major loss as if you want more accuracy you can use more bits and it will usually be faster than the error free transform.


I've done a similar experiment with log1p(expm1(x)) and for that FLoat16 has 35k exact, 26k overflow, 1.3k within 4 ULP and 3k with more than 4 ULP error. Posits for comparison are 38k exact, 19k within 4 ULP, and 8k more than 4 ULP.


>To take your example of sqrt(xx),

Yes, for small floats posits do ok, but they fail for other sizes. For example, here's float32 vs posit32 for 100,000 random values in ranges 1e2,1e4,..,1e18.

Posit32 fails on (respectively) 21%, 71%, 91%, 97%, 99%, 99.9%, 99.97%, 99.99% of the cases. Float32 fails on 0 of them. Julia code at the bottom.

Posit even fails on simple integer multiplication so often that you'd be terribly pressed to know ahead of time when it happens. For example, take integers 1 to 40 for i and j, multiple as posit16 and as float16, an see how they do. Posit fails 1.75% of the values, float fails none.

This is simple multiplication of numbers well within range. The same problem happens in posit32,64,anysize, but not for the same sized floats.

>These are The two most important parts of what equality means.

As a PhD in math, this is not what equality means. You'll find nothing like that here for example https://en.wikipedia.org/wiki/Equality_(mathematics)

And if you're worried about equality, you might notice that in posit16, 2739 gives 1052 instead of 1053, which is real (in)equality. You worry so much about made up concerns that you miss the crazy bad results scattered throughout posits.

Posits of all sizes make errors when multiplying by powers of two that floats do not make (die to their inability to keep digits). For example, in posit8, 2.01.03125 returns 2.0, 102 returns 16, and examples this bad can be found for any size posit.

To see this, take 1e6 random values in 0-100, mult by 2, then divide by 2, and see how many made it round trip. All float16 values do. 4% of posit16 values do not round trip. These are small numbers - the entire computation stays in the range 0-200, and this is even the base of the underlying number. Posit32 has the same failure rate for the same reason: posits lose precision even under small multiplications.

As a result, posits fail at x-y=0 means x=y, which is also pretty fundamental, is it not?

Want to compute a discriminant sqrt(bb-4ac)? Good luck, nearby values for a,b,c don't give smooth results, and routinely give imaginary numbers when they should be real (due to the above screwiness around powers of 2).

There's so many failure cases, not even at the edge of the ranges, where posits fail and equivalent sized floats don't, that doing any simple computations is error prone.

Here's the Julia code for the sqrt failures. You can do similar error checks for a ton of computations and you'll find posits failing a significant amount of them.

     # count failures of float32 and Posit32 in Julia
     # for sqrt(x*x) ==?= x
     using Random

     Random.seed!(1234) # make reproducible

     scale = 1.0f0 # try exponent 4,6,8,10,etc
     for s in 1:9 # powers 2,4,5,8,10,...18
         scale *= 100.0f0
  
         badF,goodF = 0,0
         badP,goodP = 0,0
         for i in 1:10000
             f = rand()*scale
   
             f1::Float32 = f
             f2 = f1*f1
             f3 = sqrt(f2)

             @assert typeof(f1) == Float32
             @assert typeof(f2) == Float32
             @assert typeof(f3) == Float32

             p1 = Posit32(f)
             p2 = p1*p1
             p3 = sqrt(p2)

             @assert typeof(p1) == Posit32
             @assert typeof(p2) == Posit32
             @assert typeof(p3) == Posit32

             if f1 != f3
                badF+=1
             else
                goodF+=1
             end

             if p1 != p3
                badP+=1
             else
                goodP+=1
             end
          end

          println("Scaling: $(scale)")
          println("float: $(goodF) good, $(badF) bad, $(100*badF/(goodF+badF)) % failed")
          println("posit: $(goodP) good, $(badP) bad, $(100*badP/(goodP+badP)) % failed")


> Posit32 fails on (respectively) 21%, 71%, 91%, 97%, 99%, 99.9%, 99.97%, 99.99% of the cases. Float32 fails on 0 of them. Julia code at the bottom.

A lot of those are only losing a bit or two of precision, though, and many of them are happening in a region where posit32 has more bits of precision than float32 to start with.

I'm not very fond of having only 2 exponent bits on the standard posit32. It makes the region with significant precision loss a lot bigger. But if you give a posit exactly two fewer exponent bits than a float, it shouldn't do worse than that float by more than one bit anywhere. I think that would be a better apples-to-apples comparison. Standard posit32 is tuned much more toward having bonus precision near 1.0, and other tests would show it beating float32 for many tasks.

> Posit even fails on simple integer multiplication so often that you'd be terribly pressed to know ahead of time when it happens. For example, take integers 1 to 40 for i and j, multiple as posit16 and as float16, an see how they do. Posit fails 1.75% of the values, float fails none.

Posit16 starts failing to represent odd numbers at 1024. Float16 starts failing to represent odd numbers at 2048.

Not ideal but not a big issue.

1052 vs. 1053 is definitely not a "crazy bad problem"!

And on the other hand, consider multiplying numbers from 1 to 400. Around the high end of the results, posit16 will store numbers below 65k with 8 bits of mantissa, and numbers above 65k with 7 bits of mantissa. Float16 will store numbers below 65k with 10 bits of mantissa, and everything above 65k becomes infinity.

> Posits of all sizes make errors when multiplying by powers of two that floats do not make (die to their inability to keep digits). For example, in posit8, 2.0 * 1.03125 returns 2.0, 10 * 2 returns 16, and examples this bad can be found for any size posit.

What's your chosen IEEE competitor?

The example on wikipedia has 3 bits of mantissa. It also says 2.0 * 33/32 == 2.0

And while that format can represent 20, it only has a third of the dynamic range. Tradeoffs. No 8 bit format is going to be good.

What kind of float would pass the 1.03125 test, anyway? You'd need 5 bits of mantissa to represent that. In an 8 bit float, that means you have a... 2 bit exponent? No, that would be ridiculous, it would have to be a 3 bit exponent with no sign bit? That would imply the smallest normal value is 1/4 and the largest value is 15.75

If we face that against an 8 bit unsigned posit with es=1, then it would have 5 bits of mantissa in [1/4, 4), 4 bits of mantissa in [4, 16), and 3 bits of mantissa in [16, 64).

That means this posit would be able to represent both 2.0 * 1.03125 and 20, and it would have four times as much as dynamic range.

I wasn't expecting such a blowout until I started doing the math. Wow, congrats posit.

> To see this, take 1e6 random values in 0-100, mult by 2, then divide by 2, and see how many made it round trip. All float16 values do. 4% of posit16 values do not round trip. These are small numbers - the entire computation stays in the range 0-200, and this is even the base of the underlying number. Posit32 has the same failure rate for the same reason: posits lose precision even under small multiplications.

Yes, they lose one bit of precision if you cross certain thresholds. That is a cost to keep in mind. But look at which values fail to round-trip. It should mostly be numbers between 8 and 16, I think? Those numbers started with 11 bits of mantissa, and they were reduced to 10. Float16 is always 10. This is not a problem. For posit32 it's even more stark. Those numbers start with 27 bits of mantissa, and get reduced to 26 bits. Float32 is always 23.

> As a result, posits fail at x-y=0 means x=y, which is also pretty fundamental, is it not?

I don't think that would happen. Do you have an example number?

What can happen is that k*x - k*y=0 even though x!=y. With normal floating point that can't happen when k is a power of 2, but it can happen with lots of values of k. 4/3 * 3 - 4/3 * 3.0000000000000004 = 0, with normal floating point. In fact every third pair of adjacent floats incorrectly returns 0.


>A lot of those are only losing a bit or two of precision, though, and many of them are happening in a region where posit32 has more bits of precision than float32 to start with.

Most aren't in those regions. Posits only have more precision for small numbers compared to most of those in this test. Numbers in these ranges are common in computing - write a mapper, or any physics sim, or a CAD tool, or a video game, or nearly anything, and you'll soon find numbers needed over many orders of (decimal) magnitude. Posits simply don't handle any of these cases well.

And losing "only a few bit or two" when doing two operations leads to massive loss in long calculations with even less stable operations. That these lose so much immediately is absolutely terrible for things that do actual numerical work. Things like BLAS, which underlie huge amounts of computing, have plenty of papers on analysis of what happens in actual practice, and posits will not handle much at all of it.

>1052 vs. 1053 is definitely not a "crazy bad problem"!

Really? When it shows up in spreadsheets in various forms I expect people will think otherwise. The Intel floating point fiasco and Excel bugs with vastly less error certainly caused major problems.

>And on the other hand, consider multiplying numbers from 1 to 400. Around the high end of the results, posit16 will store numbers below 65k with 8 bits of mantissa, and numbers above 65k with 7 bits of mantissa. Float16 will store numbers below 65k with 10 bits of mantissa, and everything above 65k becomes infinity.

Both overflow - float16 to inf (denotes overflow, correct), posit16 to NaR (not a real, incorrect - the result is a real number). float16 also then says 0 < inf, correct. Posit16 says NaR < 0, which is not correct.

>Standard posit32 is tuned much more toward having bonus precision near 1.0,

Congrats if you only have calcs that stay near 1.0. If that is your use case, then a format tuned specifically to that case will outperform posits (like lots of things being tried in ML).

>What kind of float would pass the 1.03125 test, anyway?

All of the IEEE formats - multiplying or dividing by two never loses precision until overflow or underflow since it's merely incrementing/decrementing the exponent. Posits with changing precisions loses significance over it's entire range. A fundamental rule in scientific computing is to keep the same precision - if you know something with D digits (or bits) of accuracy, you should keep that precision, otherwise you simply get bad answers. This is taught in elementary schools if I recall.

Also why did you ignore the 10*2 = 16 posit case? And these happen for all size posits for reasonable ranges, but none of the IEEE formats. The values I gave were for examples. Run your own checks and you'll find them all over the map for any size posit.

>> As a result, posits fail at x-y=0 means x=y, which is also pretty fundamental, is it not?

>I don't think that would happen. Do you have an example number?

It's a fundamental property of IEEE 754 that the difference (or sum) of two representable values is itself representable in the same IEEE 754 format (and it's also true for mult and div, all of which allows multiprecision like double-double to exist). These properties are fundamental in proving theorems and are used by compilers to optimize code. It's also a theorem for posits that none of these fundamental properties hold, with the conclusion that there exist values violating it.

Examples where the error in addition is not representable are a= Posit16(0x0001)=2^-114, a+a should be 2^-113, not representable as posit16, a+a as posit 16 gives 2^-112, with error from correct of 2^-113, not representable. For Posit32 this happens for example at Posit32(0x0...03). I gave plenty of examples above from which you can compute that posit errors are not posit representable, making plenty of algorithm impossible without significant more computation.

IEEE was ratified in 1985 and had already been incorporated into hardware from major vendors (using pre-draft). bfloat was invented around 2018 and has significant major hardware vendor support (Google, Intel (even in Xeon processors), FPGAs, ARMv8.6, AMD ROC, and NVidia CUDA among others). Unums were invented in 2015 and have what major hardware support do they have? None? There's even plenty of other float formats adopted in microcontrollers and other hardware used in practice. I've seen none that implement posits. Usually when something is really good it gets incorporated into hardware quite rapidly. Posits have not.

Why do you think that is?


> Most aren't in those regions.

If you're only looking at places where posit is worse than float, and ignoring the places where it's better, you should use a posit with a longer exponent where none of the values lose more than a single bit of precision. (There will be low-precision values, but they will be numbers that are completely unrepresentable in the equivalent float)

> Really? When it shows up in spreadsheets in various forms I expect people will think otherwise. The Intel floating point fiasco and Excel bugs with vastly less error certainly caused major problems.

Surely 2257 vs. 2258 is roughly as bad? But float16 has the exact same problem with those numbers. It's not "crazy bad" that the threshold is in a slightly different spot. 16 bit numbers are not appropriate for spreadsheets no matter what format.

> Both overflow - float16 to inf (denotes overflow, correct), posit16 to NaR (not a real, incorrect - the result is a real number). float16 also then says 0 < inf, correct. Posit16 says NaR < 0, which is not correct.

posit16 most certainly does not overflow on 400x400. It doesn't overflow on 4000x4000 either, or 4 million x 4 million.

And if you treat "infinity" literally it's not correct at all. If you're going to say "denotes overflow, correct" then it's only fair to say the same thing about NaR. Pretend it's "not a result" maybe? It's just a name.

> All of the IEEE formats - multiplying or dividing by two never loses precision until overflow or underflow since it's merely incrementing/decrementing the exponent.

Let me rephrase. What 8 bit float can represent 1.03125 in the first place?

> Also why did you ignore the 10*2 = 16 posit case? And these happen for all size posits for reasonable ranges, but none of the IEEE formats. The values I gave were for examples. Run your own checks and you'll find them all over the map for any size posit.

There is no standard 8 bit IEEE format as far as I know.

I didn't ignore that case. I pointed out how the standard posit8 fails it. But if we're using some kind of weird custom float with no sign bit, I think it's valid to use a better-balanced posit. Posit8_1 is the best competitor to a custom float with 5 mantissa bits. If you also remove the sign bit, then it can do 10 * 2 = 20.

> It's a fundamental property of IEEE 754 that the difference (or sum) of two representable values is itself representable in the same IEEE 754 format

Except when you hit infinity.

> Examples where the error in addition is not representable are a= Posit16(0x0001)=2^-114, a+a should be 2^-113, not representable as posit16, a+a as posit 16 gives 2^-112, with error from correct of 2^-113, not representable.

1. Such aggressive rounding only happens in the last couple values next to 0. If you were using float16 you wouldn't be able to represent that value, it would just be 0.

2. Does that give you x - y = 0 for different x and y, the thing I asked about?

> Why do you think that is?

I have no idea how hard it is to implement, to be honest. But adding a few more bits is easy in comparison. And it's not that different from normal floating point, so who wants to bother?

bfloat is just truncating differently; I'm not surprised it was very quickly implemented.


Also to be clear, unums v1 and v2 were (mostly) dumb ideas that haven't gone anywhere. Unums v3 (aka posits) are a (IMO) really good idea for how to generate a better floating point standard (see https://posithub.org/docs/posit_standard-2.pdf)


> lack of hardware support

That seems fixable. Don't people make chips that do what you want a lot of? A chip with an array of 8-bit posit PUs could process a hell of a lot in parallel, subject only to getting the arguments and results to useful places.


I am curious, if folks have tried hybrid methods using ensembles.

Train the main model using FP8 (or other quantized approaches) and have a small calibrating model at FP32 that is trained afterward.


I'm more intrigued in the adoption of Posits for ML (particularly DL) tasks: https://www.cs.cornell.edu/courses/cs6120/2019fa/blog/posits...


> People who follow a strict neuromorphic interpretation have even discussed binary neural networks, in which the input functions like an axon spike, just 0 or 1.

How do you perform differentiation with this datatype?


Presumably you could accumulate activation counts along the activation path, in this case.

It wouldn't provide any gradient at a single unit, though.


Or perhaps using batches, I suppose.


In the old days of CS, people were talking about optimizations in the big-O sense.

Nowadays the talk is mostly about optimization of constant factors, so it seems.


I don't know when is "old days" for you, but I remember plenty of constant factor optimization talk in CS two+ decades ago. In fact the "ignore constant factors" mantra was only valid in theoretical CS, most practical subjects saw plenty of gains in constant factors.


Why go so extreme when you can have fp12? Perhaps have 4 high bit exponent and low signed int8 mantissa.

Or vice versa, 7 bit exponent, sign and 4 bit mantissa.


I think the general idea is to make use of SIMD, and generally the max size is 2^x. So if you're trying to multiply as many numbers as possible in, say, 64 bits, FP8 would get you 8x8 and FP16 would get you 4x4. FP12 would get you 5x5 with some unused space which would be a huge amount of extra work to implement for a 20% gain in efficiency.


You can pack arrays of NPOT elements by breaking each element into POT components and transposing. So e.g. you could have one array of 8-bit mantissas and one array of 4-bit exponents.


Supposedly, 8-bit numbers could be even better if they were posits, rather than straight floats, but I haven't seen any commercial interest in posit arithmetic yet.


no




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: