Comments:"Take care, its Floating Point Arithmetic | The Limbeck"
URL:http://thelimbeck.wordpress.com/2013/04/12/take-care-its-floating-point-arithmetic/
Yesterday night, I was just idling on IRC and happened to come by an interesting discussion that was taking place there. The original issue was that doing Taylor approximation used to give identity for very small numbers, which gradually turned out to be an enlightening discussion. Thanks to Boris Zbarsky for clearing my doubts on this topic.
This post is a note to self (and others as well) so that I can keep them in mind in future.
For example we want to calculate . By the normal Taylor Series expansion, we would get:
But if we want to add them in a computer as then you would probably get a correct answer. But if you were to do the same for a very small
you would probably end up getting
, which is not correct.
The correct way out of the problem would be to use something like . The reason being for a very small
the correct answer is “basically
“. So for this we would need to use:
- Next term error estimation to decide how many terms we would want
- Add the terms from smallest to largest
This gives as some important insight into floating point arithmetic. For example we just want to add the following floats , and
instances of
. So what is
?
You would say almost (even I thought so). But to the computer that is exactly one. This where floating point arithmetic turns evil, because single precision floats only have
mantissa bits. So the rest of the thing is rounded off, and you get one.
But if you start off by adding with the smaller guys, they can actually accumulate and give the correct result. So if we add ,
times and then add
, we would get
, quite opposite to the previous result.
Floating point arithmetic is not arithmetic we learnt in school. Its not commutative nor associative nor distributive. It also implies that just because these properties are not satisfied compilers can optimize floating point operations either. The same reason why you have -ffast-math to tell gcc just to do you floating point operations fast and not care about the correct answer at all.