Taming Floating Point Error

栏目: IT技术 · 发布时间: 5年前

内容简介：If you’ve been a software engineer for long enough, it is very likely that you’ve seen this example of floating point perfidy:We understand that this is due to the fact that floating point numbers, stored with only 64 bits of precision, cannot represent th

If you’ve been a software engineer for long enough, it is very likely that you’ve seen this example of floating point perfidy:

>>> 3.0 / 10
0.3
>>> 0.1 * 3
0.30000000000000004

We understand that this is due to the fact that floating point numbers, stored with only 64 bits of precision, cannot represent the entire real number line. Moreover, when we perform operations with these floating point numbers, the errors inherent in their representation can accumulate and multiply . The moral of the story is, never use a floating point number to represent money.

At least, that is the moral for financial applications. At Square, we used a long amount_cents , and we got along with our lives. However, for most applications that have a good reason to use floating point, this can’t be the end of the story. If floating point were the unpredictable, unreliable thing that I once believed it to be, we wouldn’t be able to numerically solve differential equations, or linear systems, or land on the moon. Rather, there is a science of floating point error, forming part of a science of numerical errors in general, which seeks to tame and understand what happens to errors as they flow through our calculations. In general, numerical error is something that can be rather precisely quantified, as we’ll see. Along the way we’ll look at the Fundamental Axiom of Floating Point Arithmetic, which, at the very least, sounds way cooler than “10 things every developer should know about floating point numbers”.

Machine Epsilon

Fundamentally, error in floating point is due to the problem of “roundoff”. Just as in base 10, we cannot represent the number \(1/3\) without rounding it off somewhere:

\begin{equation*} \frac{1}{3} = 0.33333333333333 \dots \approx 0.33333333, \end{equation*}

in base 2, we cannot represent many numbers without rounding. Of course, some numbers we can represent exactly. The number 1, for example. Or any integer in the range \((-2^{53}, 2^{53})\). Also notably, many fractions can be exactly represented:

\begin{align*} \frac{1}{2} &= 0.1_2 \\\

\frac{3}{4} &= 0.11_2 \\\

\frac{17}{8} &= 10.001_2 \\\

&\vdots \end{align*}

However, a number like \(1/10\), just like \(1/3\) in base 10, must be truncated to fit in the 24 or 53 bits of the mantissa . When we enter 0.1 in a console or in source code, the value that is actually stored is very slightly different than 0.1 . According to this excellent IEEE-754 Floating Point Converter , the floating point number that is actually stored (for 64-bit floating point) is

\begin{align*} &(0.0001100110011001100110011001100110011001100110011001101)_2 = \\\

&(0.1000000000000000055511151231257827021181583404541015625)_{10} \end{align*}

So the initial input to our calculation was flawed! We weren’t calculating 0.1 * 3 , we were actually calculating

0.1000000000000000055511151231257827021181583404541015625 * 3

How much of an error is this? We can get an idea by counting the zeros in between the significant digits, \(0.100\dots 00055\). In this case, there are 16. So in simply entering a number which is not representable exactly in floating point, we have incurred a relative error of roughly \(10^{-16}\).

Indeed, in all cases we can expect to incur a relative error of roughly \(10^{-16}\). This magnitude is called machine epsilon, often written \(\epsilon_{\text{machine}}\). It comes from the relative difference between two successive floating point numbers. Unlike the number line, for every representable floating point number \(x\), there is a next floating point number, and it is approximately \(x + \epsilon_{\text{machine}} x\). So for an arbitrary real number \(x_0\), it falls between two floating point values \(x\) and \(x + \epsilon x\) (leaving off the subscript for conciseness). When we represent \(x_0\) in floating point, we will get one of these two values. Denote the floating point representation of \(x_0\) by \(\text{fl}(x_0)\). The absolute error is

\begin{equation*} e_{\text{abs}} = |\text{fl}(x_0) - x_0| \leq \max(|x_0 - x|, |x_0 - (x + \epsilon x)|) \leq |\epsilon x|. \end{equation*}

The relative error, then, i.e. the absolute error divided by the true value, is

\begin{equation*} e_{\text{rel}} = \frac{e_{\text{abs}}}{x_0} \leq \frac{|\epsilon x|}{x_0} \approx \epsilon. \end{equation*}

Cool! So we’ve seen that the worst we can do, in relative terms, when representing a floating point number, is approximately \(10^{-16}\). This is, for almost all practical purposes, very good . Because remember, we’re speaking of a relative error. That means we’re able to represent even very small values, very accurately. Here’s the nearest floating point number to \(10^{-20}\):

\begin{equation*} 0.000000000000000000009999999999999999451532\dots \end{equation*}

If you’re curious, I invite you to count the 9’s. There are 16 of them. Even when dealing with extremely small numbers, we maintain the same relative precision.

The Fundamental Axiom of Floating Point Arithmetic

Now you might be thinking, wait! It’s all good and well that we can get excellent relative accuracy when representing floating point numbers, but what about when we go to do something with them? Here we’ve got two floating point numbers, both of which are inexact, and we’re about to multiply them! Who knows what might happen?

This concern is well-founded, because the algorithms of floating point arithmetic must be implemented with finite precision. If we are asked to multiply two large numbers with pen and paper, the algorithm that most of us will use is the one we learned in school, which involves lots of addition, carrying, sub-multiplications, and so on. If all of those intermediate steps are using some kind of binary representation, then the intermediate products may be losing precision as we go along! This seems like a recipe for disaster. Fortunately, there is a property that we can require of a floating point implementation, one which is satisfied by IEEE-754 and most other popular floating point standards, that will save us from total anarchy. This is what Trefethen and Bau, Numerical Linear Algebra , refer to as the Fundamental Axiom of Floating Point Arithmetic :

All floating point arithmetic operations are exact up to a relative error of \(\epsilon_{\text{machine}}\).

This means that for any two floating point numbers, say \(x\) and \(y\), any operation involving them will give a floating point result which is within a factor of \(1 + \epsilon_{\text{machine}}\) of the true result. This is easiest to understand if we use a special notation to represent the floating point version of, say, \(+\). Let’s write \(\oplus\) to denote floating point addition, and \(+\) to denote exact addition. Then the Fundamental Axiom tells us,

\begin{equation*} \frac{|(x \bigoplus y) - (x + y)|}{x + y} \leq \epsilon_{\text{machine}}. \end{equation*}

Remember, \(x\) and \(y\) are exactly representable as floating point numbers, but of course they are also, mathematically, just real numbers. So \(x + y\) is a real number (point on the number line), which may not be exactly representable in floating point . The Fundamental Axiom is telling us that the floating point operation \(\oplus\) can do no worse than simply trying to represent the number \(x + y\) as a floating point value directly, assuming our computer had access to the mathematical, infinite precision object \(x + y\).

Put another way, we incur no extra error by going through floating point arithmetic, than we would by using a magical computer to do exact arithmetic on our floating point values, and then casting back to floating point. In mathematical terms,

\begin{equation*} \frac{\left|(\text{fl}(a) \oplus \text{fl}(b)) - (a + b)\right|}{a + b} \approx \frac{\left|\text{fl}(\text{fl}(a) + \text{fl}(b)) - (a + b)\right|}{a + b} \approx \epsilon_{\text{machine}}. \end{equation*}

Recall that \(\text{fl}(a)\) is the mathematical number we actually represent, when we try to represent \(a\) in floating point. So the first fraction gives the relative error from performing \(\oplus\) on the floating point representations of \(a\) and \(b\), and the second fraction gives the relative error from performing the exact arithmetic operation, \(+\), on the floating point representations of \(a\) and \(b\), then casting the result to floating point.

Error Analysis

The Fundamental Axiom of Floating Point Arithmetic allows us to analyze the numerical error that may be incurred by even complex arithmetic operations. To see an idea of how this works, let’s consider the problem of computing the length of a 2D vector, \([x, y]\). Mathematically, this has the form

\begin{equation*} d = \sqrt{x^2 + y^2}. \end{equation*}

There are several stages to the computation, and at each one we will incur a little bit of numerical error:

Enter the numbers \(x\) and \(y\) into the computer, incurring roundoff error.
Compute \(x^2\) and \(y^2\), using floating point multiplication.
Compute \(x^2 + y^2\), using floating point addition.
Compute \(\sqrt{x^2 + y^2}\), using the floating point square root operation.

At each step, the result we get will be equal to the true result, times some error factor \((1 + \epsilon)\), where each \(\epsilon\) is very small, on the other of \(\epsilon_\text{machine}\). Each operation may have a different error \(\epsilon\), but we’ll use the same symbol for all of them. We don’t care so much about the exact value of \(\epsilon\), only that it is very small.

We’re going to carry these \(\epsilon\) through the computation to see how they affect the final result. To make that easier, we can use special rules of arithmetic to manipulate \(\epsilon\):

sqrt

With these rules, we’re ready to do the error calculation. Red expressions are the new error factor for each calculation, which appear in blue in the next expression down:

\begin{align*} \text{fl}(x) &= {\color{red}{(1 + \epsilon)}} x \\\

\text{fl}(y) &= {\color{red}{(1 + \epsilon)}} y \\\

\\\

\text{fl}(x) \otimes \text{fl}(x) &= (1 + \epsilon) \left[ {\color{blue}{(1 + \epsilon)}}^2 x^2 \right] = (1 + \epsilon)^3 x^2 \\\

&= (1 + 3\epsilon + 3\epsilon^2 + \epsilon^3) x^2 \\\

&= {\color{red}{(1 + 3\epsilon)}} x^2 \\\

\text{fl}(y) \otimes \text{fl}(y) &= (1 + \epsilon) \left[ {\color{blue}{(1 + \epsilon)}}^2 y^2 \right] = (1 + \epsilon)^3 y^2 \\\

&= (1 + 3\epsilon + 3\epsilon^2 + \epsilon^3) y^2 \\\

&= {\color{red}{(1 + 3\epsilon)}} y^2 \\\

\\\

[\text{fl}(x) \otimes \text{fl}(x)] \oplus [\text{fl}(y) \otimes \text{fl}(y)] &= (1 + \epsilon) \left[ {\color{blue}{(1 + 3\epsilon)}} x^2 + {\color{blue}{(1 + 3\epsilon)}} y^2 \right] \\\

&= (1 + \epsilon + 3\epsilon + 3\epsilon^2) [ x^2 + y^2 ] \\\

&= {\color{red}{(1 + 4\epsilon)}}(x^2 + y^2) \\\

\\\

\sqrt[\text{fl}]{ \text{fl}(x) \otimes \text{fl}(x) \oplus \text{fl}(y) \otimes \text{fl}(y)} &= (1 + \epsilon) \sqrt{{\color{blue}{(1 + 4\epsilon)}}(x^2 + y^2)} \\\

&= (1 + \epsilon)\sqrt{(1 + 4\epsilon)} \sqrt{x^2 + y^2} \\\

&= (1 + \epsilon)(1 + 2\epsilon) \sqrt{x^2 + y^2} \\\

&= {\color{red}{(1 + 3\epsilon)}} \sqrt{x^2 + y^2} \end{align*}

Phew! After carrying our error factors through all steps of the calculation, we find that the total numerical error is at most 3 times the machine precision. Keep in mind, this is an upper bound. We can’t really say anything about what the actual error will be , only that it is not greater than \(3 \epsilon_{\text{machine}}\).

A Cautionary Tale

The error analysis we did above was pretty carefree with manipulating the magical error value \(\epsilon\). We didn’t have to be too careful, because multiplication and addition of positive values (like \(x^2\) and \(y^2\)) is safe, using those rules. However, when we’re considering subtraction, we have to be much more careful. Suppose \(X\) and \(Y\) are very large numbers, but that they are very close together: \(X - Y\) is small. Let’s try to compute \(X - Y\) in floating point, and see where we get. Now we might be tempted to write,

\begin{align*} \text{fl}(X) \ominus \text{fl}(Y) &= (1 + \epsilon) [ (1 + \epsilon) (X) - (1 + \epsilon) (Y) ] \\\

&= (1 + \epsilon)^2 (X - Y) \\\

&= (1 + 2\epsilon) (X - Y). \end{align*}

However, we can’t pull out a factor of \(1 + \epsilon\) from the subtraction! We have to remember that we are only pretending that \(\epsilon\) is the same across these operations. In fact, to be more precise, we should write

\begin{align*} \text{fl}(X) \ominus \text{fl}(Y) &= (1 + \epsilon_\ominus) [ (1 + \epsilon_X) (X) - (1 + \epsilon_Y) (Y) ] \\\

\end{align*}

to distinguish between the error values we get at different steps. Now, consider what happens if \(\epsilon_Y = 0\), but \(\epsilon_X \neq 0\). Well, then,

\begin{align*} (1 + \epsilon_X) (X) - (1 + \epsilon_Y) (Y) = X + \epsilon_X X - Y = (X - Y) + \epsilon_X X. \end{align*}

We supposed above that \(X - Y\) was small, but \(X\) was large. Now we have shown that the error incurred by subtracting the floating point representations of \(X\) and \(Y\) is of the order \(\epsilon_\text{machine} X\), which may be much larger than \(\epsilon_\text{machine}\)! Here’s a Python snippet to illustrate:

>>> (1e15 + 0.1) - 1e15
0.125

We should have gotten 0.1 , but ended up getting 0.125 , an error of 25%! This effect is known as “catastrophic cancellation”, and numerical analysts are very careful to design their algorithms so as to avoid it.

Learning More

The Fundamental Axiom of Floating Point Arithmetic gives us a starting point to analyze numerical methods, but it is by no means the whole story. The very best numerical algorithms have a magical property called “backwards stability”, which essentially ensures that they exactly solve a problem which is almost the correct problem–the problem they solve exactly is within machine precision of the problem you meant to solve. These algorithms are far more accurate than they have any right to be. Once you start to use the algorithms of numerical linear algebra to solve differential equations that evolve in time, there is a whole science of ensuring that the roundoff error incurred at each time step does not blow up exponentially. With these tools, rather than be intimidated by numerical error, we can anticipate and control it. While seeing your calculation give almost exactly the right answer is a good feeling, there is something even more satisfying about computing the error you got, and verifying that it, too, is exactly as large as expected.

Everything I know about this stuff, I learned from the excellent book Numerical Linear Algebra , by Lloyd Trefethen and David Bau, as well as the inimitable Anne Greenbaum of the University of Washington.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Taming Floating Point Error

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

锋利的jQuery

单东林、张晓菲、魏然 / 人民邮电出版社 / 2012-7-1 / 49.00元

《锋利的jQuery(第2版)》循序渐进地对jQuery的各种函数和方法调用进行了介绍，读者可以系统地掌握jQuery的选择器、DOM操作、事件和动画、AJAX应用、插件、jQuery Mobile、jQuery各个版本变化、jQuery性能优化和技巧等知识点，并结合每个章节后面的案例演示进行练习，达到掌握核心知识点的目的。为使读者更好地进行开发实践，《锋利的jQuery(第2版)》的第8......一起来看看《锋利的jQuery》这本书的介绍吧!

码农工具