内容简介:To motivate the definition of $e$, we will recall some calculus. Let $f$ be a function which is infinitely$$ f(x) = \frac{f(a)}{0!}(x - a)^0 + \frac{f’(a)}{1!}(x - a)^1 + \ldots + \frac{f^{(k)}(a)}{k!}(x - a)^k + \ldots $$This is a
1: Note that the implementations included here will both output and
calculate in double precision. We can still apply sophisticated methods in this setting, but as a general rule you’ll need higher intermediate precision to obtain an approximation which is indistinguishable from the true value in double precision. This is one (but not the only) reason why it’s generally advisable to use an optimized exp(x)
function provided by your platform rather than write one from scratch.
exp(x)
function as a benchmark.
2: In most cases, whichever numerical library you’re using will map to the platform-provided exp(x)
function under the hood. For example, this is what numpy.exp
does under the hood
.
To motivate the definition of $e$, we will recall some calculus. Let $f$ be a function which is infinitely differentiable at some real value $a$. This is to say that we can repeatedly differentiate $f$: for every $k^{\text{th}}$ derivative $f^{(k)}$, we can differentiate $f^{(k)}$ again to obtain the $(k + 1)^{\text{th}}$ derivative $f^{(k + 1)}$. Any function with this property can be uniquely represented as a Taylor series expansion “centered” at $a$. The Taylor series of $f$ centered at $a$ and evaluated at $x$ is defined to be
$$ f(x) = \frac{f(a)}{0!}(x - a)^0 + \frac{f’(a)}{1!}(x - a)^1 + \ldots + \frac{f^{(k)}(a)}{k!}(x - a)^k + \ldots $$
This is a power series , or infinite polynomial with one variable. The center of expansion determines a neighborhood of values returned by the Taylor series, and the coefficient of each Taylor term is determined by repeatedly differentiating the function $f$ and evaluating it at $a$. A common center of expansion is $a$ = 0, in which case the Taylor series is also called a Maclaurin series and the series is centered around the origin. This can be considered the “default” setting. If you cut off all terms of the Taylor expansion after some term $k$, you obtain a polynomial with degree $k$. The coefficient of the $k^{\text{th}}$ term of the series (or polynomial, if truncated) is given by
$$ a_{k} = \frac{f^{(k)}(a)}{k!} $$
where 0! = 1.
For a concrete example, consider the Taylor series expansion of the sine function. The sine function is not only infinitely differentiable, but cyclic.
$$ \begin{aligned} \sin^{(1)}(x) &= \cos(x), \\ \sin^{(2)}(x) &= \cos^{(1)}(x) = -\sin(x), \\ \sin^{(3)}(x) &= -\sin^{(1)}(x) = -\cos(x), \\ \sin^{(4)}(x) &= -\cos^{(1)}(x) = \sin(x) \end{aligned} $$
We determine each $k^{\text{th}}$ coefficient of the Taylor series by evaluating $f^{(k)}$ at $a$ and dividing it by the factorial of $k$. If we want to expand the sine function around the origin ($a$ = 0), we obtain the cyclic coefficients
$$ \begin{aligned} \sin(0) &= 0, \\ \sin^{(1)}(0) &= \cos(0) = 1, \\ \sin^{(2)}(0) &= \cos^{(1)}(0) = -\sin(0) = 0, \\ \sin^{(3)}(0) &= -\sin^{(1)}(0) = -\cos(0) = -1, \\ \sin^{(4)}(0) &= -\cos^{(1)}(0) = \sin(0) = 0 \end{aligned} $$
Since $(x - 0)^{k} = x^{k}$, we have the Taylor series expansion
$$ \sin(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \frac{x^9}{9!} - \ldots $$
Truncating the Taylor expansion of a function $f$ at any term $k$ gives a finite approximation of $f$ using the $k$ degree Taylor polynomial . A Taylor polynomial of $f$ centered at $a$ produces very accurate approximations of $f(x)$ when $x$ is relatively close to $a$. As the absolute value of $x$ increases away from $a$, the accuracy of the Taylor polynomial rapidly decreases, which means it requires more terms of the Taylor series (i.e. a higher degree polynomial) for accurate approximation. Consider the following plot, which shows the values of $\sin(x)$ over the interval $[-20, 20]$ compared to its Taylor polynomials of degree 1, 3, 5, 7 and 9 centered at $a$ = 0.
Observe that the Taylor approximation of $\sin(x)$ is more accurate when $x$ is near $a$ = 0, but quickly flies away from the true value of $\sin(x)$ further away from 0. The degree 1 Taylor polynomial is only an accurate approximation for $\sin(x)$ for a very small interval near the origin, whereas the degree 9 Taylor polynomial is very accurate within $[-5, 5]$. How long the approximation holds until it becomes extremely inaccurate depends on the number of terms of the Taylor polynomial. A higher degree polynomial will maintain a better approximation of $\sin(x)$ for longer, but any finite polynomial will eventually become extremely inaccurate.
The mathematical constant $e$ is (almost) entirely motivated by the very nice properties it exhibits under exponentiation. In particular, the definition of $e$ was born out of the desire to find a continuous function which is its own derivative and which maps the additive identity 0 to the multiplicative identity 1. This is because solving difficult integration and differentiation problems is vastly more expedient with such a function. By extension a significant fraction of problems in applied mathematics and physics reduce to solving differential equations, for which such a function is fundamental.
As it happens, $f(x) = e^x$ uniquely satisfies this property. We can show this, and define $e$ directly in the process, by starting from the Taylor series representation of an arbitrary function $f$ infinitely differentiable at $a$ = 0. Suppose $a_0, a_1, \ldots$ are the coefficients of the Taylor series of $f$ centered at $a$. Then we have the Taylor series
$$ f(x) = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + \ldots $$
It follows from the linearity of differentiation that the Taylor series expansion of the first derivative $f’$ is
$$ f’(x) = a_1 + 2a_2 x + 3a_3 x^2 + \ldots $$
To determine a function which is its own derivative, we solve for the coefficients $a_0, a_1, \ldots$ which satisfy $f = f’$:
$$ a_0 + a_1 x + a_2 x^2 + \ldots = a_1 + 2a_2 x + 3a_3 x^2 + \ldots $$
From here we can see the pattern
$$ \begin{aligned} a_0 &= a_1, \\ a_1 &= 2a_2, \\ a_2 &= 3a_3 \end{aligned} $$
and so on, which is equivalent to
$$ \begin{aligned} a_1 &= a_0, \\ a_2 &= \frac{a_1}{2}, \\ a_3 &= \frac{a_2}{3} \end{aligned} $$
By induction we have a recurrence relation which defines the $k^{\text{th}}$ coefficient $a_k$
$$ a_k = \frac{a_{k - 1}}{k} $$
Given $a_0$ = 1, we find that the Taylor series of a function which is its own derivative is
$$ f(x) = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \ldots. $$
We denote this function with $e^x$, where $e$ is defined to be the value of this function at $x$ = 1.
$$ e = f(1) = \sum_{k = 0}^{\infty}\frac{1}{k!} = 2.7183\ldots $$
A more intuitive illustration of why $e^x$ is special is given by the following graph, in which exponential functions with various bases are plotted alongside their derivatives. An exponential function with a base less than $e$, like $b$ = 2, grows more quickly than its derivative. But when the base is greater than $e$, like $b$ = 4, it grows less quickly than its derivative.
3: Almost all real numbers are not computable . The reals which are computable are frequently not exactly representable to a desirable level of accuracy because they’re either irrational (and therefore have infinite decimal expansions) or rational with very long decimal expansions.
The best we can do is approximate the value to some acceptable accuracy.The IEEE 754 floating point standard discretizes real intervals into a computable form by mapping all nearby real values in given neighborhoods to a single rounded value. Internally, an IEEE 754 binary floating point number $N$ is represented using the normalized form
$$ N = \pm b_1.b_2b_3 \ldots b_p \times 2^{E_{k}} $$
where the first bit is allocated for the sign (the sign bit ), the $p$ bits $b_1.b_2b_3 \ldots b_p$ comprise the mantissa , or significand , and $E_{k}$ is an integer exponent consisting of $k$ bits. Note that since this form is normalized, $b_1 = 1$, while each of $b_2, \ldots b_p$ may equal 0 or 1. IEEE 754 single precision binary floating point numbers have a total size of $32$ bits: $8$ are allocated for the exponent $E \in [-126, 127]$ and $23$ are allocated for the mantissa (with $p$ = 24 accounting for the normalized bit). Thus you can represent $2^{32}$ different values in single precision floating point, with underflow and overflow limits of $2^{127} \approx 3.4 \times 10^{38}$ and $2^{-126} \approx 1.2 \times 10^{-38}$, respectively. Likewise IEEE 754 double precision floating point values have a total size of $64$ bits: $11$ bits are allocated for the exponent $E \in [-1022, 1024]$ and $52$ bits are allocated for the mantissa (with $p = 53$). You can represent $2^{64}$ distinct values in double precision, with underflow and overflow limits of $2^{1024} \approx 1.8 \times 10^{308}$ and $2^{-1022} \approx 2.2 \times 10^{-308}$, respectively.
Floating point values are not evenly distributed. This is illustrated by the following density plot, which graphs all 16-bit floating point values against their base 2 exponents.
We can see the values are relatively densely clustered near $0$ and increasingly sparse the further you move away from the origin. As another example, half of all $32$-bit floating point numbers reside in the real interval $[-1, 1]$. Meanwhile the smallest and largest $32$-bit floating point numbers are $-3.4 \times 10^{38}$ and $3.4 \times 10^{38}$, respectively. More generally, in each interval $[2^n, 2^{n + 1}]$, the available floating point values are distributed with a spacing of $2^{n - p}$ between them, where $p$ is the number of mantissa bits for the precision under consideration. As you move farther away from the origin, it becomes more likely that your calculations will bump into real values in between the available floating point values, which will be rounded to the nearest available value instead.
For any binary floating point system, we can derive the maximum precision available in that system using the number of bits $p$ available in the mantissa. Given the uneven distribution of floating point values, it follows that the available precision decreases as you move away from the origin (this contributes to approximation error!). For this reason we define machine epsilon , denoted by $\epsilon_{M}$, which is the difference between $1$ and the least representable floating point value greater than $1$. In single precision floating point, the smallest distinguishable value larger than $1$ is $2^{-23}$, so $\epsilon_{M} = 2^{-23}$. Likewise in double precision we have $\epsilon_{M} = 2^{-52}$. The maximum decimal precision of these systems can be obtained by converting $\epsilon_{M}$ to a decimal value. Putting this all together, we see that $32$-bit floating point has a maximum decimal precision of about $7$ digits, and $64$-bit floating point has a maximum decimal precision of about $16$ digits. You’ll only be able to achieve this precision for a subset of the available floating point values; at the boundaries furthest from the origin you’ll only be able to obtain $6$ decimal digits of precision with $32$-bit values and $15$ decimal digits of precision with $64$-bit values.
In summary, single and double precision floating point systems have the following characteristics:
Characteristics | Single Precision | Double Precision |
---|---|---|
Total Size | 32 bits | 64 bits |
Sign Size | 1 bit | 1 bit |
Mantissa Size | 23 bits ($p$ = 24) | 52 bits ($p$ = 53) |
Exponent Size | 8 bits | 11 bits |
Exponent Range | $[-126, 127]$ | $[-1022, 1024]$ |
Machine Epsilon | $2^{-23}$ | $2^{-52}$ |
Underflow Limit | $2^{-126} \approx 1.2 \times 10^{-38}$ | $2^{-1022} \approx 2.2 \times 10^{-308}$ |
Overflow Limit | $2^{127} \approx 3.4 \times 10^{38}$ | $2^{1024} \approx 1.8 \times 10^{308}$ |
Max Decimal Precision | 7 | 16 |
Min Decimal Precision | 6 | 15 |
Numerical analysis traditionally considers four main categories of error. These are:
-
Fundamental error . This arises when a model intrinsically deviates from reality in a nontrivial way. For example, the Kermack-McKendrick SIR model in epidemiology does not consider the possibility of reinfection by default, so it will exhibit significant fundamental error if a recovered population is also susceptible to infection. For our purposes we don’t need to be concerned about fundamental error.
-
Floating point error . This type of error can occur in several ways. In the most straightforward sense, you cannot achieve more accuracy than is provided for by the precision of your floating point system. Any real number with more than 16 decimal digits simply cannot be expressed to perfect accuracy in double precision. You may find more subtle errors introduced through operations such as equality comparison , because to each floating point number there correspond infinitely many real values which are rounded to it. Floating point error also occurs if, in the course of your calculations, an intermediate result is outside the underflow/overflow limits, even if the completion of the calculation would bring it back into the range of available values.
-
Discretization error . This is also known as truncation error. It occurs whenever you take a continuous function and approximate by finitely many values. One example of discretization error is polynomial interpolation, where you use $n + 1$ points to approximate a continuous function $f(x)$ using a polynomial $P_{n}$ of degree $n$. A related example occurs when approximating functions using their Taylor series. A Taylor series cannot be evaluated in finite time because it has infinitely many terms, so you must truncate the series at some term $n$, which results in a polynomial $T_n$ of degree $n$. The infinitely many remaining terms of the Taylor series sliced off after truncation comprise the discretization error in your calculation.
-
Convergence error . This is also known as gradual loss of significance. It occurs when your calculation performs too much work and returns results beyond the available precision. This is distinct from floating point error because it’s related to iterative repetitions. The initial result may be representable (but rounded), but repeated iterations compound the initial error until it passes a precision threshold, at which point it ceases to increase in accuracy and instead decreases in accuracy.
Here is a straightforward example of floating point error. In double precision we have
$$ 1 + \frac{1}{2}\epsilon_{M} - 1 = 0 \neq 1.11022 \times 10^{-16} = 1 - 1 + \frac{1}{2}\epsilon_{M} $$
Similarly we can demonstrate that floating point numbers are not associative under addition:
$$ 0.1 + (0.2 + 0.3) = 0.600\ldots 001 \neq 0.6 = (0.1 + 0.2) + 0.3 $$
Convergence error is particularly dangerous. Consider this very simple iterative calculation in Python:
x = 10 / 9 for k in range(20): x -= 1 x *= 10 print(x)
which will return
1.1111111111111116, 1.1111111111111160, 1.1111111111111605, 1.1111111111116045, 1.1111111111160454, 1.1111111111604544, 1.1111111116045436, 1.1111111160454357, 1.1111111604543567, 1.1111116045435665, 1.1111160454356650, 1.1111604543566500, 1.1116045435665000, 1.1160454356650007, 1.1604543566500070, 1.6045435665000696, 6.0454356650006960, 50.454356650006960, 494.54356650006960, 4935.4356650006960
This is a textbook example of loss of significance. Each result has 16 decimal digits total, but the precision of the result decreases over time. Note that with a starting value of $x = \frac{10}{9}$, after one iteration we will have
$$ f(x) = (x - 1) \cdot 10 = \left(\frac{10}{9} - 1\right) \cdot 10 = \frac{1}{9} \cdot 10 = \frac{10}{9} $$
Therefore the value of $x$ should remain the same after every iteration. But instead we have a value which is rapidly diverging away from the true value with each iteration.
Finally, it’s not uncommon to see all of these types of errors occur in the same approximation. Suppose you are approximating a continuous function $f(x)$ by its Taylor series. By choosing a truncation point for the series, you immediately encounter discretization error and thereby place a lower bound on the total error of your approximation. Then, depending on the implementation of the Taylor series approximation, your intermediate calculations on the individual Taylor terms may suffer from compounding error over time. Your result may not precisely representable in floating point, which will further decrease the accuracy of your calculation through rounding error.
We have two ways of measuring error. Given a function $f$ and its approximation $F$, the absolute error at some $x$ is given by
$$ \epsilon(x) = |f(x) - F(x)| $$
and the relative error at $x$ is given by
$$ \eta(x) = \left|\frac{f(x) - F(x)}{f(x)}\right| = \frac{\epsilon(x)}{|f(x)|} $$
Relative error depends on, and normalizes, absolute error. It’s useful when evaluating the accuracy of an approximation over a very wide range of possible values. For example, suppose we have an approximation $F$ of a function $f$ which returns values between 0 and 4,294,967,296. An absolute error of $\epsilon$ = 10 is essentially worthless for small values in that range, but quite good for larger values near the upper bound. To get a better picture of how accurate an approximation function is over such a range, we can instead consider the relative error. This comes with the caveat that the relative error is undefined at $x$ when the true value of $f$ at $x$ is precisely 0, because otherwise we’d have a division by $0$. Luckily we won’t encounter this problem here because there is no value of $x$ which satisfies $e^x$ = 0.
Absolute and relative error give us a way to rigorously quantify the accuracy of an approximation. If we can show an upper bound on one of these error metrics, we can assert the worst-case accuracy of the approximation. Moreover these error metrics are amenable to common summary statistics, so we can use them to determine the mean accuracy, or the accuracy by quantile. Relative error also provides us with a way to assess the accuracy of our calculation in terms of digits of precision. For any relative error $\eta$, the corresponding precision is the maximum integer $p$ which satisfies
$$ \eta \leq 5 \times \beta^{-p} $$
where $\beta$ is the base under consideration. Decimal precision assigns $\beta$ = 10 and binary precision assigns $\beta$ = 2.
To avoid having to deal with underflow and overflow limits, we’ll implement $e^x$ for $x$ in the domain $[-709, 709]$. These values are near the limits which can be calculated in double precision (though technically we can evaluate $x$ down to -740 or so, if we don’t care about having a symmetric domain). Before we start on an implementation we need to establish a good benchmark function to compare it against. We’ll use the exp(x)
functions provided by NumPy in Python and the standard library in C++.
Next we’ll write a reporting function rel_error
which will return the relative error of our implementation versus the benchmark. Since we’re assessing an interval of values, this function will take as its inputs the “true” function $f$ given by the benchmark, our approximation $F(x)$, the infimum $a$ and supremum $b$ of the interval $[a, b]$ to operate over, and the number $n$ of values $x$ to evaluate in the interval. It will return a vector of the relative errors at each value of $x$ we evaluated. The output of this function will also be used to graphically assess relative errors using matplotlib.
We also need a few convenience functions, starting with linspace
. The linspace
function will take as input the bounds $a, b$ of an interval and a positive integer $n$, then return a vector consisting of $n$ linearly-spaced floating point values in the interval. NumPy provides this function out of the box, but we’ll write our own for the C++ tests. We’ll also write a few C++ functions for calculating summary statistics which are provided by NumPy.
import numpy as np def rel_error(f, F, a, b, n, plot=False, save=False, file=None): """ Benchmark function for assessing relative error of two functions. The ideal function is passed first, the function under testing is passed second. :param f: The benchmark function to be compared against. :param F: The implementation of the function we want to test. :param a: The left boundary of the interval to consider. :param b: The right boundary of the interval to consider. :param n: The number of values in the interval to look at. :param plot: Boolean, whether or not to plot the errors in matplotlib. :param save: Boolean, whether to save plot to disk or show in stdout. :param file: String name of a file location to save plot. If save is true and file is :param float64: Boolean flag signifying return precision. :returns: A list consisting of: the array of relative error |f(x) - F(x)| / f(x) at each of the k linearly spaced x values in the interval [a, b]; the maximum relative error; the average relative error; the maximum number of iterations; the average number of iterations. """ x = np.linspace(a, b, n) trials = np.zeros(x.shape) for i, xi in enumerate(x): trials[i] = F(xi) control = f(x) errors = np.abs(trials - control) / np.abs(control) if plot: fig = plt.figure() axes = fig.add_subplot(1, 1, 1) axes.plot(x, errors, color='b') axes.set_xlabel(r'$x$', fontsize=12) axes.set_ylabel(r'$\eta_{x}$', fontsize=12) axes.set_title(r'Relative error of $e^x$ approximation in double precision', fontsize=12) axes.grid() if save: assert(file is not None) plt.savefig(file + '.svg', format='svg', transparent=True) else: plt.show() return errors
#include <vector> #include <cassert> #include <cmath> std::vector<double> linspace(double a, double b, int n) { /* * Function which returns a std::vector containing n linearly-spaced * points in the interval [a, b], inclusive. * Args: * - a: Lower bound of an interval in double precision * - b: Upper bound of an interval in double precision * - n: Positive integer number of linearly-spaced values to return. * Returns: * - A vector of n double precision values within [a, b] */ assert(a < b && n > 0); std::vector<double> points; int iter = 0; double d = b - a; for (int i = 0; i <= n - 2; i++) { points.insert(points.begin() + iter, a + (i * d) / (floor(n) - 1)); iter++; } // We also need to insert the upper bound at the end, or the result is incomplete. points.insert(points.begin() + iter, b); return points; } template <typename Function> std::vector<double> rel_error(Function f, Function F, double a, double b, int n) { /* * Benchmark function for assessing relative error of two functions over n values in * an interval [a, b]. * Args: * - f: Benchmark function to be compared against. * - F: Approximation of the true function. * - a: Left boundary of the interval. Must be less than b. * - b: Right boundary of the interval. Must be greater than a. * - n: The number of values in the interval to consider. * Returns: * - A vector matching the input precision consisting of the relative errors * f and F at every x evaluated from the interval [a, b] */ std::vector<double> x = linspace(a, b, n); std::vector<double> control; std::vector<double> test; for (auto i : x) { control.push_back(f(i)); test.push_back(F(i)); } std::vector<double> errors; errors.reserve(control.size()); for (int i = 0; i < control.size(); i++) { errors.push_back(abs(control[i] - test[i]) / abs(control[i])); } return errors; } double mean(std::vector<double> nums) { /* * Basic function which returns the mean of double precision * values contained in the input vector. Returns the mean as * a double precision number. */ return std::accumulate(nums.begin(), nums.end(), 0) / nums.size(); } double var(std::vector<double> nums) { /* * Basic function which returns the variance of double * precision values contained in the input vector. Returns * the variance as a double precision number. */ std::vector<double> square_vec; square_vec.reserve(nums.size()); for (auto i : nums) { square_vec.push_back(i * i); } return mean(square_vec) - pow(mean(nums), 2); } double median(std::vector<double> nums) { /* * Basic function which returns the median of double * precision values contained in the input vector. Returns * the median as a double precision number. */ std::sort(nums.begin(), nums.end()); if (nums.size() % 2 == 0) { return nums[(int)(nums.size() / 2) - 1] + nums[(int)(nums.size() / 2)]; } else { return nums[(int)(nums.size() / 2)]; } }
For reporting error statistics, we’ll use the following main
harnesses.
if __name__ == '__main__': a = -709 b = 709 n = 10000 errors = rel_error(np.exp, taylor_exp, a, b, n) print("Max relative error = {}".format(errors.max())) print("Min relative error = {}".format(errors.min())) print("Avg relative error = {}".format(np.average(errors))) print("Med relative error = {}".format(np.median(errors))) print("Var relative error = {}".format(np.var(errors))) s = 0 for error in errors: if error > 5e-15: s += 1 print("{} percent of the values have less than 15 digits of precision".format( s / len(errors) * 100 ))
int main() { int a = -709.0; int b = 709.0; int n = 10000; std::vector<double> errors = rel_error(exp, taylor_exp, a, b, n); std::cout << "Max relative error: " << *std::max_element(errors.begin(), errors.end()) << std::endl; std::cout << "Min relative error: " << *std::min_element(errors.begin(), errors.end()) << std::endl; std::cout << "Avg relative error: " << mean(errors) << std::endl; std::cout << "Med relative error: " << median(errors) << std::endl; std::cout << "Var relative error: " << var(errors) << std::endl; double s = 0; for (auto i : errors) { if (i > 5e-15) { s += 1; } } std::cout << s / errors.size() * 100 << " percent of the values have less than 15 digits of precision." << std::endl; return 0; }
This will give us the maximum, minimum, mean and median relative error over the interval $[-709, 709]$. It will also tell us the variance of the relative error and the percentage of the sample which has less precision than a given threshold.
Taylor series approximation
In the Background section we showed how $e^x$ is defined using its Taylor series representation. Since this is the “canonical” definition of $e^x$, it’s also the most straightforward method for implementing the function. We can use the Taylor series representation to calculate $e^x$ to arbitrary precision by evaluating finitely many terms of the Taylor series. By definition, we have
$$ e^x = \sum_{k = 0} \frac{x^k}{k!} = 1 + \frac{x}{1!} + \frac{x^2}{2!} + \frac{x^3}{3!} + \ldots $$
exp(x)
function, it’s likely that we don’t have a general pow(a, b)
function we can use to evaluate arbitrary exponents of arbitrary bases.
4: In fact, we might be implementing exp(x)
precisely because we want
that function and don’t presently have it. While it’s not the most optimal method for doing so, we can naively define pow(a, b)
with exp(x)
using the identity $a^b = e^{a \log{b}}$.
The second problem is that we have factorials in the denominators. Dividing by a factorial is highly prone to error and
expensive. For example, at relatively small values of $x$, completely evaluating factorial(x)
will overflow before we have a chance to complete the division. This is numerically unstable, so we need to find an alternative way to represent the Taylor series.
Note that for any polynomial $a_0 + a_1 x + a_2 x^2 + \ldots$, we can iteratively factor out the variable $x$ to obtain the equivalent polynomial $a_0 + x(a_1 + x(a_2 + \ldots$ Furthermore, if we have a factorial expression of the form $1! + 2! + 3! + 4! + \ldots$, we can (by the definition of a factorial), factor out $3!$ from $4!$, and $2!$ from $3!$, and so on. Factoring out terms in order to nest multiplications in this way is known as Horner’s method of polynomial evaluation. When we apply Horner’s method to the Taylor series representation of $e^x$, we obtain the new form
$$ e^x = 1 + x(1 + \frac{x}{2}(1 + \frac{x}{3}(1 + \ldots $$
This is much more efficient and stable for implementation, and we’ll use this representation for the truncated series instead.
Now we’ll do a bit of error analysis on the Taylor series. We can prove an upper bound on the relative error of the Taylor series truncated at the $n^{\text{th}}$ term of the series. Let $T_{n}(x)$ denote the Taylor series of $e^x$ truncated at the $n^{\text{th}}$ term, let $\epsilon_{n}$ denote the absolute error of $T_{n}(x)$ versus $e^{x}$ and let $\eta_n$ denote the relative error of $T_{n}(x)$ versus $e^x$. By the definition of absolute error, we have
$$ e^x = T_{n}(x) + \epsilon_n(x) $$
Since $e^{x}$ is an infinitely differentiable function, it follows from Taylor’s theorem that the absolute error can be represented using the Lagrange form
$$ \epsilon_n(x) = \frac{e^{(n + 1)c}}{(n + 1)!}(x - a)^{n + 1} $$
where $e^{(n + 1)c}$ denotes the $(n + 1)^{\text{th}}$ derivative of $e^x$, and $c$ is a real value satisfying $a \leq c \leq x$. Since $e^x$ is its own derivative, the foregoing identity simplifies to
$$ \epsilon_n(x) = e^{c}\frac{x^{n + 1}}{(n + 1)!} $$
For all $a, b$ such that $b > a$, it follows that $e^b > e^a$. This is to say that $e^x$ is an increasing function , and its maximum value on the interval $[a, x]$ is attained at $x$. Therefore we can conclude that, since $c$ is at most $x$,
$$ \epsilon_n(x) \leq e^{x}\frac{x^{n + 1}}{(n + 1)!} $$
It follows from the definition of relative error that
$$ \eta_n = \frac{\epsilon_n}{e^x} \leq \frac{x^{n + 1}}{(n + 1)!} $$
which gives us an upper bound on the relative error of the Taylor series of $e^x$ truncated at the $n^{\text{th}}$ term. Furthermore we can see that the absolute error of $T_{n + 1}(x)$ and $T_{n}(x)$ is simply the $(n + 1)^{\text{th}}$ term of the Taylor series, because subtracting $T_{n}(x)$ from $T_{n + 1}(x)$ cancels all terms except for the $(n + 1)^{\text{th}}$ term.
$$ \begin{aligned} |T_{n + 1}(x) - T_{n}(x)| &= \left|\left(1 + x + \ldots + \frac{x^n}{n!} + \frac{x^{n + 1}}{(n + 1)!}\right) - \left(1 + x + \ldots + \frac{x^n}{n!}\right)\right| \\ &= \left|\frac{x^{n + 1}}{(n + 1)!}\right| \end{aligned} $$
Thus the relative error of $T_{n}(x)$ is bounded above by the $(n + 1)^{\text{th}}$ term of the Taylor series. This is not the optimal upper bound, but it’s a useful heuristic for proceeding. 5: You cannot in general find the exact value of the absolute error $\epsilon_n$ after truncating the Taylor series of some function $f$ at the $n^{\text{th}}$ term. If you could, you wouldn’t need an approximation since $f(x) = T_{n}(x) + \epsilon_{n}$. Typically the goal of error analysis is to identify a suitable bound on the error and work from there to ascertain a useful precision goal. We could certainly make the bound we’ve proven tighter using more sophisticated analysis.
We can apply the relative error upper bound we’ve just proven to determine a lower bound on the number of terms we’ll need to achieve that relative error. Suppose we’d like to achieve an approximation with a relative error of at most $\eta$. Then we’re interested in determining the number $n$ of Taylor terms which will be required to achieve a relative error less than or equal to $\eta$. By the foregoing error analysis we know that the relative error of the Taylor series truncated at the $n^{\text{th}}$ term is bounded above by the value of the $(n + 1)^{\text{th}}$ Taylor term. We can start by finding an $\eta$ to satisfy
$$ \left|\frac{x^n}{n!}\right| \leq \eta $$
By applying logarithms to both sides of the inequality we obtain the equivalent inequality
$$ \log\left(\left|\frac{x^n}{n!}\right|\right) \leq \log(\eta) $$
To simplify working with the factorial, we apply Stirling’s approximation . By the algebra of logarithms, we can then reduce this inequality to
$$ n\log(x) - n\log(n) + n \leq \log(\eta) $$
and then to
$$ \log\left(\frac{x}{n}\right)^{n} + n \leq \log(\eta) $$
We can simplify this further by introducing the multiplicative identity using $\log(e) = 1$:
$$ \log\left(\frac{x}{n}\right)^n + n\log(e) \leq \log(\eta) $$
We apply log rules again to reduce the second term on the left:
$$ \log\left(\frac{x}{n}\right)^n + \log(e^n) \leq \log(\eta) $$
Then we can collapse the entire left side to a single term:
$$ \log\left(\frac{x e}{n}\right)^n \leq \log(\eta) $$
which is equivalent to
$$ \log\left(\frac{x e}{n}\right) \leq \frac{1}{n}\log(\eta) $$
and finally
$$ \frac{xe}{n} \leq \eta^{-n} $$
Therefore in order to approximate $e^x$ with a relative error less than or equal to $\eta$, we must calculate at least $n$ Taylor terms, where $n$ satisfies
$$ \frac{xe}{\eta^{-n}} \leq n $$
This puts a rough floor on the number of terms we’ll need. We’ll often need significantly more. The limit of $\frac{1}{n}$ as $n$ approaches $\infty$ is 0. It follows that, for any real $x$, the limit of $x^{-n}$ as $n$ approaches $\infty$ is $x^0$ = 1. Thus as $n$ increases, the floor on the number of terms we’ll need becomes asymptotically closer to
$$ xe \leq n $$
With this preliminary analysis in hand, we can move on to implementation.
Our precision analysis uses $xe$ Taylor terms as a rough lower bound. We seek an approximation of $e^x$ which is indistinguishable from the true value in double precision (or very nearly so). This is to say that we’d like to target a relative error of machine epsilon, $\eta = \epsilon_{M}$. We can’t achieve $\epsilon_{M}$ across the entire interval due to the slow convergence of Taylor series and the sparsity of float values far from the origin. However we can get within one digit of precision almost everywhere. To do so, we’ll need more than $xe$ Taylor terms.
Heuristically, multiplying $xe$ by 12 gives a sweet spot on the number of terms to use in double precision. Using a multiplicative constant less than 12 quickly reduces the amount of accuracy you can obtain as you approach the bounds of $[-709, 709]$. On the other hand, the Taylor series of $e^x$ converges so slowly that a constant greater than 12 provides essentially no meaningful utility. In fact if we use a constant greater than 12, the additional terms of the Taylor series are so small that they vanish under the double precision floor, so at best they provide no change in calculation while making our algorithm less efficient, and at worst they contribute to a gradual loss of significance. Therefore we use an upper bound of 12$\lceil |x|e \rceil$, where $\lceil |x|e \rceil$ is the ceiling of $|x|e$ (because we can only have a positive integer number of Taylor terms) and $|x|$ is used because $x$ may not be positive. By the symmetry of $e^{-x}$ = $\frac{1}{e^{x}}$, when $x$ is negative we simply have to use $|x|$.
import numpy as np import math def taylor_exp(x): """ Evaluates f(x) = e^x for any x in the interval [-709, 709]. If x < -709 or x > 709, raises an assertion error because of underflow/overflow limits. Implemented using the Taylor series approximation of e^x truncated at ceil(|x| * e) * 12 terms, which achieves at least 14 and at most 16 digits of precision over the entire interval. Performance - There are exactly 36 * ceil(|x| * e) + 5 operations; 69,413 in the worst case (x = 709 or -709): - (12 * ceil(|x| * e)) + 2 multiplications - (12 * ceil(|x| * e)) + 1 divisions - (12 * ceil(|x| * e)) additions - 1 rounding - 1 absolute value Accuracy - over a sample of 10,000 equi-spaced points in [-709, 709] we have the following error statistics: - Max relative error = 8.39803008291388e-15 - Min relative error = 0.0 - Avg relative error = 1.2504273150972772e-15 - Med relative error = 9.45400364584735e-16 - Var relative error = 1.2390105184708059e-30 - 0.91 percent of the values have less than 15 digits of precision :param x: (int) or (float) power of e to evaluate :return: (float) approximation of e^x """ # Check that x is a valid input. assert(-709 <= x <= 709) # When x = 0 we know e^x = 1, so we exit early. if x == 0: return 1 # Normalize x to a non-negative value to take advantage of # reciprocal symmetry. But keep track of the original sign # in case we need to return the reciprocal later. x0 = np.abs(x) # First term of the Taylor expansion of e^x is 1. # Tn is the variable we will return for e^x, and its # value at any term is the sum of all currently evaluated # terms in the Taylor series. Tn = 1 # Choose a truncation point for the Taylor series, # then work down from there evaluating the polynomial # using Horner's method. n = math.ceil(x0 * np.e) * 12 for k in range(n, 0, -1): Tn = Tn * (x0 / k) + 1 # If the original input is less than 0, we want the # reciprocal because e^-x = 1 / e^x if x < 0: # 1 division if original sign is negative Tn = 1 / Tn return Tn
#include <cmath> #include <cassert> double taylor_exp(double x) { /* * Evaluates f(x) = e^x for any x in the interval [-709, 709]. * If x < -709 or x > 709, raises an assertion error. Implemented * using the truncated Taylor series of e^x with ceil(|x| * e) * 12 * terms. Achieves at least 14 and at most 16 digits of precision * over the entire interval. * Performance - There are exactly 36 * ceil(|x| * e) + 5 * operations; 69,413 in the worst case (x = 709 or -709): * - (12 * ceil(|x| * e)) + 2 multiplications * - (12 * ceil(|x| * e)) + 1 divisions * - (12 * ceil(|x| * e)) additions * - 1 rounding * - 1 absolute value * Accuracy - Over a sample of 10,000 linearly spaced points in * [-709, 709] we have the following error statistics: * - Max relative error = 8.39803e-15 * - Min relative error = 0.0 * - Avg relative error = 0.0 * - Med relative error = 1.90746e-15 * - Var relative error = 0.0 * - 0.88 percent of the values have less than 15 digits of precision * Args: * - x: (double float) power of e to evaluate * Returns: * - (double float) approximation of e^x in double precision */ // Check that x is a valid input. assert(-709 <= x && x <= 709); // When x = 0 we already know e^x = 1. if (x == 0) { return 1.0; } // Normalize x to a non-negative value to take advantage of // reciprocal symmetry. But keep track of the original sign // in case we need to return the reciprocal of e^x later. double x0 = abs(x); // First term of Taylor expansion of e^x at a = 0 is 1. // tn is the variable we we will return for e^x, and its // value at any time is the sum of all currently evaluated // Taylor terms thus far. double tn = 1.0; // Chose a truncation point for the Taylor series using the // heuristic bound 12 * ceil(|x| e), then work down from there // using Horner's method. int n = (int)ceil(x0 * M_E) * 12; for (int i = n; i > 0; i--) { tn = tn * (x0 / i) + 1.0; } // If the original input x is less than 0, we want the reciprocal // of the e^x we calculated. if (x < 0) { tn = 1 / tn; } return tn; }
Taylor series approximations provide arbitrary precision at the cost of inefficiency because they converge quite slowly. This means that while these functions are calculating comparatively many terms for their approximations, despite using the same (double) precision for intermediate calculation and output we’re able to obtain a minimum precision of nearly 15 digits. Likewise less than 1% of all values in our sample exhibit less than 15 digits of precision. To get a better idea of the error distribution over the entire interval, let’s plot the relative error of this implementation versus the benchmark exp(x)
function accessed through NumPy at each value of $x$ in our sample. If we restrict our plot to the sample of $x$ values in the interval $[-1, 1]$, we find that it performs extremely well.
However, consider what happens when we plot the relative error across the entire interval $[-709, 709]$.
This plot illustrates a typical “wing” pattern evident in the variance of Taylor approximation relative error. The variance of the relative error is small near 0 and large towards the boundaries of $[-709, 709]$. Moreover the size of the relative error at $x$ is proportionate to the distance of $x$ from 0. This is because Taylor series are comparatively accurate near the center $a$ of approximation (in this case, the origin). As $x$ moves away from the origin, the Taylor polynomial requires more terms to achieve the same level of accuracy. At the same time, the inherent sparsity of floating point values farther away from the origin works against the approximation, which compounds the overall loss of precision. Thus the approximation becomes increasingly less efficient and less precise as the absolute value of $x$ increases. The next plot illustrates the number of operations used to approximate $e^x$.
This algorithm exhibits linear time complexity - as the absolute value of $x$ increases, the number of operations increases by a factor of approximately $36e \approx 100$. We also see that the amount of work the algorithm performs is commensurate with the variance of its relative error, meaning that this implementation is both less efficient and less accurate away from the center of the interval.
We can somewhat mitigate the aforementioned limitations by combining the core methodology of Taylor series approximation with a technique called range reduction . In so doing, we reduce the area of Taylor approximation to a small neighborhood which is bounded around the origin regardless of the value of $x$. This will trade off some precision in return for significant performance improvements. The basic idea is to decompose $e^x$ by factorizing it as a product of a very small power of $e$ and an integer power of 2, since both of these components are highly efficient to compute. We use the identity
$$ e^x = e^r \cdot 2^k $$
6: Incidentally, this is the way the exp(x)
function is implemented in the MPFR
arbitrary precision library.
-
solves for a $k$ to satisfy these constraints,
-
computes $r$ using the value of $k$,
-
evaluates a truncated Taylor polynomial of $e^r$, and
-
left shifts the result by $k$ to obtain $e^x$.
While we don’t often don’t have a general pow(a, b)
function when implementing exp(x)
, it’s very likely we can efficiently calculate powers of 2 using either exp2(x)
or left shifts. In particular, note that $e^x = e^r \cdot 2^k$ implies $x = r + k \log 2$, by taking the natural logarithm of both sides of the equality. Then by isolating $k$ we obtain the identity
$$ k = \left\lceil \frac{x}{\log 2} - \frac{\log 2}{2} \right\rceil $$
which gives us a formula for determining $k$. Likewise we can compute $r$ when $k$ is known by isolating $r$ in $x = r + k \log 2$
$$ r = x - k \log 2 $$
We only need 14 Taylor terms to evaluate $e^r$ to 16 digits of precision for any $|r| \leq \frac{1}{2}\log 2$, because $r$ is very close to the center $a = 0$. This is a very large improvement over the $12 \lceil |x| e\rceil$ terms required in the previous implementation. That being said, we cannot expect the same precision characteristics as we obtained with the last algorithm. The range reduction step achieves a significant increase in efficiency by trading off a modest (single digit) decrease in precision attributable to the left shift step. To see this, let $\epsilon$ be the absolute error of $T_{n}$ and $e^r$. Then by the foregoing identity, we have
$$ e^x = 2^k \cdot (e^r - \epsilon) $$
which shows that whatever error we have in the initial approximation of $e^r$ will be compounded when we multiply $e^r$ by $2^k$. This motivates us to find the least satisfying $k$ we can. Our implementations are as follows. Given that we can expect higher relative error, we will assess how many values achieve less than 14 digits of precision rather than 15.
import numpy as np import math def reduced_taylor_exp(x): """ Evaluates f(x) = e^x for any x in the interval [-709, 709]. If x < -709 or x > 709, raises an assertion error because of underflow/overflow limits. Performs a range reduction step by applying the identity e^x = e^r * 2^k for |r| <= 1/2 log(2) and positive integer k. e^r is approximated using a Taylor series truncated at 14 terms, which is enough because e^r is in the interval [0, 1/2 log(2)]. The result is left shifted by k to obtain e^r * 2^k = e^x. Performance: In the worst case we have 51 operations: - 16 multiplications - 16 divisions - 14 additions - 2 subtractions - 1 rounding - 1 left shift - 1 absolute value Accuracy: Over a sample of 10,000 equi-spaced points in [-709, 709] we have the following statistics: - Max relative error = 7.98411243625574e-14 - Min relative error = 0.0 - Avg relative error = 1.7165594254816366e-14 - Med relative error = 1.140642685235478e-14 - Var relative error = 2.964698541882666e-28 - 6.29 percent of the values have less than 14 digits of precision :param x: (int) or (float) power of e to evaluate :return: (float) approximation of e^x """ # If x is not a valid value, exit early assert(-709 <= x <= 709) # If x = 0, we know e^x = 1 so exit early if x == 0: return 1 # Normalize x to a positive value since we can use reciprocal # symmetry to only work on positive values. Keep track of # the original sign for return value x0 = np.abs(x) # Hard code the value of natural log(2) in double precision l = 0.6931471805599453 # Solve for an integer k satisfying x = k log(2) + r k = math.ceil((x0 / l) - 0.5) # p = 2^k p = 1 << k # r is a value between 0 and 0.5 log(2), inclusive r = x0 - (k * l) # Setup the Taylor series to evaluate e^r after # range reduction x -> r. The Taylor series # only approximates on the interval [0, log(2)/2] Tn = 1 # We need at most 14 terms to achieve 16 digits # of precision anywhere on the interval [0, log(2)/2] for i in range(14, 0, -1): Tn = Tn * (r / i) + 1 # e^x = e^r * 2^k, so multiply p by Tn. This loses us # about two digits of precision because we compound the # relative error by 2^k. p *= Tn # If the original sign is negative, return reciprocal if x < 0: p = 1 / p return p
#include <cmath> #include <cassert> double reduced_taylor_exp(double x) { /* * Evaluates f(x) = e^x for any x in the interval [-709, 709]. * If x < -709 or x > 709, raises an assertion error. Performs a * range reduction step by applying the identity e^x = e^r * 2^k * for |r| <= 1/2 log(2) and positive integer k. e^r is evaluated * using Taylor series truncated at 14 terms, which is sufficient * to achieve 16 digits of precision for r so close to 0. The * result is left shifted by k to obtain e^r * 2^k = e^x. * Performance: In the worst case we have 51 operations: * - 16 multiplications * - 16 divisions * - 14 additions * - 2 subtractions * - 1 rounding * - 1 left shift * - 1 absolute value * Accuracy: Over a sample of 10,000 linearly spaced points in * [-709, 709], we have the following error statistics: * - Max relative error = 7.99528e-14 * - Min relative error = 0 * - Avg relative error = 0 * - Med relative error = 2.27878e-14 * - Var relative error = 0 * - 6.4 percent of the values have less than 14 digits of precision. * Args: * - x (double float) power of e to evaluate * Returns: * - (double float) approximation of e^x */ // Make sure x is a valid input assert(-709 <= x && x <= 709); // When x is 0, we know e^x is 1 if (x == 0) { return 1; } // Normalize x to a positive value to take advantage of // reciprocal symmetry, but keep track of the original value double x0 = abs(x); // Solve for k satisfying x = 2^k * log(2) r, with |r| <= 1/2 log(2) int k = ceil((x0 / M_LN2) - 0.5); // Determine r using the k we computed double r = x0 - (k * M_LN2); // Setup the Taylor series to evaluate e^r after range reduction // x -> r. This only approximates over the interval [0, 1/2 log(2)] double tk = 1.0; double tn = 1.0; // We only need 14 terms to achieve 16 digits of precision for e^r for (int i = 1; i < 14; i++) { tk *= r / i; tn += tk; }; tn = tn * exp2(k); if (x < 0) { return 1 / tn; } return tn; }
This implementation is much faster. Our worst-case operation count is reduced by 3 orders of magnitude, from about 70,000 iterations in the first algorithm to just 51 in this algorithm. And as expected we have lost one digit of precision on average. If we can tolerate the single digit loss of precision, this is an excellent performance improvement. We still achieve 13 digits of precision in the worst case, and 14 digits of precision on average. Only about 6% of values exhibit less than 14 digits of precision. Here is the plot of the error distribution across the interval $[-1, 1]$.
Likewise here is the plot of the error distribution across the entire interval.
We can see that the relative error is highly symmetric and much more regular than the previous plot. The variance of the relative error still clearly increases further away from the origin. However, there are six distinctive and symmetric subintervals on either side of the origin, with jumps in variance between subintervals closer to the origin and subintervals closer to the outer boundaries. This illustrates another benefit of range reduction, which is that the relative error is much more predictable and easier to reason about. We will also apply range reduction for the other implementation methodologies.
Lagrange interpolation
Now we will move from Taylor series approximation to the more general realm of polynomial interpolation . By the Weierstrass approximation theorem , for any set of $n$ + 1 values $y_1, y_2, \ldots, y_{n + 1}$ located at distinct points $x_1, x_2, \ldots, x_{n + 1}$, there exists a unique polynomial of degree $n$, denoted by $P_n$, which exactly passes through each of the $n$ + 1 points. This polynomial is called an interpolant . Likewise functions can be interpolated: $P_n$ is the unique interpolant with degree $n$ of a function $f$ at points $x_1, x_2, \ldots, x_{n + 1}$ if $P_n(x_i) = f(x_i)$ for all $x_i$ with $1 \leq i \leq n + 1$.
The set of all polynomials with degree less than or equal to $n$, denoted by $\mathbb{F}[x]$, comprises a vector space of dimension $n$ + 1. This is to say that polynomials with degree at most $n$ are equivalently represented as $n$-dimensional vectors and form a linear structure under the operations of vector addition and scalar multiplication . It follows that every polynomial in the space $\mathbb{F}[x]$ is a linear combination of a basis of $\mathbb{F}[x]$. The standard basis of $\mathbb{F}[x]$ is the set of monomials
$$ \{1, x, x^{2}, \ldots, x^{n}\} $$
This is also called the monomial basis . When we use the monomial basis, every vector of $\mathbb{F}[x]$ is a polynomial with the representation
$$ P_n(x) = \sum_{k = 0}^{n}p_{k}x^{k} = p_0 + p_{1}x + p_{2}x^2 + \ldots + p_{n}x^n $$
where $p_0, p_1, \ldots, p_n$ comprise the coordinate vector of $P_n$. This is to say that $p_0, p_1, \ldots, p_n$ are the unique scalar coefficients which define $P_n$ as a linear combination of the monomial basis. Given that the polynomials comprise a vector space, the problem of determining the unique interpolant $P_{n}$ for a function $f$ reduces to solving a system of linear equations. With $n$ + 1 distinct points $x_0, x_1, \ldots, x_n$ and their corresponding values $f(x_0), f(x_1), \ldots, f(x_n)$, we have the system of $n$ + 1 equations in $n$ + 1 unknowns
$$ \begin{aligned} P_{n}(x_0) &= p_0 + p_{1}x_0 + p_{2}x_{0}^{2} + \ldots + p_{n}x_{0}^{n} = f(x_0) \\ P_{n}(x_1) &= p_0 + p_{1}x_1 + p_{2}x_{1}^{2} + \ldots + p_{n}x_{1}^{n} = f(x_1) \\ P_{n}(x_2) &= p_0 + p_{1}x_2 + p_{2}x_{2}^{2} + \ldots + p_{n}x_{2}^{n} = f(x_2) \\ \vdots &\quad \vdots \quad \vdots \quad \vdots \quad \vdots \quad \vdots \quad \vdots \quad \vdots \quad \vdots \quad \vdots \quad \vdots \quad \vdots \quad \vdots \\ P_{n}(x_n) &= p_0 + p_{1}x_n + p_{2}x_{n}^{2} + \ldots + p_{n}x_{n}^{n} = f(x_n) \end{aligned} $$
Solving this system gives us the coordinate vector defining the unique interpolating polynomial $P_n$ of $f$ with respect to the monomial basis. To solve this system, we can represent the points $1, x_1, \ldots, x_n$ and their powers using the Vandermonde matrix
$$ \begin{bmatrix} 1 & x_0 & x_0^2 & \ldots & x_0^n \\ 1 & x_1 & x_1^2 & \ldots & x_1^n \\ 1 & x_2 & x_2^2 & \ldots & x_2^n \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^2 & \ldots & x_n^n \end{bmatrix} $$
which allows us to represent the entire system of equations using the matrix equation
$$ \begin{bmatrix} 1 & x_0 & x_0^2 & \ldots & x_0^n \\ 1 & x_1 & x_1^2 & \ldots & x_1^n \\ 1 & x_2 & x_2^2 & \ldots & x_2^n \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^2 & \ldots & x_n^n \end{bmatrix} \begin{bmatrix} p_0 \\ p_1 \\ p_2 \\ \vdots \\ p_n \end{bmatrix} = \begin{bmatrix} f(x_0) \\ f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{bmatrix} $$
Solving this matrix equation is equivalent to inverting the Vandermonde matrix, because multiplying the inverse of the Vandermonde matrix by the vector of $f(x_i)$ values yields the coordinate vector of $p_i$ scalars.
The choice of basis has a material impact on how efficiently we can find the interpolating polynomial. If we want to avoid doing matrix multiplication and inversion to find the interpolating polynomial, we can instead use the Lagrange basis of $\mathbb{F}[x]$. Corresponding to the points $x_0, x_1, \ldots, x_n$, each vector in the Lagrange basis is a polynomial of the form
$$ \ell_{i}(x) = \frac{x - x_0}{x_i - x_0} \cdot \ldots \cdot \frac{x - x_{i - 1}}{x_i - x_{i - 1}} \cdot \frac{x - x_{i + 1}}{x_i - x_{i + 1}} \cdot \ldots \cdot \frac{x - x_n}{x_i - x_n} $$
or, more succinctly,
$$ \ell_{i}(x) = \prod_{j = 0, j \neq i}^{n} \frac{x - x_j}{x_i - x_j} $$
The Lagrange basis comes with a very strong advantage: the value of each basis polynomial $\ell_{i}$ at a point $x_j$ is equal to the Kronecker delta
$$ \delta_{i, j} = \begin{cases} 0, & i \neq j \\ 1, & i = j \end{cases} $$
It follows that the $n$ + 1 $\times$ $n$ + 1 matrix involved in the matrix equation will be the identity matrix instead of the Vandermonde matrix. This is easier to compute because we don’t need to perform an inversion step - the identity matrix is its own inverse. This also simplifies the representation of the interpolant, because the Lagrange form is a linear combination of the Lagrange basis polynomials $\ell_i(x)$ and the corresponding values $f(x_i)$.
$$ P_n(x) = \sum_{i = 0}^{n}f(x_i)\ell_i(x) = f(x_0)\ell_0(x) + f(x_1)\ell_1(x) + \ldots + f(x_n)\ell_n(x) $$
Let’s take a look at a range reduced implementation.
import numpy as np class LagrangeInterpolation: """ Calculates the Lagrange interpolant of a function f parameterized by n distinct x values and corresponding f(x) values. Stores the point data, then approximates the value of f at x by calculating the Lagrange basis polynomials with respect to x and taking their linear combination with respect to f(x_0), ..., f(x_n). """ def __init__(self, xi, yi): self.xi = xi self.yi = yi def interpolate(self, x): """ :param x: (double float) point at which to approximate f :return: (double float) approximation of f(x) """ # Initialize the basis as a vector of 1s with the same length as the n points basis = np.ones(self.xi.shape) # Double for loop is O(n^2), where n is the number of distinct x points for i in range(len(self.xi)): for j in range(len(self.xi)): # If the i index is equal to the j index, skip this # We're running through this loop to calculate # l_i(x) = (x - x_0) / (x_i - x_0) * ... * (x - x_n) / (x_i - x_n) if i == j: continue basis[i] *= (x - self.xi[j]) / (self.xi[i] - self.xi[j]) # Return the linear combination of the basis and f(x_0), ..., f(x_n) values return sum(basis[i] * self.yi[i] for i in range(len(self.xi))) def lagrange_exp(x): """ Approximates f(x) = e^x for any x on the interval [-709, 709] using Lagrange interpolation. If x < -709 or x > 709, raises an assertion error due to double precision underflow/overflow limits. Performs a range reduction step by applying the identity e^x = e^r * 2^k for |r| <= 1/2 log(2) and integer k. The Lagrange interpolant approximates the value of e^r, then the result is multiplied by 2^k to obtain the final approximation for e^x. Performance: We sample 15 distinct points of e^x and their values to construct a Lagrange basis consisting of 15 polynomials. Since the Lagrange basis is computed for each approximation f(x), we have a floor of - 15^2 multiplications - 15^2 divisions - 2 * 15^2 subtractions for approximately 4 * 15^2 operations overall. The linear combination step adds another 15 additions and 15 multiplications by running through the length of the basis once more, and the interpolation proper further includes - 1 absolute value - 1 rounding - 1 division - 2 subtraction - 1 left shift - 2 multiplications for another 8 operations. Therefore there are (4 * 15^2) + 38 = 938 operations total. Accuracy: Over a sample of 10,000 equi-spaced points in [-709, 709], we have the following error statistics: Max relative error = 8.014646895154806e-14 Min relative error = 0.0 Avg relative error = 1.716737407758425e-14 Med relative error = 1.131501762186489e-14 Var relative error = 2.963335086848574e-28 6.329 percent of the values have less than 14 digits of precision :param x: (int) or (float) power of e to evaluate :return: (float) approximation of e^x """ assert(-709 <= x <= 709) if x == 0: return 1 x0 = np.abs(x) l = 0.6931471805599453 k = math.ceil((x0 / l) - 0.5) p = 1 << k r = x0 - (k * l) # lagrange is an instance of the LagrangeInterpolation class Pn = lagrange.interpolate(r) * p return Pn if x > 0 else 1 / Pn
class LagrangeInterpolation { std::vector<double> xi; std::vector<double> yi; public: LagrangeInterpolation(std::vector<double> x_points, std::vector<double> y_points) { xi = x_points; yi = y_points; } double interpolate(double x) { /* * Calculates the Lagrange interpolant of a function f parameterized by n distinct * x values and corresponding f(x) values. Stores the point data, then approximates * the value of f at x by calculating the Lagrange basis polynomials with respect * to x and taking their linear combination with respect to f(x_0), ..., f(x_n). * Args: * - x (double float) point at which to approximate f * Returns: * - (double float) approximation of f(x) */ // Initialize the basis as a vector of 1s with the same length as the n points std::vector<double> basis(xi.size(), 1); double val = 0; // Double for loop is O(n^2), where n is the number of distinct x points for (int i = 0; i < xi.size(); i++) { for (int j = 0; j < xi.size(); j++) { // If the i index is equal to the j index, skip this // We're running through this loop to calculate // l_i(x) = (x - x_0) / (x_i - x_0) * ... * (x - x_n) / (x_i - x_n) if (i == j) { continue; } basis[i] *= (x - xi[j]) / (xi[i] - xi[j]); } } // Return the linear combination of the basis and f(x_0), ..., f(x_n) values for (int i = 0; i < basis.size(); i++) { val += (basis[i] * yi[i]); } return val; } }; double lagrange_exp(double x) { /* * * Approximates f(x) = e^x for any x on the interval [-709, 709] * using Lagrange interpolation. * If x < -709 or x > 709, raises an assertion error due to * double precision underflow/overflow limits. Performs a * range reduction step by applying the identity e^x = e^r * 2^k * for |r| <= 1/2 log(2) and integer k. The Lagrange interpolant * approximates the value of e^r, then the result is multiplied * by 2^k to obtain the final approximation for e^x. * Performance: We sample 15 distinct points of e^x and their values to construct * a Lagrange basis consisting of 15 polynomials. Since the Lagrange basis is * computed for each approximation f(x), we have a floor of * - 15^2 multiplications * - 15^2 divisions * - 2 * 15^2 subtractions * for approximately 4 * 15^2 operations overall. The linear combination step adds * another 15 additions and 15 multiplications by running through the length of * the basis once more, and the interpolation proper further includes * - 1 absolute value * - 1 rounding * - 1 division * - 2 subtraction * - 1 left shift * - 2 multiplications * for another 8 operations. Therefore there are (4 * 15^2) + 38 = 938 operations * total. * Accuracy: Over a sample of 10,000 equi-spaced points in * [-709, 709], we have the following error statistics: * Max relative error: 8.00308e-14 * Min relative error: 0 * Avg relative error: 0 * Med relative error: 2.26284e-14 * Var relative error: 0 * 6.44 percent of the values have less than 14 digits of precision. * Args: * - (double float) power of e to approximate * Returns: * - (double float) approximation of e^x */ assert(-709 <= x && x <= 709); if (x == 0) { return 1; } double x0 = abs(x); int k = ceil((x0 / M_LN2) - 0.5); double r = x0 - (k * M_LN2); // lagrange is an instance of the LagrangeInterpolation class double pn = lagrange.interpolate(r); pn = pn * exp2(k); if (x < 0) { return 1 / pn; } return pn; }
First consider the error distribution of the Lagrange interpolation class versus the benchmark function over $[-1, 1]$.
This is clearly not as accurate as the non-reduced Taylor series approximation, which was able to achieve 16 digits of precision everywhere on $[-1, 1]$. But when we plot the relative error distribution of this implementation, we find that it’s remarkably similar to that of the reduced Taylor series approximation.
The accuracy characteristics are extremely similar, because the range reduction step dominates any idiosyncrasies of the Lagrange interpolant. We still have a worst-case precision of 13 significant digits, and all but approximately 6.5% of the values actually have 14 significant digits of precision or more. But this algorithm is about 18.8 times less efficient than the reduced Taylor series approximation method, because calculating the Lagrange basis has O($n^2$) time complexity and cannot be memoized . Since we need to recalculate the Lagrange basis for every input $x$, there are over 900 operations required to approximate $e^x$ anywhere on the interval.
Let $P_n$ be the interpolant of $e^x$ with degree $n$ on the interval $[a, b]$, and let $\epsilon_n$ be the absolute error of $P_n$ and $e^x$. Then by the Lagrange remainder theorem , we have the identity
$$ e^x = P_n(x) + \epsilon_n(x) $$
where the error term is given by
$$ \epsilon_n(x) = \frac{e^{(n + 1)c}}{(n + 1)!}(x - x_0)(x - x_1)\ldots(x - x_n) $$
for some $c \in [a, b]$. Note the similarity with the error term in Taylor series approximation. Both have $\frac{e^{(n + 1)c}}{(n + 1)!}$ as a factor. But the absolute error of the Taylor series approximation multiplies this term by $(x - a)^{n + 1}$, where $a$ is the center of expansion. The Lagrange absolute error is slightly different, because it instead multiplies that factor by
$$ \prod_{i = 0}^{n}(x - x_i) $$
This is to say that the information of the Taylor series error is implicit in the center of expansion. Whereas for Lagrange interpolation, it depends on the $n$ + 1 distinct points $x_0, x_1, \ldots, x_n$. We can simplify this error in a similar fashion as the Taylor series error, because $e^x$ is its own derivative.
$$ \epsilon_n(x) = e^c \frac{(x - x_0)(x - x_1)\ldots(x - x_n)}{(n + 1)!} $$
Likewise we can once again exploit the fact that $e^x$ is an increasing function and $x \in [a, b]$ to obtain the inequality
$$ \epsilon_n(x) \leq e^x \frac{(x - x_0)(x - x_1)\ldots(x - x_n)}{(n + 1)!} $$
which allows us to bound the relative error at $x$ like so:
$$ \eta_n(x) = \left| \frac{\epsilon_n(x)}{e^x}\right| \leq \left|\frac{(x - x_0)(x - x_1)\ldots(x - x_n)}{(n + 1)!}\right| $$
Clearly we would like the maximum relative error on the entire interval $[a, b]$ to be as small as possible. We can observe that $(x - x_0)(x - x_1)\ldots(x - x_n)$ is itself a factorized polynomial and that the distinct points $x_0, x_1, \ldots, x_n$ are its roots. Thus the problem of reducing the relative error is equivalent to the problem of choosing roots which minimize the local maxima . As it turns out, an equi-spaced selection of points $x_0, x_1, \ldots, x_n$ is rarely the optimal choice for minimizing the relative error over the entire interval. This leads us to an important drawback of Lagrange interpolation when the points are chosen in such a fashion.
When the $n$ + 1 distinct points $x_0, x_1, \ldots, x_n$ sampled for interpolation are equi-spaced, the interpolation is susceptible to numerical instability. Functions whose Taylor series do not have an infinite radius of convergence will exhibit increasingly large error perturbations under interpolation as the degree $n$ of the interpolant increases. In other words, suppose a function $f$ is infinitely differentiable at some $a$. There exists an interval $|a - c| \leq x$ such that $f$ absolutely converges at any $c$ within that interval. If this interval is infinite, then $f$ has an infinite radius of convergence and interpolants of $f$ will converge to $f$ uniformly regardless of their degree. But if the interval is finite, there will be a degree at which interpolants of $f$ actually become increasingly inaccurate within certain subintervals. This is known as Runge’s phenomenon . 7: Closely related is the Gibbs phenomenon in the approximation of functions using Fourier series .
Luckily this concern does not affect approximations of $e^x$, because $e^x$ absolutely converges everywhere. Therefore $e^x$ can be interpolated with a polynomial of arbitrarily high degree without suffering from numerical instability (though in practice there is no benefit to using more than a degree 15 polynomial in double precision). But to illustrate this problem more clearly, we can look at Lagrange interpolation of the sine function. Here is a plot of the degree 10 Lagrange interpolant of sine against its true value on the interval $[-1, 1]$.
Now look at what happens when we increase the degree of the interpolant to 60.
We can see that the value returned by the interpolant has started to noticeably deviate from the true value of $\sin(x)$ near the boundaries of $[-1, 1]$. This is despite the fact that this interpolant is of a higher degree than the first one. If we zoom in on $[-1, -0.9]$ we can see this behavior more clearly.
This illustrates an important point of caution when working with Lagrange interpolation: in general, approximations of a function over an interval do not uniformly improve everywhere on that interval when the sampling points of interpolation are equi-spaced on the interval. Look what happens when we push the degree of interpolation up to 100.
The absolute error of the interpolant at the boundaries of the interval continues to increase commensurate with the degree of the interpolant. Now the error visually dominates the true value of the sine function, such that we can’t distinguish it from a flat line.
The absolute error of the degree 100 interpolant is so high that it’s off from the true value of $\sin(x)$ by over 1,000,000,000! The counterintuitive result is that approximation results are not necessarily improved by higher degree interpolants. In fact, they may become (significantly) worse.
Barycentric algorithm
We can leverage a few properties of the Lagrange basis polynomials to make the interpolation algorithm more efficient. In particular, we want to find a way to interpolate arbitrary values of $f(x)$ which doesn’t require us to recalculate the basis for each $x$. It would be better if we could pre-calculate the basis just once, because it’s an O($n^2$) operation. To explore this further, suppose we wanted to find the sum of all Lagrange basis polynomials, irrespective of the value of $x$. Then we would calculate
$$ \sum_{i = 0}^{n} \ell_i(x) = \ell_0(x) + \ell_1(x) + \ldots + \ell_n(x) $$
Without loss of generality, we can represent this as a sum with coefficients of 1 at each term:
$$ 1\ell_0(x) + 1\ell_1(x) + \ldots + 1\ell_n(x) $$
Then by the definition of Lagrange interpolation, this is equivalent to
$$ f(x_0)\ell_0(x) + f(x_1)\ell_1(x) + \ldots + f(x_n)\ell_n(x) $$
where $f(x_0) = f(x_1) = \ldots = f(x_n) =$ 1. In other words, the sum of Lagrange basis polynomials for arbitrary $x$ is given by interpolating the constant function $f(x) = 1$. Therefore, letting $f(x) = 1$, we have
$$ 1 = f(x) = \sum_{i = 0}^{n}f(x_i)\ell_i(x) = \sum_{i = 0}^{n}1\ell_i(x) = \sum_{i = 0}^{n}\ell_i(x) $$
from which we conclude that the sum of Lagrange basis polynomials is always equal to 1. Now we’ll decompose each basis polynomial. We define a series of weights $\lambda_i$ such that
$$ \lambda_i = \frac{1}{\prod_{i \neq j}^{n}(x_i - x_j)} $$
which is to say that each weight $\lambda_i$ is equal to the product in the denominator of $\ell_i(x)$, and does not depend on the value of $x$. Next we observe that the product in the numerator is the same for the numerator of each basis polynomial $\ell_j(x)$, save for $(x - x_j)$. Therefore we define a convenient shorthand for this numerator
$$ \mathcal{L}(x) = (x - x_0)(x - x_1) \ldots (x - x_n) = \prod_{i = 0}^{n}(x - x_i) $$
Now we can represent each basis polynomial in decomposed form
$$ \ell_i(x) = \frac{\mathcal{L}(x)\lambda_i}{x - x_i} $$
which gives us the new Lagrange interpolation identity
$$ P_n(x) = \sum_{i = 0}^{n}\frac{\mathcal{L}(x)\lambda_i}{x - x_i}f(x_i) = \mathcal{L}(x)\sum_{i = 0}^{n}\frac{\lambda_i}{x - x_i}f(x_i) $$
because the summation of $\mathcal{L}(x)$ does not depend on $i$. Further note that
$$ \sum_{i = 0}^{n}\ell_i(x) = \sum_{i = 0}^{n}\frac{\mathcal{L}(x)\lambda_i}{x - x_i} = \mathcal{L}(x)\sum_{i = 0}^{n}\frac{\lambda_i}{x - x_i} $$
Since we’ve already proven the sum of all Lagrange basis polynomials is equal to 1, we can divide the new Lagrange interpolation identity by the sum of basis polynomials without changing the result. Doing so allows us to cancel $\mathcal{L}(x)$. Then we obtain the Barycentric interpolation algorithm
$$ \begin{aligned} P_n(x) &= \mathcal{L}(x)\sum_{i = 0}^{n}\frac{\lambda_i}{x - x_i}f(x_i) \bigg/ 1 \\ &= \mathcal{L}(x)\sum_{i = 0}^{n}\frac{\lambda_i}{x - x_i}f(x_i) \bigg/ \sum_{i = 0}^{n}\ell_i(x) \\ &= \mathcal{L}(x)\sum_{i = 0}^{n}\frac{\lambda_i}{x - x_i}f(x_i) \bigg/ \mathcal{L}(x)\sum_{i = 0}^{n}\frac{\lambda_i}{x - x_i} \\ &= \sum_{i = 0}^{n}\frac{\lambda_i f(x_i)}{x - x_i} \bigg/ \sum_{i = 0}^{n}\frac{\lambda_i}{x - x_i} \end{aligned} $$
Let’s take a look at an implementation of this formula.
class BarycentricInterpolation: """ Interpolates a function using the Barycentric algorithm. __init__ precalculates the weights, which takes O(n^2) time in the number n + 1 of distinct interpolation points. The weights are the same for all values of x, so this is calculated once then saved. The interpolate member function uses the weights and yi = f(x_i) values to approximate the function. With the weights pre-calculated, each evaluation of x takes only O(n) time. As a small complexity tradeoff, the values of f(x_i) multiplied by the weights are also pre-calculated because they do not depend on the input x. """ def __init__(self, x_points, y_points): self.xi = x_points self.yi = y_points self.weights = [] self.wi = [] self.n = len(self.xi) for j in range(self.n): w = 1 for k in range(self.n): if k == j: continue w *= (self.xi[j] - self.xi[k]) self.weights.append(1 / w) for i in range(self.n): self.wi.append(self.weights[i] * self.yi[i]) def interpolate(self, x): """ Interpolates f(x) using the Barycentric algorithm with the n + 1 distinct points x_0, x_1, ..., x_n, the pre-calculated weights \lambda_0, \lambda_1, ..., \lambda_n, and the pre-calculated terms f(x_0)\lambda_0, f(x_1)\lambda_1, ..., f(x_n)\lambda_n Performance: With the weights \lambda_i and f(x_i)\lambda_i pre-calculated, we have: - n + 1 subtractions - 2 * (n + 1) additions - 2 * (n + 1) + 1 divisions which is 5(n + 1) + 1 operations total. Assuming we interpolate using 15 points, which is equal to 81 operations. :param x: (double float) point of f to interpolate :return: (double float) approximation of f(x) """ # We consider the final numerator and denominator separately, and # initially they equal 0. p_num = 0 p_den = 0 # Perform one loop through each of the n + 1 points x_0, ..., x_n for j in range(self.n): # We have a special case that allows us to eject out of the # loop, when the currently considered x_j is equal to the # input x. Then we simply return the corresponding f(x_j). if self.xi[j] == x: return self.yi[j] # If x_j is not equal to the input x, we save the value of # x - x_j, because this value is needed twice. d = x - self.xi[j] # Add the value of f(x_j)\lambda_j / (x - x_j) to the numerator p_num += self.wi[j] / d # Add the value of \lambda_j / (x - x_j) to the denominator p_den += self.weights[j] / d # Now we just return the numerator over the denominator return p_num / p_den
class BarycentricInterpolation { std::vector<double> weights; std::vector<double> xi; std::vector<double> yi; std::vector<double> wi; public: BarycentricInterpolation(std::vector<double> x_points, std::vector<double> y_points) { /* * Interpolates a function using the Barycentric algorithm. __init__ * precalculates the weights, which takes O(n^2) time in the number * n + 1 of distinct interpolation points. The weights are the same for * all values of x, so this is calculated once then saved. The interpolate * member function uses the weights and yi = f(x_i) values to approximate * the function. With the weights pre-calculated, each evaluation of x * takes only O(n) time. As a small complexity tradeoff, the values * of f(x_i) multiplied by the weights are also pre-calculated because * they do not depend on the input x. */ xi = x_points; yi = y_points; for (int j = 0; j < x_points.size(); j++) { double w = 1; for (int k = 0; k < x_points.size(); k++) { if (k == j) { continue; } w *= x_points[j] - x_points[k]; } weights.push_back(1 / w); } for (int i = 0; i < x_points.size(); i++) { wi.push_back(weights[i] * y_points[i]); } } double interpolate(double x) { /* * Interpolates f(x) using the Barycentric algorithm with the n + 1 * distinct points x_0, x_1, ..., x_n, the pre-calculated weights * \lambda_0, \lambda_1, ..., \lambda_n, and the pre-calculated terms * f(x_0)\lambda_0, f(x_1)\lambda_1, ..., f(x_n)\lambda_n * Performance: With the weights \lambda_i and f(x_i)\lambda_i * pre-calculated, we have: * - n + 1 subtractions * - 2 * (n + 1) additions * - 2 * (n + 1) + 1 divisions * which is 5(n + 1) + 1 operations total. Assuming we interpolate * using 15 points, which is equal to 81 operations. * Args: * - (double float) point of f to interpolate * Returns: * - (double float) approximation of f(x) */ //We consider the final numerator and denominator separately, and // initially they equal 0. double p_num = 0; double p_den = 0; // Perform one loop through each of the n + 1 points x_0, ..., x_n for (int j = 0; j < xi.size(); j++) { // We have a special case that allows us to eject out of the // loop, when the currently considered x_j is equal to the // input x. Then we simply return the corresponding f(x_j). if (xi[j] == x) { return yi[j]; } // If x_j is not equal to the input x, we save the value of // x - x_j, because this value is needed twice. double d = x - xi[j]; // Add the value of f(x_j)\lambda_j / (x - x_j) to the numerator p_num += wi[j] / d; // Add the value of \lambda_j / (x - x_j) to the denominator p_den += weights[j] / d; } // Now return the numerator over the denominator return (p_num / p_den); } };
This algorithm is a significant performance improvement. Whereas the previous implementation of Lagrange interpolation took O($n^2$) time complexity on every evaluation of $x$, this one requires only O($n$) complexity. We are able to pre-calculate not only the weights $\lambda_0, \lambda_1, \ldots, \lambda_n$ upon initialization, but also the values $f(x_0)\lambda_0, f(x_1)\lambda_1, \ldots, f(x_n)\lambda_n$. Assuming we interpolate using 15 points, we observe a tenfold reduction in operation count from the first implementation. The accuracy characteristics are actually somewhat better compared to the previous algorithm, so this implementation is strictly better overall. Consider the error distribution of this algorithm over $[-1, 1]$.
Visually the error distribution is the same. However the Barycentric algorithm achieves 14 digits of precision almost everywhere over $[-1, 1]$, which is an improvement by about a full digit of precision. As with the previous implementations, including the range reduction step actually compresses the error distribution such that it’s more accurate everywhere, including $[-1, 1]$. Here is the error distribution of our lagrange_exp
function with the Barycentric interpolation algorithm used for approximating $e^r$ instead of the original Lagrange interpolation technique.
So we see that it exhibits the same relative error distributions as the previous implementations when range reduction is used. Next we will consider superior methods of point selection which are better than an equi-spaced choice of values $x_0, x_1, \ldots, x_n$.
Chebyshev interpolation
We’ve seen that the unique $n$ degree interpolant $P_n$ of a function $f$ is a vector in the space $\mathbb{F}[x]$. Furthermore, there are different ways of representing $P_n$ depending on the chosen basis of $\mathbb{F}[x]$. Recall that the error of the Lagrange interpolant depends on the (factorized) polynomial
$$ (x - x_0)(x - x_1)\ldots(x - x_n) $$
If we minimize the extrema of this polynomial by choosing appropriate roots $x_0, x_1, \ldots, x_n$, we likewise minimize the error of the interpolant $P_n$ versus $f$. This coincides with the problem of choosing optimal interpolation points $x_0, x_1, \ldots, x_n$, and determining a polynomial with such optimal roots reduces to a particular differential equation . The solutions to that equation are called the Chebyshev polynomials , and the respective roots of each polynomial are called the Chebyshev nodes . 8: Technically there are two polynomial series commonly referred to with the term “Chebyshev.” The Chebyshev polynomials depicted here are specifically the Chebyshev polynomials “of the first kind.” There are also Chebyshev polynomials “of the second kind” which we won’t use.
The first and second Chebyshev polynomials are defined as $T_0(x) = 1$ and $T_1(x) = x$, respectively. Further Chebyshev polynomials can be determined using the recurrence relation
$$ T_n(x) = 2xT_{n - 1}(x) - T_{n - 2}(x) $$
Moreover, the $n^{\text{th}}$ Chebyshev polynomial is also characterized by the identity
$$ T_n(\cos \theta) = \cos(n \theta) $$
To avoid Runge’s phenomenon, we can perform Lagrange interpolation using the roots of these polynomials as the selection of interpolation points. But we can also directly use the Chebyshev polynomials as a series for interpolation.
Orthogonality is a generalization of perpendicularity in higher dimensions. Whereas perpendicularity occurs when all vectors meet at right angles, orthogonality is defined using the inner product . In turn, an inner product provides a rigorous notion of the distance and angle of two vectors with respect to one another. A vector space equipped with the additional structure of an inner product is called an inner product space . Two vectors $x, y$ are orthogonal if and only if their inner product is $\langle x, y \rangle$ = 0. Orthogonality also implies linear independence, from which it follows that an orthogonal set of $n$ vectors comprises an orthogonal basis of the $n$-dimensional vector space. We can prove that any set of $n$ Chebyshev polynomials is orthogonal, and is thus a basis set. Let
$$ \{T_0(x), T_1(x), \ldots, T_n(x)\} $$
be a set of Chebyshev polynomials. By definition, these are vectors in the space $\mathbb{F}[x]$ of all polynomials with degree less than or equal to $n$. There are many possible inner products, and a set of vectors can be orthogonal with respect to one but not orthogonal with respect to another. One of the most common inner products is the $L^2$ inner product , defined on an interval $[a, b]$:
$$ \langle f, g \rangle = \int_{a}^{b} f(x)g(x)dx $$
Chebyshev polynomials are not orthogonal with respect to the $L^2$ inner product. However, we can define a new inner product from any other inner product by scaling it. So given any weight $w$, we have the new inner product
$$ \langle f, g \rangle = \int_{a}^{b} f(x)g(x)w(x)dx $$
In this case, we can prove that the Chebyshev polynomials are orthogonal on the interval $[-1, 1]$ with respect to the weight
$$ w(x) = \frac{1}{\sqrt{1 - x^2}} $$
To show this, we need to prove
$$ \int_{-1}^{1}T_{j}(x)T_{k}(x)\frac{1}{\sqrt{1 - x^2}}dx = \begin{cases} 0, & j = k \\ a, & j \neq k \end{cases} $$
Using the identity $T_n(\cos \theta) = \cos(n \theta)$, we can apply a change of variable to obtain a new definite integral with the weight implicit:
$$ \langle T_j, T_k \rangle = \int_{0}^{\pi}\cos(j x)\cos(k x)dx $$
Now suppose $j = k =$ 0. By applying the trigonometric identity $\cos(0 x) = \cos(0) = 1$, we have
$$ \int_{0}^{\pi}\cos(j x)\cos(k x)dx = \int_{0}^{\pi}1 \cdot 1dx = \pi $$
Next suppose $j = k \neq$ 0. Substitute $nx = mx = u$, and $s = 2u$, $ds = 2du$. Using the trigonometric identity $\sin(\pi) = 0$ with these variable substitutions, we have
$$ \int_{0}^{\pi} \cos^2(u)du = \int_{0}^{\pi} \left(\frac{1}{2}\cos(2u) + \frac{1}{2}\right)du = \frac{\pi}{2} $$
Finally suppose $j \neq k$. Substitute $u = x - \frac{\pi}{2}$ and $du = dx$. Then we have
$$ \int_{0}^{\pi}\cos(j x)\cos(k x) dx = \int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}\cos\left(j\left(u + \frac{\pi}{2}\right)\right)\cos\left(k\left(u + \frac{\pi}{2}\right)\right)du $$
Given that $f(x) = \cos\left(j\left(u + \frac{\pi}{2}\right)\right)\cos\left(k\left(u + \frac{\pi}{2}\right)\right)$ is an odd function and the interval $\left[-\frac{\pi}{2}, \frac{\pi}{2} \right]$ is symmetric about the origin, we have
$$ \int_{-\frac{\pi}{2}}^{\frac{\pi}{2}}f(x)dx = 0 $$
Therefore we see the Chebyshev polynomials satisfy orthogonality with respect to the weight $\frac{1}{\sqrt{1 - x^2}}$ because
$$ \langle T_j, T_k \rangle = \int_{-1}^{1}T_{j}(x)T_{k}(x)\frac{1}{\sqrt{1 - x^2}}dx = \begin{cases} 0, & j \neq k \\ \pi, & j = k = 0 \\ \frac{\pi}{2}, & j = k \neq 0 \end{cases} $$
Thus the Chebyshev polynomials comprise an orthogonal basis for functions which satisfy certain continuity constraints on $[-1, 1]$. This range may appear small, but given that we’re often performing range reduction into a smaller interval than $[-1, 1]$ anyway, it doesn’t present much of an obstacle. It’s more important for us to show that candidate functions for Chebyshev interpolation have unique representations as linear combinations of the Chebyshev basis polynomials. The orthogonality of the Chebyshev polynomials also allows us to efficiently compute the scalar coefficients of these representations.
A function $f$ is called Lipschitz continuous , or simply Lipschitz , in a domain $[a, b]$ if there exists a constant $L$ such that
$$ |f(x) - f(y)| \leq L|x - y| $$
9: Sobolev spaces are frequently used in numerical analysis and approximation theory. For more information, refer to these notes .
Therefore these functions have a unique representation as a Chebyshev series$$ f(x) = \sum_{k = 0}^{\infty}a_kT_k(x) = a_0 + a_1x + a_2T_2(x) + \ldots $$
where, by the orthogonality of the polynomials, each coefficient $a_k$ is defined
$$ a_k = \frac{2}{\pi}\int_{-1}^{1}\frac{f(x)T_k(x)}{\sqrt{1 - x^2}}dx $$
except for the first coefficient $a_0$, which has the factor $\frac{1}{\pi}$ instead of $\frac{2}{\pi}$. 10: A proof of these facts is given in Trefethen’s Approximation Theory and Approximation Practice , Chapter 3, Theorem 3.1.
11: In fact, $e^x$ is Lipschitz on the domain $[-\infty, 1]$.
On the other hand, the sine function is globally Lipschitz because its first derivative, the cosine function, is bounded: $\cos(x) \leq |1|$ for all $x \in \mathbb{R}$.Whereas an inner product defines a notion of vector distance and angle, a norm defined on a space formalizes the notion of vector length, and is denoted by $||\cdot||$. One of the ways we can rigorously bound the absolute error of an approximation is by minimizing the norm of the error. For example, the uniform norm , or sup norm , of a function $f$ on an interval $[a, b]$ is defined as
$$ ||f||_\infty = \max{|f(x)|}, \quad x \in [a, b] $$
Then for any function $f$ defined on $[a, b]$ with approximation $F$, the norm of the absolute error is defined
$$ ||\epsilon||_\infty = \max{|f(x) - F(x)|}, \quad x \in [a, b] $$
Therefore minimizing the uniform norm of the absolute error minimizes the maximum absolute error. For any function $f$ defined on $[a, b]$, there exists a unique polynomial $P_n$ which is the “best” approximation of $f$, in the sense that the norm of $|P_n(x) - f(x)|$ is the minimum possible for a polynomial with degree $n$ on the domain. Then $P_n$ is called the minimax polynomial of $f$. Minimax polynomials are difficult to compute and do not have a straightforward representation. But Chebyshev interpolants are very close to best approximation polynomials while being much easier to understand.
Our first implementation will approximate $e^x$ using our familiar range reduction technique with the identity $e^x = e^r \cdot 2^k$. We will first approximate $e^r$ on the interval $[-1, 1]$ by determining the coordinate vector of $e^r$ with respect to the Chebyshev polynomial basis in dimension 14. Then we will take the linear combination of that coordinate vector and the Chebyshev basis, and multiply the result by $2^k$. Since each coefficient $a_k$ is a definite integral on $[-1, 1]$, the coefficients do not depend on the value of $x$ and we can pre-calculate them. Therefore we calculate each coefficient 12: For a concrete example of how to do this in Mathematica, see this blog post concerning Chebyshev approximation of the sine function.
$$ a_k = \frac{2}{\pi}\int_{-1}^{1} \frac{e^x T_{k}(x)}{\sqrt{1 - x^2}}dx $$
for $0 \leq k \leq n$, and find that the coefficients are
1.266065877752008335598244625214717537923, 1.130318207984970054415392055219726613610, 0.2714953395340765623657051399899818507081, 0.04433684984866380495257149525979922986386, 0.00547424044209373265027616843118645948703, 0.000542926311913943750362147810307554678760, 0.00004497732295429514665469032811091269841937, 3.198436462401990505863872976602295688795e-6, 1.992124806672795725961064384805589035648e-7, 1.103677172551734432616996091335324170860e-8, 5.50589607967374725047142040200552692791e-10, 2.497956616984982522712010934218766985311e-11, 1.039152230678570050499634672423840849837e-12, 3.991263356414401512887720401532162026594e-14
This gives us what we need to write the rest of our implementation. We’ll apply the standard range reduction technique and use Chebyshev approximation to evaluate $e^r$. But we’ll also consider the error distribution of the non-reduced Chebyshev approximation over $[-1, 1]$.
def cheb_series_exp(x): """ Approximates f(x) = e^x for any x on the interval [-709, 709] using Chebyshev interpolation in the Chebyshev basis with degree 13. If x < -709 or x > 709, raises an assertion error due to double precision underflow/overflow limits. Performs a range reduction step by applying the identity e^x = e^r * 2^k for |r| <= 1/2 log(2) and integer k. The Chebyshev interpolant approximates the value of e^r, then the result is multiplied by 2^k to obtain the final approximation for e^x. Performance: In the worst case we have 65 operations: - 34 multiplications - 13 subtractions - 12 additions - 2 divisions - 1 rounding - 1 left shift - 1 absolute value Accuracy: Over a sample of 10,000 equi-spaced points in [-709, 709], we have the following error statistics. - Max relative error = 8.133024023260273e-14 - Min relative error = 0.0 - Avg relative error = 1.7128494413598806e-14 - Med relative error = 1.127445546004485e-14 - Var relative error = 2.938702723948832e-28 - 6.31 percent of the values have less than 14 digits of precision :param x: (int) or (float) power of e to evaluate :return: (float) approximation of e^x """ assert(-709 <= x <= 709) if x == 0: return 1 # Scalar coefficients of e^x representation as linear combination # of the Chebyshev polynomial basis coeffs = [ 1.266065877752008335598244625214717537923, 1.130318207984970054415392055219726613610, 0.2714953395340765623657051399899818507081, 0.04433684984866380495257149525979922986386, 0.00547424044209373265027616843118645948703, 0.000542926311913943750362147810307554678760, 0.00004497732295429514665469032811091269841937, 3.198436462401990505863872976602295688795e-6, 1.992124806672795725961064384805589035648e-7, 1.103677172551734432616996091335324170860e-8, 5.50589607967374725047142040200552692791e-10, 2.497956616984982522712010934218766985311e-11, 1.039152230678570050499634672423840849837e-12, 3.991263356414401512887720401532162026594e-14 ] x0 = np.abs(x) l = 0.6931471805599453 k = math.ceil((x0 / l) - 0.5) r = x0 - (k * l) # We need the first two Chebyshev polynomials to # determine the others ti, tj = 1, r # Use the first two Chebyshev coefficients to # initialize the linear combination sum p = coeffs[0] + (coeffs[1] * r) # Iterate through the coefficients once # We implement the Chebyshev polynomial recurrence # relation iteratively because it's more stable for i in range(2, 14): tk = (2 * r * tj) - ti p += coeffs[i] * tk ti, tj = tj, tk # Back out of the range reduction using e^x = e^r * 2^k p *= 1 << k if x < 0: p = 1 / p return p
double cheb_series_exp(double x) { /* * Approximates f(x) = e^x for any x on the interval [-709, 709] using * Chebyshev interpolation in the Chebyshev basis with degree 13. * If x < -709 or x > 709, raises an assertion error due to * double precision underflow/overflow limits. Performs a * range reduction step by applying the identity e^x = e^r * 2^k * for |r| <= 1/2 log(2) and integer k. The Chebyshev interpolant * approximates the value of e^r, then the result is multiplied * by 2^k to obtain the final approximation for e^x. * Performance: In the worst case we have 65 operations: * - 34 multiplications * - 13 subtractions * - 12 additions * - 2 divisions * - 1 rounding * - 1 left shift * - 1 absolute value * Accuracy: Over a sample of 10,000 equi-spaced points in * [-709, 709], we have the following error statistics. * - Max relative error: 8.13302e-14 * - Min relative error: 0 * - Avg relative error: 0 * - Med relative error: 2.25423e-14 * - Var relative error: 0 * - 6.31 percent of the values have less than 14 digits of precision. * Args: * - (double float) power of e to approximate * Returns: * - (double float) approximation of e^x */ assert(-709 <= x && x <= 709); if (x == 0) { return 1; } double x0 = abs(x); int k = ceil((x0 / M_LN2) - 0.5); double r = x0 - (k * M_LN2); // Scalar coefficients of e^x representation as linear // combination of Chebyshev polynomial basis std::vector<double> coeffs = { 1.266065877752008335598244625214717537923, 1.130318207984970054415392055219726613610, 0.2714953395340765623657051399899818507081, 0.04433684984866380495257149525979922986386, 0.00547424044209373265027616843118645948703, 0.000542926311913943750362147810307554678760, 0.00004497732295429514665469032811091269841937, 3.198436462401990505863872976602295688795e-6, 1.992124806672795725961064384805589035648e-7, 1.103677172551734432616996091335324170860e-8, 5.50589607967374725047142040200552692791e-10, 2.497956616984982522712010934218766985311e-11, 1.039152230678570050499634672423840849837e-12, 3.991263356414401512887720401532162026594e-14 }; // We need the first two Chebyshev polynomials to // determine the others double ti = 1; double tj = r; // Use the first two coefficients to initialize the // linear combination sum double p = coeffs[0] + (coeffs[1] * r); // Iterate through the coefficients once // We implement the Chebyshev polynomial recurrence // relation iteratively because it's more stable for (int i = 2; i < 14; i++) { double tk = (2 * r * tj) - ti; p += coeffs[i] * tk; ti = tj; tj = tk; } // Back out of the range reduction using e^x = e^r * 2^k p *= exp2(k); if (x < 0) { p = 1 / p; } return p; };
Plotting the relative error of the non-reduced implementation over $[-1, 1]$, we find that it’s significantly more accurate than Lagrange interpolation, including using the Barycentric algorithm. We also see a certain symmetry which is not exhibited by the Lagrange interpolant; this is expected by the Equioscillation Theorem since Chebyshev interpolants closely approximate optimal interpolants of the same degree.
Then when we plot the relative error of the range-reduced implementation over $[-709, 709]$, we observe the same relative error distribution as those of the previous implementations.
The operation count is constant, like the Barycentric interpolation scheme, but about 25% smaller. We can further reduce the operation count if we apply a change of basis from Chebyshev to monomial. Given the bases are the same dimension and correspond to the same vector space, there exists an invertible linear map $T: P_{n} \rightarrow P_{n}$ which transforms vectors represented in the Chebyshev basis into their unique representation with respect to the monomial basis. This requires us to left multiply the coordinate vector $[a_0, a_1, \ldots, a_n]$ by the change of basis matrix $[T]$, which is the matrix representation of the map $T$. When we do this with the coordinate vector $[a_0, a_1, \ldots, a_n]$, we obtain
0.99999999999999999710, 1.0000000000000000304, 0.50000000000000176860, 0.16666666666666168651, 0.041666666666492764626, 0.0083333333335592763712, 0.0013888888951224622134, 0.00019841269432671705526, 0.000024801486520816039662, 2.7557622535824850520e-6, 2.7632293485109074120e-7, 2.4994303701340372837e-8, 2.0862193241800986288e-9
This allows us to implement the Chebyshev interpolation algorithm using fewer operations.
def cheb_mono_exp(x): """ Approximates f(x) = e^x for any x on the interval [-709, 709] using Chebyshev interpolation in the monomial basis with degree 13. If x < -709 or x > 709, raises an assertion error due to double precision underflow/overflow limits. Performs a range reduction step by applying the identity e^x = e^r * 2^k for |r| <= 1/2 log(2) and integer k. The Chebyshev interpolant approximates the value of e^r, then the result is multiplied by 2^k to obtain the final approximation for e^x. Performance: In the worst case we have 31 operations. - 14 multiplications - 2 divisions - 12 additions - 2 subtractions - 1 rounding - 1 left shift - 1 absolute value Accuracy: Over a sample of 10,000 equi-spaced points in [-709, 709], we have the following error statistics. - Max relative error = 8.197045651378647e-14 - Min relative error = 0.0 - Avg relative error = 1.7408224033601725e-14 - Med relative error = 1.149513869636177e-14 - Var relative error = 3.005578974516822e-28 - 6.41 percent of the values have less than 14 digits of precision :param x: (int) or (float) power of e to evaluate :return: (float) approximation of e^x """ assert(-709 <= x <= 709) if x == 0: return 1 poly = [0.99999999999999999710, 1.0000000000000000304, 0.50000000000000176860, 0.16666666666666168651, 0.041666666666492764626, 0.0083333333335592763712, 0.0013888888951224622134, 0.00019841269432671705526, 0.000024801486520816039662, 2.7557622535824850520e-6, 2.7632293485109074120e-7, 2.4994303701340372837e-8, 2.0862193241800986288e-9] x0 = np.abs(x) l = 0.6931471805599453 k = math.ceil((x0 / l) - 0.5) p = 1 << k r = x0 - (k * l) Pn = poly[-1] for k in range(len(poly) - 1, -1, -1): Pn = Pn * r + poly[k] Pn *= p if x < 0: Pn = 1 / Pn return Pn
double cheb_mono_exp(double x) { /* * Approximates f(x) = e^x for any x on the interval [-709, 709] using * Chebyshev interpolation in the monomial basis with degree 13. * If x < -709 or x > 709, raises an assertion error due to * double precision underflow/overflow limits. Performs a * range reduction step by applying the identity e^x = e^r * 2^k * for |r| <= 1/2 log(2) and integer k. The Chebyshev interpolant * approximates the value of e^r, then the result is multiplied * by 2^k to obtain the final approximation for e^x. * Performance: In the worst case we have 31 operations. * - 14 multiplications * - 2 divisions * - 12 additions * - 2 subtractions * - 1 rounding * - 1 left shift * - 1 absolute value * Accuracy: Over a sample of 10,000 equi-spaced points in [-709, 709], * we have the following error statistics. * - Max relative error = 8.01465e-14 * - Min relative error = 0 * - Avg relative error = 0 * - Med relative error = 2.27191e-14 * - Var relative error = 0 * - 6.38 percent of the values have less than 14 digits of precision. * Args: * - (double float) value of e^x to evaluate * Returns: * - (double float) approximation of e^x */ assert(-709 <= x && x <= 709); if (x == 0) { return 1; } double x0 = abs(x); int k = ceil((x0 / M_LN2) - 0.5); double r = x0 - (k * M_LN2); std::vector<double> coeffs = { 1.000000000000000, 1.000000000000000, 0.500000000000002, 0.166666666666680, 0.041666666666727, 0.008333333333342, 0.001388888888388, 1.984126978734782e-4, 2.480158866546844e-5, 2.755734045527853e-6, 2.755715675968011e-7, 2.504861486483735e-8, 2.088459690899721e-9, 1.632461784798319e-10 }; double pn = 1.143364767943110e-11; for (auto i = coeffs.rbegin(); i != coeffs.rend(); i++) { pn = pn * r + *i; } pn *= exp2(k); if (x < 0) { return 1 / pn; } return pn; }
This implementation has a constant time complexity, with 31 operations required to approximate $e^x$ for any $x \in [-709, 709]$. This makes it the most efficient algorithm thus far. On the other hand, its accuracy matches every other implementation using range reduction. Namely that it hits at least 13 digits of precision everywhere on the interval, and at least 14 digits of precision for approximately 94% of the sampled values. That being said, there are two considerations worth noting. First, the smoothness characteristics of $e^x$ are such that the Chebyshev and Taylor approximations are asymptotically equivalent, and more importantly, practically equivalent in double precision. This is to say that we could implement the range reduced, truncated Taylor series approximation in an explicit manner and it would be essentially identical to this one in appearance, efficiency and accuracy. While this is true for $e^x$, it is not true for all functions.
Second, we can observe a (very) minor reduction in accuracy going from the algorithm using the Chebyshev basis to the algorithm using the monomial basis, as well as a degradation of the equioscillation property. Consider the error distribution over $[-1, 1]$ when we use the monomial basis instead of the Chebyshev basis.
This is because the conditioning of the monomial basis is worse than that of the Chebyshev basis. For the function $e^x$, the difference in conditioning is negligible. But for other functions, it would be appropriate to explicitly investigate the difference in conditioning, because there could be numerical consequences to changing the basis like this. In the case of $e^x$, the loss in precision resulting from the change in basis is so insignificant that it doesn’t meaningfully change the accuracy characteristics of the new algorithm.
Given that $e^x$ is very well-behaved, let’s motivate Chebyshev interpolation further by showing how it performs with the sine function. To test this we’ll use the Barycentric interpolation algorithm with points $x_0, x_1, \ldots, x_n$ sampled at the Chebyshev nodes. Recall that these are the roots of the Chebyshev polynomials, and by the foregoing error analysis in the section on Lagrange interpolation, should approximately minimize the error of the interpolant. First we will need a function for computing the nodes.
import numpy as np def cheb_nodes(n, a = -1, b = 1): nodes = np.zeros(n) for k in range(n): nodes[k] = ((a + b) / 2) + (((b - a) / 2) * math.cos((k * math.pi) / n)) return nodes xi = cheb_nodes(15) yi = np.sin(2 * np.pi * xi) bary_cheb = BarycentricInterpolation(xi, yi)
When we plot the interpolant bary_cheb
against the true value of $\sin(x)$ for $x \in [-1, 1]$ with 10 Chebyshev nodes, we see very similar interpolation characteristics as compared to the Lagrange interpolation with 10 equi-spaced points.
But in contrast to the Lagrange interpolation with equi-spaced points, when we increase the number of interpolation points to 100 we do not observe any catastrophic increase in error at the boundaries of the interval.
The Barycentric interpolation algorithm is not doing the heavy lifting here. What’s happening is the Chebyshev nodes provide for a better approximation near the boundaries of the interval under consideration, because exhibit clustering. Whereas standard Lagrange interpolation samples equi-spaced points, the Chebyshev nodes on an interval $[a, b]$ are naturally clustered closer to the edges. This gives tighter control over the approximation, which reduces both the absolute and relative error of the interpolant at each value of $x$.
Carathéodory-Fejer approximation
Minimax approximation
以上所述就是小编给大家介绍的《Implementing the Exponential Function》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。