# Least Squares, Method of

## Least Squares, Method of

a method in the theory of errors for estimating unknown values in terms of the results of measurements containing random errors. The method is also used to approximate a given function by other (simpler) functions and often proves to be useful in the analysis of observational data (calculus of observations).

The method of least squares was proposed by K. Gauss (1794–95) and A. Legendre (1805–06). It was first used in analyzing the results of astronomical and geodetic observations. A rigorous mathematical foundation and a determination of the limits of meaningful applicability of the method of least squares were provided by A. A. Markov (the elder) and A. N. Kolmogorov. The method is now one of the most important branches of mathematical statistics and is widely used for statistical deductions in different branches of science and engineering.

According to Gauss, the basis of the method of least squares is essentially the assumption that the “loss” resulting from the replacement of an exact (unknown) value of a physical quantity μ by its approximate value *X*, calculated from the results of observations, is proportional to the square of the error: *(X* − μ)^{2}. Under these conditions, the optimal estimate is naturally assigned a value *X* that minimizes the mean value of the “loss” and that is free of systematic error. It is precisely this requirement that constitutes the basis of the method of least squares. In the general case, the problem of finding an optimum estimate *X* in the sense of the method of least squares is extremely complex. So in actual practice the problem is narrowed and *X* is chosen as the linear function of the observed values, free of systematic errors, that has the minimum mean loss among all linear functions. If the random observation errors obey a normal distribution and if the quantity μ to be estimated depends linearly on the mean values of the observation results (a case often encountered in applications of the method of least squares), then the solution of this problem will simultaneously be the solution of the general problem. The optimum estimate *X* also obeys a normal distribution with mean value μ and, consequently, the probability density of the random variable *X*,

attains at *x* = *X* a maximum at the point μ = *X*. (This property expresses the precise meaning of the prevalent assertion in the theory of errors that “the estimate *X* calculated according to the method of least squares is the most likely value of the unknown parameter μ.”)

** Case of a single unknown**. Suppose

*n*independent observations giving results

*Y*

_{1},

*Y*

_{2}, . . . ,

*Y*

_{n}have been carried out to estimate the value of the unknown μ, that is,

*Y*

_{1}= μ + δ1,

*Y*

_{2}= μ + δ

_{2}, . . . ,

*Y*

_{n}= μ + δ

_{n}, where δ

_{1}, δ

_{2}, . . . , δ

_{n}are the random errors. (By the definition accepted in classical error theory, random errors are independent random variables with zero mathematical expectation: Eδ

_{i}= 0; but if Eδ

^{i}Φ 0, then the Eδ

_{i}, are said to be systematic errors.) According to the

Table 1 | |||||||
---|---|---|---|---|---|---|---|

n | 2 | 3 | 4 | 5 | 10 | 20 | 30 |

t | 63.66 | 9.92 | 5.84 | 4.60 | 3.25 | 2.86 | 2.76 |

method of least squares, we take as the estimate of μ that *X* for which the following sum of squares is minimized:

where *p _{i}* =

*k*/σ

_{i}

^{2}and σ

_{i}

^{2}=

*D*δ

_{i}=

*E*δ

_{i}

^{2}(the coefficient

*k*> 0 may be arbitrarily selected). The quantity

*p*is called the weight, and σ

_{i}_{i}is called the standard deviation of the measurement with number

*i*. In particular, if the measurements are all equally precise, then σ

_{1}= σ

_{2}= . . . = σ

_{n}, and in this case we may set

*p*

_{1}=

*p*

_{2}= ... =

*p*

_{n}= 1. But if each

*Y*

_{1}is the arithmetic mean of n,

^{i}equally precise measurements, then we assume

*p*

_{1}=

*n*

_{1}.

The sum *S(X)* will be least if we select as *X* the weighted mean

where *P* = Σ*p _{i}*. The estimate ȳ of μ is free of systematic errors and has weight

*P*and variance

*DY*=

*k/P*. In particular, if the measurements are all equally precise, then

*ȳ*is the arithmetic mean of the measurement results:

It can be proved under certain general assumptions that if the number *n* of observations is sufficiently large, then the distribution of the estimate ȳ differs little from a normal distribution with mathematical expectation μ and dispersion *k/P*. In this case, the absolute error of the approximate equality *μ* ≈ *y*̄ is less than with probability nearly equal to the value of the integral

For example, I(1.96) = 0.950, I(2.58) = 0.990, and I(3.00) = 0.997.

If the weights *pi* of the measurements are given and if the factor *k* remains undetermined prior to the observations, then this factor and the dispersion of the estimate *Y* can be approximated by the formulas

*k* ≈ *S* (*Y̅*)/(*n* – *1*)

and

D *Y̅* = *k*/*p* ≈ *s*^{2} = *S* (*Y̅*)/[(*n* – 1)*P*]

(neither estimate has systematic errors).

In the case of practical importance when the errors δ/ obey a normal distribution, it is possible to find an exact value of the probability with which the absolute error of the approximate equality μ ≈ *Y*̄proves to be less than *ts (t* is an arbitrary positive number). This probability, as a function of *t*, is called the Student distribution function with *n* − 1 degrees of freedom and is calculated using the formula

where the constant *C*_{n} − 1 is selected such that *I _{n−1}* (∞) =1. For larger

*n*, equation (2) can be replaced by equation (1). However, the use of equation (1) for small

*n*leads to large errors. For example,

*t*= 2.58 corresponds, according to equation (1), to a value of I = 0.99; the true values of

*t*determined for small

*n*as solutions of the corresponding equations

*I*

_{n − 1}(r) = 0.99 are presented in Table 1.

EXAMPLE. To determine the mass of some body, ten independent equally precise weighings giving results *Y _{i}* (in grams) are carried out; the results of which are given in Table 2. In the table,

*Y*is the number of cases for which the weight

_{i}*Y*is observed;

_{i}*n*= Σ

*n*= 10. Since all the weighings are equally precise, we must set

_{i}*p*=

_{i}*n*and select as the estimate for the unknown weight μ the expression ȳ= Ʃ

_{i}*n*/Ʃ

_{i}Y_{i}*n*= 18.431. For example, setting

_{i}*I*

_{9}= 0.95, we find that

*t*= 2.262, using tables of the Student distribution with nine degrees of freedom; therefore, we take the quantity

as a bound on the absolute error of the approximate equality μ *x* 18.431. Thus, 18.420 < μ < 18.442.

*Case of several unknowns (linear relations)*. Suppose the *n* measurement results Y_{1}, Y_{2}, . . . , Y_{n} are related to the *m* unknowns x_{1}, x_{2}, . . . , x_{m} (*m* < *n*) by the independent linear equations

where the *a _{ij}* are known coefficients and the δ

_{i}are independent random errors of the measurements. It is desired to estimate the unknowns

*x*. (This problem may be considered as a generalization of the preceding one, in which μ =

_{j}*x*and

_{1}*m*= a

_{i1}= 1;

*i*= 1, 2.....n.)

Since Eδ_{i} = 0, the mean values of the measurement results *y _{i}*=

*EK*are related to the unknowns x

_{i}_{1},x

_{2}, . . . , x

_{m}by the linear equations (linear constraints)

Consequently, the unknown values *x _{j}* constitute a solution of the system (4), whose equations are assumed to be consistent. The precise values of the measured variables

*y*and the random errors

_{i}*δ*are usually unknown, and therefore in place of the systems (3) and (4) it is customary to write what are called conditional equations:

_{i}According to the method of least squares, we take as the estimates for the unknowns *x _{j}*. The values

*X*for which the sum of the squares of the deviations

_{j}will be a minimum (as in the preceding case, *p _{i}*—the weight of the measurement

*Y*—is inversely proportional to the dispersion of the random error δ

_{i}_{i}). As a rule, the conditional equations are not consistent—that is, for any values of

*X*, the differences

_{j}in general cannot all vanish, and in this case S = Σp_{i}Δ_{i}^{2} also cannot vanish. The method of least squares prescribes that we take as the estimates the values of *X _{j}* that minimize the sum

*S*. In the unusual cases when the conditional equations are consistent, that is, have a solution, this solution coincides with the estimates obtained by the method of least squares.

The sum of squares *S* constitutes a quadratic polynomial of the variables *X _{j}*. This polynomial attains a minimum at values

*X*

_{1},

*X*

_{2}, ....

*X*, at which all the first partial derivatives vanish:

_{m}Table 2 | ||||||
---|---|---|---|---|---|---|

Y_{i} | 18.41 | 18.42 | 18.43 | 18.44 | 18.45 | 18.46 |

n_{i} | 1 | 3 | 3 | 1 | 1 | 1 |

Therefore, it follows that the estimates *X _{j}* obtained by the method of least squares will satisfy a system of “normal” equations, which in the notation proposed by Gauss has the form

where

The estimates *X _{j}* obtained as a result of solving a system of normal equations lack systematic errors

*(EX*. The variances D

_{j}= x_{j})*X*of the

_{j}*X*are equal to

_{j}*kd*/

_{jj}*d*, where

*d*is the determinant of the system (5) and

*d*is the minor corresponding to the diagonal element

_{jj}*[pa*]. (In other words,

_{j}a_{j}*d*/

_{jj}*d*is the weight of the estimate

*X*.) If the proportionality factor

_{j}*k*(

*k*is called the variance per unit weight) is unknown, then in order to estimate

*k*, as well as the dispersion D

*Xj*, we use the formulas

(S is the minimum value of the initial sum of squares.) Under certain general assumptions it can be proved that if the number *n* of observations is sufficiently large, then the absolute error of the approximate equality *x _{i}* ≈

*X*is less than

_{j}*ts*with probability close to the value of the integral in equation (1). If the random errors of the observations δ

_{j}_{i}obey a normal distribution, then all the ratios

*(X*are distributed according to Student distribution with

_{j}—x_{j})/s_{j}*n − m*degrees of freedom. A precise bound on the absolute error of the approximate equality is carried out here using the integral of equation (2) as in the case of a single unknown. In addition, the minimum value of the sum

*S*is independent, in the probabilistic sense, of

*X*,

_{1}*X*, . . . ,

_{2}*X*, so that the approximate values of the dispersions of the estimates

_{m}*DX*≈

_{j}*s*are independent of the estimates

_{j}*X*themselves.

_{j}One of the most typical applications of the method of least squares lies in the “smoothing” observation results *Y _{i}* taking

*a*=

_{ij}*a*in equations (3), where a

_{j}(t_{i})_{j}(t) are known functions of some parameter

*t*(if

*t*is time, then

*t*

_{1},

*t*

_{2}, . . . are the moments of time at which the observations are performed). The case of polynomial interpolation, when

*a*are polynomials [for example,

_{j}(t)*a*

_{1}(

*t*) = 1,

*a*

_{2}(

*t*) =

*t*,

*a*

_{3}(

*t*) =

*t*

^{2}, . . . ] is often encountered in applications. If

*t*

_{2}−

*t*

_{1}=

*t*

_{3}−

*t*

_{2}= ... =

*t*−

_{n}*t*

_{n − 1}and if the observations are equally precise, tables of orthogonal polynomials can be used to calculate the estimates

*X*; such tables are available in many handbooks on modern computational mathematics. A second important case for applications is harmonic interpolation, where trigonometric functions are taken as the

_{j}*a*= cos

_{j}(t)*(j − 1)t*, where

*j*= 1, 2, ... ,

*m)*.

EXAMPLE. To estimate the accuracy of one method of chemical analysis by means of the method of least squares, the concentration of CaO was determined in ten reference samples of previously known composition. The results of the equally precise observations are indicated in Table 3, where 1 is the number of the experiment, *t _{i}* is the true CaO concentration,

*T*is the CaO concentration determined as a result of the chemical analysis, and

_{i}*Y*=

_{i}*T*, −

_{i}*t*is the error of the chemical analysis.

_{i}If the results of the chemical analysis lack systematic errors, then EY_{i} = 0. But if such errors exist, then they may be represented to a first approximation in the form *EY _{i}* = α +

*βt*, (α is called the systematic error and

_{i}*βt*is the method error), or, what is the same thing

_{i}**E** Y_{i} = (α + βt̄) + β(t_{i} - t̄)

where

To find the estimates α and *β*, we need only estimate the coefficients *x _{1}* =

*α*+

*βt̄*and x

_{2}= β. In this case, the conditional equations have the form

*Y _{i} =x_{1} + x_{2}(t̄_{i} – t), i = 1, 2,..., 10*

Therefore, *a _{i1}* = 1 and

*a*=

_{i2}*t*−

_{i}*t̄*all the

*p*are equal to 1, since we have assumed that the observations are equally precise. Since [a

_{i}_{1}a

_{2}] = [a

_{2}a

_{1}] = Ʃ(

*t*−

_{i}*t̄*) = 0, the system of normal equations can be written in a particularly simple form: [a

_{1}a

_{1}]

*X*

_{1}= [Ya

_{1}] and [a

_{2}a

_{2}]

*X*= [Ya

_{2}_{2}] = 1569

where [a_{1}a_{1}] = 10 [a_{2}a_{2}] - Σ(t_{i} – t̄)^{2} [Ya_{1}] = ΣY_{i} = – 3.5 [Ya_{2}] = ΣY_{t}(t_{t} – t̄) = –8.225

The variances of the components of the solution of this system are given by

and

where *k* is the unknown variance per unit weight (in the given case, *k* is the variance of any of the *Y _{i})*. Since the components of the solution in this case take the values

*X*

_{1}= −0.35 and

*X*

_{2}= -0.00524, we have

If the random observation errors obey a normal distribution, the ratios ǀ *X _{j}* −

*x*ǀ/

_{j}*s*(

_{j}*j*= 1, 2) are distributed according to the Student distribution. In particular, if the observation results lack systematic errors, then

*x*

_{1}=

*x*

_{2}= 0, and therefore the ratios ǀ

*X*ǀ/

_{1}*s*

_{1}and ǀ

*X*

_{2}ǀ\

*s*

_{2}must obey the Student distribution. Using tables for the Student distribution with

*n − m*= 8 degrees of freedom, we can verify that if in fact

*x*

_{1}=

*x*

_{2}= 0, then each of these ratios will be at most 5.04 with probability 0.999 and at most 2.31 with probability 0.95. In this case, ǀ

*X*

_{1}ǀ/

*s*

_{1}= 5.38 > 5.04, so that we should reject the hypothesis that systematic errors are absent. At the same time, we must acknowledge that the hypothesis that method errors are absent (

*x*

_{2}= 0) is consistent with the observed results, since ǀ

*X*

_{2}ǀ/

*s*

_{2}= 1.004 < 2.31. We may therefore conclude that the approximate formula

*t*=

*T*+ 0.35 should be used to determine

*t*in terms of the observed value

*T*.

In many cases of practical importance, and particularly in estimating complicated nonlinear relations, the number of unknown parameters may be quite large and the method of least squares will turn out to be effective only when modern computing technology is used.

### REFERENCES

Markov, A. A.*Ischislenie veroiatnostei*, 4th ed. Moscow, 1924.

Kolmogorov, A. N. “K obosnovaniiu metoda naimen’shikh kvadratov.”

*Uspekhi matematicheskikh nauk*, 1946, vol. 1, issue 1.

Linnik, Iu. V.

*Metod naimen’shikh kvadratov i osnovy matematiko-statisticheskoi teorii obrabotki nabliudenii*, 2nd ed. Moscow, 1962.

Helmert, F. R.

*Die Ausgleichungsrechnung nach der Methode der kleinsten Quadrate. . .*, 2nd ed. Leipzig, 1907.

Table 3 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|

i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |

t_{i} | 4 | 8 | 12.5 | 16 | 20 | 25 | 31 | 36 | 40 | 40 |

Y_{i} | –0.3 | –0.2 | –0.4 | –0.4 | –0.2 | –0.5 | –0.1 | –0.5 | –0.6 | –0.5 |