Documentation / Guide méthodologique :
When using the linear regression, one assumes that there is only one output variable and at least one input variable. The data from the training database are stored here in a matrix where is the number of elements in the set and is the number of input variables to be used. The idea is to write any output as , where are the regression coefficients and , are the regressors: simple functions depending on one or more input variables[7] that will be the new basis for the linear regression. A classical simple case is to have and .
The regressor matrix is then constructed as and is filled with
In the case where the number of points () is greater than the number of input variables (), this estimation is just a minimisation of which leads to the general form of the solution . From this, the estimated values of the output from the regression are computed as , if one calls .
As a result, a vector of parameters is computed and used to re-estimate the output parameter value. Few quality
criteria are also computed, such as and the adjusted one . There is an interesting
interpretation of the
criteria, in the specific case of a linear regression, coming from the previously introduced matrix , once considered as a projection
matrix. It is indeed symmetrical and the following relation holds , so the estimation by the linear regression is a orthogonal projection of
onto the subspace spanned by the column of
. This is depicted in
Figure IV.4 and it shows that the variance, can be decomposed into its component explained by the
model and the residual part, . From this, the formula in Equation IV.1 can be also
written
Figure IV.4. Schematical view of the projection of the original value from the code onto the subspace spanned by the column of H (in blue).
For theoretical completeness, in most cases, the matrix is decomposed following a Singular Value Decomposition (SVD) such as . In this context and are orthogonal matrices and is a diagonal matrix (that can also be stored as a singular values vector ). The diagonal matrix always exists, assuming that the number of samples is greater than the number of inputs (). This has two main advantages the first one being that there is no matrix inversion to be performed, which implies that this procedure is more robust. The second advantage is when considering the matrix that links directly the output variable and its estimation through the surrogate model: it can now simply be written as . This is highly practical once one knows that this matrix is used to compute the Leave-One-Out uncertainty, only considering its diagonal component.