# Implementation in a nutshell

If one calls $\mathbf{X}$ the original sample whose dimension is $(n_X,n_S)$, the idea behind PCA is 
to find the projection matrix $\mathbf{P}$, whose dimension is $(n_X,n_X)$, that would re-express the 
data optimally as a new sample, called hereafter $\mathbf{Y}$, with the same dimension $(n_X,n_S)$. 
The rows of $\mathbf{P}$ are forming a new basis to represent the column of $\mathbf{X}$ and this new 
basis will later become our principal component directions.

Now recalling the aim of PCA, the way to determine this projection matrix is crucial and should be 
designed as to

- find out the best linear combinations between variables so that the minimum number of rows 
(principal components) of $\mathbf{P}$ are considered useful to carry on as much inertia as possible;

- rank the principal component so that, if not satisfy with the new representation, it would be 
simple to add an extra principal component to improve it.

This can be done by investigating the covariance matrix $\mathbf{C_X}$ of $\mathbf{X}$ that, by 
definition, describes the linear combination between variables and that could be computed from the 
centered matrix sample $\mathbf{X_C}$[^metho_pca_nutshell] as

[^metho_pca_nutshell]: The centered matrix is defined as 
$\mathbf{X_C} = \mathbf{X} - \bar{\mathbf{x}}^T\mathbf{1_{n_S}}$ where $\bar{\mathbf{x}}$ is the 
vector of mean value for every variable and $\mathbf{1_{n_S}}$ is a vector of 1 whose dimension is 
$n_S$.

```{math}
\mathbf{C_X} = \frac{1}{n_S-1}\mathbf{X_C}\mathbf{X_C}^T
```

If one consider the resulting covariance matrix $\mathbf{C_Y}$, the aim is to maximise the signal 
measured by variance (diagonal entries that represents the variance of the principal components) 
while minimising the covariance between them. As the lowest covariance value reachable is 0, if the 
desired covariance matrix $\mathbf{C_Y}$ would append to be diagonal, this would mean our objectives 
are achieved. From the very definition of the covariance matrix, one could see that 

```{math}
\mathbf{C_Y} = \frac{1}{n_S-1} \mathbf{Y_C} \mathbf{Y_C}^T = \frac{1}{n_S-1} (\mathbf{P}\mathbf{X_C}) 
(\mathbf{P}\mathbf{X_C})^T = \frac{1}{n_S-1} \mathbf{P} \mathbf{X_C} \mathbf{X_C}^T \mathbf{P}^T = 
\mathbf{P} \mathbf{C_X} \mathbf{P}^T
```

As $\mathbf{C_X}$ is symmetric, it is orthogonal diagonalisable, and can be written 
$\mathbf{C_X} = \mathbf{E} \mathbf{S} \mathbf{E}^{T}$. In this equation, $\mathbf{E}$ is an 
orthonormal matrix whose columns are the orthonormal eigenvectors of $\mathbf{C_X}$, and $\mathbf{S}$ 
is a diagonal matrix which has the eigenvalues of $\mathbf{C_X}$. Given this, if we choose 
$\mathbf{P}=\mathbf{E}^T$, this leads to 

```{math}
\mathbf{C_Y} = \mathbf{P} \mathbf{C_X} \mathbf{P}^T = \mathbf{E}^T \mathbf{E} \mathbf{S} 
\mathbf{E}^{T} \mathbf{E} = \mathbf{S}
```

At this level, there is no unicity of the $\mathbf{S}$ matrix as one can have many permutations of 
the eigenvalues along the diagonal, as long as one changes $\mathbf{E}$ accordingly.

Finally, an interesting link can be drawn between this protocol and a very classical method of linear 
algebra, already mentioned in other places of this document, called the Singular Value Decomposition 
(*SVD*[^metho_pca_nutshell_2]) leading to 

[^metho_pca_nutshell_2]: SVD is applied to matrix whose number of rows should be greater than 
its number of columns.

**{eq}`svd_formula`: General form of a SVD**

```{math}
:label: svd_formula

\mathbf{X_C}^T = \mathbf{U} \boldsymbol\Sigma \mathbf{V}^{T} \;\; {\rm where} \;\; 
\mathbf{X_C}^T(n_{S},n_{X}),\;\;\mathbf{U}(n_{S},n_{S}), \;\; \mathbf{V}(n_{X},n_{X}) \;\; {\rm and} 
\;\; \boldsymbol\Sigma(n_{S},n_{X})
```

In this context $\mathbf{U}$ and $\mathbf{V}$ are unitary matrices (also known as respectively the 
*left singular vectors* and *right singular vectors* of $\mathbf{X_C}^T$) while $\boldsymbol\Sigma$ 
is a diagonal matrix storing the singular values of $\mathbf{X_C}^T$ in decreasing order. The last 
step is then to state the linear algebra theorem which says that the non-zero singular values of 
$\mathbf{X_C}^T$ are the square roots of the nonzero eigenvalues of $\mathbf{X_C}\mathbf{X_C}^T$ and 
$\mathbf{X_C}^T\mathbf{X_C}$ (the corresponding eigenvectors being the columns of respectively 
$\mathbf{V}$ and $\mathbf{U}$).

Gathering all this, one can see that by doing the SVD on the centered original sample matrix, the 
resulting projection matrix can be identified as $\mathbf{P}=\mathbf{V}^T$ and the resulting 
covariance matrix will be proportional to $\boldsymbol\Sigma^2$. The final interesting property is 
coming from the SVD itself: as $\boldsymbol\Sigma^2$ gathers the eigenvalues in decreasing order, it 
assures the unicity of the transformation and give access to the principal component in a 
hierarchical way.