Demos
Demo 2: Principal Component Analysis
If you have problems displaying the page, try zooming out (Ctrl+- or Cmd+-) and refreshing the page (F5 or Cmd+R). Remember that you can ask questions or leave a comment on the last text slide.The following demo will visualize how a Principal Component Analysis (PCA) works based on the data set introduced in Demo 1: Iris data. To the right is an interactive figure with various options to play with. You can get a guided tour by navigating these text slides (press the right-left buttons, click the entries underneath, or swipe).
Reasons for doing PCA
PCA is a dimensionality reduction method. This means that a PCA can be used to obtain a new description of a data set in fewer dimensions. Using a PCA allows us to visualize the features of a high-dimensional space in e.g. 2-dimensions. A PCA can also help machine learning models as a pre-processing step.
The data
For this demo, we start by applying a PCA to a 2-dimensional problem (it's easier to visualize). The first plot on the left is a scatter plot of the petal dimensions (lengths and widths) from the iris data. We have coloured the dots based on the kind of iris the observation represents, as we also did in Demo 1.
Projection
The starting point of how a PCA gives us a lower-dimensional representation is that the PCA describes the data in a new coordinate system. Describing a data set in a new coordinate system requires that you project the data onto the basis vectors of the new coordinate system. On the figure below, you see two directions: one blue and one orange (Direction 1 and 2, respectively). If we project each point in the original feature space onto these directions, we can the projection you see on the right plot. Below the plot, you see some slides (one is called direction1_x). By changing these you change the coordinates of the vectors the data set is projected onto. Try changing the x and y value of the two directions and see how it changes the projection.
Constraints on the basis
When we do a PCA, we obtain a basis of vectors that are constrained - they follow certain rules. One of these rules is that they form an orthonormal basis. Try clicking the orthonormality button to see how it affects the relationship between the directions. An orthonormal basis is defined by having vectors that are orthogonal (at right angles to each other) and of unit-length (normal vectors). When we impose this on the directions, the only free parameter for our 2-dimensional toy problem is the angle of the first direction, so the xy-sliders are changed to a slider controlling the angle of Direction 1.
Initial processing of the data
When doing a PCA, the first step is to subtract the mean of each feature from that dimension of the data (see Section 3.3.1 in the book). We can do this in the plot below by pressing Centering. An optional step is to divide each feature by its standard deviation. We can do this by pressing Scaling. Try applying the two processing steps to see how it changes the data we are projecting onto the directions.
Singular Value Decomposition
Once the pre-processing is done, we compute the Singular Value Decomposition (SVD). By doing this, we obtain a matrix of vectors ($\mathbf{V}$) called the principal directions ($\mathbf{v}_1, \cdots, \mathbf{v}_n$). Try clicking "Set sliders to principal directions" to see how the Directions 1 and 2 change. You can also try to omit the processing steps to see how this affects the SVD (and therefore the principal directions). You can always reset the view by pressing Reset, and if the data has disappeared out of view, you can try using Rescale (this does not change the data, just the limits on the axes). The PCA algorithm ensures that the chosen Direction 1 maximizes the variation we see along the projection onto that direction. Notice how most of the range of the projected data is along the "x"-axis (Direction 1-axis) in the projection plot.
Scale difference
You might have seen, that the application of Scaling does not change how the projection looks all too much. This is because the scaling does not change the features (petal width and length) very much relative to each other (they were already on similar scales). However, if we had other features measured in wildly different scales, the scaling would be an important part of the pre-processing. To see why this is, try changing the scale that we measure the petal width in to millimeters as opposed to centimeters (remember to turn of scaling to see the effect - you might want to rescale, too). When we determine the principal directions now (without scaling), Direction 1 points mostly along the feature measured in the largest scale, but the captured variation by the direction is solely due to a difference in magnitude of the feature.
Dimensionality reduction 2
The original data (on the left) is in two dimensions. The projection you see on the right plot is also in two dimensions, so as of yet we have not reduced the dimensionality of the data in any way (merely rotated the coordinate system). The dimensionality reduction is obtained by not including all dimensions of the projection moving forwards in what we do with the data. Since Direction 1 has the largest variation, using this direction alone (the projection onto it), we obtain a 1-dimensional description with as much variance explained as possible. When doing PCA for $N$-dimensional problems, we get $N$ principal direction. We can then choose to include only the amount of directions we want. This choice is usually based on the amount of variance explained by the directions, and you can read more about that and PCA in Chapter 3 on page 29.