﻿ 02409 Multivariate Statistics - PCA Exercises

# Principal Component Analysis on the ’Sundhed’ dataset

1. Analysis on all variables

1. Should we analyze the correlation- or the variance-covariance matrix? Why?

2. Perform statistical tests on the eigenvalues in order to assess the relevant number of components to retain!

2. Exclude ‘alder’ and ‘vegt’

1. Repeat the above analyses!

3. Condition on alder and vegt

1. Compare the partial correlations and the eigenstructure based on those with the results in B!

2. Are there big differences?

3. Explain this by means of adequate statistical analyses.

4. Partial  correlations based on vegt (by hand!)

1. Reproduce some of the partial correlations, the test statistic for assessing whether the true partial correlations are zero and find the associated p-values!

# Principal Component Analysis on Beef characterization

In this exercise you will work on data reflecting different aspects of meat, both chemical, physical and sensory attributes.
The example comes from the paper 'The use of principal component analysis (PCA) to characterize beef' by G. Destefanis, M.T. Bargem A. Brugiapaglia and S. Tassone,
where they wish to find the parameters that most describing for a piece of meat.

The data comes from measurements of a certain muscle from young bulls of five different breeds. The breeds and the number of animals from each kind is represented in the table.

 Hypertrophied Piemontese n=23 Normal Piemontese n=12 Hypertrophied Piemontese x Friesian Crossbreed n=10 Friesian n=11 Belgian blue and white n=23

18 different measurements has been done on each animal. The chemical measurements are pH-value (pH), water (W), protein (P) and ether (E) extract, hydroxyproline content (Hy) and heat-solubility of collagen (Cs). The physical parameters are lightness (L) and hue (H), which relates to the color of the meat, drip losses (Dl), cooking loss (Cl), shear force, which gives an indication of the tenderness of the meat (WB). At last seven different sensory parameters (A, Te, Tf, Tr, Ji, Js, Oa) were assesed using a trained panel.

Exercises:

1. What are the variables and observations in this problem? And how many of each do we have?

The measurements that were done had different magnitudes, which mean that in order to get a proper result of our PCA we will have to normalize the data. You're are given the correlation matrix of the normalized data in the file PCAbeef.sas and will use this for finding the PCs. The file can be found here.

2. Replace the ?'s in PCAbeef.sas in order to use the correlation matrix to find the principle components of the method.

3. How much variation in the data does the first four PCs explain?

4. Look at the principle components. What kind of parameters dominate the first two PCs? Hint: Look at the magnitude of the elements of the PCs.
In what way can this be related to the correlation matrix? Give a suggestion/strategy for exclusion of variables.

SAS tips:
The correlation matrix can be defined in the same way as the covariance matrix in 'SAS introduction, part2' - just replace 'cov' with 'corr' two places.
'princomp' takes the extra obtion 'plots=patterns' which plots the PC2 vs PC1, PC3 vs PC2 etc. The plots can be found under the princomp results.

# Principal Component Analysis on Olympic data

The following is the data with the correct SAS commands so it can be copied into the SAS Program Editor.

data hep;
input name \$1-20 hurdles highjump shot run200m longjump javelin run800ml;
cards;
Joyner Kersee (USA)  12.69 1.86 15.8 22.56 7.27 45.66 128.51
John (GDR)           12.85 1.8 16.23 23.65 6.71 42.56 126.12
Behmer (GDR)         13.2 1.83 14.2 23.1 6.68 44.54 124.2
Sablovskaite (URS)   13.61 1.8 15.23 23.92 6.25 42.78 132.24
Choubenkova (URS)    13.51 1.74 14.76 23.93 6.32 47.46 127.9
Schulz (GDR)         13.75 1.83 13.5 24.65 6.33 42.82 125.79
Fleming (AUS)        13.38 1.8 12.88 23.59 6.37 40.28 132.54
Greiner (USA)        13.55 1.8 14.13 24.48 6.47 38 133.65
Lajbnerova (CZE)     13.63 1.83 14.28 24.86 6.11 42.2 136.05
Bouraga (URS)        13.25 1.77 12.62 23.59 6.28 39.06 134.74
Wijnsma (HOL)        13.75 1.86 13.01 25.03 6.34 37.86 131.49
Dimitrova (BUL)      13.24 1.8 12.88 23.59 6.37 40.28 132.54
Scheider (SWI)       13.85 1.86 11.58 24.87 6.05 47.5 134.93
Braun (FRG)          13.71 1.83 13.16 24.78 6.12 44.58 142.82
Ruotsalainen (FIN)   13.79 1.8 12.32 24.61 6.08 45.44 137.06
Yuping (CHN)         13.93 1.86 14.21 25 6.4 38.6 146.67
Hagger (GB)          13.47 1.8 12.75 25.47 6.34 35.76 138.48
Brown (USA)          14.07 1.83 12.69 24.83 6.13 44.34 146.43
Mulliner (GB)        14.39 1.71 12.68 24.92 6.1 37.76 138.02
Hautenauve (BEL)     14.04 1.77 11.82 25.61 5.99 35.68 133.9
Kytola (FIN)         14.31 1.77 11.66 25.69 5.75 39.48 133.35
Geremias (BRA)       14.23 1.71 12.95 25.5 5.5 39.64 144.02
Hui-Ing (TAI)        14.85 1.68 10 5.23 5.47 39.14 137.3
Jeong-Mi (KOR)       14.53 1.71 10.83 26.61 5.5 39.26 139.17
Launa (PNG)          16.42 1.5 11.78 26.16 4.88 46.38 163.43
;

The data set consists of the results from the 1988 Olympic women’s heptathlon competition in Seoul. The variables are the following:
• hurdles - results 100m hurdless
• highjump - results high jump
• shot - results shot
• run200m - results 200m race
• longjump - results long jump
• javelin - results javelin
• run800m - results 800m race

Step 1
Make a scatter plot of the data and calculate the correlation coefficients.
• What relationship do you see between the plots and the correlation coefficients?
Step 2
Run a principle component analysis (PCA) and make the score plot.
• How many principle components should you use?
• Which variables are explained by each principle component? What is the relation to the correlation matrix?
• Do you see any outliers?
Step 3
Think about how you should modify the data set such that you can use the variance-covariance matrix instead of the correlation matrix?