SAS exercises on discriminant analysis (and visualization)

1. Print dataset

 There is a dataset "IRIS" on the disk. Print it using:

proc print data=stat2.iris;

The variables correspond to the examples 6.5 and 6.9 from the book.

SEPALLEN = length of sepal
SEPALWID = width of sepal
PETALLEN = length of petal
PETALWID = width of sepal
SPECIES = type of iris

2. Try the following SAS job:

proc discrim data=stat2.iris wcov wcorr pcov pcorr list
pool=yes;
class species;

3. Experiment with priors

4. Training and test set

Classifying the same observations that were used for "training" is probably not a sound idea. Two other datasets on the disk give you the possibility to "train" on one set and "classify" the other. The datasets are: "IRISCALI" and "IRISTEST". Print them using:

proc print data=stat2.iriscali;
proc print data=stat2.iristest;

The two datasets together form the complete "IRIS" dataset.

5. Try the following sasjob:

proc discrim data=stat2.iriscali wcov wcorr pcov pcorr list
pool=yes testdata=stat2.iristest testlist;
class species;
testclass species;

6. Try the following sasjob:

proc candisc data=stat2.iris out=toplot distance anova;
class species;

proc plot data=toplot;
plot can2*can1=species;
run;

7. Explore the dataset

By now you should have a good display of how well the data are seperated in just 2 dimensions.

8. Check the results

Check the conclusions from last time with this intuitive plot.

9. 3D visualization

You can do the same for a 3D-plot and obtain an even better view of the data. Try!

10. Scatterplot matrix

You can even have a matrix of scatterplots (or 3D-plots!) by selecting all 4 variables SEPALLEN SEPALWID PETALLEN PETALWID (keep the mouse-button down to select all 4) as X and all 4 as Y.

11. Brush points

Finally, you can "brush" observations. Point the mouse at some interesting observations press the mouse button and "drag" a square from there. When you release the button the observations in the square are high-lighted in ALL the plots! In this way you can gain insight (!) in multi-dimensional data. Can you find the observation(s) which were most easily mis-classified?

12. Classification results

Compare the plug in estimates of probabilities of misclassification with the cross validated values.

13. Do your own experiments!