SAS introduction, part 4

PROC DISCRIM

PROC DISCRIM is one of the SAS-procedures, which can perform discriminant analyses. The procedure has lots of options, we will see some in the following.

The syntax is as follows:

proc discrim data=<training data> wcov wcorr pcov pcorr list
pool=yes testdata=<testdata> testlist;
var <classification variables>;
class <class varible in training data>;
priors <class1=0.xxxx> <class2=0.xxxx> ..... ;
testclass <class variable in test data>;

DATA=<training data>; could be e.g. DATA=STAT2.IRISCALI; and is the dataset from which the variance covariance matrices and mean values are computed.
WCOV prints the "within class" variance covariance matrix.
WCORR prints the "within class" correlation matrix.
PCOV prints the "pooled" variance covariance matrix.
PCORR prints the "pooled" correlation matrix.
LIST means that the training data is classified, and that the result is listed. Alternatively one could list only the misclassified observations. Then LISTERR is used instead of LIST.
POOL=YES means that we want to discriminate using the common (pooled) variance covariance matrix. Alternatively POOL=NO would mean the classification is performed using different variance covariance matrices for each class. Another possibility is POOL=TEST, where a test for the equality of the variance covariance matrices is performed. The matrices are then pooled if the test statistic is not significant.
TESTDATA=<test data> could be e.g. TESTDATA=STAT2.IRISTEST; and is a dataset, which is classified using the discriminant functions estimated from the training data.
TESTLIST (and TESTLISTERR) have the same meaning as LIST (and LISTERR), but for the test data.
VAR <classification variables>; indicates which variables are used in the classification. If it is not used SAS will use all numerical variables in the training dataset.
CLASS <class varible in training data>; could e.g. be CLASS SPECIES; and tells which variable gives the different classes.
PRIORS <class1=0.xxxx> <class2=0.xxxx> ..... ; could be e.g. PRIORS SETOSA=0.333333 VERSICOLOR=0.333333 VIRGINICA=0.333334; and indicates the prior probabilities of the different classes. If this statement is not used SAS will automatically use equal priors. (Remember: priors should sum to 1.....)

PROC CANDISC

Another possibility is PROC CANDISC which performs canonical discriminant analysis. Again, the procedure has many options.

The basic syntax is:

proc candisc data=<input data> out=<data to plot> distance anova;
class <class variable>;
var <classification variables>;

* a plot of the canonical variables is interesting, so;

proc plot data=<data to plot>;
plot can2*can1=<class variable>;
run;

DATA=<input data> could be eg. DATA=STAT2.IRIS
OUT=<data to plot> is the name of the dataset chosen to contain data for the PROC PLOT step. The dataset contains the canonical variates as new variables CAN1, CAN2,...
DISTANCE gives Mahalanobis' distance between groups (classes)
ANOVA gives univariate one-sided analyses of variance for each variable for the hypothesis: there is no difference (in the mean) between the groups.
CLASS <class varible in training data>; could e.g. be CLASS SPECIES; and tells which variable gives the different classes.

By letting the plot variable be <class variable> one gets a plot, where it is possible to see the different classes. The first letter or symbol in the variable is used, so in our case we have called the species "VIRGINICA", "versicolor", and "SETOSA", which gives the plotting symbols "V", "v", and "S" respectively.

DATA MANIPULATION

You may find it avantageous to be able to fiddle a little with the data. One way to do that is to make a new dataset out of an old one.

data temporary;
set stat2.iris;
if species='VIRGINICA' or species='SETOSA'; * this implies then "keep" the observation;
logseplen=log(sepallen); * a new variable logseplen which is the logarithm of sepallen is created;

Above a new dataset named "temporary" is created. Only observations of species VIRGINICA or SETOSA are retained. A new variable is also created. This dataset can then be referenced and used in subsequent analyses.

MORE DATA MANIPULATION

Some of you might want to try dividing a dataset in two randomly. Here is one way to do that (this is nearly what was done for question 2.4 below).

data train test; * names of two temporary datasets;
set stat2.iris; * the dataset to be divided;
rand=ranuni(7913); * generate a uniform random number between 0 and 1;
if rand>0.5 then output train; * for approx. 50% observations in each;
else output test;

proc print data=train;
proc print data=test;
run;