SAS introduction, part 4
PROC DISCRIM
PROC DISCRIM is one of the SAS-procedures, which can perform discriminant
analyses. The procedure has lots of options, we will see some in the following.
The syntax is as follows:
proc discrim data=<training data> wcov wcorr pcov
pcorr list
pool=yes testdata=<testdata> testlist;
var <classification variables>;
class <class varible in training data>;
priors <class1=0.xxxx> <class2=0.xxxx> ..... ;
testclass <class variable in test data>;
- DATA=<training data>; could be e.g.
DATA=STAT2.IRISCALI; and is the dataset from
which the variance covariance matrices and mean values are computed.
- WCOV prints the "within class" variance
covariance matrix.
- WCORR prints the "within class" correlation
matrix.
- PCOV prints the "pooled" variance
covariance matrix.
- PCORR prints the "pooled" correlation
matrix.
- LIST means that the training data is
classified, and that the result is listed. Alternatively one could list only
the misclassified observations. Then LISTERR is
used instead of LIST.
- POOL=YES means that we want to discriminate
using the common (pooled) variance covariance matrix. Alternatively POOL=NO
would mean the classification is performed using different variance
covariance matrices for each class. Another possibility is POOL=TEST, where
a test for the equality of the variance covariance matrices is performed.
The matrices are then pooled if the test statistic is not significant.
- TESTDATA=<test data> could be e.g.
TESTDATA=STAT2.IRISTEST; and is a dataset,
which is classified using the discriminant functions estimated from the
training data.
- TESTLIST (and
TESTLISTERR) have the same meaning as LIST
(and LISTERR), but for the test data.
- VAR <classification variables>; indicates
which variables are used in the classification. If it is not used SAS will
use all numerical variables in the training dataset.
- CLASS <class varible in training data>;
could e.g. be CLASS SPECIES; and tells which
variable gives the different classes.
- PRIORS <class1=0.xxxx> <class2=0.xxxx> ..... ;
could be e.g. PRIORS SETOSA=0.333333
VERSICOLOR=0.333333 VIRGINICA=0.333334; and indicates the prior
probabilities of the different classes. If this statement is not used SAS
will automatically use equal priors. (Remember: priors should sum to 1.....)
PROC CANDISC
Another possibility is PROC CANDISC which performs canonical discriminant
analysis. Again, the procedure has many options.
The basic syntax is:
proc candisc data=<input data> out=<data to plot> distance anova;
class <class variable>;
var <classification variables>;
* a plot of the canonical variables is interesting, so;
proc plot data=<data to plot>;
plot can2*can1=<class variable>;
run;
- DATA=<input data> could be eg.
DATA=STAT2.IRIS
- OUT=<data to plot> is the name of the
dataset chosen to contain data for the PROC PLOT step. The dataset contains
the canonical variates as new variables CAN1, CAN2,...
- DISTANCE gives Mahalanobis' distance
between groups (classes)
- ANOVA gives univariate one-sided analyses
of variance for each variable for the hypothesis: there is no difference (in
the mean) between the groups.
- CLASS <class varible in training data>;
could e.g. be CLASS SPECIES; and tells which
variable gives the different classes.
By letting the plot variable be <class variable> one gets a plot, where it is
possible to see the different classes. The first letter or symbol in the
variable is used, so in our case we have called the species "VIRGINICA",
"versicolor", and "SETOSA", which gives the plotting symbols "V", "v", and "S"
respectively.
DATA MANIPULATION
You may find it avantageous to be able to fiddle a little with the data. One
way to do that is to make a new dataset out of an old one.
data temporary;
set stat2.iris;
if species='VIRGINICA' or species='SETOSA'; * this implies then
"keep" the observation;
logseplen=log(sepallen); * a new variable logseplen which is the
logarithm of sepallen is created;
Above a new dataset named "temporary" is created. Only observations of
species VIRGINICA or SETOSA are retained. A new variable is also created. This
dataset can then be referenced and used in subsequent analyses.
MORE DATA MANIPULATION
Some of you might want to try dividing a dataset in two randomly. Here is one
way to do that (this is nearly what was done for question 2.4 below).
data train test; * names of two temporary datasets;
set stat2.iris; * the dataset to be divided;
rand=ranuni(7913); * generate a uniform random number between 0 and 1;
if rand>0.5 then output train; * for approx. 50% observations in each;
else output test;
proc print data=train;
proc print data=test;
run;