SAS introduction, part 4

PROC DISCRIM

PROC DISCRIM is one of the SAS-procedures, which can perform discriminant analyses. The procedure has lots of options, we will see some in the following.

The syntax is as follows:

proc discrim data=<training data> wcov wcorr pcov pcorr list
pool=yes testdata=<testdata> testlist;

var <classification variables>;

class <class varible in training data>;

priors <class1=0.xxxx> <class2=0.xxxx> ..... ;

testclass <class variable in test data>;

PROC CANDISC

Another possibility is PROC CANDISC which performs canonical discriminant analysis. Again, the procedure has many options.

The basic syntax is:

proc candisc data=<input data> out=<data to plot> distance anova;
class <class variable>;
var <classification variables>;

* a plot of the canonical variables is interesting, so;

proc plot data=<data to plot>;
plot can2*can1=<class variable>;
run;

By letting the plot variable be <class variable> one gets a plot, where it is possible to see the different classes. The first letter or symbol in the variable is used, so in our case we have called the species "VIRGINICA", "versicolor", and "SETOSA", which gives the plotting symbols "V", "v", and "S" respectively.

DATA MANIPULATION

You may find it avantageous to be able to fiddle a little with the data. One way to do that is to make a new dataset out of an old one.

data temporary;
set stat2.iris;
  if species='VIRGINICA' or species='SETOSA'; * this implies then "keep" the observation;
  logseplen=log(sepallen); * a new variable logseplen which is the logarithm of sepallen is created;

Above a new dataset named "temporary" is created. Only observations of species VIRGINICA or SETOSA are retained. A new variable is also created. This dataset can then be referenced and used in subsequent analyses.

MORE DATA MANIPULATION

Some of you might want to try dividing a dataset in two randomly. Here is one way to do that (this is nearly what was done for question 2.4 below).

data train test; * names of two temporary datasets;
set stat2.iris; * the dataset to be divided;
rand=ranuni(7913); * generate a uniform random number between 0 and 1;
if rand>0.5 then output train; * for approx. 50% observations in each;
else output test;

proc print data=train;
proc print data=test;
run;