SAS introduction, part 5

Generalized Linear Models

More about the SAS datastep

It is possible to create new variables from the old ones while you read data into the system:

data xample;
  input x1 x2;

  x11=log(x1); a38=abs(x2); b52=exp(x1/x2);

  cards;

    45.9    98.4

   3  56.3

     45.3   -42

  ;

proc print data=xample;

Here 2 variables, x1 and x2 are input and used to create the transformed variables x11, a38, and b52. Try it and see what happens.
 

(Simple) use of PROC GLM.

The description of PROC GLM in the SAS-manual is amoung the largest of any of the mentioned procedures. Compared to this the following description will be somewhat limited. Most often one can just use the default values provided by SAS.

For ordinary regression in the previous example one simply writes:

proc glm data=xample;
  model x2 = x1;

then SAS vil estimate a and b i the regression:

  x2 = a*x1 + b

If more variables are to enter the regression one writes:

proc glm data=stat2.sundhed;
  model maxpuls = hvilpuls loebpuls vegt;

If one wants to force an intercept through (0,0) one writes:

proc glm data=eksempel;
  model x2 = x1 /noint;

PROC GLM and designed experiments.

If one has the result from some designed experiment one wishes to analyse, then it is possible to tell PROC GLM that there are one or more indicator variables (CLASS variables), which PROC GLM should use appropriatly. These indicatorvariables (factors) tell something about the level of a factor. As an example consider example 3.4 in the book. Here the catalyser could have level A of B. Another example might be if a temperature in an experiment was "high" or "low".

The results in example 3.4 could have been the following:  

Exp 1 Exp 2
Catalyser at level A 34.5 39.7
Catalyser at level B 23.1 29.4

The first model of example 3.4 could be solved using PROC GLM in the following way:

    data xmpl3_4; * read in the data;
      input level exp yield;
      cards;
               1       1      34.5
               1       2      39.7
               2       1      23.1
               2       2      29.4
       ;

    proc print data=xmpl3_4; * control printout of data;

    proc glm data=xmpl3_4;
      class level; * indicator variables mentioned her;
      model yield=level /noint solution;

A few remarks:

CLASS LEVEL; means that PROC GLM should use the different values of LEVEL as an indicator-variable, rather than as a regression-variable. One will get the same analysis if LEVEL had the values 101 and 45 insted of 1 and 2.

MODEL YIELD=LEVEL /NOINT SOLUTION;

means that we want PROC GLM to estimate the Theta's in:

                   E{Y11} = E{Y12} = Theta1

                   E{Y21} = E{Y22} = Theta2

The SOLUTION option in the MODEL statement is nessecary, if one has a CLASS statement and wants the "solution" (estimates) printed. (Strange but it is that way!!!)

If one uses the following MODEL statement insted of the previous:
MODEL RESPONS=LEVEL /SOLUTION; * Without the NOINT option;
then it means that GLM should estimate My and the Theta's in:

                E{Y11} = E{Y12} = My + Theta1

                E{Y21} = E{Y22} = My + Theta2

corresponding to the second model in example 3.4. PROC GLM will introduce the nessecary restrictions, though not the restrictions mentioned in the book. There are ways of making GLM do precisely what one wants, but that is beyond the scope of this text.
 

Interactions

In PROC GLM interactions are indicated by writing them in the model statement with an * between them. If the example above also had the indicator variable TEMPERAT then the model with all main effects and all interactions could be specified by:

 MODEL YIELD=LEVEL TEMPERAT LEVEL*TEMPERAT /NOINT SOLUTION;
 
 

 PROC GLM and the OUTPUT dataset.

As an option one can have PROC GLM produce a dataset, which contains the predicted values and the residuals from the model analysed. This is done in the following way:

  proc glm data=stat2.sundhed;
    model ilt = vegt;
    output out=glm_data predicted=ilt_pred residual=ilt_res;

The above OUTPUT statement contains:

  out=<SAS-datasetname>                ,here "glm_data"
  predicted=<name of pred. variabel>   ,here "ilt_pred"
  residual=<name of residual>          ,here "ilt_res"

You are of course free to choose the names yourself.

One now has access to the new variables "ILT_PRED" and "ILT_RES" in the (temporary) SAS-dataset "GLM_DATA". SAS also includes all the in-going variables from the input-dataset "STAT2.SUNDHED" in the output-dataset "GLM_DATA". This is easily checked by printing the data e.g.:

  proc print data=glm_data;

It is of course possible to operate on "GLM_DATA", as on any SAS-dataset using procedures of your own liking.

Exercises.

  1. Use SAS to solve problem 12 in the ordinary exercise booklet. Note: in this case you must enter the data yourself.
  2. Use PROC INSIGHT (interactive data analysis) to help you visualize the problem. (Once the data has been entered and run it will exist in "WORK" if you have used a 1 term name for the dataset, or somewhere you choose if you use a 2 term name.)
  3. A permanent SAS-dataset corresponding to example 3.5 in the book is accessible to you, and can be printed using e.g.

                print data=stat2.xmpl3_5;


    Do this and convince yourself that data can be represented this way.

  4. The data can be analysed using the following GLM statements:

    proc glm data=sasdata.xmpl3_5;

         class bakterie buffer;
         model udbytte=bakterie bakterie*buffer /noint solution;

    Try it and see if you can enter the (estimated) expected values into the following table:

    pH-buffer added pH-buffer not added
    Type of bacteria: acid producer    
    Type of bacteria: neutral    
  5. The data are analysed using the following GLM statements:

       proc glm data=stat2.xmpl3_5;
         class bakterie buffer;
         model udbytte=bakterie*buffer bakterie /noint solution;

    and are analysed once again using:

       proc glm data=stat2.xmpl3_5;
         class bakterie buffer;
         model udbytte=bakterie bakterie*buffer /solution;

    Try to fill out the table once again with the (estimated) expected values. Can you explain what is going on? (Tip: pseudo-inverse)