Demo 1: Iris Data

If you have problems displaying the page, try zooming out (Ctrl+- or Cmd+-) and refreshing the page (F5 or Cmd+R). Remember that you can ask questions or leave a comment on the last text slide.

Fisher's Iris data set is a data set describing a set of 150 flowers. For each of the 150 flowers, a set of measurements was done. These measurements included the lengths and widths of the "leaves" of the flowers. An iris has a set of inner leaves (or petals) and outer leaves (sepals). By measuring each type of leaf's length and width, four measurements per flower is obtained Additionally, the species of iris of each of the flowers was recorded. The data set is of interest as a simple data set that can be used in understanding various visualization methods and machine learning techniques. For instance, the data set is introduced in Chapter 1 on page 3, where Figure 1.1 on page 6 is a scatter plot of the petal lengths and widths, along with a marking by colour of the iris species.

Data table

A typical way we record data such as the Iris data set is to store it in a data table. Below is shown such a data table. The data table contains information on 150 measurements of flowers. In the first column, we see the sepal lengths measured in centimetres. The sepal width is in the second column, and so on. The last column shows the species, either setosa, versicolor, or virginica. You can sort the list based on each of the attributes (or features). For instance, if you click on petal length twice (to sort it in descending order), you will see that the largest recorded petal length is 6.9 centimeters and that the flower was of the species virginica.

Data matrix

The table's first four columns correspond to the data matrix $\mathbf{X}$ as introduced in 1.3 "Terminology of Machine Learning" of the book and we often call the four measurements features, and each feature is a dimension of the dataset (such that the dataset is 4 dimensional). Often, we will refer to the type of information that is stored in the species column as "a label" or "a class label".

Illustration of an iris

When we work with machine learning, it is important to evaluate, interpret and visualize the data we are working with (see Figure 1.10 on page 15). In order to gain some intuition with the data set, we can visualize an abstract iris flower where the leaves are represented as ellipses. Below is a plot with some sliders to the left of it. By changing the sliders, for instance, "petal_width", we change the values of the features that the abstract iris is generated from.

Scatter plots of iris features

We can visualize how the features of a data set interact by plotting them against each other. Above we have plotted the petal dimensions against each other ("petal width as a function of petal length"), and the sepal dimension against each other.

This is done by taking the value of each of the features and marking the corresponding value in the coordinate system. Each entry in the data table above corresponds to a dot in the scatter plot. We have also coloured the dots such that we can see whether each observation, or sample, was of the setosa (red), versicolor (green), or virginica (blue) species. You can choose which observations are highlighted in the scatter plot by selecting them in the data table above.

The visualization of an iris

In addition to the actual data, we have marked the placement of the abstract iris with a crossed circle in each of the plots. On the left plot of petal dimensions, you see a dark purple marker for the chosen petal dimensions. Similarly, you see a lighter purple marker for the sepal dimensions on the right plot.

Try changing the dimensions of the abstract flower by either changing the sliders or clicking on the scatter plot to see how a flower from various parts of the feature space would look like.

Considerations

Based on the scatter plot, which of these two sets of features would you suspect to be most correlated - the petal lengths and widths or the sepal lengths and widths?

Try sorting the data table for petal lengths (such that the largest are on top). Do the virginica flowers with the largest petal length also have the largest sepal lengths?

Can you choose a single feature (petal/sepal width/length) that would separate the setosa (red) flowers from the versicolor and virginica flowers (green and blue)? I.e. all flowers with a feature value above a certain threshold is different from all those below.

If you could choose only two features (either petal dimensions or sepal dimensions), which two would you choose in trying to classify each of the three classes?

Demos