Slide 13.17: The second iteration for information gain

The Second Iteration for Information Gain

In the second iteration, we need to update our data table. Since Expensive and Standard “Travel cost/km” have been associated with pure class, we do not need these data any longer. For second iteration, our data table D is only come from the Cheap “Travel cost/km.” We remove attribute travel cost/km from the data because they are equal and redundant as follows.

Attributes				Classes
Gender	Car Ownership	Travel Cost ($)/km	Income Level	Transportation Mode
Female	0	~~Cheap~~	Low	Bus
Male	0	~~Cheap~~	Low	Bus
Male	1	~~Cheap~~	Medium	Bus
Male	1	~~Cheap~~	Medium	Bus
Female	1	~~Cheap~~	Medium	Train

⇓

Attributes			Classes
Gender	Car Ownership	Income Level	Transportation Mode
Female	0	Low	Bus
Male	0	Low	Bus
Male	1	Medium	Bus
Male	1	Medium	Bus
Female	1	Medium	Train

Now we have only three attributes: “Gender,” “Car ownership,” and “Income level.” Based on these data, probability of each class and the degree of impurity are computed as follows:

  Prob( Bus )   = 4 / 5 = 0.8        # 4B / 5 rows
  Prob( Train ) = 1 / 5 = 0.2        # 1T / 5 rows

  Entropy = –0.8×log(0.8) – 0.2×log(0.2) = 0.722

  Gini index = 1 – (0.8² + 0.2²) = 0.320

  Classification error = 1 – Max{0.8, 0.2} = 1 – 0.8 = 0.200