How to Generate a Decision Tree (Cont.)


From table D and for each associated subset Si, we compute degree of impurity. To compute the degree of impurity, we must distinguish whether it is come from the parent table D or it come from a subset table Si with attribute i. If the table is a parent table D, we simply compute the number of records of each class. For example, in the parent table below, we can compute degree of impurity based on transportation mode. In this case we have 4 Busses, 3 Cars and 3 Trains (in short 4B, 3C, 3T):

Attributes Classes
Gender Car Ownership Travel Cost ($)/km Income Level Transportation Mode
Male 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Female 1 Cheap Medium Train
Female 0 Cheap Low Bus
Male 1 Cheap Medium Bus
Male 0 Standard Medium Train
Female 1 Standard Medium Train
Female 1 Expensive High Car
Male 2 Expensive Medium Car
Female 2 Expensive High Car

Based on these data, we can compute probability of each class and the degrees of impurity:
    Prob( Bus )   = 4 / 10 = 0.4        # 4B / 10 rows
    Prob( Car )   = 3 / 10 = 0.3        # 3C / 10 rows
    Prob( Train ) = 3 / 10 = 0.3        # 3T / 10 rows
     Entropy
   = –0.4×log(0.4) – 0.3×log(0.3) – 0.3×log(0.3) = 1.571

     Gini index
   = 1 – (0.42 + 0.32 + 0.32) = 0.660
 
     Classification error
   = 1 – Max{0.4, 0.3, 0.3} = 1 – 0.4 = 0.60