Slide 13.12: How to generate a decision tree (cont.)

How to Generate a Decision Tree (Cont.)

From table D and for each associated subset S_i, we compute degree of impurity. To compute the degree of impurity, we must distinguish whether it is come from the parent table D or it come from a subset table S_i with attribute i. If the table is a parent table D, we simply compute the number of records of each class. For example, in the parent table below, we can compute degree of impurity based on transportation mode. In this case we have 4 Busses, 3 Cars and 3 Trains (in short 4B, 3C, 3T):

Attributes				Classes
Gender	Car Ownership	Travel Cost ($)/km	Income Level	Transportation Mode
Male	0	Cheap	Low	Bus
Male	1	Cheap	Medium	Bus
Female	1	Cheap	Medium	Train
Female	0	Cheap	Low	Bus
Male	1	Cheap	Medium	Bus
Male	0	Standard	Medium	Train
Female	1	Standard	Medium	Train
Female	1	Expensive	High	Car
Male	2	Expensive	Medium	Car
Female	2	Expensive	High	Car

Based on these data, we can compute probability of each class and the degrees of impurity:

    Prob( Bus )   = 4 / 10 = 0.4        # 4B / 10 rows
    Prob( Car )   = 3 / 10 = 0.3        # 3C / 10 rows
    Prob( Train ) = 3 / 10 = 0.3        # 3T / 10 rows

     Entropy
   = –0.4×log(0.4) – 0.3×log(0.3) – 0.3×log(0.3) = 1.571

     Gini index
   = 1 – (0.4² + 0.3² + 0.3²) = 0.660
 
     Classification error
   = 1 – Max{0.4, 0.3, 0.3} = 1 – 0.4 = 0.60