How to Generate a Decision Tree (Cont.)
From table D
and for each associated subset Si
, we compute degree of impurity.
To compute the degree of impurity, we must distinguish whether it is come from the parent table D
or it come from a subset table Si
with attribute i
.
If the table is a parent table D
, we simply compute the number of records of each class.
For example, in the parent table below, we can compute degree of impurity based on transportation mode.
In this case we have 4 Busses, 3 Cars and 3 Trains (in short 4B, 3C, 3T):
Attributes |
Classes |
Gender |
Car Ownership |
Travel Cost ($)/km |
Income Level |
Transportation Mode |
Male |
0 |
Cheap |
Low |
Bus |
Male |
1 |
Cheap |
Medium |
Bus |
Female |
1 |
Cheap |
Medium |
Train |
Female |
0 |
Cheap |
Low |
Bus |
Male |
1 |
Cheap |
Medium |
Bus |
Male |
0 |
Standard |
Medium |
Train |
Female |
1 |
Standard |
Medium |
Train |
Female |
1 |
Expensive |
High |
Car |
Male |
2 |
Expensive |
Medium |
Car |
Female |
2 |
Expensive |
High |
Car |
Based on these data, we can compute probability of each class and the degrees of impurity:
Prob( Bus ) = 4 / 10 = 0.4 # 4B / 10 rows
Prob( Car ) = 3 / 10 = 0.3 # 3C / 10 rows
Prob( Train ) = 3 / 10 = 0.3 # 3T / 10 rows
Entropy
= –0.4×log(0.4) – 0.3×log(0.3) – 0.3×log(0.3) = 1.571
Gini index
= 1 – (0.42 + 0.32 + 0.32) = 0.660
Classification error
= 1 – Max{0.4, 0.3, 0.3} = 1 – 0.4 = 0.60