D
and subset table Si
is because we would like to compare the difference of impurity degrees before we split the table (i.e. data table D
) and after we split the table according to the values of an attribute i
(i.e. subset table Si
) .
The measure to compare the difference of impurity degrees is called information gain.
We would like to know what our gain is if we split the data table based on some attribute values.
Information gain is computed as impurity degrees of the parent table and weighted summation of impurity degrees of the subset table.
The weight is based on the number of records for each attribute values.
Suppose we will use entropy as measurement of impurity degree, then we have:
Information gain( i ) = Entropy of parent table D –where
Σ( ∣k∣/∣n∣ × Entropy of each value k of subset table Si )
i
is an attribute, k
is an attribute value, and n
is all attribute values.
For example, our data table D
has classes of 4B, 3C, 3T which produce entropy of 1.571.
Now we try the attribute “Travel cost per km” which we split into three:
1.571 – (5/10×0.722 + 2/10×0 + 3/10×0) = 1.210
0.660 – (5/10×0.320 + 2/10×0 + 3/10×0) = 0.500
0.600 – (5/10×0.200 + 2/10×0 + 3/10×0) = 0.500
Q: Why did the duck get arrested? A: because he was selling quack |