Slide 13.14: Information gain

◀
Previous

Slide 13.13: How to generate a decision tree (cont.)
Slide 13.15: The first iteration for information gain
Home Print version

▶
Next

Information Gain

The reason for different ways of computation of impurity degrees between data table D and subset table S_i is because we would like to compare the difference of impurity degrees before we split the table (i.e. data table D) and after we split the table according to the values of an attribute i (i.e. subset table S_i) . The measure to compare the difference of impurity degrees is called information gain. We would like to know what our gain is if we split the data table based on some attribute values. Information gain is computed as impurity degrees of the parent table and weighted summation of impurity degrees of the subset table. The weight is based on the number of records for each attribute values. Suppose we will use entropy as measurement of impurity degree, then we have:

   Information gain( i ) = Entropy of parent table D –

       Σ( ∣k∣/∣n∣ × Entropy of each value k of subset table S_i )

where i is an attribute, k is an attribute value, and n is all attribute values. For example, our data table D has classes of 4B, 3C, 3T which produce entropy of 1.571. Now we try the attribute “Travel cost per km” which we split into three:

Cheap that has classes of 4B, 1T (thus entropy of 0.722),
Standard that has classes of 2T (thus entropy = 0 because pure single class), and
Expensive with single class of 3C (thus entropy also zero).

The information gains of attribute “Travel cost per km” based on entropy, Gini index, and classification error are computed as follows.

Entropy:: 1.571 – (5/10×0.722 + 2/10×0 + 3/10×0) = 1.210
Gini index:: 0.660 – (5/10×0.320 + 2/10×0 + 3/10×0) = 0.500
Classification error:: 0.600 – (5/10×0.200 + 2/10×0 + 3/10×0) = 0.500

◀
Previous

Slide 13.13: How to generate a decision tree (cont.)
Slide 13.15: The first iteration for information gain
Home Print version

▶
Next

Q: Why did the duck get arrested?
A: because he was selling quack