Measuring Impurity
Given a data table that contains attributes and class of the attributes, we can measure homogeneity (or heterogeneity) of the table based on the classes.
A table is pure or homogenous if it contains only a single class.
If a data table contains several classes, then the table is impure or heterogeneous.
There are several indices to measure degree of impurity quantitatively.
Most well known indices to measure degree of impurity are as follows:
Entropy = Σ[-pj(log2pj)] for all j
Gini Index = 1 – Σ(pj2) for all j
Classification Error = 1 – max{pj}
where pj
is the probability of the class value j
.
In the following example (10 rows), the classes of Transportation mode consist of three groups of Bus, Car and Train.
It has 4 buses, 3 cars and 3 trains (4B, 3C, and 3T in short).
Attributes |
Classes |
Gender |
Car Ownership |
Travel Cost ($)/km |
Income Level |
Transportation Mode |
Male |
0 |
Cheap |
Low |
Bus |
Male |
1 |
Cheap |
Medium |
Bus |
Female |
0 |
Cheap |
Low |
Bus |
Male |
1 |
Cheap |
Medium |
Bus |
Female |
1 |
Expensive |
High |
Car |
Male |
2 |
Expensive |
Medium |
Car |
Female |
2 |
Expensive |
High |
Car |
Female |
1 |
Cheap |
Medium |
Train |
Male |
0 |
Standard |
Medium |
Train |
Female |
1 |
Standard |
Medium |
Train |
Based on these data, we can compute probability of each class.
Prob( Bus ) = 4 / 10 = 0.4 # 4B / 10 rows
Prob( Car ) = 3 / 10 = 0.3 # 3C / 10 rows
Prob( Train ) = 3 / 10 = 0.3 # 3T / 10 rows
Having the probability of each class, now we are ready to compute the quantitative indices of impurity degrees.