Slide 12.21: A decision tree application using scikit-learn

A Decision Tree Application Using scikit-learn

A decision tree is a flow chart, and can help you make decisions based on previous experience. In the example, a person will try to decide if he/she should go to a comedy show or not based on the data saved in data.csv. Now, based on this data set, Python can create a decision tree that can be used to decide if any new shows are worth attending to:

Read the dataset with pandas.
To make a decision tree, all data has to be numerical. We have to convert the non numerical columns “Nationality” and “Go” into numerical values. Pandas has a map( ) method that takes a dictionary with information on how to convert the values. For example,
```
   { 'UK': 0, 'USA': 1, 'N': 2 }
```
which converts the values UK to 0, USA to 1, and N to 2.
Separate the feature columns from the target column. The feature columns are the columns that we try to predict from, and the target column is the column with the values we try to predict.
Create the actual decision tree, fit it with our details.
Use the decision tree to predict new values. For example: Should I go see a show starring a 40 years old American comedian, with 10 years of experience, and a comedy ranking of 6?

Below is the Python source code for the decision tree method. The decision tree gives you different results if you run it enough times, even if you feed it with the same data. It is because the decision tree does not give us a 100% certain answer. It is based on the probability of an outcome, and the answer will vary. Below is a decision tree application using scikit-learn:

A Decision Tree Application Using scikit-learn

Training data (data.csv)

The decision tree

The decision

(before clicking, uncommenting 3 commands, plot_tree, savefig, & flush, below)

(before clicking, uncommenting 3 print commands below)

(after clicking any one of the above three buttons)

# Three lines to make Python compiler able to draw:
import sys
import matplotlib
matplotlib.use( 'Agg' )

import pandas
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

df = pandas.read_csv( "data.csv" )

d = { 'UK': 0, 'USA': 1, 'N': 2 }
df['Nationality'] = df['Nationality'].map( d )
d = { 'YES': 1, 'NO': 0 }
df['Go'] = df['Go'].map( d )

features = [ 'Age', 'Experience', 'Rank', 'Nationality' ]

X = df[ features ]
y = df[ 'Go' ]

dtree = DecisionTreeClassifier( )
dtree = dtree.fit( X, y )

# 
# Training: Uncomment the 3 commands, plot_tree, savefig, & flush, below:
#

# Three lines to make our Python able to draw.
#
#tree.plot_tree( dtree, feature_names=features )
#plt.savefig( sys.stdout.buffer )
#sys.stdout.flush( )

# Predict new values.
# Should I go see a show starring a 40 years old American comedian,
#   with 10 years of experience, and a comedy ranking of 6?

#
# Testing: Uncomment the 3 print commands below:
#

#print( dtree.predict( [[40, 10, 6, 1]] ) )
#print( "[1] means 'GO'" )
#print( "[0] means 'NO'" )