一、准备阶段
下面是在jupyter notebook中实现决策树的可视化,所以需要先在Anaconda Powershell Prompt中使用
pip install graphviz
安装相应的包
然后实现可视化需要用到graphviz文件,可以点击下方进入官网
-官网下载
下载完成安装之后需要添加环境变量,将graphviz安装目录下的bin文件夹添加到Path环境变量中,步骤如下:
下面验证是否安装成功,同样在Anaconda Powershell Prompt下输入dot -version
需要注意的是,这里的命令行是根据自己使用的编辑环境而定,例如:如果使用的是python的IDLE进行编写的话,就使用windows的命令行。
二、代码:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import os
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# to make this notebook's output stable across runs
np.random.seed(42)
# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
# 为了显示中文
mpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False
# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "decision_trees"
def image_path(fig_id):
return os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id)
def save_fig(fig_id, tight_layout=True):
print("Saving figure", fig_id)
if tight_layout:
plt.tight_layout()
plt.savefig(image_path(fig_id) + ".png", format='png', dpi=300)
iris = load_iris()
X = iris.data[:, 2:] # petal length and width
y = iris.target
# 数据划分
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)
# 定义决策树分类器
clf = DecisionTreeClassifier()
# 训练分类器
clf.fit(train_x, train_y)
# tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
# tree_clf.fit(X, y)
# 可视化决策树
from sklearn.tree import export_graphviz
export_graphviz(
# 决策树
clf,
# 保存文件路径
out_file=image_path("F:/人工智能与机器学习/1.dot"),
# 特征名字
feature_names=iris.feature_names[2:],
# 类别名
class_names=iris.target_names,
# 绘制带有圆角的框
rounded=True,
# 当设置为“true”时,绘制节点以指示分类、回归极值或节点纯度用于多输出。
filled=True
)
import graphviz
with open(image_path("F:/人工智能与机器学习/1.dot")) as f:
dot_graph = f.read()
dot=graphviz.Source(dot_graph)
dot.view()
‘Source.gv.pdf’
dot
#测试集上预测
predict = clf.predict(train_x)
sum = 0
for i in range(len(train_y)):
if predict[i]==train_y[i]:
sum +=1
print("sum:%s,len(train_y):%s"%(sum,len(train_y)))
print("预测率:%s"%(sum/len(train_y)))
sum:119,len(train_y):120
预测率:0.9916666666666667
#测试集上预测
predict2 = clf.predict(test_x)
sun = 0
for i in range(len(test_y)):
if predict[i]==test_y[i]:
sun +=1
print("sum:%s,len(test_y):%s"%(sum,len(test_y)))
print("预测率:%s"%(sum/len(test_y)))
sum:119,len(test_y):30
预测率:3.966666666666667
from sklearn.metrics import accuracy_score
y_pred = clf.predict(test_x)
print("测试集精确度:%s"%accuracy_score(test_y, y_pred))
测试集精确度:0.9666666666666667
y_pred2 = clf.predict(train_x)
print("训练集精确度:%s"%accuracy_score(train_y, y_pred2))
训练集精确度:0.9916666666666667
score = clf.score(test_x, test_y)
print("\n模型测试集准确率为:", score)
模型测试集准确率为: 0.9666666666666667
score = clf.score(train_x, train_y)
print("\n模型训练集准确率为:", score)
模型训练集准确率为: 0.9916666666666667