NumPy学习(三)
本次练习使用 鸢尾属植物数据集.\iris.txt
,在这个数据集中,包括了三类不同的鸢尾属植物:Iris Setosa,Iris Versicolour,Iris Virginica。每类收集了50个样本,因此这个数据集一共包含了150个样本。
- sepallength:萼片长度
- sepalwidth:萼片宽度
- petallength:花瓣长度
- petalwidth:花瓣宽度
以上四个特征的单位都是厘米(cm)。
所有操作均被封装进irisData文件中的函数,
首先调用此模块。
>>> from irisData import *
>>>
文件中调用了numpy库。为了后续方便操作,将数据集的五列的标题与索引进行对应。
import numpy as np
# 全局变量,数据每列代表的属性
sepallength = 0 # 萼片长度
sepalwidth = 1 # 萼片宽度
petallength = 2 # 花瓣长度
petalwideh = 3 # 花瓣宽度
species = 4 # 种类
1.导入鸢尾属植物数据集,保持文本不变。
【知识点:输入和输出】
- 如何导入存在数字和文本的数据集?
# 读取数据
# 参数为数据集路径
def loadData(dataPath):
global irisData
irisData = np.loadtxt(dataPath, dtype=object, delimiter=',', skiprows=1)
return irisData
>>> irisData = loadData("iris.txt")
>>> print(irisData[0:10])
[['5.1' '3.5' '1.4' '0.2' 'Iris-setosa']
['4.9' '3.0' '1.4' '0.2' 'Iris-setosa']
['4.7' '3.2' '1.3' '0.2' 'Iris-setosa']
['4.6' '3.1' '1.5' '0.2' 'Iris-setosa']
['5.0' '3.6' '1.4' '0.2' 'Iris-setosa']
['5.4' '3.9' '1.7' '0.4' 'Iris-setosa']
['4.6' '3.4' '1.4' '0.3' 'Iris-setosa']
['5.0' '3.4' '1.5' '0.2' 'Iris-setosa']
['4.4' '2.9' '1.4' '0.2' 'Iris-setosa']
['4.9' '3.1' '1.5' '0.1' 'Iris-setosa']]
求出鸢尾属植物萼片长度的平均值、中位数和标准差(第1列,sepallength)
【知识点:统计相关】
- 如何计算numpy数组的均值,中位数,标准差?
# 计算平均值
# 参数为属性代号,即0~3
def average(num):
global irisData
datas = irisData[:, num].astype(float)
result = np.mean(datas)
return result
# 计算标准差
def stddev(num):
global irisData
datas = irisData[:, num].astype(float)
result = np.std(datas)
return result
# 计算中位数
def median(num):
global irisData
datas = irisData[:, num].astype(float)
result = np.median(datas)
return result
>>> ave = average(sepallength)
>>> print(ave)
5.843333333333334
>>> std = stddev(sepallength)
>>> print(std)
0.8253012917851409
>>> med = median(sepallength)
>>> print(med)
5.8
>>>
3. 创建一种标准化形式的鸢尾属植物萼片长度,其值正好介于0和1之间,这样最小值为0,最大值为1(第1列,sepallength)。
【知识点:统计相关】
- 如何标准化数组?
# 数据标准化,此处为规范化方法,即结果落在[0, 1]上
def normalization(num):
global irisData
datas = irisData[:, num].astype(float)
aMax = np.amax(datas)
aMin = np.amin(datas)
result = (datas - aMin) / (aMax - aMin)
return result
>>> X = normalization(sepallength)
>>> print(X[0:10])
[0.22222222 0.16666667 0.11111111 0.08333333 0.19444444 0.30555556
0.08333333 0.19444444 0.02777778 0.16666667]
>>>
标准化方法参考三种常用数据标准化方法
4.把iris_data数据集中的20个随机位置修改为np.nan值。
【知识点:随机抽样】
- 如何在数组中的随机位置修改值?
# 随机替换数据中的n个值为np.nan
def swap(datas, n):
datas[np.random.choice(datas.shape[0], size=n), np.random.choice(datas.shape[1], size=n)]
return datas
>>> X = swap(irisData, 20)
>>> print(X[0:10])
[['5.1' '3.5' '1.4' '0.2' 'Iris-setosa']
['4.9' '3.0' '1.4' nan 'Iris-setosa']
['4.7' '3.2' '1.3' '0.2' 'Iris-setosa']
['4.6' '3.1' '1.5' '0.2' 'Iris-setosa']
['5.0' '3.6' nan '0.2' 'Iris-setosa']
['5.4' '3.9' '1.7' '0.4' 'Iris-setosa']
['4.6' nan '1.4' '0.3' 'Iris-setosa']
['5.0' '3.4' '1.5' '0.2' 'Iris-setosa']
['4.4' '2.9' '1.4' '0.2' 'Iris-setosa']
['4.9' nan '1.5' '0.1' 'Iris-setosa']]
>>>
5.计算 iris_data 中sepalLength(第1列)和petalLength(第3列)之间的相关系数。
【知识点:统计相关】
- 如何计算numpy数组两列之间的相关系数?
# 计算相关系数,此处为皮尔逊系数
# 参数为某两列的属性代号
def pearson(x, y):
global irisData
X = irisData[:, x].astype(float)
Y = irisData[:, y].astype(float)
xMean = np.mean(X)
yMean = np.mean(Y)
xStd = np.sqrt(np.dot(X-xMean, X-xMean))
yStd = np.sqrt(np.dot(Y-yMean, Y-yMean))
result = np.dot(X-xMean, Y-yMean) / (xStd * yStd)
return result
>>> pear = pearson(sepallength, petallength)
>>> print(pear)
0.8717541573048712
关于相关系数可参考三大统计相关系数
6.将 iris_data 的花瓣长度(第3列)以形成分类变量的形式显示。
【知识点:统计相关】
- 如何将数字转换为分类(文本)数组?
# 将某一列以分类变量的形式显示,区间端点为三等分点
def clfied(num):
global irisData
datas = irisData[:, num].astype(float)
aMax = np.amax(datas)
aMin = np.amin(datas)
div1 = (aMax + aMin) / 3
div2 = div1 * 2
binData = np.digitize(datas, [aMin, div1, div2, aMax])
label_map = { 1: 'small', 2: 'medium', 3: 'large', 4: np.nan}
result = [label_map[x] for x in binData]
return result
>>> petal_length_cat = clfied(petallength)
>>> print(petal_length_cat[0:10])
['small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small', 'small']
7.在 iris_data 中创建一个新列,其中 volume 是 (pi x petallength x sepallength ^ 2)/ 3
。
【知识点:数组操作】
- 如何从numpy数组的现有列创建新列?
# 在irisData中新建一列,其值为 (pi * petallength * sepallength ^ 2)/ 3
def newCul():
global irisData
splLength = iris_data[:, 0].astype(float)
ptlLength = iris_data[:, 2].astype(float)
volume = (np.pi * petalLength * sepalLength ** 2) / 3
volume = volume[:, np.newaxis]
irisData = np.concatenate([iris_data, volume], axis=1)
return
>>> Z = newCul()
>>> print(Z[0:10])
[['5.1' '3.5' '1.4' '0.2' 'Iris-setosa' 38.13265162927291]
['4.9' '3.0' '1.4' '0.2' 'Iris-setosa' 35.200498485922445]
['4.7' '3.2' '1.3' '0.2' 'Iris-setosa' 30.0723720777127]
['4.6' '3.1' '1.5' '0.2' 'Iris-setosa' 33.238050274980004]
['5.0' '3.6' '1.4' '0.2' 'Iris-setosa' 36.65191429188092]
['5.4' '3.9' '1.7' '0.4' 'Iris-setosa' 51.911677007917746]
['4.6' '3.4' '1.4' '0.3' 'Iris-setosa' 31.022180256648003]
['5.0' '3.4' '1.5' '0.2' 'Iris-setosa' 39.269908169872416]
['4.4' '2.9' '1.4' '0.2' 'Iris-setosa' 28.38324242763259]
['4.9' '3.1' '1.5' '0.1' 'Iris-setosa' 37.714819806345474]]
>>>
8.随机抽鸢尾属植物的种类,使得Iris-setosa的数量是Iris-versicolor和Iris-virginica数量的两倍。
【知识点:随机抽样】
- 如何在numpy中进行概率抽样?
# 随机抽鸢尾属植物的种类,使得Iris-setosa的数量是Iris-versicolor和Iris-virginica数量的两倍
def pickSpecies():
global irisData
species = np.array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])
speciesOut = np.random.choice(species, 20, p=[0.5, 0.25, 0.25])
return speciesOut
>>> out = pickSpecies(20)
>>> print(out)
['Iris-setosa' 'Iris-virginica' 'Iris-setosa' 'Iris-versicolor'
'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-setosa' 'Iris-virginica'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-virginica' 'Iris-setosa'
'Iris-versicolor' 'Iris-setosa' 'Iris-setosa']
>>>
9.根据 sepallength 列对数据集进行排序。
【知识点:排序】
- 如何按列对2D数组进行排序?
# 根据某一项对数据集进行排序
def sort(num):
global irisData
datas = irisData[:, num]
index = np.argsort(datas)
result = irisData[index]
return result
>>> result = sort(sepallength)
>>> print(result[0:10])
[['4.3' '3.0' '1.1' '0.1' 'Iris-setosa']
['4.4' '3.2' '1.3' '0.2' 'Iris-setosa']
['4.4' '3.0' '1.3' '0.2' 'Iris-setosa']
['4.4' '2.9' '1.4' '0.2' 'Iris-setosa']
['4.5' '2.3' '1.3' '0.3' 'Iris-setosa']
['4.6' '3.6' '1.0' '0.2' 'Iris-setosa']
['4.6' '3.1' '1.5' '0.2' 'Iris-setosa']
['4.6' '3.4' '1.4' '0.3' 'Iris-setosa']
['4.6' '3.2' '1.4' '0.2' 'Iris-setosa']
['4.7' '3.2' '1.3' '0.2' 'Iris-setosa']]
>>>