文章目录
- 1. Baseline
- 2. AUC评估要使用predict_proba
- 2.1 导入工具包
- 2.2 特征提取
- 2.3 训练+模型选择
- 2.4 网格/随机搜索 参数+提交
- 2.5 测试结果
- 3. 致谢
新人赛地址
1. Baseline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
train = pd.read_csv("./train_set.csv")
test = pd.read_csv("./test_set.csv")
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25317 entries, 0 to 25316
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 25317 non-null int64
1 age 25317 non-null int64
2 job 25317 non-null object
3 marital 25317 non-null object
4 education 25317 non-null object
5 default 25317 non-null object
6 balance 25317 non-null int64
7 housing 25317 non-null object
8 loan 25317 non-null object
9 contact 25317 non-null object
10 day 25317 non-null int64
11 month 25317 non-null object
12 duration 25317 non-null int64
13 campaign 25317 non-null int64
14 pdays 25317 non-null int64
15 previous 25317 non-null int64
16 poutcome 25317 non-null object
17 y 25317 non-null int64
dtypes: int64(9), object(9)
memory usage: 3.5+ MB
NO | 字段名称 | 数据类型 | 字段描述 |
---|---|---|---|
1 | ID | Int | 客户唯一标识 |
2 | age | Int | 客户年龄 |
3 | job | String | 客户的职业 |
4 | marital | String | 婚姻状况 |
5 | education | String | 受教育水平 |
6 | default | String | 是否有违约记录 |
7 | balance | Int | 每年账户的平均余额 |
8 | housing | String | 是否有住房贷款 |
9 | loan | String | 是否有个人贷款 |
10 | contact | String | 与客户联系的沟通方式 |
11 | day | Int | 最后一次联系的时间(几号) |
12 | month | String | 最后一次联系的时间(月份) |
13 | duration | Int | 最后一次联系的交流时长 |
14 | campaign | Int | 在本次活动中,与该客户交流过的次数 |
15 | pdays | Int | 距离上次活动最后一次联系该客户,过去了多久(999表示没有联系过) |
16 | previous | Int | 在本次活动之前,与该客户交流过的次数 |
17 | poutcome | String | 上一次活动的结果 |
18 | y | Int | 预测客户是否会订购定期存款业务 |
- 相关系数
abs(train.corr()['y']).sort_values(ascending=False)
y 1.000000
ID 0.556627
duration 0.394746
pdays 0.107565
previous 0.088337
campaign 0.075173
balance 0.057564
day 0.031886
age 0.029916
Name: y, dtype: float64
- 绘制数字特征分布图
s = (train.dtypes == 'object')
object_col = list(s[s].index)
object_col
num_col = list(set(train.columns) - set(object_col))
plt.figure(figsize=(25,22))
for (i,col) in enumerate(num_col):
plt.subplot(3,3,i+1)
sns.distplot(train[col]) # kde=False 可不显示密度线
plt.xlabel(col,size=20)
plt.show()
- 分析下训练集
y
标签的比例
len(train[train['y']==1])/len(train['y'])
0.11695698542481336
只有 11% 的人会购买
- 新人赛,数据没有缺失的,直接用模型试试效果
X_train = train.drop(['ID','y'], axis=1)
X_test = test.drop(['ID'], axis=1)
y_train = train['y']
def num_cat_splitor(X_train):
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)
num_cols = list(set(X_train.columns) - set(object_cols))
return num_cols, object_cols
num_cols, object_cols = num_cat_splitor(X_train)
# 查看文字变量的种类
for col in object_col:
print(col, sorted(train[col].unique()))
print(col, sorted(test[col].unique()))
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_cols)),
#('imputer', SimpleImputer(strategy="median")),
('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(object_cols)),
('cat_encoder', OneHotEncoder(sparse=False,handle_unknown='ignore')),
])
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
X_prepared = full_pipeline.fit_transform(X_train)
from sklearn.ensemble import RandomForestClassifier
prepare_select_and_predict_pipeline = Pipeline([
('preparation', full_pipeline),
('forst_reg', RandomForestClassifier(random_state=0))
])
param_grid = [{
'forst_reg__n_estimators' : [50,100, 150, 200,250,300,330,350],
'forst_reg__max_features':[45,50, 55, 65]
}]
grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=7,
scoring='roc_auc', verbose=2, n_jobs=-1)
grid_search_prep.fit(X_train,y_train)
grid_search_prep.best_params_
final_model = grid_search_prep.best_estimator_
y_pred_test = final_model.predict(X_test)
# AUC 评估准则,需要使用 predict_proba,这里错误!!!
result = pd.DataFrame()
result['ID'] = test['ID']
result['pred'] = y_pred_test
result.to_csv('buy_product_pred.csv',index=False)
排名结果
auc 得分:0.72439844
2. AUC评估要使用predict_proba
AUC 指标,在预测时,应该使用概率来预测,上面做法是错误的(未使用概率预测)。
- 机器学习之分类器性能指标之ROC曲线、AUC值 https://www.cnblogs.com/dlml/p/4403482.html
- 如何理解机器学习和统计中的AUC? https://www.zhihu.com/question/39840928
- sklearn.metrics.roc_auc_score 介绍
AUC 评估模型的优点,在模型正负样本比例失衡的情况下,依然可以很好的评估模型
以下重新对代码进行优化
2.1 导入工具包
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.facecolor']=(1,1,1,1) # pycharm 绘图白底,看得清坐标
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
train = pd.read_csv("./train_set.csv")
test = pd.read_csv("./test_set.csv")
2.2 特征提取
- 查看文字特征的值
for col in object_col:
print(col, sorted(train[col].unique()))
print(col, sorted(test[col].unique()))
job ['admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown']
job ['admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown']
marital ['divorced', 'married', 'single']
marital ['divorced', 'married', 'single']
education ['primary', 'secondary', 'tertiary', 'unknown']
education ['primary', 'secondary', 'tertiary', 'unknown']
default ['no', 'yes']
default ['no', 'yes']
housing ['no', 'yes']
housing ['no', 'yes']
loan ['no', 'yes']
loan ['no', 'yes']
contact ['cellular', 'telephone', 'unknown']
contact ['cellular', 'telephone', 'unknown']
month ['apr', 'aug', 'dec', 'feb', 'jan', 'jul', 'jun', 'mar', 'may', 'nov', 'oct', 'sep']
month ['apr', 'aug', 'dec', 'feb', 'jan', 'jul', 'jun', 'mar', 'may', 'nov', 'oct', 'sep']
poutcome ['failure', 'other', 'success', 'unknown']
poutcome ['failure', 'other', 'success', 'unknown']
- 二值特征转化为 0, 1
# 对 'default','housing','loan' 3列二值(yes,no)特征转为 0,1
def binaryFeature(data):
data['default_']=0
data['default_'][data['default']=='yes'] = 1
data['housing_']=0
data['housing_'][data['housing']=='yes'] = 1
data['loan_']=0
data['loan_'][data['loan']=='yes'] = 1
return data.drop(['default','housing','loan'], axis=1)
X_train = binaryFeature(train)
X_test = binaryFeature(test)
- 训练集数据切分,用于本地测试
X_train = X_train.drop(['ID'], axis=1)
X_test = X_test.drop(['ID'], axis=1)
# 将训练集拆分一些出来做验证, 分层抽样
from sklearn.model_selection import StratifiedShuffleSplit
splt = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1)
for train_idx, vaild_idx in splt.split(X_train, X_train['y']):
train_part = X_train.loc[train_idx]
valid_part = X_train.loc[vaild_idx]
# 训练集拆成两部分 本地测试
train_part_y = train_part['y']
valid_part_y = valid_part['y']
train_part = train_part.drop(['y'], axis=1)
valid_part = valid_part.drop(['y'], axis=1)
- 特征处理管道
def num_cat_splitor(X_train):
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)
num_cols = list(set(X_train.columns) - set(object_cols))
return num_cols, object_cols
num_cols, object_cols = num_cat_splitor(X_train)
num_cols.remove('y')
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_cols)),
# ('imputer', SimpleImputer(strategy="median")),
# ('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(object_cols)),
('cat_encoder', OneHotEncoder(sparse=False, handle_unknown='ignore')),
])
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
2.3 训练+模型选择
# 本地测试,选模型
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score
rf = RandomForestClassifier()
knn = KNeighborsClassifier()
lr = LogisticRegression()
svc = SVC(probability=True)
gbdt = GradientBoostingClassifier()
models = [knn, lr, svc, rf, gbdt]
param_grid_list = [
# knn
[{
'model__n_neighbors' : [5,15,35,50,100],
'model__leaf_size' : [10,20,30,40,50]
}],
# lr
[{
'model__penalty' : ['l1', 'l2'],
'model__C' : [0.2, 0.5, 1, 1.2, 1.5],
'model__max_iter' : [10000]
}],
# svc
[{
'model__C' : [0.2, 0.5, 1, 1.2],
'model__kernel' : ['rbf']
}],
# rf
[{
# 'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],
'model__n_estimators' : [200,250,300,330,350],
'model__max_features' : [20,30,40,50],
'model__max_depth' : [5,7]
}],
# gbdt
[{
'model__learning_rate' : [0.1, 0.5],
'model__n_estimators' : [130, 200, 300],
'model__max_features' : ['sqrt'],
'model__max_depth' : [5,7],
'model__min_samples_split' : [500,1000,1200],
'model__min_samples_leaf' : [60, 100],
'model__subsample' : [0.8, 1]
}],
]
for i, model in enumerate(models):
pipe = Pipeline([
('preparation', full_pipeline),
('model', model)
])
grid_search = GridSearchCV(pipe, param_grid_list[i], cv=3,
scoring='roc_auc', verbose=2, n_jobs=-1)
grid_search.fit(train_part, train_part_y)
print(grid_search.best_params_)
final_model = grid_search.best_estimator_
pred = final_model.predict_proba(valid_part)[:,1] # roc 必须使用概率预测
print("auc score: ", roc_auc_score(valid_part_y, pred))
- 注意 AUC 评分标准 要使用
predict_proba
方法 !!!
Fitting 3 folds for each of 25 candidates, totalling 75 fits
{'model__leaf_size': 20, 'model__n_neighbors': 50}
auc score: 0.8212256518034133
Fitting 3 folds for each of 10 candidates, totalling 30 fits
{'model__C': 1.2, 'model__max_iter': 10000, 'model__penalty': 'l2'}
auc score: 0.9011510812019533
Fitting 3 folds for each of 4 candidates, totalling 12 fits
{'model__C': 0.2, 'model__kernel': 'rbf'}
auc score: 0.7192431208601267
Fitting 3 folds for each of 40 candidates, totalling 120 fits
{'model__max_depth': 7, 'model__max_features': 20, 'model__n_estimators': 350}
auc score: 0.913398647137746
Fitting 3 folds for each of 144 candidates, totalling 432 fits
{'model__learning_rate': 0.1, 'model__max_depth': 7, 'model__max_features': 'sqrt', 'model__min_samples_leaf': 60, 'model__min_samples_split': 500, 'model__n_estimators': 300, 'model__subsample': 1}
auc score: 0.9299485084368806
可以看见 GBDT 梯度提升下降树模型表现最好
2.4 网格/随机搜索 参数+提交
微调参数列表,使用全部的训练数据训练,使用 RF 和 GBDT 模型 对测试集进行预测
- 网格搜索
# 全量训练,网格搜索,提交
y_train = X_train['y']
X_train_ = X_train.drop(['y'], axis=1)
select_model = [rf, gbdt]
param_grid_list = [
# rf
[{
# 'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],
'model__n_estimators' : [250,300,350,400],
'model__max_features' : [7,8,10,15,20],
'model__max_depth' : [7,9,10,11]
}],
# gbdt
[{
'model__learning_rate' : [0.03, 0.05, 0.1],
'model__n_estimators' : [200, 300, 350],
'model__max_features' : ['sqrt'],
'model__max_depth' : [7,9,11],
'model__min_samples_split' : [300, 400, 500],
'model__min_samples_leaf' : [50,60,70],
'model__subsample' : [0.8, 1, 1.2]
}],
]
for i, model in enumerate(select_model):
pipe = Pipeline([
('preparation', full_pipeline),
('model', model)
])
grid_search = GridSearchCV(pipe, param_grid_list[i], cv=3,
scoring='roc_auc', verbose=2, n_jobs=-1)
grid_search.fit(X_train_, y_train)
print(grid_search.best_params_)
final_model = grid_search.best_estimator_
pred = final_model.predict_proba(X_test)[:,1] # roc 必须使用概率预测
print(model,'\n finished!')
result = pd.DataFrame()
result['ID'] = test['ID']
result['pred'] = pred
result.to_csv('{}_pred.csv'.format(i), index=False)
Fitting 3 folds for each of 80 candidates, totalling 240 fits
{'model__max_depth': 11, 'model__max_features': 15, 'model__n_estimators': 400}
RandomForestClassifier()
finished!
Fitting 3 folds for each of 729 candidates, totalling 2187 fits
{'model__learning_rate': 0.05, 'model__max_depth': 11, 'model__max_features': 'sqrt',
'model__min_samples_leaf': 50, 'model__min_samples_split': 500,
'model__n_estimators': 300, 'model__subsample': 1}
GradientBoostingClassifier()
finished!
- 随机搜索
# 随机搜索参数
y_train = X_train['y']
X_train_ = X_train.drop(['y'], axis=1)
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
select_model = [rf, gbdt]
param_distribs = [
# rf
[{
# 'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],
'model__n_estimators' : randint(low=250, high=500),
'model__max_features' : randint(low=10, high=30),
'model__max_depth' : randint(low=8, high=20)
}],
# gbdt
[{
'model__learning_rate' : np.linspace(0.01, 0.1, 10),
'model__n_estimators' : randint(low=250, high=500),
'model__max_features' : ['sqrt'],
'model__max_depth' : randint(low=8, high=20),
'model__min_samples_split' : randint(low=400, high=1000),
'model__min_samples_leaf' : randint(low=40, high=80),
'model__subsample' : np.linspace(0.5, 1.5, 10)
}],
]
for i, model in enumerate(select_model):
pipe = Pipeline([
('preparation', full_pipeline),
('model', model)
])
rand_search = RandomizedSearchCV(pipe, param_distributions=param_distribs[i], cv=3,
n_iter=20,scoring='roc_auc', verbose=2, n_jobs=-1)
rand_search.fit(X_train_, y_train)
print(rand_search.best_params_)
final_model = rand_search.best_estimator_
pred = final_model.predict_proba(X_test)[:,1] # roc 必须使用概率预测
print(model,'\n finished!')
result = pd.DataFrame()
result['ID'] = test['ID']
result['pred'] = pred
result.to_csv('{}_pred.csv'.format(i), index=False)
Fitting 3 folds for each of 20 candidates, totalling 60 fits
{'model__max_depth': 18, 'model__max_features': 13, 'model__n_estimators': 481}
RandomForestClassifier()
finished!
Fitting 3 folds for each of 20 candidates, totalling 60 fits
{'model__learning_rate': 0.05000000000000001, 'model__max_depth': 15, 'model__max_features': 'sqrt',
'model__min_samples_leaf': 68, 'model__min_samples_split': 905,
'model__n_estimators': 362, 'model__subsample': 0.9444444444444444}
GradientBoostingClassifier()
finished!
2.5 测试结果
RF 模型得分:0.9229160811692528
GBDT 模型得分:0.9332932318964199
第二期排名,暂列第8
3. 致谢
感谢徐师兄一直以来的指点和帮助!
欢迎大家一起分享练习心得,一起继续加油!