基于乳腺癌数据集的Logistic回归分类器¶

发布日期：2019-10-22
难度：较难
类别：分类与预测、Logistic回归
标签：Python、scikit-learn、Logistic回归、乳腺癌数据集

1. 问题描述¶

使用LogisticRegression函数对乳腺癌数据集进行分类预测，并尝试增加多项式特征达到模型优化的目的。

2. 程序实现¶

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

#导入乳腺癌数据集
cancer=load_breast_cancer()
#划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.3, random_state=0)
#模型训练
clf=LogisticRegression()
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
#结果评估
train_score=clf.score(X_train,y_train)
test_score=clf.score(X_test,y_test)
print("train_score:%s" % (train_score))
print("test_score:%s" % (test_score))
print(classification_report(y_test,y_pred))

train_score:0.957286432161
test_score:0.964912280702
             precision    recall  f1-score   support

          0       0.93      0.98      0.95        63
          1       0.99      0.95      0.97       108

avg / total       0.97      0.96      0.97       171

#针对二元分类问题，LogisticRegression模型会针对每个样本输出两个概率，即为0的概率和为1的概率，哪个概率高就预测为哪个类别。也就是说，这个概率值越接近0.5，预测值越“不自信”。
#输出预测概率
y_pred_p=clf.predict_proba(X_test)
print(y_pred_p[0])

[ 0.99287819  0.00712181]

#为进一步增强对数据的理解，可以针对测试集找出预测概率值低于80%的样本。首先，计算出测试数据集中每个样本的预测概率值。针对每个样本，会有两个数据，预测为“恶性”（malignant）或“良性”（benign）。分别找出预测为恶性的概率大于0.2的样本，再在结果集中找出预测为良性的概率也大于0.2的样本，即完成任务。
#找出预测概率低于80%的样本，即第一列和第二列中均大于0.2的样本
y_pred_0=y_pred_p[:,0]>0.2
result=y_pred_p[y_pred_0]
y_pred_1=result[:,1]>0.2
print(result[y_pred_1])

[[ 0.20406213  0.79593787]
 [ 0.49585809  0.50414191]
 [ 0.77393894  0.22606106]
 [ 0.25547513  0.74452487]
 [ 0.41280044  0.58719956]
 [ 0.73974084  0.26025916]
 [ 0.61370663  0.38629337]
 [ 0.77701569  0.22298431]
 [ 0.62624437  0.37375563]
 [ 0.28338046  0.71661954]
 [ 0.79173367  0.20826633]
 [ 0.20526775  0.79473225]
 [ 0.63737015  0.36262985]
 [ 0.28925078  0.71074922]
 [ 0.37287551  0.62712449]]

#接下来对模型进行优化，尝试增加多项式特征

#模型优化
#定义多项式函数
def polymodel(degree=1,penalty=''): 
    poly_features=PolynomialFeatures(degree=degree,include_bias=False)
    logistic_regression=LogisticRegression(penalty)    
    pipeline=Pipeline([('poly_features',poly_features),('logistic_regression',logistic_regression)])
    return pipeline
#增加二阶多项式特征
clf_poly=polymodel(degree=2,penalty='l1')
clf_poly=clf_poly.fit(X_train,y_train)
train_score=clf_poly.score(X_train,y_train)
test_score=clf_poly.score(X_test,y_test)
#结果评估
print("poly_train_score:%s" % (train_score))
print("poly_test_score:%s" % (test_score))

poly_train_score:0.992462311558
poly_test_score:0.964912280702

#筛选影响因素
logistic_regression=clf_poly.named_steps['logistic_regression']
print("parameters:{0}".format(logistic_regression.coef_.shape))
print("efficientparaments:{0}".format(np.count_nonzero(logistic_regression.coef_)))

#从输出结果可以看出，增加二阶多项式特征后，输入特征增加到了495个，最终大多数特征都被丢弃，只保留了114个有效特征

parameters:(1, 495)
efficientparaments:114