基于乳腺癌数据集的Logistic回归分类器

  • 发布日期:2019-10-22
  • 难度:较难
  • 类别:分类与预测、Logistic回归
  • 标签:Python、scikit-learn、Logistic回归、乳腺癌数据集

1. 问题描述

使用LogisticRegression函数对乳腺癌数据集进行分类预测,并尝试增加多项式特征达到模型优化的目的。

2. 程序实现

In [1]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

#导入乳腺癌数据集
cancer=load_breast_cancer()
#划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.3, random_state=0)
#模型训练
clf=LogisticRegression()
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
#结果评估
train_score=clf.score(X_train,y_train)
test_score=clf.score(X_test,y_test)
print("train_score:%s" % (train_score))
print("test_score:%s" % (test_score))
print(classification_report(y_test,y_pred))
train_score:0.957286432161
test_score:0.964912280702
             precision    recall  f1-score   support

          0       0.93      0.98      0.95        63
          1       0.99      0.95      0.97       108

avg / total       0.97      0.96      0.97       171

In [4]:
#针对二元分类问题,LogisticRegression模型会针对每个样本输出两个概率,即为0的概率和为1的概率,哪个概率高就预测为哪个类别。也就是说,这个概率值越接近0.5,预测值越“不自信”。
#输出预测概率
y_pred_p=clf.predict_proba(X_test)
print(y_pred_p[0])
[ 0.99287819  0.00712181]
In [5]:
#为进一步增强对数据的理解,可以针对测试集找出预测概率值低于80%的样本。首先,计算出测试数据集中每个样本的预测概率值。针对每个样本,会有两个数据,预测为“恶性”(malignant)或“良性”(benign)。分别找出预测为恶性的概率大于0.2的样本,再在结果集中找出预测为良性的概率也大于0.2的样本,即完成任务。
#找出预测概率低于80%的样本,即第一列和第二列中均大于0.2的样本
y_pred_0=y_pred_p[:,0]>0.2
result=y_pred_p[y_pred_0]
y_pred_1=result[:,1]>0.2
print(result[y_pred_1])
[[ 0.20406213  0.79593787]
 [ 0.49585809  0.50414191]
 [ 0.77393894  0.22606106]
 [ 0.25547513  0.74452487]
 [ 0.41280044  0.58719956]
 [ 0.73974084  0.26025916]
 [ 0.61370663  0.38629337]
 [ 0.77701569  0.22298431]
 [ 0.62624437  0.37375563]
 [ 0.28338046  0.71661954]
 [ 0.79173367  0.20826633]
 [ 0.20526775  0.79473225]
 [ 0.63737015  0.36262985]
 [ 0.28925078  0.71074922]
 [ 0.37287551  0.62712449]]
In [7]:
#接下来对模型进行优化,尝试增加多项式特征

#模型优化
#定义多项式函数
def polymodel(degree=1,penalty=''): 
    poly_features=PolynomialFeatures(degree=degree,include_bias=False)
    logistic_regression=LogisticRegression(penalty)    
    pipeline=Pipeline([('poly_features',poly_features),('logistic_regression',logistic_regression)])
    return pipeline
#增加二阶多项式特征
clf_poly=polymodel(degree=2,penalty='l1')
clf_poly=clf_poly.fit(X_train,y_train)
train_score=clf_poly.score(X_train,y_train)
test_score=clf_poly.score(X_test,y_test)
#结果评估
print("poly_train_score:%s" % (train_score))
print("poly_test_score:%s" % (test_score))
poly_train_score:0.992462311558
poly_test_score:0.964912280702
In [8]:
#筛选影响因素
logistic_regression=clf_poly.named_steps['logistic_regression']
print("parameters:{0}".format(logistic_regression.coef_.shape))
print("efficientparaments:{0}".format(np.count_nonzero(logistic_regression.coef_)))

#从输出结果可以看出,增加二阶多项式特征后,输入特征增加到了495个,最终大多数特征都被丢弃,只保留了114个有效特征
parameters:(1, 495)
efficientparaments:114