- 🍨 本文为🔗365天深度学习训练营 中的学习记录博客
- 🍖 原作者:K同学啊
一、进阶说明
针对于特征对模型结果的影响我们做了特征分析
特征选择
1. SelectFromModel
- 工作原理:基于模型的特征选择方法,使用基础评估器(如决策树)的
feature_importances_
或coef_
属性来衡量特征的重要性,并根据设定的阈值选择重要性得分超过该阈值的特征。 - 适用场景:当需要基于预训练模型的特征重要性快速筛选特征时,适合快速特征过滤。
- 优点:简单易用,适合快速特征筛选。
- 缺点:无法直接控制最终选择的特征数量,不如RFE精确。
2. RFE (Recursive Feature Elimination)
- 工作原理:递归地移除最不重要的特征。每次训练模型后,根据特征的重要性评分移除最不重要的特征,直到剩下指定数量的特征。
- 适用场景:当需要精确控制最终选择的特征数量时,适合精细特征选择。
- 优点:可以精确控制最终选择的特征数量,在每轮迭代中考虑所有剩余特征的整体贡献。
- 缺点:计算成本较高,因为需要多次训练模型,尤其在数据集大或模型复杂时。
总结
- SelectFromModel:适用于基于预定义重要性阈值快速简化模型的场景。
- RFE:适用于需要直接控制最终特征数量且愿意接受更高计算成本的场景。
本文使用的方法
该代码使用了两种特征选择方法:
RFE (Recursive Feature Elimination)
RFE 是递归特征消除方法,它使用递归的方式逐步移除最不重要的特征,直到保留指定数量的特征为止。代码中利用 RFE 从数据集中选择了前20个重要特征:
from sklearn.feature_selection import RFE# 使用 RFE 来选择特征
rfe_selector = RFE(estimator=tree, n_features_to_select=20) # 选择前20个特征
rfe_selector.fit(X, y)
X_new = rfe_selector.transform(X)
feature_names = np.array(X.columns)
selected_feature_names = feature_names[rfe_selector.support_]
print(selected_feature_names)
手动指定特征选择
代码中还手动指定了一个特征列表feature_selection
,将数据集中特定的20个特征作为最终的特征集用于后续建模:
feature_selection = ['年龄', '种族', '教育水平', '身体质量指数(BMI)', '酒精摄入量', '体育活动时间', '饮食质量评分','睡眠质量评分', '心血管疾病', '收缩压', '舒张压', '胆固醇总量', '低密度脂蛋白胆固醇(LDL)', '高密度脂蛋白胆固醇(HDL)', '甘油三酯', '简易精神状态检查(MMSE)得分', '功能评估得分', '记忆抱怨', '行为问题', '日常生活活动(ADL)得分']
X = data_df[feature_selection]
总结来说,代码中使用了 RFE 方法自动选择前20个重要特征,并在后续建模中使用了手动指定的20个特征进行训练和预测。这两种方法可以结合使用,也可以单独使用其中一种,具体取决于数据特性和任务需求。
二、代码实现
1.导入库函数
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.preprocessing import LabelEncoder
2.导入数据
plt.rcParams["font.sans-serif"] = ["Microsoft YaHei"] # 显示中文
plt.rcParams['axes.unicode_minus'] = False # 显示负号data_df = pd.read_csv("./data/alzheimers_disease_data.csv")#data_df.head()data_df.rename(columns={"Age":"年龄", "Gender":"性别", "Ethnicity":"种族", "EducationLevel":"教育水平", "BMI":"身体质量指数(BMI)", "Smoking":"吸烟情况", "AlcoholConsumption":"酒精摄入量", "PhysicalActivity":"体育活动时间", "DietQuality":"饮食质量评分", "SleepQuality":"睡眠质量评分", "FamilyHistoryAlzheimers":"家族阿尔茨海默病史", "CardiovascularDisease":"心血管疾病", "Diabetes":"糖尿病", "Depression":"抑郁症史", "HeadInjury":"头部受伤", "Hypertension":"高血压", "SystolicBP":"收缩压", "DiastolicBP":"舒张压", "CholesterolTotal":"胆固醇总量", "CholesterolLDL":"低密度脂蛋白胆固醇(LDL)", "CholesterolHDL":"高密度脂蛋白胆固醇(HDL)", "CholesterolTriglycerides":"甘油三酯","MMSE":"简易精神状态检查(MMSE)得分","FunctionalAssessment":"功能评估得分","MemoryComplaints":"记忆抱怨","BehavioralProblems":"行为问题","ADL":"日常生活活动(ADL)得分","Confusion":"混乱与定向障碍","Disorientation":"迷失方向","PersonalityChanges":"人格变化","DifficultyCompletingTasks":"完成任务困难","Forgetfulness":"健忘","Diagnosis":"诊断状态","DoctorInCharge":"主诊医生"},inplace=True)
print(data_df.columns)
Index(['PatientID', '年龄', '性别', '种族', '教育水平', '身体质量指数(BMI)', '吸烟情况', '酒精摄入量','体育活动时间', '饮食质量评分', '睡眠质量评分', '家族阿尔茨海默病史', '心血管疾病', '糖尿病', '抑郁症史','头部受伤', '高血压', '收缩压', '舒张压', '胆固醇总量', '低密度脂蛋白胆固醇(LDL)','高密度脂蛋白胆固醇(HDL)', '甘油三酯', '简易精神状态检查(MMSE)得分', '功能评估得分', '记忆抱怨', '行为问题','日常生活活动(ADL)得分', '混乱与定向障碍', '迷失方向', '人格变化', '完成任务困难', '健忘', '诊断状态','主诊医生'],dtype='object')
3.数据处理
print(data_df.isnull().sum())```python
PatientID 0
年龄 0
性别 0
种族 0
教育水平 0
身体质量指数(BMI) 0
吸烟情况 0
酒精摄入量 0
体育活动时间 0
饮食质量评分 0
睡眠质量评分 0
家族阿尔茨海默病史 0
心血管疾病 0
糖尿病 0
抑郁症史 0
头部受伤 0
高血压 0
收缩压 0
舒张压 0
胆固醇总量 0
低密度脂蛋白胆固醇(LDL) 0
高密度脂蛋白胆固醇(HDL) 0
甘油三酯 0
简易精神状态检查(MMSE)得分 0
功能评估得分 0
记忆抱怨 0
行为问题 0
日常生活活动(ADL)得分 0
混乱与定向障碍 0
迷失方向 0
人格变化 0
完成任务困难 0
健忘 0
诊断状态 0
主诊医生 0
创建 LabelEncoder 实例
label_encoder = LabelEncoder()
对非数值型列进行标签编码
data_df[‘主诊医生’] = label_encoder.fit_transform(data_df[‘主诊医生’])
data_df.head()
## 4.数据集构建```python
# 计算是否患病,人数
counts = data_df["诊断状态"].value_counts()# 计算百分比
sizes = counts / counts.sum() * 100# 绘制环形图
fig, ax = plt.subplots()
wedges, texts, autotexts = ax.pie(sizes, labels=sizes.index, autopct='%1.2f%%', startangle=90, wedgeprops=dict(width=0.2))
plt.title("患病占比(1患病, 0没有患病)")
plt.show()
5.患病占比
class model_rnn(nn.Module):def __init__(self):super(model_rnn, self).__init__()self.rnn0 = nn.RNN(input_size=32, hidden_size=200,num_layers=1, batch_first=True)self.fc0 = nn.Linear(200, 50)self.fc1 = nn.Linear(50, 2)def forward(self, x):out, hidden1 = self.rnn0(x)out = self.fc0(out)out = self.fc1(out)return outmodel = model_rnn().to(device)
6.相关性分析
plt.figure(figsize=(40, 35))
sns.heatmap(data_df.corr(), annot=True, fmt=".2f")
plt.show()
7. 年龄与患病探究
data_df['年龄'].min(), data_df['年龄'].max()
age_bins = range(60, 91)
grouped = data_df.groupby('年龄').agg({'诊断状态': ['sum', 'size']})
grouped.columns = ['患病', '总人数']
grouped['不患病'] = grouped['总人数'] - grouped['患病'] # 计算不患病的人数# 设置绘图风格
sns.set(style="whitegrid")plt.figure(figsize=(12, 5))# 获取x轴标签(即年龄)
x = grouped.index.astype(str) # 将年龄转换为字符串格式便于显示# 画图
plt.bar(x, grouped["不患病"], 0.35, label="不患病", color="skyblue")
plt.bar(x, grouped["患病"], 0.35, label="患病", color="salmon")# 设置标题
plt.title("患病年龄分布", fontproperties='Microsoft YaHei')
plt.xlabel("年龄", fontproperties='Microsoft YaHei')
plt.ylabel("人数", fontproperties='Microsoft YaHei')# 如果需要对图例也应用相同的字体
plt.legend(prop={'family': 'Microsoft YaHei'})# 展示
plt.tight_layout()
plt.show()
8.特征选择
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_reportdata = data_df.copy()X = data_df.iloc[:, 1:-2]
y = data_df.iloc[:, -2]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# 标准化
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)# 模型创建
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
pred = tree.predict(X_test)
reporter = classification_report(y_test, pred)
print(reporter)
precision recall f1-score support0 0.91 0.92 0.92 2771 0.85 0.84 0.84 153accuracy 0.89 430macro avg 0.88 0.88 0.88 430
weighted avg 0.89 0.89 0.89 430
9.构建数据集
feature_selection = ['年龄', '种族', '教育水平', '身体质量指数(BMI)', '酒精摄入量', '体育活动时间', '饮食质量评分','睡眠质量评分', '心血管疾病', '收缩压', '舒张压', '胆固醇总量', '低密度脂蛋白胆固醇(LDL)', '高密度脂蛋白胆固醇(HDL)', '甘油三酯', '简易精神状态检查(MMSE)得分', '功能评估得分', '记忆抱怨', '行为问题', '日常生活活动(ADL)得分']
X = data_df[feature_selection]# 标准化,标准化其实对应连续性数据,分类数据不适合,由于特征中只有种族是分类数据,这里我偷个“小懒”
sc = StandardScaler()
X = sc.fit_transform(X)X = torch.tensor(np.array(X), dtype=torch.float32)
y = torch.tensor(np.array(y), dtype=torch.long)# 再次进行特征选择
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
10.模型构建
batch_size = 32train_dl = DataLoader(TensorDataset(X_train, y_train),batch_size=batch_size,shuffle=True
)test_dl = DataLoader(TensorDataset(X_test, y_test),batch_size=batch_size,shuffle=False
)class Rnn_Model(nn.Module):def __init__(self):super().__init__()# 调用rnnself.rnn = nn.RNN(input_size=20, hidden_size=200, num_layers=1, batch_first=True)self.fc1 = nn.Linear(200, 50)self.fc2 = nn.Linear(50, 2)def forward(self, x):x, hidden1 = self.rnn(x)x = self.fc1(x)x = self.fc2(x)return x# 数据不大,cpu即可
device = "cpu"model = Rnn_Model().to(device)
model
11.构建训练集和测试集
def train(data, model, loss_fn, opt):size = len(data.dataset)batch_num = len(data)train_loss, train_acc = 0.0, 0.0for X, y in data:X, y = X.to(device), y.to(device)pred = model(X)loss = loss_fn(pred, y)# 反向传播opt.zero_grad() # 梯度清零loss.backward() # 求导opt.step() # 设置梯度train_loss += loss.item()train_acc += (pred.argmax(1) == y).type(torch.float).sum().item()train_loss /= batch_numtrain_acc /= sizereturn train_acc, train_lossdef test(data, model, loss_fn):size = len(data.dataset)batch_num = len(data)test_loss, test_acc = 0.0, 0.0with torch.no_grad():for X, y in data:X, y = X.to(device), y.to(device)pred = model(X)loss = loss_fn(pred, y)test_loss += loss.item()test_acc += (pred.argmax(1) == y).type(torch.float).sum().item()test_loss /= batch_numtest_acc /= sizereturn test_acc, test_loss
12.训练
loss_fn = nn.CrossEntropyLoss() # 损失函数
learn_lr = 1e-4 # 超参数
optimizer = torch.optim.Adam(model.parameters(), lr=learn_lr) # 优化器train_acc = []
train_loss = []
test_acc = []
test_loss = []epochs = 50for i in range(epochs):model.train()epoch_train_acc, epoch_train_loss = train(train_dl, model, loss_fn, optimizer)model.eval()epoch_test_acc, epoch_test_loss = test(test_dl, model, loss_fn)train_acc.append(epoch_train_acc)train_loss.append(epoch_train_loss)test_acc.append(epoch_test_acc)test_loss.append(epoch_test_loss)# 输出template = ('Epoch:{:2d}, Train_acc:{:.1f}%, Train_loss:{:.3f}, Test_acc:{:.1f}%, Test_loss:{:.3f}')print(template.format(i + 1, epoch_train_acc*100, epoch_train_loss, epoch_test_acc*100, epoch_test_loss))print("Done")
Epoch: 1, Train_acc:65.7%, Train_loss:0.650, Test_acc:68.4%, Test_loss:0.612
Epoch: 2, Train_acc:68.1%, Train_loss:0.580, Test_acc:70.5%, Test_loss:0.555
Epoch: 3, Train_acc:74.2%, Train_loss:0.526, Test_acc:76.7%, Test_loss:0.497
Epoch: 4, Train_acc:79.3%, Train_loss:0.473, Test_acc:80.9%, Test_loss:0.451
Epoch: 5, Train_acc:83.1%, Train_loss:0.429, Test_acc:81.9%, Test_loss:0.421
Epoch: 6, Train_acc:83.8%, Train_loss:0.402, Test_acc:83.3%, Test_loss:0.406
Epoch: 7, Train_acc:83.8%, Train_loss:0.386, Test_acc:83.3%, Test_loss:0.398
Epoch: 8, Train_acc:84.8%, Train_loss:0.378, Test_acc:83.5%, Test_loss:0.400
Epoch: 9, Train_acc:84.2%, Train_loss:0.376, Test_acc:84.9%, Test_loss:0.399
Epoch:10, Train_acc:84.9%, Train_loss:0.372, Test_acc:83.5%, Test_loss:0.404
Epoch:11, Train_acc:83.5%, Train_loss:0.377, Test_acc:83.7%, Test_loss:0.401
Epoch:12, Train_acc:84.4%, Train_loss:0.373, Test_acc:84.0%, Test_loss:0.399
Epoch:13, Train_acc:84.4%, Train_loss:0.371, Test_acc:83.5%, Test_loss:0.401
Epoch:14, Train_acc:84.9%, Train_loss:0.372, Test_acc:83.3%, Test_loss:0.397
Epoch:15, Train_acc:84.8%, Train_loss:0.370, Test_acc:84.0%, Test_loss:0.396
Epoch:16, Train_acc:85.2%, Train_loss:0.373, Test_acc:83.7%, Test_loss:0.400
Epoch:17, Train_acc:84.6%, Train_loss:0.373, Test_acc:84.0%, Test_loss:0.403
Epoch:18, Train_acc:84.6%, Train_loss:0.371, Test_acc:83.5%, Test_loss:0.401
Epoch:19, Train_acc:85.0%, Train_loss:0.368, Test_acc:83.3%, Test_loss:0.402
Epoch:20, Train_acc:84.5%, Train_loss:0.372, Test_acc:83.3%, Test_loss:0.403
Epoch:21, Train_acc:85.9%, Train_loss:0.371, Test_acc:83.0%, Test_loss:0.404
Epoch:22, Train_acc:84.6%, Train_loss:0.373, Test_acc:82.6%, Test_loss:0.400
Epoch:23, Train_acc:84.2%, Train_loss:0.374, Test_acc:82.8%, Test_loss:0.400
Epoch:24, Train_acc:84.2%, Train_loss:0.372, Test_acc:83.5%, Test_loss:0.400
Epoch:25, Train_acc:84.6%, Train_loss:0.372, Test_acc:83.0%, Test_loss:0.397
Epoch:26, Train_acc:85.0%, Train_loss:0.370, Test_acc:83.3%, Test_loss:0.400
Epoch:27, Train_acc:84.8%, Train_loss:0.373, Test_acc:83.0%, Test_loss:0.398
Epoch:28, Train_acc:84.4%, Train_loss:0.373, Test_acc:84.0%, Test_loss:0.398
Epoch:29, Train_acc:85.0%, Train_loss:0.369, Test_acc:83.7%, Test_loss:0.395
Epoch:30, Train_acc:84.6%, Train_loss:0.370, Test_acc:83.3%, Test_loss:0.397
Epoch:31, Train_acc:84.9%, Train_loss:0.369, Test_acc:84.7%, Test_loss:0.396
Epoch:32, Train_acc:84.9%, Train_loss:0.370, Test_acc:84.2%, Test_loss:0.395
Epoch:33, Train_acc:84.9%, Train_loss:0.370, Test_acc:84.2%, Test_loss:0.395
Epoch:34, Train_acc:84.8%, Train_loss:0.369, Test_acc:84.0%, Test_loss:0.398
Epoch:35, Train_acc:84.1%, Train_loss:0.373, Test_acc:84.4%, Test_loss:0.395
Epoch:36, Train_acc:85.0%, Train_loss:0.370, Test_acc:83.0%, Test_loss:0.400
Epoch:37, Train_acc:84.9%, Train_loss:0.371, Test_acc:83.3%, Test_loss:0.398
Epoch:38, Train_acc:85.0%, Train_loss:0.372, Test_acc:83.7%, Test_loss:0.398
Epoch:39, Train_acc:84.9%, Train_loss:0.369, Test_acc:83.3%, Test_loss:0.398
Epoch:40, Train_acc:85.4%, Train_loss:0.367, Test_acc:84.2%, Test_loss:0.396
Epoch:41, Train_acc:84.8%, Train_loss:0.368, Test_acc:84.0%, Test_loss:0.399
Epoch:42, Train_acc:84.6%, Train_loss:0.370, Test_acc:83.7%, Test_loss:0.396
Epoch:43, Train_acc:84.8%, Train_loss:0.369, Test_acc:83.7%, Test_loss:0.396
Epoch:44, Train_acc:84.8%, Train_loss:0.371, Test_acc:83.3%, Test_loss:0.401
Epoch:45, Train_acc:84.8%, Train_loss:0.372, Test_acc:84.2%, Test_loss:0.399
Epoch:46, Train_acc:85.1%, Train_loss:0.371, Test_acc:83.7%, Test_loss:0.397
Epoch:47, Train_acc:84.9%, Train_loss:0.369, Test_acc:83.5%, Test_loss:0.397
Epoch:48, Train_acc:85.1%, Train_loss:0.371, Test_acc:83.0%, Test_loss:0.397
Epoch:49, Train_acc:84.7%, Train_loss:0.372, Test_acc:83.3%, Test_loss:0.397
Epoch:50, Train_acc:85.0%, Train_loss:0.371, Test_acc:83.7%, Test_loss:0.397
Done
13.模型评估
import matplotlib.pyplot as plt
# 隐藏警告
import warnings
warnings.filterwarnings("ignore") # 忽略警告信息
from datetime import datetime
current_time = datetime.now() # 获取当前时间epochs_range = range(epochs)plt.figure(figsize=(12, 3))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, train_acc, label='Training Accuracy')
plt.plot(epochs_range, test_acc, label='Test Accuracy')
plt.legend(loc='lower right')
plt.title('Training Accuracy')
plt.xlabel(current_time) # 打卡请带上时间戳,否则代码截图无效plt.subplot(1, 2, 2)
plt.plot(epochs_range, train_loss, label='Training Loss')
plt.plot(epochs_range, test_loss, label='Test Loss')
plt.legend(loc='upper right')
plt.title('Training Loss')
plt.show()
14.混淆矩阵
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplaypred = model(X_test.to(device)).argmax(1).cpu().numpy()# 计算混淆矩阵
cm = confusion_matrix(y_test, pred)# 计算
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
# 标题
plt.title("混淆矩阵")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")plt.tight_layout() # 自适应
plt.show()