目录

一、数据基础与预处理目标

二、具体预处理步骤及代码解析

2.1 数据加载与初步清洗

2.2 标签编码

2.3 缺失值处理

(1)删除含缺失值的样本

(2)按类别均值填充

(3)按类别中位数填充

(4)按类别众数填充

(5)线性回归填充

(6)随机森林填充

2.4 特征标准化

2.5 数据集拆分与类别平衡

(1)拆分训练集与测试集

(2)处理类别不平衡

2.6 数据保存

三、具体代码

四、预处理小结


一、数据基础与预处理目标

本矿物分类系统基于矿物数据.xlsx展开,该数据包含 1044 条样本,涵盖 13 种元素特征(氯、钠、镁等)和 4 类矿物标签(A、B、C、D)。因数据敏感,故无法提供,仅提供代码用于学习。数据预处理的核心目标是:通过规范格式、处理缺失值、平衡类别等操作,为后续模型训练提供可靠输入。


二、具体预处理步骤及代码解析

2.1 数据加载与初步清洗

首先加载数据并剔除无效信息:

import pandas as pd# 加载数据,仅保留有效样本
data = pd.read_excel('矿物数据.xlsx', sheet_name='Sheet1')
# 删除特殊类别E(仅1条样本,无统计意义)
data = data[data['矿物类型'] != 'E']
# 转换数据类型,将非数值符号(如"/"、空格)转为缺失值NaN
for col in data.columns:if col not in ['序号', '矿物类型']:  # 标签列不转换# errors='coerce'确保非数值转为NaNdata[col] = pd.to_numeric(data[col], errors='coerce')

此步骤解决了原始数据中格式混乱的问题,确保所有特征列均为数值型,为后续处理奠定基础。

2.2 标签编码

将字符标签(A/B/C/D)转为模型可识别的整数:

# 建立标签映射关系
label_dict = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
# 转换标签并保持DataFrame格式
data['矿物类型'] = data['矿物类型'].map(label_dict)
# 分离特征与标签
X = data.drop(['序号', '矿物类型'], axis=1)  # 特征集
y = data['矿物类型']  # 标签集

编码后,标签范围为 0-3,符合机器学习模型对输入格式的要求。

2.3 缺失值处理

针对数据中存在的缺失值,设计了 6 种处理方案,具体如下:

(1)删除含缺失值的样本

适用于缺失率极低(<1%)的场景,直接剔除无效样本:

def drop_missing(train_data, train_label):# 合并特征与标签,便于按行删除combined = pd.concat([train_data, train_label], axis=1)combined = combined.reset_index(drop=True)  # 重置索引,避免删除后索引混乱cleaned = combined.dropna()  # 删除含缺失值的行# 分离特征与标签return cleaned.drop('矿物类型', axis=1), cleaned['矿物类型']

该方法优点是无偏差,缺点是可能丢失有效信息(当缺失率较高时)。

(2)按类别均值填充

对数值型特征,按矿物类型分组计算均值,用组内均值填充缺失值(减少跨类别干扰):

def mean_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)combined = combined.reset_index(drop=True)# 按矿物类型分组填充filled_groups = []for type_id in combined['矿物类型'].unique():group = combined[combined['矿物类型'] == type_id]# 计算组内各特征均值,用于填充该组缺失值filled_group = group.fillna(group.mean())filled_groups.append(filled_group)# 合并各组数据filled = pd.concat(filled_groups, axis=0).reset_index(drop=True)return filled.drop('矿物类型', axis=1), filled['矿物类型']

适用于特征分布较均匀的场景,避免了不同类别间的均值混淆。

(3)按类别中位数填充

当特征存在极端值(如个别样本钠含量远高于均值)时,用中位数填充更稳健:

def median_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)combined = combined.reset_index(drop=True)filled_groups = []for type_id in combined['矿物类型'].unique():group = combined[combined['矿物类型'] == type_id]# 中位数对极端值不敏感filled_group = group.fillna(group.median())filled_groups.append(filled_group)filled = pd.concat(filled_groups, axis=0).reset_index(drop=True)return filled.drop('矿物类型', axis=1), filled['矿物类型']

(4)按类别众数填充

针对离散型特征(如部分元素含量为整数编码),采用众数(出现次数最多的值)填充:

def mode_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)combined = combined.reset_index(drop=True)filled_groups = []for type_id in combined['矿物类型'].unique():group = combined[combined['矿物类型'] == type_id]# 对每列取众数,无众数时返回Nonefill_values = group.apply(lambda x: x.mode().iloc[0] if not x.mode().empty else None)filled_group = group.fillna(fill_values)filled_groups.append(filled_group)filled = pd.concat(filled_groups, axis=0).reset_index(drop=True)return filled.drop('矿物类型', axis=1), filled['矿物类型']

(5)线性回归填充

利用特征间的线性相关性(如氯与钠含量的关联)预测缺失值:

from sklearn.linear_model import LinearRegressiondef linear_reg_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)features = combined.drop('矿物类型', axis=1)# 按缺失值数量升序处理(从缺失少的列开始)null_counts = features.isnull().sum().sort_values()for col in null_counts.index:if null_counts[col] == 0:continue  # 无缺失值则跳过# 构建训练数据:用其他特征预测当前列X_train = features.drop(col, axis=1).dropna()  # 其他特征无缺失的样本y_train = features.loc[X_train.index, col]  # 当前列的非缺失值# 待填充样本(当前列缺失,其他特征完整)X_pred = features.drop(col, axis=1).loc[features[col].isnull()]# 训练线性回归模型lr = LinearRegression()lr.fit(X_train, y_train)# 预测并填充缺失值features.loc[features[col].isnull(), col] = lr.predict(X_pred)return features, combined['矿物类型']

该方法要求特征间存在一定线性关系,适用于元素含量呈比例关联的场景。

(6)随机森林填充

对于特征间非线性关系,采用随机森林模型预测缺失值:

from sklearn.ensemble import RandomForestRegressordef rf_fill(train_data, train_label):combined = pd.concat([train_data, train_label], axis=1)features = combined.drop('矿物类型', axis=1)null_counts = features.isnull().sum().sort_values()for col in null_counts.index:if null_counts[col] == 0:continue# 分离训练样本和待填充样本X_train = features.drop(col, axis=1).dropna()y_train = features.loc[X_train.index, col]X_pred = features.drop(col, axis=1).loc[features[col].isnull()]# 训练随机森林回归器(100棵树,固定随机种子确保结果可复现)rfr = RandomForestRegressor(n_estimators=100, random_state=10)rfr.fit(X_train, y_train)# 填充预测结果features.loc[features[col].isnull(), col] = rfr.predict(X_pred)return features, combined['矿物类型']

随机森林能捕捉特征间复杂关系,填充精度通常高于线性方法,但计算成本略高。

2.4 特征标准化

不同元素含量数值差异大(如钠可达上千,硒多为 0-1),需消除量纲影响:

from sklearn.preprocessing import StandardScalerdef standardize_features(X_train, X_test):# 用训练集的均值和标准差进行标准化(避免测试集信息泄露)scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)  # 拟合训练集并转换X_test_scaled = scaler.transform(X_test)  # 用相同参数转换测试集# 转回DataFrame格式,保留特征名称return pd.DataFrame(X_train_scaled, columns=X_train.columns), pd.DataFrame(X_test_scaled, columns=X_test.columns)

标准化后,所有特征均值为 0、标准差为 1,确保模型不受数值大小干扰。

2.5 数据集拆分与类别平衡

(1)拆分训练集与测试集

按 7:3 比例拆分,保持类别分布一致:

from sklearn.model_selection import train_test_split# stratify=y确保测试集与原始数据类别比例一致
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y
)

(2)处理类别不平衡

采用 SMOTE 算法生成少数类样本,平衡各类别数量:

from imblearn.over_sampling import SMOTE# 仅对训练集过采样(测试集保持原始分布)
smote = SMOTE(k_neighbors=1, random_state=0)  # 近邻数=1,避免引入过多噪声
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

2.6 数据保存

将预处理后的数据存储,供后续模型训练使用:

def save_processed_data(X_train, y_train, X_test, y_test, method):# 拼接特征与标签train_df = pd.concat([X_train, pd.DataFrame(y_train, columns=['矿物类型'])], axis=1)test_df = pd.concat([X_test, pd.DataFrame(y_test, columns=['矿物类型'])], axis=1)# 保存为Excel,明确标识预处理方法train_df.to_excel(f'训练集_{method}.xlsx', index=False)test_df.to_excel(f'测试集_{method}.xlsx', index=False)# 示例:保存经随机森林填充和标准化的数据
save_processed_data(X_train_balanced, y_train_balanced, X_test, y_test, 'rf_fill_standardized')


三、具体代码

数据预处理.py

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import fill_datadata = pd.read_excel('矿物数据.xlsx')
data = data[data['矿物类型'] != 'E']  # 删除特殊类别E,整个数据集中只存在1个E数据
null_num = data.isnull()
null_total = data.isnull().sum()X_whole = data.drop('矿物类型', axis=1).drop('序号', axis=1)
y_whole = data.矿物类型'''将数据中的中文标签转换为字符'''
label_dict = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
encoded_labels = [label_dict[label] for label in y_whole]
y_whole = pd.DataFrame(encoded_labels, columns=['矿物类型'])'''字符串数据转换成float,异常数据('\'和空格)转换成nan'''
for column_name in X_whole.columns:X_whole[column_name] = pd.to_numeric(X_whole[column_name], errors='coerce')'''Z标准化'''
scaler = StandardScaler()
X_whole_Z = scaler.fit_transform(X_whole)
X_whole = pd.DataFrame(X_whole_Z, columns=X_whole.columns)  # Z标准化处理后为numpy数据,这里再转换回pandas数据'''数据集切分'''
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X_whole, y_whole, test_size=0.3, random_state=0)'''数据填充,6种方法'''
# # 1.删除空缺行
# X_train_fill, y_train_fill = fill_data.cca_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.cca_test_fill(X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.cca_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)
#
# # 2.平均值填充
# X_train_fill, y_train_fill = fill_data.mean_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.mean_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.mean_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)
#
# # 3.中位数填充
# X_train_fill, y_train_fill = fill_data.median_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.median_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.median_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)# # 4.众数填充
# X_train_fill, y_train_fill = fill_data.mode_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.mode_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.mode_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)# # 5.线性回归填充
# X_train_fill, y_train_fill = fill_data.linear_train_fill(X_train_w, y_train_w)
# X_test_fill, y_test_fill = fill_data.linear_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
# os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
# fill_data.linear_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)# 6.随机森林填充
X_train_fill, y_train_fill = fill_data.RandomForest_train_fill(X_train_w, y_train_w)
X_test_fill, y_test_fill = fill_data.RandomForest_test_fill(X_train_fill, y_train_fill, X_test_w, y_test_w)
os_x_train, os_y_train = fill_data.oversampling(X_train_fill, y_train_fill)
fill_data.RandomForest_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill)

fill_data.py

import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression'''过采样'''
def oversampling(train_data, train_label):oversampler = SMOTE(k_neighbors=1, random_state=0)os_x_train, os_y_train = oversampler.fit_resample(train_data, train_label)return os_x_train, os_y_train'''1.删除空缺行'''
def cca_train_fill(train_data, train_label):data = pd.concat([train_data, train_label], axis=1)data = data.reset_index(drop=True)  # 重置索引df_filled = data.dropna()  # 删除包含缺失值的行或列return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型def cca_test_fill(test_data, test_label):data = pd.concat([test_data, test_label], axis=1)data = data.reset_index(drop=True)df_filled = data.dropna()return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型def cca_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//训练数据集[删除空缺行].xlsx', index=False)data_test.to_excel(r'..//temp_data//测试数据集[删除空缺行].xlsx', index=False)'''2.平均值填充'''
def mean_train_method(data):fill_values = data.mean()return data.fillna(fill_values)def mean_test_method(train_data, test_data):fill_values = train_data.mean()return test_data.fillna(fill_values)def mean_train_fill(train_data, train_label):data = pd.concat([train_data, train_label], axis=1)data = data.reset_index(drop=True)A = data[data['矿物类型'] == 0]B = data[data['矿物类型'] == 1]C = data[data['矿物类型'] == 2]D = data[data['矿物类型'] == 3]A = mean_train_method(A)B = mean_train_method(B)C = mean_train_method(C)D = mean_train_method(D)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型def mean_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)A_train = train_data_all[train_data_all['矿物类型'] == 0]B_train = train_data_all[train_data_all['矿物类型'] == 1]C_train = train_data_all[train_data_all['矿物类型'] == 2]D_train = train_data_all[train_data_all['矿物类型'] == 3]A_test = test_data_all[test_data_all['矿物类型'] == 0]B_test = test_data_all[test_data_all['矿物类型'] == 1]C_test = test_data_all[test_data_all['矿物类型'] == 2]D_test = test_data_all[test_data_all['矿物类型'] == 3]A = mean_test_method(A_train, A_test)B = mean_test_method(B_train, B_test)C = mean_test_method(C_train, C_test)D = mean_test_method(D_train, D_test)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型def mean_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//训练数据集[平均值填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//测试数据集[平均值填充].xlsx', index=False)'''3.中位数填充'''
def median_train_method(data):fill_values = data.median()return data.fillna(fill_values)def median_test_method(train_data, test_data):fill_values = train_data.median()return test_data.fillna(fill_values)def median_train_fill(train_data, train_label):data = pd.concat([train_data, train_label], axis=1)data = data.reset_index(drop=True)A = data[data['矿物类型'] == 0]B = data[data['矿物类型'] == 1]C = data[data['矿物类型'] == 2]D = data[data['矿物类型'] == 3]A = median_train_method(A)B = median_train_method(B)C = median_train_method(C)D = median_train_method(D)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型def median_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)A_train = train_data_all[train_data_all['矿物类型'] == 0]B_train = train_data_all[train_data_all['矿物类型'] == 1]C_train = train_data_all[train_data_all['矿物类型'] == 2]D_train = train_data_all[train_data_all['矿物类型'] == 3]A_test = test_data_all[test_data_all['矿物类型'] == 0]B_test = test_data_all[test_data_all['矿物类型'] == 1]C_test = test_data_all[test_data_all['矿物类型'] == 2]D_test = test_data_all[test_data_all['矿物类型'] == 3]A = median_test_method(A_train, A_test)B = median_test_method(B_train, B_test)C = median_test_method(C_train, C_test)D = median_test_method(D_train, D_test)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型def median_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//训练数据集[中位数填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//测试数据集[中位数填充].xlsx', index=False)'''4.众数填充'''
def mode_train_method(data):fill_values = data.apply(lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else None)a = data.mode()return data.fillna(fill_values)def mode_test_method(train_data, test_data):fill_values = train_data.apply(lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else None)a = train_data.mode()return test_data.fillna(fill_values)def mode_train_fill(train_data, train_label):data = pd.concat([train_data, train_label], axis=1)data = data.reset_index(drop=True)A = data[data['矿物类型'] == 0]B = data[data['矿物类型'] == 1]C = data[data['矿物类型'] == 2]D = data[data['矿物类型'] == 3]A = mode_train_method(A)B = mode_train_method(B)C = mode_train_method(C)D = mode_train_method(D)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型def mode_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)A_train = train_data_all[train_data_all['矿物类型'] == 0]B_train = train_data_all[train_data_all['矿物类型'] == 1]C_train = train_data_all[train_data_all['矿物类型'] == 2]D_train = train_data_all[train_data_all['矿物类型'] == 3]A_test = test_data_all[test_data_all['矿物类型'] == 0]B_test = test_data_all[test_data_all['矿物类型'] == 1]C_test = test_data_all[test_data_all['矿物类型'] == 2]D_test = test_data_all[test_data_all['矿物类型'] == 3]A = mode_test_method(A_train, A_test)B = mode_test_method(B_train, B_test)C = mode_test_method(C_train, C_test)D = mode_test_method(D_train, D_test)df_filled = pd.concat([A, B, C, D], axis=0)df_filled = df_filled.reset_index(drop=True)return df_filled.drop('矿物类型', axis=1), df_filled.矿物类型def mode_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//训练数据集[众数填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//测试数据集[众数填充].xlsx', index=False)'''5.线性回归填充'''
def linear_train_fill(train_data, train_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)train_data_X = train_data_all.drop('矿物类型', axis=1)null_num = train_data_X.isnull().sum()null_num_sorted = null_num.sort_values(ascending=True)filling_feature = []for i in null_num_sorted.index:filling_feature.append(i)if null_num_sorted[i] != 0:X = train_data_X[filling_feature].drop(i, axis=1)y = train_data_X[i]row_numbers_mg_null = train_data_X[train_data_X[i].isnull()].index.tolist()X_train = X.drop(row_numbers_mg_null)y_train = y.drop(row_numbers_mg_null)X_test = X.iloc[row_numbers_mg_null]lr = LinearRegression()lr.fit(X_train, y_train)y_pred = lr.predict(X_test)train_data_X.loc[row_numbers_mg_null, i] = y_predprint(f'完成训练数据集中的{i}列数据的填充')return train_data_X, train_data_all.矿物类型def linear_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)train_data_X = train_data_all.drop('矿物类型', axis=1)test_data_X = test_data_all.drop('矿物类型', axis=1)null_num = test_data_X.isnull().sum()null_num_sorted = null_num.sort_values(ascending=True)filling_feature = []for i in null_num_sorted.index:filling_feature.append(i)if null_num_sorted[i] != 0:X_train = train_data_X[filling_feature].drop(i, axis=1)y_train = train_data_X[i]X_test  = test_data_X[filling_feature].drop(i, axis=1)row_numbers_mg_null = test_data_X[test_data_X[i].isnull()].index.tolist()X_test = X_test.iloc[row_numbers_mg_null]lr = LinearRegression()lr.fit(X_train, y_train)y_pred = lr.predict(X_test)test_data_X.loc[row_numbers_mg_null, i] = y_predprint(f'完成测试数据集中的{i}列数据的填充')return test_data_X, test_data_all.矿物类型def linear_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//训练数据集[线性回归填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//测试数据集[线性回归填充].xlsx', index=False)'''6.随机森林填充'''
def RandomForest_train_fill(train_data, train_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)train_data_X = train_data_all.drop('矿物类型', axis=1)null_num = train_data_X.isnull().sum()null_num_sorted = null_num.sort_values(ascending=True)filling_feature = []for i in null_num_sorted.index:filling_feature.append(i)if null_num_sorted[i] != 0:X = train_data_X[filling_feature].drop(i, axis=1)y = train_data_X[i]row_numbers_mg_null = train_data_X[train_data_X[i].isnull()].index.tolist()X_train = X.drop(row_numbers_mg_null)y_train = y.drop(row_numbers_mg_null)X_test = X.iloc[row_numbers_mg_null]rfg = RandomForestRegressor(n_estimators=100, random_state=10)rfg.fit(X_train, y_train)y_pred = rfg.predict(X_test)train_data_X.loc[row_numbers_mg_null, i] = y_predprint(f'完成训练数据集中的{i}列数据的填充')return train_data_X, train_data_all.矿物类型def RandomForest_test_fill(train_data, train_label, test_data, test_label):train_data_all = pd.concat([train_data, train_label], axis=1)train_data_all = train_data_all.reset_index(drop=True)test_data_all = pd.concat([test_data, test_label], axis=1)test_data_all = test_data_all.reset_index(drop=True)train_data_X = train_data_all.drop('矿物类型', axis=1)test_data_X = test_data_all.drop('矿物类型', axis=1)null_num = test_data_X.isnull().sum()null_num_sorted = null_num.sort_values(ascending=True)filling_feature = []for i in null_num_sorted.index:filling_feature.append(i)if null_num_sorted[i] != 0:X_train = train_data_X[filling_feature].drop(i, axis=1)y_train = train_data_X[i]X_test  = test_data_X[filling_feature].drop(i, axis=1)row_numbers_mg_null = test_data_X[test_data_X[i].isnull()].index.tolist()X_test = X_test.iloc[row_numbers_mg_null]rfg = RandomForestRegressor(n_estimators=100, random_state=10)rfg.fit(X_train, y_train)y_pred = rfg.predict(X_test)test_data_X.loc[row_numbers_mg_null, i] = y_predprint(f'完成测试数据集中的{i}列数据的填充')return test_data_X, test_data_all.矿物类型def RandomForest_save_file(os_x_train, os_y_train, X_test_fill, y_test_fill):data_train = pd.concat([os_x_train, os_y_train], axis=1)data_test = pd.concat([X_test_fill, y_test_fill], axis=1)data_train.to_excel(r'..//temp_data//训练数据集[随机森林填充].xlsx', index=False)data_test.to_excel(r'..//temp_data//测试数据集[随机森林填充].xlsx', index=False)

四、预处理小结

数据预处理完成以下关键工作:

  1. 清洗无效样本与异常符号,统一数据格式;
  2. 通过多种方法处理缺失值,适应不同数据特征;
  3. 标准化特征,消除量纲差异;
  4. 拆分并平衡数据集,为模型训练做准备。

经处理后的数据已满足模型输入要求,下一阶段将进行模型训练,包括:

  • 选择随机森林、SVM 等分类算法;
  • 开展模型评估与超参数调优;
  • 对比不同模型的分类性能。

后续将基于本文预处理后的数据,详细介绍模型训练过程及结果分析。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。
如若转载,请注明出处:http://www.pswp.cn/pingmian/93665.shtml
繁体地址,请注明出处:http://hk.pswp.cn/pingmian/93665.shtml
英文地址,请注明出处:http://en.pswp.cn/pingmian/93665.shtml

如若内容造成侵权/违法违规/事实不符,请联系英文站点网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

《UE5_C++多人TPS完整教程》学习笔记43 ——《P44 奔跑混合空间(Running Blending Space)》

本文为B站系列教学视频 《UE5_C多人TPS完整教程》 —— 《P44 奔跑混合空间&#xff08;Running Blending Space&#xff09;》 的学习笔记&#xff0c;该系列教学视频为计算机工程师、程序员、游戏开发者、作家&#xff08;Engineer, Programmer, Game Developer, Author&…

TensorRT-LLM.V1.1.0rc1:Dockerfile.multi文件解读

一、TensorRT-LLM有三种安装方式&#xff0c;从简单到难 1.NGC上的预构建发布容器进行部署,见《tensorrt-llm0.20.0离线部署DeepSeek-R1-Distill-Qwen-32B》。 2.通过pip进行部署。 3.从源头构建再部署&#xff0c;《TensorRT-LLM.V1.1.0rc0:在无 GitHub 访问权限的服务器上编…

UniApp 实现pdf上传和预览

一、上传1、html<template><button click"takeFile">pdf上传</button> </template>2、JStakeFile() {// #ifdef H5// H5端使用input方式选择文件const input document.createElement(input);input.type file;input.accept .pdf;input.onc…

《用Proxy解构前端壁垒:跨框架状态共享库的从零到优之路》

一个项目中同时出现React的函数式组件、Vue的模板语法、Angular的依赖注入时,数据在不同框架体系间的流转便成了开发者不得不面对的难题—状态管理,这个本就复杂的命题,在跨框架场景下更显棘手。而Proxy,作为JavaScript语言赋予开发者的“元编程利器”,正为打破这道壁垒提…

MOESI FSM的全路径测试用例

MOESI FSM的全路径测试用例摘要&#xff1a;本文首先提供一个UVM版本的测试序列&#xff08;基于SystemVerilog和UVM框架&#xff09;&#xff0c;设计为覆盖MOESI FSM的全路径&#xff1b;其次详细解释如何使用覆盖组&#xff08;covergroup&#xff09;来量化测试的覆盖率&am…

git仓库和分支的关系

1️⃣ 仓库分支&#xff08;Repository Branch&#xff09;每个 Git 仓库都有自己的分支结构。分支决定你当前仓库看到的代码版本。示例&#xff1a;仓库分支只是局部修改&#xff0c;项目分支才是全局管理所有仓库分支的概念。wifi_camera 仓库&#xff1a; - main - dev - fe…

Linux的基本操作

Linux 系统基础操作完整指南一、文件与目录操作1. 导航与查看pwd (Print Working Directory)作用&#xff1a;显示当前所在目录的完整路径示例&#xff1a;pwd → 输出 /home/user/documents使用场景&#xff1a;当你在多层目录中迷失时快速定位当前位置ls (List)常用选项&…

npm设置了镜像 pnpm还需要设置镜像吗

npm配置镜像后是否需要为pnpm单独设置镜像&#xff1f; 是的&#xff0c;即使您已经为npm设置了镜像源&#xff08;如淘宝镜像&#xff09;&#xff0c;仍然需要单独为pnpm配置镜像源。这是因为npm和pnpm是两个独立的包管理工具&#xff0c;它们的配置系统和环境变量是分离的&a…

Linux管道

预备知识&#xff1a;进程通信进程需要某种协同&#xff0c;协同的前提条件是通信。有些数据是用来通知就绪的&#xff0c;有些是单纯的传输数据&#xff0c;还有一些是控制相关信息。进程具有独立性&#xff0c;所以通信的成本可能稍微高一点&#xff1b;进程间通信前提是让不…

基于Spring Boot的快递物流仓库管理系统 商品库存管理系统

&#x1f525;作者&#xff1a;it毕设实战小研&#x1f525; &#x1f496;简介&#xff1a;java、微信小程序、安卓&#xff1b;定制开发&#xff0c;远程调试 代码讲解&#xff0c;文档指导&#xff0c;ppt制作&#x1f496; 精彩专栏推荐订阅&#xff1a;在下方专栏&#x1…

脚手架开发-Common封装基础通用工具类<基础工具类>

书接上文 java一个脚手架搭建_redission java脚手架-CSDN博客 以微服务为基础搭建一套脚手架开始前的介绍-CSDN博客 脚手架开发-准备配置-进行数据初始化-配置文件的准备-CSDN博客 脚手架开发-准备配置-配置文件的准备项目的一些中间件-CSDN博客 脚手架开发-Nacos集成-CSD…

软件系统运维常见问题

系统部署常见问题 环境配置、兼容性问题。生产与测试环境的操作系统、库版本、中间件版本不一致&#xff0c;运行环境软件版本不匹配。新旧版本代码/依赖不兼容。依赖缺失或冲突问题。后端包启动失败&#xff0c;提示类/方法/第三方依赖库找不到或者版本冲突。配置错误。系统启…

2021 IEEE【论文精读】用GAN让音频隐写术骗过AI检测器 - 对抗深度学习的音频信息隐藏

使用GAN生成音频隐写术的隐写载体 本文为个人阅读GAN音频隐写论文&#xff0c;部分内容注解&#xff0c;由于原文篇幅较长这里就不再一一粘贴&#xff0c;仅对原文部分内容做注解&#xff0c;仅供参考详情参考原文链接 原文链接&#xff1a;https://ieeexplore.ieee.org/abstra…

PWA技术》》渐进式Web应用 Push API 和 WebSocket 、webworker 、serviceworker

PWA # 可离线 # 高性能 # 无需安装 # 原生体验Manifest {"name": "天气助手", // 应用全名"short_name": "天气", // 短名称&#xff08;主屏幕显示&#xff09;"start_url": "/index.html&…

数据结构——栈和队列oj练习

225. 用队列实现栈 - 力扣&#xff08;LeetCode&#xff09; 这一题需要我们充分理解队列和栈的特点。 队列&#xff1a;队头出数据&#xff0c;队尾入数据。 栈&#xff1a;栈顶出数据和入数据。 我们可以用两个队列实现栈&#xff0c;在这过程中&#xff0c;我们总要保持其…

Java基础 8.19

目录 1.局部内部类的使用 总结 1.局部内部类的使用 说明&#xff1a;局部内部类是定义在外部类的局部位置&#xff0c;比如方法中&#xff0c;并且有类名可以直接访问外部类的所有成员&#xff0c;包含私有的不能添加访问修饰符&#xff0c;因为它的地位就是一个局部变量。局…

从父类到子类:C++ 继承的奇妙旅程(2)

前言&#xff1a;各位代码航海家&#xff0c;欢迎回到C继承宇宙&#xff01;上回我们解锁了继承的「基础装备包」&#xff0c;成功驯服了public、protected和花式成员隐藏术。但——⚠️前方高能预警&#xff1a; 继承世界的暗流涌动远不止于此&#xff01;今天我们将勇闯三大神…

【图像算法 - 16】庖丁解牛:基于YOLO12与OpenCV的车辆部件级实例分割实战(附完整代码)

庖丁解牛&#xff1a;基于YOLO12与OpenCV的车辆部件级实例分割实战&#xff08;附完整代码&#xff09; 摘要&#xff1a; 告别“只见整车不见细节”&#xff01;本文将带您深入实战&#xff0c;利用YOLO12-seg训练实例分割模型&#xff0c;结合OpenCV的强大图像处理能力&…

ubuntu22.04配置远程桌面

文章目录前言检查桌面类型xorg远程桌面(xrdp)安装xrdpxrdp添加到ssl-certwayland远程桌面(gnome-remote-desktop)检查安装开启开启状况检查自动登录奇技淫巧前言 在windows上使用远程桌面服务&#xff0c;连接ubuntu主机的远程桌面 检查桌面类型 查看桌面类型、协议 echo $…

SQL Server 中子查询、临时表与 CTE 的选择与对比

在 SQL Server 的实际开发过程中&#xff0c;我们常常需要将复杂的查询逻辑分解为多个阶段进行处理。实现这一目标的常见手段有 子查询 (Subquery)、临时表 (Temporary Table) 和 CTE (Common Table Expression)。这三者在语法、执行效率以及可维护性方面各有优势与局限。如何选…