Machine Learning Toolbox
机器学习与深度学习实战工具箱:像拼乐高一样搭建分析流程
本页面作为一个“代码工具箱”,旨在提供标准化的代码块,用于快速搭建数据挖掘和模型分析流程。主要语言是 Python。
数据预处理 (Data Preprocessing)
Overview
- 数据读取模板
- CSV
- Excel
- SQL
- 可视化分析
- 正态性检验(直方图/KDE图/Q-Q图)
- KMO检验
- Bartlett 球形检验
- 相关性热力图(Pearson/Spearman)
- 异常值检测与处理
- Boxplot
- Z-Score
- Isolation Forest
- 缺失值处理
- Drop
- Mean/Median
- KNN Imputer
- 数据类型转换
- Categorical to Numerical
- 数据集拆分
- Train/Test/Validation Split
- 样本不平衡处理
- SMOTE
- Undersampling
特征工程 (Feature Engineering)
Overview
- 特征构建
- Von Mises分布提取周期性行为特征
- 回归模型参数和统计量
- 数据标准化
- StandardScaler (Z-Score)
- 数据归一化
- MinMaxScaler
- 类别编码
- One-Hot Encoding
- Label Encoding
- 特征降维
- PCA
- t-SNE
- LDA
- 特征选择
- VarianceThreshold
- SelectKBest
- RFE (Recursive Feature Elimination)
- 随机森林特征选择法 (Geni/Permutation)
- 特征生成
- Polynomial Features
- 文本特征提取
- TF-IDF
- Word2Vec
- BERT embeddings
随机森林特征选择 (Random Forest Feature Selection)
随机森林不仅是非常强大的预测模型,也是一种高效的特征选择工具。它主要通过以下两种机制来衡量特征重要性:
1. 基于杂质减少 (MDI - Mean Decrease in Impurity)
也称为 Gini 重要性(在 Sklearn 中通过 feature_importances_ 获取)。
- 原理:计算每个特征在所有决策树的分裂节点中,使杂质(分类问题用 Gini/Entropy,回归问题用 MSE)减少的总量。
- 优点:计算速度极快(训练模型时同步完成)。
- 缺点:严重倾向于选择数值范围广或类别较多(High Cardinality)的特征(例如 ID 类特征),可能会产生误导。
2. 基于排列重要性 (Permutation Importance)
- 原理:在模型训练完成后,使用验证集或测试集。保持其他特征不变,将某一特征的值随机打乱(Permute),观察模型性能(如 Accuracy, R2)的下降程度。如果性能大幅下降,说明该特征很重要。
- 优点:更加客观真实,不存在对高基数特征的偏见,且适用于任何模型。
- 缺点:计算成本较高(需要多次重复预测)。
实战建议:通常先用 MDI 快速查看,但在正式报告或严谨分析中,强烈建议以 Permutation Importance 为准。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
# ==========================================
# 1. Configuration & Data Loading
# ==========================================
# Set plot style
sns.set_style("whitegrid")
plt.rcParams['font.sans-serif'] = ['SimHei'] # For Chinese characters
plt.rcParams['axes.unicode_minus'] = False
# Model Parameters
RF_PARAMS = {
'n_estimators': 100,
'max_depth': 10,
'min_samples_split': 2,
'n_jobs': -1,
'random_state': 42
}
# Load Data (Replace with your data source)
# df = pd.read_csv('your_data.csv')
# X = df.drop(['target', 'date'], axis=1)
# y = df['target']
# For demonstration, generate random data
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=20, n_informative=5, random_state=42)
feature_names = [f'Feature_{i}' for i in range(X.shape[1])]
# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# ==========================================
# 2. Model Training
# ==========================================
print("Training Random Forest...")
model = RandomForestRegressor(**RF_PARAMS) # Use RandomForestClassifier for classification
model.fit(X_train, y_train)
# Basic Evaluation
y_pred = model.predict(X_test)
print(f"Test R2 Score: {r2_score(y_test, y_pred):.4f}")
# ==========================================
# 3. Feature Importance (Gini & Permutation)
# ==========================================
# A. Gini Importance (MDI) - Fast but biased towards high cardinality features
gini_importance = model.feature_importances_
gini_df = pd.DataFrame({
'Feature': feature_names,
'Gini_Importance': gini_importance
}).sort_values('Gini_Importance', ascending=False)
# B. Permutation Importance - Slower but more reliable
print("Calculating Permutation Importance...")
perm_result = permutation_importance(
model, X_test, y_test,
n_repeats=10,
random_state=42,
n_jobs=-1
)
perm_df = pd.DataFrame({
'Feature': feature_names,
'Permutation_Importance': perm_result.importances_mean,
'Permutation_Std': perm_result.importances_std
}).sort_values('Permutation_Importance', ascending=False)
# ==========================================
# 4. Visualization
# ==========================================
def plot_importance(df, value_col, title):
plt.figure(figsize=(10, 6))
# Plot top 20 features
plot_df = df.head(20).sort_values(value_col, ascending=True)
plt.barh(range(len(plot_df)), plot_df[value_col], color='#1f77b4', alpha=0.8)
plt.yticks(range(len(plot_df)), plot_df['Feature'])
plt.xlabel('Importance')
plt.title(title)
plt.tight_layout()
plt.show()
# Plot Gini Importance
plot_importance(gini_df, 'Gini_Importance', 'Random Forest Feature Importance (Gini)')
# Plot Permutation Importance
plot_importance(perm_df, 'Permutation_Importance', 'Random Forest Feature Importance (Permutation)')
# Cumulative Importance (to decide how many features to keep)
gini_sorted = gini_df.sort_values('Gini_Importance', ascending=False)
cumulative_importance = np.cumsum(gini_sorted['Gini_Importance'])
plt.figure(figsize=(10, 5))
plt.plot(range(1, len(cumulative_importance) + 1), cumulative_importance, 'b-o')
plt.axhline(y=0.9, color='r', linestyle='--')
plt.text(0, 0.91, '90% Threshold', color='r')
plt.xlabel('Number of Features')
plt.ylabel('Cumulative Importance')
plt.title('Cumulative Feature Importance')
plt.grid(True)
plt.show()
# ==========================================
# 5. Output Report
# ==========================================
print("\nTop 10 Features (Permutation):")
print(perm_df.head(10))模型选择 (Model Selection)
Overview
- 线性模型
- Linear Regression
- Logistic Regression
- 支持向量机
- SVM
- SVR
- 树模型与集成学习
- Decision Tree
- Random Forest
- XGBoost
- LightGBM
- CatBoost
- 聚类算法
- K-Means
- DBSCAN
- 基础神经网络
- MLP (PyTorch/Keras)
- 卷积神经网络
- CNN (ResNet, VGG)
- 循环神经网络
- LSTM
- GRU
- Transformer 模型
- HuggingFace Transformers
参数调整 (Parameter Tuning)
Overview
- 交叉验证
- K-Fold Cross Validation
- Stratified K-Fold
- 超参数搜索
- Grid Search
- Random Search
- Bayesian Optimization (Optuna)
- 启发式搜索
- Particle Swarm Optimization (PSO)
- Genetic Algorithm (GA)
- 训练策略
- Early Stopping
效果评价 (Performance Evaluation)
- 分类指标
- Accuracy
- Precision
- Recall
- F1-Score
- AUC-ROC
- 回归指标
- MSE
- RMSE
- MAE
- R2 Score
- 可视化分析
- Confusion Matrix
- ROC Curve
- PR Curve
- Loss/Accuracy Training Curves
- 模型解释
- Feature Importance Plot
- SHAP Values