面向小样本的 XGBoost 妥当建模与 SHAP 表明 —— 分身猜测精度与焦点变量可表明性
在社会科学、经济管理等范畴的研究中,样本量通常有限,同时又须要对关键变量的影响力做出可靠推断。本文分享一套新改进的 XGBoost 建模流程,该流程专为小样本回归任务计划,在控制过拟合的条件下,仍旧可以或许给出可查验的 (R^2) 评价,并通过 SHAP 值包管焦点表明变量的显着性。整套代码可复现,天生十余张高质量分析图,末了附有完备源码供直接使用。
1. 计划思绪与改进要点
相较于通例的“调参—猜测—输出告急性”流程,本代码告急做了三方面改进:
- 面向小样本的候选参数池:预先计划多组强正则化的超参数组合(浅树深度、低学习率、子采样、L1/L2 正则项等),制止在小样本上太过搜刮复杂模子。
- 焦点变量驱动的模子筛选机制:不光依据验证集 RMSE 选最优模子,还欺压要求焦点特性(本例为“普惠金融指数”)的 SHAP 绝对值均值非零,并在弊端可担当范围内选择对该特性表明力最强的模子。
- 独立测试集的 (R^2) 验证:全程严酷区分练习、验证、测试三层数据,终极在从未到场练习的测试集上陈诉 (R^2)、RMSE、MAE,确保小样本条件下的泛化本领可量化。
2. 数据预备与预处理处罚
代码从 xgbosst.csv 中读取数据,主动剔除非数值列、常量列以及含有无穷值的行,包管建模输入干净。目的变量为“韧性指数”(列名 XHM),全部特性通过 FEATURE_CN_MAP 映射为中文名称,便于后续图表直接用于论文或陈诉。- df = pd.read_csv(FILE_PATH)
- df_numeric = df.select_dtypes(include=[np.number])
- df_numeric = df_numeric.loc[:, df_numeric.nunique(dropna=True) > 1]
- df_numeric = df_numeric.replace([np.inf, -np.inf], np.nan).dropna()
复制代码 数据按照 80%/20% 分别为练习集和测试集,再从练习会合划出 25% 作为验证集(即团体比例为 60% 练习、20% 验证、20% 测试)。此处未使用分层抽样(stratify=None),实用于回归任务。
3. 小样本适配的候选参数计划
为了在小样本下均衡弊端与方差,预设了 7 组超参数组合,共同特点为:
- max_depth 控制在 1~3,树结构极浅;
- learning_rate 分布在 0.03~0.3,多数组合接纳较低学习率;
- subsample 与 colsample_bytree 不大于 1.0,部分组合刻意降至 0.7,增长随机性;
- reg_alpha(L1)、reg_lambda(L2)、gamma(分裂最小丧失淘汰)、min_child_weight 均被显式设置,强化正则。
同时,启用早停机制(early_stopping_rounds),依据验证集 RMSE 在过拟合前制止练习。每组候选模子都会记录验证集 RMSE、验证集 (R^2)、焦点特性“普惠金融指数”在验证集上的匀称绝对 SHAP 值,以赶早停后的实际迭代次数。
4. 焦点变量驱动的模子择优战略
模子选择并非只取验证集 RMSE 最小的候选者,而是接纳两阶段逻辑:
- 找出 RMSE 最低的候选模子作为基准;
- 从全部焦点特性 SHAP 非零的候选模子中,筛选出验证集 RMSE 不凌驾基准 RMSE 肯定倍数(CORE_RMSE_TOLERANCE = 1.5)的模子;
- 若存在符合条件的模子,优先选择此中焦点特性匀称绝对 SHAP 最大的模子(即对普惠金融指数表明力最强,同时弊端未显着变差);若无符合条件者,则退化为选择 RMSE 最小的模子。
此举包管终极模子不光猜测精确,还能在焦点研究变量上提供有统计意义的归因,制止“猜测好但关键变量毫无贡献”的尴尬情况。
5. 测试集评价与 (R^2) 验证
择优完成后,使用最佳超参数和早停确定的最佳迭代次数在练习集(60% 数据)上重新练习终极模子,并在测试集(20% 数据)上盘算:
- (R^2)(决定系数)
- RMSE(均方根弊端)
- MAE(匀称绝对弊端)
这些指标会生存至 model_test_metrics.csv,同时全部候选模子的验证集表现会输出到 model_candidate_results.csv,方便回溯比力。对于小样本题目,独立的测试集 (R^2) 是衡量模子是否过拟合的关键证据。
6. SHAP 可表明性全景分析
代码天生了十余种 SHAP 可视化图表,覆盖全局表明与局部表明:
- 小提琴图(shap_violin.png)和蜂群图(shap_beeswarm.png):展示各特性 SHAP 值分布与方向;
- 全样本热力图(shap_heatmap.png)与20 个随机样本热力图(shap_heatmap_20samples.png):直观出现每个样本在每个特性上的 SHAP 贡献;
- 瀑布图(shap_waterfall_sample5.png):分解单个样本的猜测驱动力;
- 特性告急性条形图(shap_feature_importance_bar.png):按匀称绝对 SHAP 排序;
- 依赖图(shap_dependence_top1.png、top2.png、core_feature.png):展示告急特性与 SHAP 值的关系;
- 贡献份额图(shap_contribution_share.png):TOP N 特性的贡献占比及累计曲线。
别的,测试集猜测值与真实值的散点图(model_prediction_performance.png)会标注 (R^2)、RMSE 和 MAE,便于快速评估模子表现。
7. 使用阐明
- 将数据文件定名为 xgbosst.csv,确保包罗目的列 XHM 及 FEATURE_CN_MAP 中对应的列(可按实际修改映射和文件名)。
- 安装依赖:numpy, pandas, matplotlib, xgboost, shap, scikit-learn。
- 运行脚本,全部图表和 CSV 文件将输出至当前目次。
- 若须要调解焦点特性或容忍系数,修改 CORE_FEATURE_NAME 和 CORE_RMSE_TOLERANCE 两个常量即可。
完备代码
以下代码可直接复制运行(注意查抄文件路径和字体设置):- import osimport sysimport importlibimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport xgboost as xgbfrom sklearn.metrics import mean_absolute_error, mean_squared_error, r2_scorefrom sklearn.model_selection import train_test_splitif __name__ == "__main__": current_dir = os.path.dirname(os.path.abspath(__file__)) sys.path = [p for p in sys.path if os.path.abspath(p or ".") != current_dir]shap = importlib.import_module("shap")FIG_DPI = 400TOP_N = 15RANDOM_HEATMAP_SAMPLES = 20FILE_PATH = "xgbosst.csv"TARGET_COL = "XHM"CORE_FEATURE_NAME = "普惠金融指数"CORE_SHAP_EPS = 1e-8CORE_RMSE_TOLERANCE = 1.5FEATURE_CN_MAP = { "Feat1": "人均GDP", "Feat2": "专利/万人", "Feat3": "对外开放度", "Feat4": "财产高级化", "Feat5": "科技付出占比", "Feat6": "交通可达性", "Feat7": "普惠金融指数",}TARGET_CN_NAME = "韧性指数"plt.rcParams["font.family"] = ["SimSun", "DejaVu Sans"]plt.rcParams["axes.unicode_minus"] = Falsedf = pd.read_csv(FILE_PATH)
- df_numeric = df.select_dtypes(include=[np.number])
- df_numeric = df_numeric.loc[:, df_numeric.nunique(dropna=True) > 1]
- df_numeric = df_numeric.replace([np.inf, -np.inf], np.nan).dropna()if TARGET_COL not in df_numeric.columns: raise ValueError(f"目的列 `{TARGET_COL}` 不在数据中。")X = df_numeric.drop(columns=[TARGET_COL], errors="ignore")y = df_numeric[TARGET_COL]X = X.rename(columns=FEATURE_CN_MAP)y = y.rename(TARGET_CN_NAME)X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=None)X_fit, X_valid, y_fit, y_valid = train_test_split( X_train, y_train, test_size=0.25, random_state=42, stratify=None)base_xgb_params = { "objective": "reg:squarederror", "eval_metric": "rmse", "random_state": 42, "n_jobs": 1,}candidate_params = [ { "n_estimators": 100, "learning_rate": 0.3, "max_depth": 2, "subsample": 0.7, "colsample_bytree": 0.7, "reg_alpha": 0.5, "reg_lambda": 1.0, "gamma": 0.1, "min_child_weight": 1, }, { "n_estimators": 200, "learning_rate": 0.05, "max_depth": 1, "subsample": 0.8, "colsample_bytree": 0.8, "reg_alpha": 0.2, "reg_lambda": 3.0, "gamma": 0.0, "min_child_weight": 2, }, { "n_estimators": 150, "learning_rate": 0.1, "max_depth": 2, "subsample": 0.8, "colsample_bytree": 0.8, "reg_alpha": 0.1, "reg_lambda": 2.0, "gamma": 0.05, "min_child_weight": 2, }, { "n_estimators": 80, "learning_rate": 0.2, "max_depth": 1, "subsample": 0.7, "colsample_bytree": 0.7, "reg_alpha": 0.5, "reg_lambda": 5.0, "gamma": 0.1, "min_child_weight": 3, }, { "n_estimators": 250, "learning_rate": 0.03, "max_depth": 2, "subsample": 0.9, "colsample_bytree": 1.0, "reg_alpha": 0.0, "reg_lambda": 0.5, "gamma": 0.0, "min_child_weight": 1, }, { "n_estimators": 120, "learning_rate": 0.08, "max_depth": 2, "subsample": 1.0, "colsample_bytree": 1.0, "reg_alpha": 0.0, "reg_lambda": 1.0, "gamma": 0.0, "min_child_weight": 1, }, { "n_estimators": 80, "learning_rate": 0.1, "max_depth": 3, "subsample": 0.9, "colsample_bytree": 1.0, "reg_alpha": 0.0, "reg_lambda": 0.5, "gamma": 0.0, "min_child_weight": 1, },]core_feature_idx = X_train.columns.get_loc(CORE_FEATURE_NAME) if CORE_FEATURE_NAME in X_train.columns else Nonecandidate_results = []for params in candidate_params: early_stop_rounds = min(10, max(5, params["n_estimators"] // 10)) candidate_model = xgb.XGBRegressor( **base_xgb_params, **params, early_stopping_rounds=early_stop_rounds, ) try: candidate_model.fit( X_fit, y_fit, eval_set=[(X_valid, y_valid)], verbose=False, ) except TypeError: candidate_model = xgb.XGBRegressor(**base_xgb_params, **params) candidate_model.fit( X_fit, y_fit, eval_set=[(X_valid, y_valid)], early_stopping_rounds=early_stop_rounds, verbose=False, ) valid_pred = candidate_model.predict(X_valid) valid_rmse = np.sqrt(mean_squared_error(y_valid, valid_pred)) valid_r2 = r2_score(y_valid, valid_pred) core_valid_mean_abs_shap = 0.0 if core_feature_idx is not None: valid_shap_values = shap.Explainer(candidate_model)(X_valid).values core_valid_mean_abs_shap = np.abs(valid_shap_values[:, core_feature_idx]).mean() best_iteration = getattr(candidate_model, "best_iteration", None) if best_iteration is None: best_iteration = params["n_estimators"] - 1 selected_n_estimators = max(1, int(best_iteration) + 1) candidate_result = { "valid_rmse": valid_rmse, "valid_r2": valid_r2, "core_valid_mean_abs_shap": core_valid_mean_abs_shap, "selected_n_estimators": selected_n_estimators, "params": params, } candidate_results.append(candidate_result)rmse_best_candidate = min(candidate_results, key=lambda item: item["valid_rmse"])core_candidates = [ item for item in candidate_results if item["core_valid_mean_abs_shap"] > CORE_SHAP_EPS]core_rmse_limit = max( rmse_best_candidate["valid_rmse"] * CORE_RMSE_TOLERANCE, rmse_best_candidate["valid_rmse"] + 0.01,)eligible_core_candidates = [ item for item in core_candidates if item["valid_rmse"] min_value else 0.05line_min = min_value - paddingline_max = max_value + paddingax_p.plot([line_min, line_max], [line_min, line_max], color="#C0392B", linewidth=2.0, line)ax_p.set_xlim(line_min, line_max)ax_p.set_ylim(line_min, line_max)ax_p.set_title("Test Set Prediction Performance", fontsize=20, pad=14, fontweight="bold")ax_p.set_xlabel(f"Actual {TARGET_CN_NAME}", fontsize=14)ax_p.set_ylabel(f"Predicted {TARGET_CN_NAME}", fontsize=14)ax_p.grid(line, alpha=0.25)ax_p.text( 0.05, 0.95, f"R² = {test_r2:.4f}\nRMSE = {test_rmse:.4f}\nMAE = {test_mae:.4f}", transform=ax_p.transAxes, va="top", fontsize=13, bbox={"boxstyle": "round,pad=0.35", "facecolor": "white", "edgecolor": "#D0D3D4", "alpha": 0.92},)fig_p.tight_layout()fig_p.savefig("model_prediction_performance.png", dpi=FIG_DPI, bbox_inches="tight", facecolor="white")plt.close(fig_p)# 6) Feature importance bar chartbar_order = top_idx[::-1]bar_names = [X_test.columns[i] for i in bar_order]bar_values = mean_abs_shap[bar_order]fig_b, ax_b = plt.subplots(figsize=(12, 8), dpi=150)colors = plt.cm.Blues(np.linspace(0.35, 0.95, len(bar_values)))ax_b.barh(bar_names, bar_values, color=colors, edgecolor="white", linewidth=1.0)ax_b.set_title("Mean |SHAP| Feature Importance", fontsize=20, pad=14, fontweight="bold")ax_b.set_xlabel("Mean Absolute SHAP Value", fontsize=14)ax_b.tick_params(axis="both", labelsize=12)ax_b.grid(axis="x", line, alpha=0.25)for spine in ["top", "right", "left"]: ax_b.spines[spine].set_visible(False)for value, name in zip(bar_values, bar_names): ax_b.text(value, name, f" {value:.4f}", va="center", fontsize=10)fig_b.tight_layout()fig_b.savefig("shap_feature_importance_bar.png", dpi=FIG_DPI, bbox_inches="tight", facecolor="white")plt.close(fig_b)# 7) Beeswarm summary plotplt.figure(figsize=(12, 8), dpi=150)shap.summary_plot( shap_values_test, X_test, plot_type="dot", max_display=top_n, show=False,)ax_s = plt.gca()ax_s.set_title("SHAP Beeswarm Summary", fontsize=20, pad=14, fontweight="bold")ax_s.set_xlabel("SHAP Value", fontsize=16)ax_s.tick_params(axis="both", labelsize=12)plt.tight_layout()plt.savefig("shap_beeswarm.png", dpi=FIG_DPI, bbox_inches="tight", facecolor="white")plt.close()# 8-9) Dependence plots for top features and core featuredef save_dependence_plot(feature_idx, output_suffix): feature_name = X_test.columns[feature_idx] feature_values = X_test.iloc[:, feature_idx] feature_shap_values = shap_mat[:, feature_idx] fig_d, ax_d = plt.subplots(figsize=(10, 7), dpi=150) scatter = ax_d.scatter( feature_values, feature_shap_values, c=feature_values, cmap="coolwarm", s=78, alpha=0.85, edgecolor="white", linewidth=0.7, ) ax_d.axhline(0, color="#777777", linewidth=1.2, line, alpha=0.7) ax_d.set_title(f"Dependence Plot - {feature_name}", fontsize=18, pad=12, fontweight="bold") ax_d.set_xlabel(feature_name, fontsize=14) ax_d.set_ylabel("SHAP Value", fontsize=14) ax_d.tick_params(axis="both", labelsize=11) ax_d.grid(line, alpha=0.22) cbar_d = fig_d.colorbar(scatter, ax=ax_d, fraction=0.045, pad=0.04) cbar_d.set_label("Feature Value", fontsize=12) cbar_d.ax.tick_params(labelsize=10) fig_d.tight_layout() fig_d.savefig(f"shap_dependence_{output_suffix}.png", dpi=FIG_DPI, bbox_inches="tight", facecolor="white") plt.close(fig_d)for rank, feature_idx in enumerate(feature_order[: min(2, len(feature_order))], start=1): save_dependence_plot(feature_idx, f"top{rank}")if core_feature_idx is not None: save_dependence_plot(core_feature_idx, "core_feature")# 10) Cumulative contribution chartimportance_pct = top_importance / top_importance.sum() * 100cumulative_pct = np.cumsum(importance_pct)fig_c, ax_c = plt.subplots(figsize=(13, 7), dpi=150)x_pos = np.arange(len(top_feature_names))ax_c.bar(x_pos, importance_pct, color="#5DADE2", edgecolor="white", linewidth=1.0)ax_c.set_title("SHAP Contribution Share", fontsize=20, pad=14, fontweight="bold")ax_c.set_ylabel("Contribution Share (%)", fontsize=14)ax_c.set_xticks(x_pos)ax_c.set_xticklabels(top_feature_names, rotation=35, ha="right", fontsize=11)ax_c.tick_params(axis="y", labelsize=11)ax_c.grid(axis="y", line, alpha=0.25)ax_c2 = ax_c.twinx()ax_c2.plot(x_pos, cumulative_pct, color="#D35400", marker="o", linewidth=2.6)ax_c2.set_ylabel("Cumulative Share (%)", fontsize=14)ax_c2.set_ylim(0, 105)ax_c2.tick_params(axis="y", labelsize=11)for spine in ["top"]: ax_c.spines[spine].set_visible(False) ax_c2.spines[spine].set_visible(False)fig_c.tight_layout()fig_c.savefig("shap_contribution_share.png", dpi=FIG_DPI, bbox_inches="tight", facecolor="white")plt.close(fig_c)print("已天生同款风格图:")print("1) shap_violin.png")print("2) shap_heatmap.png")print("3) shap_heatmap_20samples.png")print("4) shap_waterfall_sample5.png")print("5) model_prediction_performance.png")print("6) shap_feature_importance_bar.png")print("7) shap_beeswarm.png")print("8) shap_dependence_top1.png")print("9) shap_dependence_top2.png")print("10) shap_dependence_core_feature.png")print("11) shap_contribution_share.png")print("测试集模子结果:")print(f"R² = {test_r2:.4f}")print(f"RMSE = {test_rmse:.4f}")print(f"MAE = {test_mae:.4f}")print("练习集内部验证结果:")print(f"Validation R² = {best_candidate['valid_r2']:.4f}")print(f"Validation RMSE = {best_candidate['valid_rmse']:.4f}")print(model_selection_note)print(f"{CORE_FEATURE_NAME} 测试集 Mean |SHAP| = {core_test_mean_abs_shap:.6f}")print(f"{CORE_FEATURE_NAME} 验证集 Mean |SHAP| = {best_candidate['core_valid_mean_abs_shap']:.6f}")print("终极接纳的 XGBoost 参数:")for key, value in best_params.items(): print(f"{key}: {value}")print("指标表:model_test_metrics.csv")print("候选参数对比表:model_candidate_results.csv")print(f"目的变量中文名:{TARGET_CN_NAME}")
复制代码 末了制品:

免责声明:如果侵犯了您的权益,请联系站长及时删除侵权内容,谢谢合作!qidao123.com:ToB企服之家,中国第一个企服评测及软件市场,开放入驻,技术点评得现金. |