摘要

本文总结了6种实用的模型调试技巧：1）通过设置断点逐行检查代码；2）使用fast_dev_run参数快速验证全流程；3）限制批次量缩短训练周期；4）利用num_sanity_val_steps进行预验证；5）通过ModelSummary打印模型权重结构；6）设置example_input_array显示各层输入输出尺寸。这些方法可显著提升调试效率，特别适用于大规模深度学习模型的开发验证环节，帮助开发者快速定位问题并优化模型结构。

1. 设置断点

断点会停止代码执行，以便您可以检查变量等。并允许您的代码一次执行一行。

def function_to_debug():x = 2# set breakpointbreakpoint()y = x**2

在此示例中，代码将在执行该行 y = x**2 之前停止。

2. 快速运行所有模型代码一次

如果你曾经历过模型训练数日后却在验证或测试阶段崩溃的痛苦，那么这个训练器参数将成为你的救星。

fast_dev_run（快速开发运行模式）参数会让训练器仅执行：
5个批次的训练 → 验证 → 测试 → 预测全流程
快速检测代码是否存在错误：

trainer = Trainer(fast_dev_run=True)

要更改要使用的批次数，请将参数更改为整数。在这里，我们运行每个批次的 7 个批次：

trainer = Trainer(fast_dev_run=7)

启用fast_dev_run参数时，将自动禁用以下功能组件：

超参优化器（tuner）
模型检查点回调（checkpoint callbacks）
早停回调（early stopping callbacks）
所有日志记录器（loggers）
日志类回调（如学习率监控器 LearningRateMonitor / 设备状态监控器 DeviceStatsMonitor）

3. 缩短 epoch 长度

在某些场景下，仅使用训练集/验证集/测试集/预测数据的子集（或限定批次量）能显著提升效率。例如：

✅ 仅抽取20%训练集

✅ 仅使用1%验证集

在处理ImageNet等大型数据集时，此方法可帮助您：

✅ 快速完成调试或验证

✅ 避免等待完整周期结束

✅ 大幅缩短反馈周期

# use only 10% of training data and 1% of val data
trainer = Trainer(limit_train_batches=0.1, limit_val_batches=0.01)# use 10 batches of train and 5 batches of val
trainer = Trainer(limit_train_batches=10, limit_val_batches=5)

4. 运行健全性检查

Lightning框架在训练初始阶段会预先执行2步验证，该设计能有效避免：当训练进入耗时漫长的深水区后，才在验证环节意外崩溃的风险。

trainer = Trainer(num_sanity_val_steps=2)

5. 打印 LightningModule 权重摘要

1. 每当调用该函数.fit() 时，Trainer 都会打印 LightningModule 的权重摘要。

trainer.fit(...)

这会生成一个表，如下所示：

  | Name  | Type        | Params | Mode
-------------------------------------------
0 | net   | Sequential  | 132 K  | train
1 | net.0 | Linear      | 131 K  | train
2 | net.1 | BatchNorm1d | 1.0 K  | train

如需在模型摘要中显示子模块，需添加 ModelSummary 回调：

from lightning.pytorch.callbacks import ModelSummary  # 导入模型摘要组件trainer = Trainer(callbacks=[ModelSummary(max_depth=-1)])  # 创建训练器时配置回调

参数解释

ModelSummary(max_depth=-1,  # 深度控制：-1=无限递归，0=仅顶层，1=展开一级子模块max_recursion=10  # 可选：防止无限递归的保险机制（默认10层）
)

典型输出示例

| Name        | Type          | Params | In dim       | Out dim      |
|-------------|---------------|--------|--------------|--------------|
| net         | Sequential    | 1.5 M  | [32, 3, 224] | [32, 1000]   |
|  ├─conv1    | Conv2d        | 9.4 K  | [32, 3, 224] | [32, 64,112] |
|  ├─bn1      | BatchNorm2d   | 128    | [32,64,112]  | [32,64,112]  |
|  └─...      | ...           | ...    | ...          | ...          |

若需在不调用 .fit() 的情况下打印模型摘要，请使用以下方案：

from lightning.pytorch.utilities.model_summary import ModelSummary  # 从工具库导入摘要类model = LitModel()  # 实例化自定义模型
summary = ModelSummary(model, max_depth=-1)  # 生成深度摘要对象
print(summary)  # 打印结构化模型报告

参数解释

ModelSummary(model,        # 必需：继承LightningModule的自定义模型max_depth=-1, # 层级深度：-1=无限递归（显示所有子模块）max_recursion=10  # 递归安全限制（防循环引用崩溃）
)

典型输出示例

╒═════════════╤══════════════╤═════════╤══════════╤═══════════╕
│ Layer       │ Type         │ Params  │ In dim   │ Out dim   │
╞═════════════╪══════════════╪═════════╪══════════╪═══════════╡
│ encoder     │ Sequential   │ 4.7M    │ [32,256] │ [32,512]  │
│ ├─lstm1     │ LSTM         │ 3.2M    │ [32,256] │ [32,128]  │
│ ├─dropout   │ Dropout      │ 0       │ [32,128] │ [32,128]  │
│ └─...       │ ...          │ ...     │ ...      │ ...       │
╘═════════════╧══════════════╧═════════╧══════════╧═══════════╛
Trainable params: 4.7M
Non-trainable params: 0

要关闭自动汇总，请使用：

trainer = Trainer(enable_model_summary=False)

6. 打印输入输出层尺寸

另一个调试工具是通过在 LightningModule 中设置属性来显示所有层的中间输入和输出大小。example_input_array

class LitModel(LightningModule):def __init__(self, *args, **kwargs):self.example_input_array = torch.Tensor(32, 1, 28, 28)

对于输入数组，摘要表将包括输入和输出层维度：

  | Name  | Type        | Params | Mode  | In sizes  | Out sizes
----------------------------------------------------------------------
0 | net   | Sequential  | 132 K  | train | [10, 256] | [10, 512]
1 | net.0 | Linear      | 131 K  | train | [10, 256] | [10, 512]
2 | net.1 | BatchNorm1d | 1.0 K  | train | [10, 512] | [10, 512]