老铁们我最近真的好惨😭,一个大模型在单机多卡上运行就是出错,debug看的老眼昏花,最后发现大张量在设备间直接传输会有很发癫的行为,还请大家帮我看看🙇
摒弃屎山一样的代码,简单运行下列脚本
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1" # 设定GPU使用范围
import torch# 1. 检查PyTorch版本
print(f"PyTorch版本: {torch.__version__}") # 需≥1.7# 2. 检查CUDA设备
device0 = torch.device('cuda:0')
device1 = torch.device('cuda:1')
print(f"设备数量: {torch.cuda.device_count()}") # 需≥2
print(f"cuda:1是否可用: {torch.cuda.is_available()}")# 3. 生成测试张量(模拟用户输入)
a = torch.randn((1, 447, 4096), dtype=torch.float16, device="cuda:0")
print(f"转移前a(cuda:0): \n{a}")# 4. 转移设备
b = a.to(device1)
# b = a.to("cpu").to(device1)
print(f"转移后b(cuda:1): \n{b}")# 5. 验证总和(全0则总和为0)
print(f"b的总和: {b.sum()}")
输出如下
PyTorch版本: 2.6.0+cu124
设备数量: 2
cuda:1是否可用: True
转移前a(cuda:0):
tensor([[[-0.1302, 0.5620, -0.8608, ..., -1.2217, -0.9214, -0.7627],[ 1.3145, 0.3105, -0.4094, ..., -0.7773, -2.0195, -0.6938],[ 0.5791, -0.9868, 1.1650, ..., 2.1152, 0.8052, -1.2822],...,[ 0.6489, 0.8032, 1.4160, ..., -1.7891, 1.6729, 0.8071],[-1.0586, 0.2969, -0.2120, ..., 1.1533, 0.1047, 0.4153],[ 0.6528, -0.3794, 0.5884, ..., -0.8838, -0.4670, 0.6138]]],device='cuda:0', dtype=torch.float16)
转移后b(cuda:1):
tensor([[[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],...,[ 0.6489, 0.8032, 1.4160, ..., -1.7891, 1.6729, 0.8071],[-1.0586, 0.2969, -0.2120, ..., 1.1533, 0.1047, 0.4153],[ 0.6528, -0.3794, 0.5884, ..., -0.8838, -0.4670, 0.6138]]],device='cuda:1', dtype=torch.float16)
b的总和: 145.375
b的输出就是很随机很发癫,甚至有全零的情况🤦
改用b = a.to("cpu").to(device1)借助CPU中转就正常了,b的元素也对的上