老铁们我最近真的好惨😭，一个大模型在单机多卡上运行就是出错，debug看的老眼昏花，最后发现大张量在设备间直接传输会有很发癫的行为，还请大家帮我看看🙇‍

摒弃屎山一样的代码，简单运行下列脚本

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1" # 设定GPU使用范围
import torch# 1. 检查PyTorch版本
print(f"PyTorch版本: {torch.__version__}")  # 需≥1.7# 2. 检查CUDA设备
device0 = torch.device('cuda:0')
device1 = torch.device('cuda:1')
print(f"设备数量: {torch.cuda.device_count()}")  # 需≥2
print(f"cuda:1是否可用: {torch.cuda.is_available()}")# 3. 生成测试张量（模拟用户输入）
a = torch.randn((1, 447, 4096), dtype=torch.float16, device="cuda:0")
print(f"转移前a（cuda:0）: \n{a}")# 4. 转移设备
b = a.to(device1)
# b = a.to("cpu").to(device1)
print(f"转移后b（cuda:1）: \n{b}")# 5. 验证总和（全0则总和为0）
print(f"b的总和: {b.sum()}")

输出如下

PyTorch版本: 2.6.0+cu124
设备数量: 2
cuda:1是否可用: True
转移前a（cuda:0）: 
tensor([[[-0.1302,  0.5620, -0.8608,  ..., -1.2217, -0.9214, -0.7627],[ 1.3145,  0.3105, -0.4094,  ..., -0.7773, -2.0195, -0.6938],[ 0.5791, -0.9868,  1.1650,  ...,  2.1152,  0.8052, -1.2822],...,[ 0.6489,  0.8032,  1.4160,  ..., -1.7891,  1.6729,  0.8071],[-1.0586,  0.2969, -0.2120,  ...,  1.1533,  0.1047,  0.4153],[ 0.6528, -0.3794,  0.5884,  ..., -0.8838, -0.4670,  0.6138]]],device='cuda:0', dtype=torch.float16)
转移后b（cuda:1）: 
tensor([[[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],...,[ 0.6489,  0.8032,  1.4160,  ..., -1.7891,  1.6729,  0.8071],[-1.0586,  0.2969, -0.2120,  ...,  1.1533,  0.1047,  0.4153],[ 0.6528, -0.3794,  0.5884,  ..., -0.8838, -0.4670,  0.6138]]],device='cuda:1', dtype=torch.float16)
b的总和: 145.375

b的输出就是很随机很发癫，甚至有全零的情况🤦‍

改用b = a.to("cpu").to(device1)借助CPU中转就正常了，b的元素也对的上