閱讀(3.8k) 書簽贊(0) 我要糾錯(cuò)

PyTorch 單機(jī)模型并行最佳實(shí)踐

2025-06-19 10:16 更新

隨著深度學(xué)習(xí)模型的不斷增大和復(fù)雜化，傳統(tǒng)的單 GPU 訓(xùn)練方式已難以滿足模型對(duì)計(jì)算資源的需求。模型并行作為一種有效的解決方案，通過(guò)將模型的不同部分分配到多個(gè) GPU 上進(jìn)行計(jì)算，使得在單機(jī)環(huán)境下也能高效地訓(xùn)練大型模型。本文將深入探討 PyTorch 中單機(jī)模型并行的最佳實(shí)踐方法，幫助您在有限的硬件資源下實(shí)現(xiàn)模型的高效訓(xùn)練和推理。

一、模型并行的基本概念

模型并行的核心思想是將模型的不同子網(wǎng)絡(luò)放置在不同的 GPU 上，并在訓(xùn)練過(guò)程中實(shí)現(xiàn)各子網(wǎng)絡(luò)之間的高效通信。與數(shù)據(jù)并行（DataParallel）不同，模型并行將模型的不同部分分配到多個(gè) GPU 上，而不是在每個(gè) GPU 上復(fù)制整個(gè)模型。例如，假設(shè)一個(gè)模型包含 10 層神經(jīng)網(wǎng)絡(luò)，使用模型并行時(shí)，可以將前 5 層放在一個(gè) GPU 上，后 5 層放在另一個(gè) GPU 上。

通過(guò)這種方式，每個(gè) GPU 只需處理模型的一部分，從而能夠容納更大規(guī)模的模型。然而，模型并行也帶來(lái)了額外的通信開銷，因?yàn)橹虚g輸出需要在 GPU 之間進(jìn)行傳輸。因此，在實(shí)際應(yīng)用中需要權(quán)衡模型并行的性能和通信成本。

二、實(shí)現(xiàn)簡(jiǎn)單的模型并行示例

接下來(lái)，我們將通過(guò)一個(gè)簡(jiǎn)單的玩具模型來(lái)演示如何在 PyTorch 中實(shí)現(xiàn)模型并行。

（一）定義模型

import torch
import torch.nn as nn
import torch.optim as optim


class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = torch.nn.Linear(10, 10).to('cuda:0')  # 將第一層線性層放置在 GPU0 上
        self.relu = torch.nn.ReLU()
        self.net2 = torch.nn.Linear(10, 5).to('cuda:1')  # 將第二層線性層放置在 GPU1 上


    def forward(self, x):
        x = self.relu(self.net1(x.to('cuda:0')))  # 將輸入數(shù)據(jù)移動(dòng)到 GPU0
        return self.net2(x.to('cuda:1'))  # 將中間輸出移動(dòng)到 GPU1

（二）定義損失函數(shù)和優(yōu)化器

model = ToyModel()
loss_fn = nn.MSELoss()  # 定義均方誤差損失函數(shù)
optimizer = optim.SGD(model.parameters(), lr=0.001)  # 定義隨機(jī)梯度下降優(yōu)化器

（三）訓(xùn)練模型

optimizer.zero_grad()  # 清空梯度
outputs = model(torch.randn(20, 10))  # 生成隨機(jī)輸入數(shù)據(jù)并進(jìn)行前向傳播
labels = torch.randn(20, 5).to('cuda:1')  # 將標(biāo)簽數(shù)據(jù)移動(dòng)到 GPU1 上
loss_fn(outputs, labels).backward()  # 計(jì)算損失并進(jìn)行反向傳播
optimizer.step()  # 更新模型參數(shù)

在上述代碼中，我們將模型的不同部分分別放置在兩個(gè) GPU 上，并通過(guò) .to(device) 方法將輸入數(shù)據(jù)和中間輸出在 GPU 之間進(jìn)行傳輸。同時(shí)，損失函數(shù)和優(yōu)化器的使用與單 GPU 訓(xùn)練時(shí)保持一致。

三、將模型并行應(yīng)用于現(xiàn)有模塊

對(duì)于已有的模型，我們可以通過(guò)繼承原模型類并重新定義 forward 方法來(lái)實(shí)現(xiàn)模型并行。以下以 ResNet50 模型為例進(jìn)行說(shuō)明。

（一）定義模型并行的 ResNet50

from torchvision.models.resnet import ResNet, Bottleneck


num_classes = 1000


class ModelParallelResNet50(ResNet):
    def __init__(self, *args, **kwargs):
        super(ModelParallelResNet50, self).__init__(Bottleneck, [3, 4, 6, 3], num_classes=num_classes, *args, **kwargs)
        self.seq1 = nn.Sequential(
            self.conv1,
            self.bn1,
            self.relu,
            self.maxpool,
            self.layer1,
            self.layer2
        ).to('cuda:0')  # 將部分層序列放置在 GPU0 上


        self.seq2 = nn.Sequential(
            self.layer3,
            self.layer4,
            self.avgpool,
        ).to('cuda:1')  # 將剩余層序列放置在 GPU1 上


        self.fc.to('cuda:1')  # 將全連接層放置在 GPU1 上


    def forward(self, x):
        x = self.seq2(self.seq1(x).to('cuda:1'))  # 將中間輸出從 GPU0 傳輸?shù)?GPU1
        return self.fc(x.view(x.size(0), -1))  # 在 GPU1 上進(jìn)行全連接層計(jì)算

（二）訓(xùn)練并行模型

import torchvision.models as models
import timeit
import matplotlib.pyplot as plt
import numpy as np


## 定義訓(xùn)練函數(shù)
def train(model):
    model.train(True)
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.001)
    one_hot_indices = torch.LongTensor(batch_size).random_(0, num_classes).view(batch_size, 1)


    for _ in range(num_batches):
        inputs = torch.randn(batch_size, 3, image_w, image_h)
        labels = torch.zeros(batch_size, num_classes).scatter_(1, one_hot_indices, 1)


        optimizer.zero_grad()
        outputs = model(inputs.to('cuda:0'))  # 輸入數(shù)據(jù)移動(dòng)到 GPU0
        labels = labels.to(outputs.device)  # 標(biāo)簽數(shù)據(jù)移動(dòng)到輸出所在 GPU
        loss_fn(outputs, labels).backward()
        optimizer.step()


## 設(shè)置訓(xùn)練參數(shù)
num_batches = 3
batch_size = 120
image_w = 128
image_h = 128
num_repeat = 10


## 測(cè)試模型并行的 ResNet50
model = ModelParallelResNet50()
mp_run_times = timeit.repeat("train(model)", globals=globals(), number=1, repeat=num_repeat)
mp_mean, mp_std = np.mean(mp_run_times), np.std(mp_run_times)


## 測(cè)試單 GPU 的 ResNet50
model = models.resnet50(num_classes=num_classes).to('cuda:0')
rn_run_times = timeit.repeat("train(model)", globals=globals(), number=1, repeat=num_repeat)
rn_mean, rn_std = np.mean(rn_run_times), np.std(rn_run_times)


## 繪制執(zhí)行時(shí)間對(duì)比圖
def plot(means, stds, labels, fig_name):
    fig, ax = plt.subplots()
    ax.bar(np.arange(len(means)), means, yerr=stds, align='center', alpha=0.5, ecolor='red', capsize=10, width=0.6)
    ax.set_ylabel('ResNet50 Execution Time (Second)')
    ax.set_xticks(np.arange(len(means)))
    ax.set_xticklabels(labels)
    ax.yaxis.grid(True)
    plt.tight_layout()
    plt.savefig(fig_name)
    plt.close(fig)


plot([mp_mean, rn_mean], [mp_std, rn_std], ['Model Parallel', 'Single GPU'], 'mp_vs_rn.png')

從實(shí)驗(yàn)結(jié)果可以看出，模型并行的實(shí)現(xiàn)雖然能夠解決模型過(guò)大無(wú)法放入單個(gè) GPU 的問(wèn)題，但其執(zhí)行時(shí)間通常會(huì)長(zhǎng)于單 GPU 實(shí)現(xiàn)，這是由于 GPU 之間的通信開銷所致。

四、通過(guò)流水線輸入加速模型并行

為了進(jìn)一步提升模型并行的訓(xùn)練效率，可以采用流水線技術(shù)對(duì)輸入數(shù)據(jù)進(jìn)行劃分和并行處理。

（一）定義流水線并行模型

class PipelineParallelResNet50(ModelParallelResNet50):
    def __init__(self, split_size=20, *args, **kwargs):
        super(PipelineParallelResNet50, self).__init__(*args, **kwargs)
        self.split_size = split_size


    def forward(self, x):
        splits = iter(x.split(self.split_size, dim=0))  # 將輸入數(shù)據(jù)按批次劃分
        s_next = next(splits)
        s_prev = self.seq1(s_next).to('cuda:1')  # 將第一個(gè)批次的數(shù)據(jù)傳入 GPU1
        ret = []


        for s_next in splits:
            s_prev = self.seq2(s_prev)  # 在 GPU1 上計(jì)算當(dāng)前批次
            ret.append(self.fc(s_prev.view(s_prev.size(0), -1)))


            s_prev = self.seq1(s_next).to('cuda:1')  # 將下一個(gè)批次的數(shù)據(jù)傳入 GPU1，與 GPU1 的計(jì)算并行進(jìn)行


        s_prev = self.seq2(s_prev)
        ret.append(self.fc(s_prev.view(s_prev.size(0), -1)))


        return torch.cat(ret)

（二）測(cè)試流水線并行模型

## 測(cè)試流水線并行的 ResNet50
model = PipelineParallelResNet50(split_size=20)
pp_run_times = timeit.repeat("train(model)", globals=globals(), number=1, repeat=num_repeat)
pp_mean, pp_std = np.mean(pp_run_times), np.std(pp_run_times)


## 繪制包含流水線并行的執(zhí)行時(shí)間對(duì)比圖
plot([mp_mean, rn_mean, pp_mean], [mp_std, rn_std, pp_std], ['Model Parallel', 'Single GPU', 'Pipelining Model Parallel'], 'mp_vs_rn_vs_pp.png')

流水線技術(shù)通過(guò)將輸入數(shù)據(jù)劃分為多個(gè)批次，并在多個(gè) GPU 上并行處理這些批次，能夠顯著提高模型并行的訓(xùn)練效率。

五、總結(jié)與展望

通過(guò)本文，您已經(jīng)學(xué)習(xí)了如何在 PyTorch 中實(shí)現(xiàn)單機(jī)模型并行，包括基本的模型并行概念、實(shí)現(xiàn)方法以及通過(guò)流水線技術(shù)加速模型并行訓(xùn)練的技巧。盡管模型并行在處理大型模型時(shí)具有顯著優(yōu)勢(shì)，但其通信開銷也可能成為性能瓶頸。在實(shí)際應(yīng)用中，需要根據(jù)模型結(jié)構(gòu)和硬件資源合理選擇并行策略，并通過(guò)實(shí)驗(yàn)優(yōu)化相關(guān)參數(shù)。

未來(lái)，您可以進(jìn)一步探索分布式模型并行訓(xùn)練、多機(jī)多 GPU 并行等更高級(jí)的并行技術(shù)，以應(yīng)對(duì)更大規(guī)模的模型和更復(fù)雜的訓(xùn)練任務(wù)。編程獅將持續(xù)為您帶來(lái)更多深度學(xué)習(xí)模型并行訓(xùn)練的優(yōu)質(zhì)教程，助力您在高性能計(jì)算領(lǐng)域不斷前行。

以上內(nèi)容是否對(duì)您有幫助：

在文檔使用的過(guò)程中是否遇到以下問(wèn)題：