TensorFlow深度学习实战——使用Hugging Face构建Transformer模型

- 0. 前言
- 1. 安装 Hugging Face
- 2. 文本生成
- 3. 自动模型选择和自动分词
- 4. 命名实体识别
- 5. 摘要生成
- 6. 模型微调
- 相关链接

0. 前言

除了需要实现特定的自定义结构，或者想要了解 Transformer 工作原理外，从零开始实现 Transformer 并不是最佳选择，和其它编程实践一样，通常并不需要从头开始造轮子。只有想要理解 Transformer 架构的内部细节，或者修改 Transformer 架构以得到新的变体时才需要从零开始构建。有很多优秀的库提供高质量的 Transformer 解决方案，Hugging Face 是其中的代表之一，它提供了一些构建 Transformer 的高效工具：

Hugging Face 提供了一个通用的 API 来处理多种 Transformer 架构
Hugging Face 不仅提供了基础模型，还提供了带有不同类型“头”的模型来处理特定任务(例如，对于 BERT 架构，提供了 TFBertModel，用于情感分析的 TFBertForSequenceClassification，用于命名实体识别的 TFBertForTokenClassification，以及用于问答的 TFBertForQuestionAnswering 等)
可以通过使用 Hugging Face 提供的预训练权重来轻松创建自定义的网络，例如，使用 TFBertForPreTraining
除了 pipeline() 方法，还可以以常规方式定义模型，使用 fit() 进行训练，使用 predict() 进行推理，就像普通的 TensorFlow 模型一样

1. 安装 Hugging Face

和其它第三方库一样，可以使用 pip 命令安装 Hugging Face 库：

$ pip install transformers[tf]

然后，通过下载一个用于情感分析的预训练模型来验证 Hugging Face 库是否安装成功：

$ python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))"

如果成功安装，将显示如下输出结果：

[{'label': 'POSITIVE', 'score': 0.9998704791069031}]

接下来，介绍如何使用 Hugging Face 解决具体任务。

2. 文本生成

在本节中，我们将使用 GPT-2 进行自然语言生成，这是一个生成自然语言输出的过程。

(1) 使用 GPT-2 生成文本：

from transformers import pipeline
generator = pipeline(task="text-generation")

(2) 模型下载完成后，将文本传递给生成器，观察结果：

generator("Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone")generator ("The original theory of relativity is based upon the premise that all coordinate systems in relative uniform translatory motion to each other are equally valid and equivalent ")generator ("It takes a great deal of bravery to stand up to our enemies")

生成结果

3. 自动模型选择和自动分词

Hugging Face 能够尽可能帮助自动化多个步骤。

(1) 可以非常简单的从数十个可用的预训练模型中导入可用模型：

from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

可以在下游任务上训练模型，以便用于预测和推理。

(2) 可以使用 AutoTokenizer 将单词转换为模型使用的词元：

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
sequence = "The original theory of relativity is based upon the premise that all coordinate systems"
print(tokenizer(sequence))

输出结果

4. 命名实体识别

命名实体识别 (Named Entity Recognition, NER) 是经典的自然语言处理任务。命名实体识别也称实体识别 (entity identification)、实体分块 (entity chunking) 或实体提取 (entity extraction)，是信息提取的一个子任务，旨在定位和分类在非结构化文本中提到的命名实体，将其划分为预定义的类别，例如人名、组织、地点、时间表达、数量、货币值和百分比等。接下来，我们使用 Hugging Face 完成命名实体识别任务。

(1) 创建一个 NER 管道：

from transformers import pipeline
ner_pipe = pipeline("ner")
sequence = """Mr. and Mrs. Dursley, of number four, Privet Drive, were
proud to say that they were perfectly normal, thank you very much."""
for entity in ner_pipe(sequence):print(entity)

(2) 结果如下所示，其中实体已经被识别出来：
识别结果

命名实体识别可以理解九个不同的类别：

O: 不属于命名实体
B-MIS: 在另一个杂项实体后开始的杂项实体
I-MIS: 杂项实体
B-PER: 在另一个人名后面开始的人名
I-PER: 人名
B-ORG: 在另一个组织后面开始的组织
I-ORG: 组织
B-LOC: 在另一个地点后面开始的地点
I-LOC: 地点

这些实体在 CoNLL-2003 数据集中定义，并由 Hugging Face 自动选择。

5. 摘要生成

摘要生成，是指用简短而清晰的形式表达有关某事或某人的最重要事实或观点。Hugging Face 使用 T5 模型作为完成此任务的默认模型。

(1) 首先，使用默认的 T5 small 模型创建一个摘要生成管道：

from transformers import pipeline
summarizer = pipeline("summarization")
ARTICLE = """Mr. and Mrs.Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.Mr.Dursley was the director of a firm called Grunnings, which made drills.He was a big, beefy man with hardly any neck, although he did have a very large mustache.Mrs.Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors.The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere"""
print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

输出结果如下：

输出结果

(2) 如果想要更换使用不同的模型，只需修改参数 model：

summarizer = pipeline("summarization", model='t5-base')

输出结果如下：

输出结果

6. 模型微调

一种常见的 Transformer 使用模式是先使用预训练的大语言模型 (Large Language Model, LLM)，然后对模型进行微调以适应特定的下游任务。微调步骤将基于自定义数据集，而预训练则是在非常大的数据集上进行的。这种策略的优点在于节省计算成本，此外，微调令我们使用最先进的模型，而不需要从头开始训练一个模型。接下来，我们介绍如何使用 TensorFlow 进行模型微调，使用的预训练模型是 bert-base-cased，在 Yelp Reviews 数据集上进行微调。
本节使用 datasets 库加载数据集，datasets 库是由 Hugging Face 提供的一个非常强大的工具，专门用于加载、处理和分享数据集，使用 pip 命令安装 datasets 库：

$ pip install datasets

(1) 首先，加载并对 Yelp 数据集进行分词：

from datasets import load_datasetdataset = load_dataset("yelp_review_full")
from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("bert-base-cased")def tokenize_function(examples):return tokenizer(examples["text"], padding="max_length", truncation=True)tokenized_datasets = dataset.map(tokenize_function, batched=True)small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

(2) 然后，将数据集转换为 TensorFlow 格式：

from transformers import DefaultDataCollator
data_collator = DefaultDataCollator(return_tensors="tf")# convert the tokenized datasets to TensorFlow datasetstf_train_dataset = small_train_dataset.to_tf_dataset(columns=["attention_mask", "input_ids", "token_type_ids"],label_cols=["labels"],shuffle=True,collate_fn=data_collator,batch_size=8,
)tf_validation_dataset = small_eval_dataset.to_tf_dataset(columns=["attention_mask", "input_ids", "token_type_ids"],label_cols=["labels"],shuffle=False,collate_fn=data_collator,batch_size=8,
)

(3) 使用 TFAutoModelForSequenceClassification，选择 bert-base-cased：

import tensorflow as tf
from transformers import TFAutoModelForSequenceClassificationmodel = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

(4) 最后，微调模型的方法是使用 TensorFlow 中的标准训练方式，通过编译模型并使用 fit() 进行训练：

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=tf.metrics.SparseCategoricalAccuracy(),
)model.fit(tf_train_dataset, validation_data=tf_validation_dataset, epochs=3)