數據派THU - 使用DistilBERT 蒸餾類 BERT 模型的代碼實現－鑽石舞台

來源：DeepHub IMBA

本文約2700字，建議閱讀9分鐘

本文帶你進入Distil細節，並給出完整的代碼實現。本文為你詳細介紹DistilBERT，並給出完整的代碼實現。

機器學習模型已經變得越來越大，即使使用經過訓練的模型當硬件不符合模型對它應該運行的期望時，推理的時間和內存成本也會飆升。為了緩解這個問題是使用蒸餾可以將網絡縮小到合理的大小，同時最大限度地減少性能損失。

我們在以前的文章中介紹過 DistilBERT [1] 如何引入一種簡單而有效的蒸餾技術，該技術可以輕鬆應用於任何類似 BERT 的模型，但沒有給出任何的代碼實現，在本篇文章中我們將進入細節，並給出完整的代碼實現。

學生模型的初始化

由於我們想從現有模型初始化一個新模型，所以需要訪問舊模型的權重。本文將使用Hugging Face 提供的 RoBERTa [2] large 作為我們的教師模型，要獲得模型權重，必須知道如何訪問它們。

Hugging Face的模型結構

可以嘗試的第一件事是打印模型，這應該讓我們深入了解它是如何工作的。當然，我們也可以深入研究 Hugging Face 文檔 [3]，但這太繁瑣了。

from transformers import AutoModelForMaskedLMroberta = AutoModelForMaskedLM.from_pretrained("roberta-large")print(roberta)

運行此代碼後得到：

在 Hugging Face 模型中，可以使用 .children() 生成器訪問模塊的子組件。因此，如果我們想使用整個模型，我們需要在它上面調用 .children() ，並在每個子節點上調用，這是一個遞歸函數，代碼如下：

from typing import Anyfrom transformers import AutoModelForMaskedLMroberta = AutoModelForMaskedLM.from_pretrained("roberta-large")def visualize_children( object : Any, level : int = 0,) -> None: """ Prints the children of (object) and their children too, if there are any. Uses the current depth (level) to print things in a ordonnate manner. """ print(f"{' ' * level}{level}- {type(object).__name__}") try: for child in object.children(): visualize_children(child, level + 1) except: passvisualize_children(roberta)

這樣獲得了如下輸出：

看起來 RoBERTa 模型的結構與其他類似 BERT 的模型一樣，如下所示：

複製教師模型的權重

要以 DistilBERT [1] 的方式初始化一個類似 BERT 的模型，我們只需要複製除最深層的 Roberta 層之外的所有內容，並且刪除其中的一半。所以這裡的步驟如下：首先，我們需要創建學生模型，其架構與教師模型相同，但隱藏層數減半。只需要使用教師模型的配置，這是一個類似字典的對象，描述了Hugging Face模型的架構。查看 roberta.config 屬性時，我們可以看到以下內容：

我們感興趣的是numhidden -layers屬性。讓我們寫一個函數來複製這個配置，通過將其除以2來改變屬性，然後用新的配置創建一個新的模型：

from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel, RobertaConfigdef distill_roberta( teacher_model : RobertaPreTrainedModel,) -> RobertaPreTrainedModel: """ Distilates a RoBERTa (teacher_model) like would DistilBERT for a BERT model. The student model has the same configuration, except for the number of hidden layers, which is // by 2. The student layers are initilized by copying one out of two layers of the teacher, starting with layer 0. The head of the teacher is also copied. """ # Get teacher configuration as a dictionnary configuration = teacher_model.config.to_dict() # Half the number of hidden layer configuration['num_hidden_layers'] //= 2 # Convert the dictionnary to the student configuration configuration = RobertaConfig.from_dict(configuration) # Create uninitialized student model student_model = type(teacher_model)(configuration) # Initialize the student's weights distill_roberta_weights(teacher=teacher_model, student=student_model) # Return the student model return student_model

這個函數distill_roberta_weights函數將把教師的一半權重放在學生層中，所以仍然需要對它進行編碼。由於遞歸在探索教師模型方面工作得很好，可以使用相同的思想來探索和複製某些部分。這裡將同時在老師和學生的模型中迭代，並將其從一個到另一個進行複製。唯一需要注意的是隱藏層的部分，只複製一半。

函數如下：

from transformers.models.roberta.modeling_roberta import RobertaEncoder, RobertaModelfrom torch.nn import Moduledef distill_roberta_weights( teacher : Module, student : Module,) -> None: """ Recursively copies the weights of the (teacher) to the (student). This function is meant to be first called on a RobertaFor... model, but is then called on every children of that model recursively. The only part that's not fully copied is the encoder, of which only half is copied. """ # If the part is an entire RoBERTa model or a RobertaFor..., unpack and iterate if isinstance(teacher, RobertaModel) or type(teacher).__name__.startswith('RobertaFor'): for teacher_part, student_part in zip(teacher.children(), student.children()): distill_roberta_weights(teacher_part, student_part) # Else if the part is an encoder, copy one out of every layer elif isinstance(teacher, RobertaEncoder): teacher_encoding_layers = [layer for layer in next(teacher.children())] student_encoding_layers = [layer for layer in next(student.children())] for i in range(len(student_encoding_layers)): student_encoding_layers[i].load_state_dict(teacher_encoding_layers[2*i].state_dict()) # Else the part is a head or something else, copy the state_dict else: student.load_state_dict(teacher.state_dict())

這個函數通過遞歸和類型檢查，確保學生模型與 Roberta 層的教師安全模型相同。如果想在初始化的時候改變複製哪些層，只需要更改encoder部分的for循環就可以了。

現在我們有了學生模型，我們需要對其進行訓練。這部分相對簡單，主要的問題就是使用的損失函數。

自定義損失函數

作為對 DistilBERT 訓練過程的回顧，先看一下下圖：

請把注意力轉向上面寫着「損失」的紅色大盒子。但是在詳細介紹裡面是什麼之前，需要知道如何收集我們要餵給它的東西。在這張圖中可以看到需要 3 個東西：標籤、學生和教師的嵌入。標籤已經有了，因為是有監督的學習。現在看啊可能如何得到另外兩個。

教師和學生的輸入

在這裡需要一個函數，給定一個類 BERT 模型的輸入，包括兩個張量 input_ids 和 attention_mask 以及模型本身，然後函數將返回該模型的 logits。由於我們使用的是 Hugging Face，這非常簡單，我們需要的唯一知識就是能看懂下面的代碼：

from torch import Tensordef get_logits( model : RobertaPreTrainedModel, input_ids : Tensor, attention_mask : Tensor,) -> Tensor: """ Given a RoBERTa (model) for classification and the couple of (input_ids) and (attention_mask), returns the logits corresponding to the prediction. """ return model.classifier( model.roberta(input_ids, attention_mask)[0] )

學生和老師都可以使用這個函數，但是第一個有梯度，第二個沒有。

損失函數的代碼實現

損失函數具體的介紹請見我們上次發布的文章，這裡使用下面的圖片進行解釋：

我們所說的「『converging cosine-loss（收斂餘弦損失）」是用於對齊兩個輸入向量的常規餘弦損失。這是代碼：

import torchfrom torch.nn import CrossEntropyLoss, CosineEmbeddingLossdef distillation_loss( teacher_logits : Tensor, student_logits : Tensor, labels : Tensor, temperature : float = 1.0,) -> Tensor: """ The distillation loss for distilating a BERT-like model. The loss takes the (teacher_logits), (student_logits) and (labels) for various losses. The (temperature) can be given, otherwise it's set to 1 by default. """ # Temperature and sotfmax student_logits, teacher_logits = (student_logits / temperature).softmax(1), (teacher_logits / temperature).softmax(1) # Classification loss (problem-specific loss) loss = CrossEntropyLoss()(student_logits, labels) # CrossEntropy teacher-student loss loss = loss + CrossEntropyLoss()(student_logits, teacher_logits) # Cosine loss loss = loss + CosineEmbeddingLoss()(teacher_logits, student_logits, torch.ones(teacher_logits.size()[0])) # Average the loss and return it loss = loss / 3 return loss

以上就是 DistilBERT 的所有關鍵思想的實現，但是還缺少一些東西，比如 GPU 支持、整個訓練例程等，所以最後完整的代碼會在文章的最後提供，如果需要實際使用，建議使用最後的 Distillator 類。

結果

以這種方式提煉出來的模型最終表現如何呢？對於 DistilBERT，可以閱讀原始論文 [1]。對於 RoBERTa，Hugging Face 上已經存在類似 DistilBERT 的蒸餾版本。在 GLUE 基準 [4] 上，我們可以比較兩個模型：

至於時間和內存成本，這個模型大約是 roberta-base 大小的三分之二，速度是兩倍。

總結

通過以上的代碼我們可以蒸餾任何類似 BERT 的模型。除此以外還有很多其他更好的方法，例如 TinyBERT [5] 或 MobileBERT [6]。如果你認為其中一篇更適合您的需求，你應該閱讀這些文章。甚至是完全嘗試一種新的蒸餾方法，因為這是一個日益發展的領域。

本文的代碼在這裡：

https://gist.github.com/remi-or/4814577c59f4f38fcc89729ce4ba21e6

引用

[1] Victor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019), Hugging Face

[2] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach (2019), arXiv

[3] Hugging Face team crediting Julien Chaumond, Hugging Face’s RoBERTa documentation, Hugging Face

[4] Alex WANG, Amanpreet SINGH, Julian MICHAEL, Felix HILL, Omer LEVY, Samuel R. BOWMAN, GLUE: A multi-task benchmark and analysis platform for natural language understanding (2019), arXiv

[5] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu, TinyBERT: Distilling BERT for Natural Language Understanding (2019), arXiv

[6] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou, MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices (2020), arXiv

編輯：黃繼彥

校對：林亦霖

鑽石舞台

鑽石舞台發表在痞客邦留言(0) 人氣()

鑽石舞台

鑽石鑽石亮晶晶

數據派THU - 使用DistilBERT 蒸餾類 BERT 模型的代碼實現

歷史上的今天

留言列表

文章搜尋

最新文章

熱門文章

誰來我家

參觀人氣

鑽石舞台

鑽石鑽石亮晶晶

數據派THU - 使用DistilBERT 蒸餾類 BERT 模型的代碼實現

歷史上的今天

留言列表

文章搜尋

最新文章

熱門文章

誰來我家

參觀人氣

贊助商連結