prscrew.com

Innovative Pretraining Techniques for Self-Attention Networks

Written on

Chapter 1: Overview of Cloze-Driven Pretraining

This paper presents a novel approach to pretraining bidirectional transformers, aimed at boosting performance across various language comprehension tasks. The primary achievements include substantial improvements on the GLUE benchmark and new state-of-the-art results in Named Entity Recognition (NER) and constituency parsing.

Prerequisites

Before delving into the details, it's essential to understand a couple of concepts:

  • Cloze-Reading: This task involves filling in missing words in a sentence. For instance, in "This is a _____ paper," the goal is to predict "research" from the surrounding context.
  • Transformers: If you need a refresher, refer to this informative blog on transformers.

Introduction

The authors argue that earlier bidirectional training utilized separate loss functions for each direction. This paper proposes a unified pretraining methodology that simultaneously trains both directions. The resulting bidirectional transformer is designed to predict every token based on cloze-style training, wherein the model predicts a central word from both left-to-right and right-to-left contexts.

The model calculates both forward and backward states utilizing a masked self-attention architecture.

The Two Tower Model

Architecture Diagram of the Two Tower Model

The cloze model establishes a probability distribution ( p(t_i | t_1, ldots, t_{i-1}, t_{i+1}, ldots, t_n) ) for a sentence containing ( n ) tokens. It comprises two self-attentional towers, with the forward tower operating left-to-right and the backward tower functioning in the opposite direction. To predict a token, the model amalgamates the representations from both towers, ensuring neither contains information about the target token.

The forward tower generates the representation ( F_{il} ) for token ( i ) at layer ( l ) based on the prior layer's forward representations ( F_{l-1} ) through self-attention, while the backward tower computes ( B_{il} ) from the reverse direction ( B_{l-1} ).

Block Structure

Each block consists of two sub-blocks:

  • The first is a multi-head self-attention block with ( H = 16 ).
  • The second is a Feed Forward Network (FFN).

Position information is encoded using fixed sinusoidal position embeddings (similar to transformers). Additionally, a character-based CNN is employed for token encoding, where words are decomposed into characters. The character embeddings are formed using Conv1D layers of varying filter sizes, followed by max pooling and a highway network to produce the final word embedding.

Notably, layer normalization is applied before the two sub-blocks, enhancing training effectiveness. The input embeddings are shared across both models.

Extended Model Architecture Diagram

Combination of Outputs

The outputs from both towers are processed through a self-attention module, followed by an FFN and a softmax activation corresponding to the vocabulary size ( V ). When predicting token ( i ), the attention module's input includes forward states ( F_{L1} ldots F_{Li-1} ) and backward states ( B_{i+1} ldots B_n ), where ( n ) represents the sequence length and ( L ) the number of layers. The attention query for token ( i ) integrates ( F_{Li-1} ) and ( B_{Li+1} ). For the base model, these representations are summed, while larger models concatenate them.

Fine-Tuning Adjustments

During fine-tuning, several modifications are applied:

  • All tokens from the input sentences are processed through both towers.
  • The softmax layer is removed, allowing access to the model's output via the boundary token.
  • When processing two sentences, a special token separates them.
  • It is advantageous to remove the masking of the current token in the final layer that pools the outputs from both towers.

Datasets Utilized

The following datasets were employed:

  • Common Crawl: 9 billion tokens
  • News Crawls: 4.5 billion words
  • Book Corpus + Wikipedia: 800 million words, with an additional 2.5 billion words from English Wikipedia (similar to BERT).
Overview of Datasets Used

Results and Evaluation

The results are compelling:

Evaluation Results on GLUE

For evaluating NER, the CoNLL-2003 dataset along with the Penn Treebank was utilized.

NER Evaluation Results on CoNLL Dataset

Conclusion and Further Reading

For additional insights, refer to the original paper and related articles.

Stay connected by following me on LinkedIn (Aziz Belaweid) or GitHub!

Chapter 2: Video Insights

The video titled "[Paper Review] Latent Retrieval for Weakly Supervised Open Domain Question Answering" offers in-depth explanations and discussions based on the paper's concepts, enhancing your understanding of the material presented.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

The Springtime Conspiracy: Is Mother Nature Out to Get Us?

A light-hearted exploration of spring's challenges, from unpredictable weather to allergies, and how we can navigate them.

Women and Men: Equal Contributors to Life Creation

Exploring the shared role of men and women in the creation of life and challenging cultural stereotypes.

The Unsolved Murder of Ötzi the Iceman: A 5,300-Year-Old Mystery

Explore the chilling details of Ötzi the Iceman's murder, a 5,300-year-old case that captivates scientists and historians alike.