Top NLP Papers – pt 1

Zeyneb K
Jan 21
60 min read

Updated: Mar 3

I've read many papers over the years, and I often like to review and collect a summary of the contributions. I'm sharing a list of some of these notes I've made on the top papers in the field of NLP, particularly on large language models.

1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Purpose

The authors aim to present a new language representation model that can be effectively fine-tuned without substantial task-specific architecture modifications. They introduce BERT, designed to pretrain bidirectional representations from unlabeled text.

Methods

The authors demonstrate the value of bidirectional pre-training for language representations. In their work, they introduce BERT: Bidirectional Encoder Representations from Transformers. The model architecture is a multi-layer bidirectional Transformer encoder, with a consistent input representation of a single sequence to handle a variety of down-stream tasks.

Findings

BERT achieves state-of-the-art results across all 11 NLP tasks, “pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).” The authors further establish the value of the method by demonstrating that bidirectional presentations perform better on all tasks, with large drops on MRPC and SQuAD.

Originality and Value

The research introduces a pre-training approach for language models that significantly improved results on a wide range of NLP tasks. It is able to effectively capture contextual relationships between words and demonstrates the value of bidirectional representations to language modeling.

2. Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Dario Amodei

Purpose

The authors aim to examine the generalizability of pretrained language models on new tasks with limited task-specific data.

Methods

The authors’ experiments are based on variations of GPT-3 pretrained with varying datasets. They then apply their models to several datasets, analyzing fine-tuning, few-shot, one-shot, and zero-shot performance to examine the effectiveness of language models on new tasks with minimal new data. The authors also study the contribution of different components of the pretrained model to its performance, including pretraining corpus size.

Findings

The authors’ experiments showed strong performance on many NLP tasks in the low-source settings, nearly matching and performing competitively with state-of-the-art fine-tuned systems. Their analyses demonstrated predictable trends of scaling in performance without fine-tuning. They identified the importance of pretrianing with a large corpus of text data for generalizability.

Originality and Value

The work shows the effectiveness of pretrained language models to generalize on a range of NLP tasks with minimal fine-tuning data.

3. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu

Purpose

Transfer learning has shown great effectiveness in natural language processing. The authors examine the limits of transfer learning through a systematic study by presenting a unified framework enabling text-based tasks to be converted into a text-to-text format.

Methods

The authors present a text-to-text framework in order to train a single model across NLP tasks without altering the loss function or decoding procedure. In their method, all tasks, including translation, question-answering, text classification, summarization, and more, take the same input format with the addition of a prefix that provides the instructions for the model on what its task is. All of the outputs then also take the same format of a text-based target. They pretrain a transformer model with this approach and fine-tuned it to various tasks in order to analyze the capabilities of transfer learning and the performance of their method at scale. They call their model, Text-to-Text Transfer Transformer, T5.

Findings

The T5 approach obtained comparable performance to task-specific architectures and achieved state-of-the-art results on NLP benchmarks when combined with scale. They demonstrate the effectiveness of transfer learning with minimal task-specific fine-tuning.

Originality and Value

The authors present the effectiveness of transfer learning through the performance of their unified text-to-text transformer model as well as its limitations through its dependence on fine-tuning data size. The work is valuable with the growing application of transfer learning in the field of NLP.

4. XLNet: Generalized Autoregressive Pretraining for Language Understanding

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le

Purpose

BERT’s denoising autoencoding based pretraining is advantageous over autoregressive language modeling, but is limited as it neglects dependency between the masked positions. The authors aim to use these limitations and strengths to develop a new autoregressive pretraining method.

Methods

The authors propose XLNet, a novel method that addresses the limits of autoencoding and autoregressive language modeling. Instead of a fixed forward or backward factorization order, they maximize “the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order”. This enables capturing of bidirectional context. The autoregressive modeling addresses the pretrain-finetune discrepancy of corruption-based approaches. Beyond the pretraining objective, the authors use a new architectural design. They use segment a recurrence mechanism and relative encoding scheme of Transformer-XLinto pretraining to improve handling of longer text sequences.

Findings

In evaluations on various benchmarks, XLNet consistently outperformed BERT, including GLUE, SQuAD, RACE, Yelp, IMDB, and ClueWeb09-B. Further, with scale, XLNet’s gains over RoBERTa were greater in tasks benefiting from larger contexts.

Originality and Value

The work addresses the limitations of autoregressive and autoencoding based pretraining methods with a novel and effective generalized pretraining.

5. RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov

Purpose

The authors present a replication study of BERT with an evaluation and modification of hyperparameter tuning and training set size to improve its performance and reveal the effect of careful pretraining strategies and attention to design details.

Methods

The authors reimplement BERT, following the original architecture, but with adjustments in the training process in order to optimize its performance, and evaluate their results on various benchmarks in comparison to those reported in by the original BERT model. With each experiment, they identify the best method and continue to use that method in the remainder of the experiment.

Findings

The authors identified adjustments to the original BERT configuration that improved the performance without changes to the architecture of the model: dynamic masking, document sentences input format without NSP loss, increased batch size, and a universal encoding scheme with greater vocabulary size.

Originality and Value

The authors demonstrate the value of careful design decisions in pretraining BERT and the substantial improvements they can enable. The work is an important lesson to the creation of large language models and the practices used in the refining of their details.

6. Language Models are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever

Purpose

Large language models trained on diverse datasets have been able to perform well across domains and tasks. The authors thus aim to demonstrate that language models learn new tasks implicitly without explicit supervision and determine the zero-shot capabilities of language models on downstream tasks.

Methods

The authors first construct a large and diverse corpus, WebText, in order to train the model on varied domains and contexts for better applicability on more tasks. They emphasize document quality in their web scrape by limiting to manually human-filtered pages. The authors used a Transformer-based language model architecture following that of GPT with modifications to develop GPT-2: they moved layer normalization to the input of each sub-block and added an additional layer normalization after the final self-attention block. They evaluated the model’s zero-shot performance on various diverse benchmark tasks to establish its multi-task capabilities.

Findings

GPT-2’s zero-shot performance achieves state-of-the-art performance on 7 of the 8 tested language modeling datasets. On certain tasks, like reading comprehension, GPT-2 even obtained zero-shot results competitive with supervised baselines.

Originality and Value

The authors demonstrate the ability of language models to perform effectively in the zero-shot setting without any parameter or architecture modification by training on large and diverse datasets. They present WebText, a dataset containing diverse and high-quality text for training effective generalizable models, and GPT-2, a large language model with strong zero-shot performance.

7. Improving Language Understanding by Generative Pre-Training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever

Purpose

The authors aim to improve performance on natural language understanding tasks via generative pretraining that makes use of large unlabeled text corpora.

Methods

The authors’ proposed methods consist of two stages. They first pretrain the language model on a diverse corpus through unsupervised generative pretraining. For this, they use a multi-layer Transformer decoder and a standard language modeling objective. In the next step, the authors use discriminative fine-tuning for the specific task. The authors apply task-specific input transformations during fine-tuning for improved transfer learning. For each task, they use unique modifications to the model. For text classification, they do not make changes to the original fine-tuning process. For entailment, they concatenate the premise and hypothesis sequences. For similarity tasks, they capture the lack of ordering between the two sentences by ensuring that the input sequence contains both sentence orderings. For question-answering and common-sense reasoning tasks, which consist of a context document, a question, and a set of possible answers, they concatenate the document, context, and question with each possible answer. The authors then evaluate the effectiveness of their methods on a variety of NLU tasks in the GLUE benchmark.

Findings

The proposed framework of the authors performs successfully across the NLU tasks, improving the state-of-the-art on 9 of the 12 datasets. The authors further identify that transferring embeddings improves performance and that each transformer layer provides further improvements, demonstrating the value of each layer in the pretrained model.

Originality and Value

The authors demonstrate the success of generative pretraining followed by discriminative fine-tuning according to each specific task in order to attain state-of-the-art performance on NLU tasks. Their research enables further study into understanding the effectiveness of unsupervised learning.

8. Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy , Girish Sastry, Amanda Askell, Pamela Mishkin, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Jack Clark, Gretchen Krueger, Ilya Sutskever

Purpose

Predetermined object classes limit the use and generalizability of computer vision systems. The authors thus draw from natural language processing to expand the potential and performance of vision models with pretraining based on text captions to learn visual concepts.

Methods

The authors’ model is based on a convolutional neural network (CNN) architecture. Their proposed method, Contrastive Language-Image Pre-training (CLIP), pretrains the model with natural language supervision. They identify scaling limitations in previous pretraining methods that jointly train an image CNN and text transformer to predict caption of an image, instead pretrianing by learning a joint multimodal embedding through with an image and text encoder and then predict image-text pairings. The authors’ pretraining is done on a large dataset of images with associated text captions used to generate attention weights. These attention weights are used to focus the model on specific relevant regions of the image during training.

Findings

CLIP is evaluated on several benchmark datasets. Their zero-shot performance matches the accuracy of ResNet-50 without requiring any training examples. The results show that the method is effective at improving the performance of visual models, with a significant boost in performance on the smaller dataset after fine-tuning.

Originality and Value

The research presents a novel method to use text captions to generate attention weights and focus on semantically meaningful regions of the image. CLIP is valuable in its generalizability.

9. GLUE: A MULTI-TASK BENCHMARK AND ANALYSIS PLATFORM FOR NATURAL LANGUAGE UNDERSTANDING

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy & Samuel R. Bowman

Purpose

In order to develop models that have a general, flexible, and robust understanding of language like humans, it is important to train with a variety of linguistic tasks and domains. The research paper thus presents the General Language Understanding Evaluation (GLUE) benchmark, which especially targets sample efficiency and knowledge transferability. The benchmark includes nine NLU tasks, an online evaluation platform, and a diagnostic evaluation dataset.

Methods

GLUE’s tasks are diverse in type, domain, and quantity, and favor generalizability of NLU tasks. They cover nine English sentence understanding tasks. The Corpus of Linguistic Acceptability (CoLa) provided a corpus of sentences annotated for grammaticality, and the Stanford Sentiment Treebank (SST-2) provided a corpus of movie reviews annotated for sentiment, both binary classification tasks. Multiple datasets provide similarity and paraphrase tasks in different domains: the Microsoft Research Paraphrase Corpus (MRPC), Quora Question Pairs (QQP), and Semantic Textual Similarity Benchmark (STS-B). For inference based tasks, the Multi-Genre Natural Language Inference Corpus (MNLI) contains sentence pairs annotated for entailment, as do the Recognizing Textual Entailment (RTE) datasets; the Stanford Question Answering Dataset (QNLI) contains question-paragraph pairs; and the Winograd Schema Challenge (WNLI) contains a multiple-choice task of identifying the referent of a pronoun. The WNLI dataset is converted into sentence pair classification by creating sentence pairs with the pronoun replaced with the referent options. The resulting task is to predict the sentence where the substituted pronoun is originally entailed. The QNLI dataset is also converted into sentence pair classification by forming pairs between questions and context sentences from their corresponding paragraphs. The resulting task is to determine whether the context answers the question. These tasks make up the GLUE benchmark. Results can be evaluated on the website gluebenchmark.com for scoring, following the models of SemEval and Kaggle.

Findings

The authors compare the baselines based on the performance of the baselines’ MNLI classifier on the diagnostic set. They find that the overall performance of the models to be poor. The models trained on the GLUE tasks achieve better results and greatly surpass most of the compared pretrained models.

Originality and Value

The paper introduces a benchmark to develop more general and robust natural language models that can more effectively share information across tasks and provides a dataset to analyze models’ linguistic capabilities.

10. Universal Language Model Fine-tuning for Text Classification

Jeremy Howard, Sebastian Ruder

Purpose

Approaches to training NLP tasks are largely inefficient, requiring task-specific modifications, training from scratch, and lots of data. Inspired by inductive transfer learning from computer vision, the authors present an effective transfer learning method for NLP that can be applied to any NLP task.

Methods

The authors’ transfer learning method, Universal Language Model Fine-tuning (ULMFiT) pretrains a language model, which is largely applicable to fine-tuning across different NLP tasks, on a large general-domain corpus. ULMFiT then fine-tunes the pretrained language model using the novel discriminative fine-tuning technique. Discriminative fine-tuning tunes each layer with different learning rates based on slanted triangular learning rates, which first linearly increases then linearly decays learning rates. Then, the authors fine-tune the final classifier on the target task with gradual unfreezing, where layers, starting from the last layers, are one-by-one unfrozen each epoch. The authors experiment with ULMFiT across different benchmarks to demonstrate its applicability to different text classification tasks.

Findings

ULMFiT is an effective method for fine-tuning large, pre-trained language models and achieves state-of-the-art performance on six of the text classification benchmarks, significantly surpassing previous best results.

Originality and Value

The paper presents ULMFiT, an effective and sample-efficient transfer learning method applicable across NLP tasks. It utilizes novel fine-tuning techniques that achieve state-of-the-art performance on text classification benchmarks and advances developments in transfer learning for NLP.

11. Unsupervised Cross-lingual Representation Learning at Scale

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov

Purpose

The work aims to improve cross-lingual language understanding (XLU), studying the effects of training unsupervised crosslingual representations at scale and presenting an effective multilingual pretrained language model.

Methods

In their experiments developing XLM-R, the authors follow the XLM approach to train a cross lingual language model, modifying for scaling. They use a Transformer model trained with the multilingual MLM objective. They scale the model to 100 languages, building a clean CommonCrawl Corpus in the languages. They do not use language embeddings to better deal with code-switching. The authors then present evaluations for various benchmarks: XNLI, CoNLL, MLQA, and GLUE.

Findings

In their analyses of the results, the authors identify the High-resource vs Low-resource Trade-off, which describes that allocation of the model capacity across languages is based on several parameters. They demonstrate that scaling the size of the shared vocabulary can improve the performance of multilingual models on downstream tasks. The model, XLM-R, achieved a new state-of-the-art on XNLI as well as the MLQA cross-lingual benchmarks. Even without the advantage of CRF, performed on par with the state-of-the-art in NER. On GLUE, the authors further show that their crosslingual model is still competitive with monolingual model in monolingual tasks. They also show its surprising effectiveness over monolingual models and improvements on low-resource languages.

Originality and Value

The authors present a scaled state-of-the-art crosslingual model and provide characteristics of effective cross lingual training while demonstrating that crosslingual models do not have to be impaired in monolingual tasks.

12. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov

Purpose

The authors aim to address the limitations of Transformers’ fixed-length context by proposing a novel neural architecture Transformer-XL that enables learning longer-term dependencies.

Methods

The authors’ Transformer architecture consists of two key new components. They present a segment-level recurrence mechanism where, during training, “the hidden state sequence computed for the previous segment is fixed and cached to be reused as an extended context when the model processes the next new segment.” The additional input enables access to information in the past. The recurrence scheme is not restricted to only the previous segment as extra context.

Findings

The authors show that TransformerXL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers. It surpasses the state-of-the-art performance on five different datasets. “Transformer-XL obtains strong perplexity results, models longer-term dependency than RNNs and Transformer, achieves substantial speedup during evaluation, and is able to generate coherent text articles.”

Originality and Value

The authors present a faster and more effective Transformer architecture that can learn longer-term dependencies without the restrictions of context size.

13. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning

Purpose

Pretraining large language models is computationally costly, and masked language modeling does not make use of the full extent of their training data. The authors propose a more sample-efficient pre-training task alternative by using discriminators.

Methods

The pretraining method proposed by the authors, instead of corrupting train data with [MASK] tokens, corrupts the data by replacing tokens with alternatives sampled from a small generator. The pretraining task of the model is to discriminate between generated and true tokens. They train two transformers for their method: the generator is trained with masked language modeling to predict the original identities of the masked-out tokens, and the discriminator that predicts whether or not it was generated. However, generated tokens that match the original token are considered real and not generated.

Findings

The authors’ models’ performances consistently surpassed other methods on a range of benchmarks, both compared to large and small language model models. Their proposed method, while being more compute-efficient, obtained better results on downstream tasks. Their efficiency analysis suggested that ELECTRA’s gains come from more than just faster training, and that the gains from ELECTRA grow larger as the models get smaller.

Originality and Value

The authors present a more efficient and quick pretraining method for language modeling that addresses the computation and data size limitations that come from training large language models. Their method surpasses the performance of other language models on downstream tasks while taking much less time to train.

14. Language Modeling with Gated Convolutional Networks

Yann N. Dauphin, Angela Fan, Michael Auli, David Grangier

Purpose

The authors aim to develop an efficient language model parallelization over sequential tokens through a finite context approach with stacked convolutions.

Methods

The authors propose a new neural language model that replaces the recurrent connections typically used in RNNs with gated temporal convolutions. The model computes each context as a function of a finite number of preceding words, which they demonstrate are sufficient to obtain effective performance while enabling parallelization. Words are represented by a vector embedding stored in a lookup table, and the input is a sequence of word embedding. When convolving inputs, the inputs are shifted to prevent access to future context. The output of each layer is a linear projection modulated by gates that control the information passed on in the hierarchy, dubbed Gated Linear Units (GLU). The authors use adaptive softmax for compute efficiency.

Findings

The GCNN outperforms the comparable LSTM results on Google billion words, obtaining strong performance with much greater computational efficiency. The model achieves a new state-of-the-art on WikiText-103.

Originality and Value

The work presents an efficient approach to achieve high-performance language modeling with significantly fewer resources.

15. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell

Purpose

As language models grow larger, their potential harms become more pervasive as well. The authors aim to address the risks associated with large language models and present developmental recommendations for mitigating these harmful effects.

Methods & Findings

The authors’ initial presentation of the problems of language models is based largely on a review of relevant and current literature.

The authors first discuss the environmental and financial costs of training models. Large language models require tremendous amounts of energy to train, with a significant carbon footprint and global impact, establishing the need for energy efficient model architectures and training methods. Furthermore, they present the financial cost required for accuracy gains, describing that just a minor increase in performance can require disproportionately drastic financial compute costs and carbon emissions.

They further examine practices in the curation and documentation of datasets and stress the importance of carefully constructed corpora. This is important towards building truly diverse datasets and minimizing the ethical problems that come with training on biased, human data. Language models can learn harmful associations through bias and stereotypes which are observable in their generations and predictions. Documentation and exhaustive consideration of the data used to train language models are thus highly important.

The authors then examine the limitations of language models in terms of their ability to generalize to new tasks and their tendency to memorize training data. They describe them as “stochastic parrots,” where while they attain impressive performance on a wide range of benchmarks, they often have flawed memorization, poor generalizability, and biased or inaccurate reproduction from their training data.

Large language models have potential for misuse and abuse. The authors describe that they may also be used to automate certain types of decision-making, such as hiring or lending, in ways that could be harmful due to its flaws and biases. Furthermore, with the large scale of data, language models are also prone to security attacks that extract personally identifiable personal and sensitive information.

The authors finally present paths forward with solutions to reduce the limitations and dangers of large language models, such as using smaller models, training models on a diverse set of data, and developing better evaluation metrics to measure model performance.

Originality and Value

The paper presents a critical analysis of the limitations and potential dangers of large pre-trained language models, with crucial considerations for the developmental practices in NLP, and encourages cautious and thoughtful use of these models. The overview is relevant with the growing size of language models, as the work is important to guide the methods of future research in the field.

16. Semi-supervised Sequence Learning

Andrew M. Dai, Quoc V. Le

Purpose

The nature of unsupervised learning allows for training on large quantities of unlabeled data to improve model quality. This work aims to improve sequence-to-sequence models (LSTM RNNs) with unsupervised training in order to improve the supervised training step.

Methods

The authors propose two approaches to semi-supervised seq2seq learning. The first method used a sequence autoencoder trained to reproduce input documents, while the second method used a language model. In each method, the weights of an LSTM are initialized using the weights obtained from the first unsupervised step. These methods are evaluated on a range of datasets and tasks.

Findings

Using the sequence autoencoder method with LSTMs (SA-LSTM) resulted in matching or surpassing previous best models across all datasets, and the LM-LSTM initialization also works well, but less so compared to the SA-LSTM.

Originality and Value

The authors demonstrate the effectiveness of using LSTMs across NLP tasks by using unsupervised learning with sequence autoencoders and language models. The semi-supervised method can reach or surpass the performance of all previous baselines. Using semi-supervised methods also allows for the utilization of unlabeled data and thus is valuable in low-data tasks.

17. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman

Purpose

With the rapid advancements in language models, the authors aim to address the limitations of the GLUE benchmark in evaluating and guiding future research through the development of an expanded dataset.

Methods

The authors base their benchmark on GLUE, with more challenging tasks, diverse task formats, and comprehensive human baselines. They decide on eight tasks that are challenging and beyond the scope of current state-of-the-art systems, evaluable, from public data, and with relatively simple input and output formats.

Findings

The simple most frequent class and CBOW baselines attain poor near-chance performance, while BERT attains significant gains. However, on one task, BERT performs worse than the simple baselines. The best baselines are still substantially behind human performance. The results suggest a longer lasting and challenging benchmark for developing more advanced language models.

Originality and Value

The work presents a new benchmark for evaluating generalizable language understanding systems that is challenging and enables greater progress and development.

18. Language Models as Knowledge Bases?

Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, Sebastian Riedel

Purpose

Language models, being trained on large corpora, have potential to store knowledge extracted from the data that allow for applications as knowledge bases. In this research paper, the authors analyze the utility of the relational knowledge present in pretrained language models to recall factual knowledge, demonstrating their potential.

Methods

In order to test their question, the authors present the LAMA (LAnguage Model Analysis) probe, a corpus of facts composed of subject-relation object triples and question-answer pairs from various knowledge sources. The authors experiment with pretrained large language models and compare their results with the performance of various knowledge-extraction baselines. In their experiments, models are evaluated based on their ranking of the ground truth tokens, resembling the metrics used in knowledge base completion.

Findings

The findings of the researchers demonstrated the ability of pretrained language models to, without fine-tuning, recall stored knowledge. They find that BERT, in particular, contained relational knowledge competitive with the baselines and traditional NLP methods with oracle knowledge, and also did well on open-domain question answering. In further analysis and through the comparison of the Pearson correlation coefficients, the researchers identified that more appearances of an object in the training data, as well as similarity between subject and object vectors, improved performance, showing that certain knowledge was learned more readily.

Originality and Value

The research introduces the LAMA probe to test language models’ potential as language bases and demonstrates the effectiveness of language models to recall knowledge from their training corpora as a competitive alternative to structured language bases.

19. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

Suchin Gururangan, Ana Marasovic, Swabha Swayamdipta, Kyle Lo. Iz Beltagy, Doug Downey, Noah A. Smith

Purpose

Pretrained language models have shown success in applications to a broad range of domains. Therefore, the authors aim to investigate the value of a second phase of pretraining to tailor models to the domain and task of target tasks.

Methods

The authors’ experiments work with RoBERTa on a range of domains and NLP tasks.

Findings

DAPT obtained improvements over the baseline across all domains, including domains that had more overlap with the original pretraining set, and pattern is consistent across high- and low- resource settings. TAPT also consistently improved the baseline for all tasks across domains. When combined, the DAPT+TAPT pretraining achieved better performance on all tasks. Finally, Curated-TAPT further improved results.

Originality and Value

The authors demonstrate the value of several variations for adapting pretrained language models to domains and tasks. Their findings show that even large language models, that are known to be so broadly applicable across domains, benefit from additional pretraining when facing the complexity of new tasks.

20. Cross-lingual Language Model Pretraining

Guillaume Lample, Alexis Conneau

Purpose

In this work, the authors demonstrate the effectiveness of cross-lingual language model pretraining on various cross lingual understanding tasks. They propose two learning methods, supervised and unsupervised, to significantly outperform past best results on multiple benchmarks. Such methods are particularly relevant in tasks pertaining to low-resource languages, making them impactful in producing more effective models with less available data.

Methods

The authors proposed methods for supervised and unsupervised cross-lingual learning. For each, they processed languages through Byte Pair Encoding trained on samples of the corpora.

Findings

The authors compare the performance of their methods on benchmark methods for various cross-lingual tasks.

Originality and Value

This research shows “for the first time the strong impact of cross-lingual language model pretraining” and present methods that surpass state-of-the-art results on various benchmarks and demonstrate methods to improving perplexity in low-resource languages.

21. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean

Purpose

Activating parts of a network at a time through conditional computation has enabled size efficiency in training models, improving learning capacity with less increase in their number of parameters. However, this method has many limitations. The authors aim to address the challenges of conditional computation towards the training of efficient and scalable large language models.

Methods

The authors’ work improves the memory-intensive and computationally expensive Mixture-of-Experts (MoE) layer, which allows activation of different parts of the layer during training for specialization on different inputs. In order to enable large-scale model training, the authors propose the Sparsely-Gated MoE (SGMoE) layer. Their method uses a sparse gating mechanism to control which experts are activated for each input, which allows the layer to be more efficient and expressive. The authors further identify that gating networks often converge with larger weights for the same few experts. In order to diversify the utilization of experts, they define an additional loss function based on expert importances scaled by a hand-tuning scaling factor to encourage equal importance. The authors evaluated their proposed SGMoE layer on several large-scale NLU tasks and compared its performance as well as its improvements in computational cost and memory usage.

Findings

SGMoE outperformed the state-of-the-art models on several benchmarks, and it also showed significant improvement in terms of computational cost and memory usage. The authors establish the effectiveness of the SGMoE layer for scaling up neural networks with conditional computation.

Originality and Value

The research is “the first to demonstrate major wins from conditional computation in deep networks” and presents methods opening directions to scalable and effective large language models.

22. What Does BERT Look At? An Analysis of BERT’s Attention

Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning

Purpose

The authors aim to understand language models’ attention mechanisms and examine BERT’s learning of specific linguistic features.

Methods

The authors propose a series of methods for analyzing attention mechanisms and demonstrate the linguistic information captured. They work with the BERT model and NLU and NLI tasks.

Findings

The analyses demonstrate that certain attention heads correspond well to linguistic notions of syntax and coreference. Certain attention heads in BERT capture syntactic structures by attending to tokens connected by syntactic dependencies. Additionally, some heads encode semantic information by attending to tokens with similar semantic properties, such as subject-verb pairs or related entities. Attention heads show specialized roles for linguistic phenomena such as negation, coreference, and long-distance dependencies. The similarity analysis reveals clusters of attention heads with similar attention patterns.

Originality and Value

The paper provides valuable insights into the attention mechanisms of large language models and an understanding of their capabilities.

23. Exploring the Limits of Language Modeling

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu

Purpose

The paper aims to examine the capabilities of RNNs in large-scale language modeling, looking at their ability to capture long-term dependencies in language and to handle rare and unknown words. They build upon effective methods in order to propose methods that can more effectively address these.

Methods

The authors expanded upon and unified the methods proposed in advancing research in RNN-based large-scale language modeling. In particular, they presented a Softmax loss based on character level CNNs. They built upon works proposing the application of CNN character embeddings, which allow for efficient parametrization of the word embeddings, to reduce the number of parameters of the Softmax layer. They propose that the CNN Softmax layer can better handle arbitrary words. To improve efficiency, they combine the word and character-level models, feeding the word-level LSTM. The authors evaluate their training methods on experiments with the 1B Word Benchmark data set and various LSTM language model architectures.

Findings

The methods obtained significant improvements upon the previous state-of-the-art large scale language model tasks, reducing perplexity from 51.3 to 30.0 while reducing the number of parameters by a factor of 20.

Originality and Value

The authors demonstrate the capabilities of RNN-based models in large-scale language modeling towards surpassing by exploring recent advances in model architectures to present new, very effective methods.

24. Unified Language Model Pre-training for Natural Language Understanding and Generation

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon

Purpose

Pretrained language models are highly effective across a variety of NLP tasks, and with different training objectives can be fine tuned to downstream tasks of different types. The authors propose a new unified pre-training language model (UNILM) which is jointly optimized for multiple objectives to enable fine-tuning for both NLU and NLG tasks.

Methods

The authors’ method unifies bidirectional, unidirectional, and sequence-to-sequence language models. Their model UNILM is composed of several transformer blocks to make up a shared network with these three types of unsupervised language modeling objectives. It is pretrained on a set of masked modeling tasks with varying contexts controlled by different self-attention masks. The bidirectional, and unidirectional language models’ contexts are the tokens to the right and left, tokens to the left, respectively, while the sequence-to-sequence language model’s context is the source sequence and the tokens to the left in the target sequence. Once UNILM is pretrained, it can be fine-tuned using task-specific data for downstream tasks. The authors evaluate the performance of their model on a range of NLU and NLG tasks.

Findings

On multiple of the NLU datasets, UNILM’s performance compares to or even surpasses BERT. UNILM also outperforms previous state-of-the-art models on five NLG datasets. UNILM is able to effectively perform very well across both classes of NLP tasks.

Originality and Value

The authors present a new language model that can be applied across NLU and NLG tasks with a single transformer with shared parameters and architecture.

25. ERNIE: Enhanced Language Representation with Informative Entities

Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, Qun Liu

Purpose

The authors aim to improve the performance of pretrained language models on knowledge-based tasks by incorporating knowledge graphs (KGs) to provide structured information along with natural language.

Methods

The authors present their framework for the model, ERNIE. The architecture includes both an underlying textual encoder (T-Encoder) for basic lexical and syntactic information as well as an upper knowledgeable encoder (K-Encoder) for aggregating entities from external knowledge graphs and the textual information from the underlying layer. The authors use a novel pre-training task designed to inject knowledge and fuse the two forms of information. The procedure, denoising entity auto-encoder (dEA), involves randomly masking token-entity alignments and then predicting corresponding entities to tokens. Masked language modeling and next sentence prediction tasks are also employed.

Findings

The ERNIE model is evaluated on five NLP datasets with both general as well as knowledge-driven tasks. They examine entity typing (label an entity mention with its semantic type), relation classification (determine the relation between two entities), and the NLU tasks of GLUE. ERNIE reduces the noisy label challenge in entity typing, data efficiency from its knowledge information in relation classification, and performance comparative to BERT in GLUE.

Originality and Value

The authors present ERNIE, incorporating knowledge information into a language model to improve performance on knowledge-based tasks.

26. Multi-Task Deep Neural Networks for Natural Language Understanding

Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao

Purpose

The authors aim to improve generalizability in NLU tasks with multi-task learning.

Methods

The authors develop a Multi-Task Deep Neural Network (MT-DNN) that combines a language model with multi-task learning to learn representations across multiple tasks. The output layers of the architecture are task-specific. For all tasks, the input sequence of embedding vectors is fed into the Transformer encoder to generate shared contextual embedding vectors. Task-specific layers generate task-specific representations: these include one for each of Single-Sentence Classification, Text Similarity, Pairwise Text Classification, and Relevance Ranking.

Findings

The authors evaluate the model on various NLU tasks and compare the performance of MT-DNN to the leaderboard models. MT-DNN obtained state-of-the-art results on ten NLU tasks and eight out of nine GLUE tasks. They demonstrate that the learned representations allow effective domain adaptation with substantially fewer in-domain labels.

Originality and Value

The authors expand on multi-task deep learning by incorporating a transformer language model, achieving state-of-the-art results and improving generalizability in NLU tasks.

27. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, Mohammad Norouzi

Purpose

The authors aim to build on the power of large transformer language models for encoding text for image synthesis in diffusion models in high-fidelity image generation.

Methods

The authors present Imagen, consisting of a text encoder and conditional diffusion models that map text embeddings to images. They explore all of BERT, T5, and CLIP, freezing their weights for simplicity. A cascade of diffusion models takes the embeddings from the text encoder. The authors introduce a new dynamic thresholding method, which pushes saturated pixels inwards to prevent pixels from saturation at each step. The authors evaluate their method on various benchmarks, generating images from natural language inputs.

Findings

Imagen achieves state-of-the-art zero-shot FID on COCO and attains quality image generation. The authors identify Imagen’s limited ability to generate photorealistic people. With human evaluation compared to previous approaches, human raters exceedingly prefer Imagen over all others models across metrics.

Originality and Value

The work demonstrates the effectiveness of frozen large pretrained language models as text encoders for the text-to-image generation and shows the impact of scale in these models.

28. PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma…et al.

Purpose

The authors aim to improve few-shot learning performance of large language models with scaling.

Methods

The authors present the PaLM language model, which is based on the standard Transformer architecture. They make multiple modifications, including using the SwiGLU activation instead of GeLU, using a “parallel” formulation in each Transformer block instead of a “serialized” formulation for faster training speed, Multi-Query Attention for cost savings at autoregressive decoding time, and RoPE embeddings instead of absolute position embeddings. The authors compare multiple model sizes: 540B parameters, 62B parameters, and 8B parameters. In training, the Pathways system is used to scale training across two TPU v4 pods using two-way data parallelism at the pod level. PaLM was evaluated across a range of NLP tasks.

Findings

PaLM outperformed the previous state-of-the-art results on 24 of the 29 English NLP benchmark tasks in the 1-shot setting and 28 of the 29 tasks in the few-shot setting. On the Massive Multitask Language Understanding (MMLU) benchmark, PaLM outperforms the Chinchilla model on all the categories except one. After fine-tuning on the SuperGLUE benchmark, PaLM is competitive with state-of-the-art (encoder-decoder model) while outperforming the best decoder-only autoregressive language model. PaLM further outperforms the prior state-of-the-art on 44 out of the 58 common tasks in the BIG-bench benchmark. PaLM achieves state-of-the-art accuracy across reasoning-based arithmetic and commonsense tasks, and surpasses past results on coding tasks. It achieves strong zero-shot translation performance as well, and does well on various other tasks and benchmarks.

Originality and Value

The authors demonstrate the value of scaling in large language models, obtaining high performance across a range of tasks with generalizability and data-efficiency.

29. DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen

Purpose

With the advancements of BERT and RoBERTa models, the authors aim to improve performance on NLP tasks with a new model architecture DeBERTa with a disentangled attention mechanism and an enhanced mask decoder.

Methods

The authors build upon BERT and RoBERTa to propose a new Transformer-based neural language model, DeBERTa (Decodingenhanced BERT with disentangled attention). They introduce two novel components. The first is the disentangled attention mechanism. Instead of representing each word as a vector made up of the sum of its word and position embeddings, the authors represent each word with two vectors encoding its content and position separately. Attention weights are computed using disentangled matrices based on each of these vectors. Second, an enhanced mask decoder incorporates absolute positions in the decoding layer to predict the masked tokens in model pre-training. Further, the authors present a new virtual adversarial training method, Scale-invariant-Fine-Tuning (SiFT). The authors evaluate DeBERTa’s performance on various NLU tasks.

Findings

With half as much training data, DeBERTa surpasses RoBERTa-Large on a range of NLP tasks. When scaled, the single DeBERTa model surpasses the human performance on the SuperGLUE benchmark for the first time, and the ensemble DeBERTa model achieves state-of-the-art performance on the SuperGLUE leaderboard.

Originality and Value

The authors present a highly effective and efficient new model architecture that improves upon Transformer-based language models across a range of NLP tasks.

30. Making Pre-trained Language Models Better Few-shot Learners

Tianyu Gao, Adam Fisch, Danqi Chen

Purpose

With the effectiveness of the GPT-3 model in few-shot tasks with prompts and demonstrations, the authors study and present techniques for fine-tuning of language models.

Methods

The authors first introduce approaches for automatic prompt generation. To select label words, for each class in the label space, the top vocabulary words are selected based on conditional likelihood of the initial language model. Diverse templates are generated from the labels using the T5 model trained to fill missing spans of text. Training sentences are structured to be filled with the T5 model into templates. They then fine-tune each generated template and use a validation set to either select the top templates.

Findings

The authors show that their method outperforms vanilla fine-tuning by up to 30% and 11% on average. The authors find that “GPT-3”-style in-context learning does not always improve over zero-shot prediction, demonstrating its weakness in smaller language models. Prompt-based fine-tuning was able to greatly outperform standard fine-tuning, and automatically searched templates were comparable to or better than manual ones. Demonstrations in context led to consistent gains in a majority of tasks.

Originality and Value

The authors presented LM-BFF, a set of effective techniques for fine-tuning language models for few-shot learning.

31. The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester, Rami Al-Rfou, Noah Constant

Purpose

Prompt design has shown to be effective in utilizing language models, and allows for freezing language models and limiting their growing size, yet has several key drawbacks in its construction of quality task descriptions and input limitations. Therefore, the authors aim to improve prompt-based methods through prompt tuning, a size-efficient method that utilizes both prompting and fine-tuning.

Methods

The prompt-tuning method works with a frozen pretrained model where only a limited number of tunable tokens per downstream task. In doing so, the “soft prompt” is trained end-to-end and its signal is not lost as fine-tuning continues, improving few-shot prompt performance. The authors also experiment with “prompt ensembling” by training multiple prompts on the same task to create many separate models. In their evaluations, they compare their method to model-tuning, prompting, and past baselines on a range of datasets.

Findings

The prompt-tuning method improves upon prompting and closes the gap between prompt-based and model-tuned task performance. On the SuperGLUE benchmark, the performance of prompt-tuning is competitive with that of model tuning, approaching closer with larger models. Furthermore, prompt-tuning resulted in improved generalization in zero-shot domain transfer.

Originality and Value

The authors narrow the gap between prompt-based and model-tuning based learning with large language models, addressing the issues of the increasing size of language models. Furthermore, their methods demonstrate the effectiveness of prompt-tuning for generalizability and zero-shot performance.

32. Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference

Timo Schick, Hinrich Schutze

Purpose

Task instructions improve few-shot learning and enable unsupervised training on NLP tasks. The authors aim to bring the advantages of higher performance supervised methods and unsupervised, task-instruction-based methods to NLP tasks with a semi-supervised approach that restructures inputs as cloze-style tasks.

Methods

The author’s method, Pattern-Exploiting Training (PET), is based on a masked language model. It uses a pattern-verbalizer pair, where the pattern outputs a cloze question with one masked token from the input phrase of a task, and the verbalizer maps each label to a word in the language model’s vocabulary. This pair enables an input task to be reconstructed as to predict the most likely word for the masked position, which the verbalizer interprets to the output, rather than predicting a label without inherent meaning.

Findings

The training method PET significantly improves over standard supervised training, unsupervised training and other semi-supervised approaches in the limited-data settings.

Originality and Value

The paper presents a new training method that leverages training data more effectively with a semi-supervised approach that improves upon supervised methods. The advance in training on instruction-based NLP tasks provides greater potential for few-shot learning and task generalizability.

33. Using the Output Embedding to Improve Language Models

Ofir Press, Lior Wolf

Purpose

The authors improve the quality of word embeddings by demonstrating that the output embedding is a valid word embedding and tying it with the input embedding.

Methods

The authors work with the large and small models of each of three different model categories: NNLMs, the word2vec skip-gram model, and NMT models. For small models with no regularization, they present a new regularization scheme, adding a projection matrix before the output embedding and a regularizing term to the loss function. For each model, they apply weight tying of the input and output embeddings. For translation models, they propose threeway weight tying (TWWT), where the input embedding of the decoder, the output embedding of the decoder and the input embedding of the encoder are all tied.

Findings

The authors find that, in examining the quality of the embeddings, the output embedding is nearly as good as the input embedding. In the trained small NNLM model, however, the output embedding greatly surpasses the input embedding and the tied embedding is comparable to the output embedding. Weight tying significantly reduces perplexity in the NNLM models. In the NMT models, despite having about 28%-52% fewer parameters, the tied models achieve similar performance to the untied models.

Originality and Value

The work presents a novel approach of incorporating output embeddings to improve model performance while reducing model size.

34. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, Noam Shazeer

Purpose

Mixture of Experts (MoE) models have shown notable effectiveness in NLP by enabling variety in the parameters across inputs. However, the method’s applicability is limited by flaws in computational cost and training instability. The authors aim to address this towards more efficient and scalable sparsely-activated models.

Methods

The authors present Switch Transformers as a new, efficient training approach for large language models. The authors establish their method based on the hypothesis that a significant portion of large language models’ parameters are not required for effective performance. They thus propose Switchable Dropout which induces sparsity to reduce the number of parameters in the model which maintain performance. The Switchable Dropout technique applied dropout to different subsets of parameters to learn the most and least relevant parameters for a given task. They apply their Switch Transformers to train language models at scale with over a trillion parameters and apply their method to a variety of NLP tasks to demonstrate its performance. They compare their models with and without Switchable Dropout across the tasks to validate their hypothesis.

Findings

Switch Transformers achieve exceed or perform similar to models without the Switchable Dropout method, demonstrating the ability of the method to significantly improve efficiency and scalability without compromising performance.

Originality and Value

The paper presents a simple and computationally efficient approach for training large language models at extreme scale by reducing the number of parameters used while improving performance. Furthermore, their experiments establish that many parameters are not required in the training process, opening further pathways for larger language models.

35. How Can We Know What Language Models Know?

Zhengbao Jiang, Frank F. Xu, Jun Araki, Graham Neubig

Purpose

Varied prompt quality can affect the ability to determine the knowledge contained in language models. Therefore, the authors aim to more accurately estimate the knowledge of language models with automatic generation of better prompts.

Methods

The authors propose mining-based and paraphrasing-based methods to automatically generate high-quality and diverse prompts in order to extract the knowledge of language models.

Findings

The authors’ methods improve accuracy from 31.1% to 39.6%, narrowing the lower bound on the knowledge of large language models. The optimized ensemble method further raised accuracy to 43.7% on BERT-large.

Originality and Value

The paper provides a valuable understanding to the capabilities of large language models and presents a method to effectively extract knowledge contained within a language model accurately with quality prompts. It shows significant improvements in accuracy utilizing better prompting techniques.

36. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou

Purpose

The scale of language models has shown limited effectiveness in performance on complex tasks involving arithmetic, common sense, or symbolic reasoning. The authors thus examine a new method of prompting with “chains of thought” to improve reasoning-based tasks in large language models.

Methods

The authors draw from the human tendency to break down complex tasks into multi-step problems. They mimic this in their prompts. To examine the effectiveness of such an approach in large language models, the authors work with arithmetic, common sense, and symbolic reasoning tasks.

Findings

The authors find that the advantage of chain-of-thought prompting is positive with greater model size. Further, it was especially beneficial with increasing complexity of the problems. The best chain-of-thought prompting achieved performance outperforming or competitive with the previous state-of-the-art model.

Originality and Value

The authors demonstrate the effectiveness of chain-of-thought prompting and its ability to improve large language models’ performance on complex reasoning-based tasks beyond scaling.

37. Adversarial NLI: A New Benchmark for Natural Language Understanding

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, Douwe Kiela

Purpose

With the value of large datasets for the progress of AI and the rapid development in NLP requiring more long-lasting benchmarks, the authors aim to present a large and expandable dataset for NLP.

Methods

The authors propose an iterative, adversarial human-and-model-in-the-loop approach for NLU dataset collection for longevity and robustness. This is called HAMLET (Human-And-Model-in-the-Loop Enabled Training). With a base model trained for NLI, a human annotator “adversary” is used, along with human “verifiers.” Given a context, the human annotator constructs intentionally difficult examples that should expose additional model weaknesses to be added to the training set. This is iteratively done for several rounds, each time training a new model and setting a new test set.

Findings

The authors find that base model performance is low and that the rounds become increasingly more difficult. Furthermore, they show that training on more rounds improves robustness. With their dataset, RoBERTa achieves state-of-the-art performance on both the SNLI and MNLI datasets. The varied relative performance of each of the models across the rounds indicate different weaknesses between BERT, XLNet, and RoBERTa.

Originality and Value

The work presents a new benchmark for NLU designed to be challenging, and proposes an approach for developing dynamic benchmarks.

38. Extracting Training Data from Large Language Models

Nicholas Carlinil, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss,Katherine Leet, Adam Roberts, Tom Brown, Dawn Song, Úlfar Erlingsson, Alina Opreat, Colin Raffel

Purpose

Many released large language models are trained and great amounts of often private datasets. In their work, the researchers demonstrate that large language models memorize individual training points that can be extracted and thus pose a threat to leaking personally identifiable information.

Methods

The authors propose a method for “extracting verbatim sequences from a language model’s training set using only black-box query access.” Their method draws from the differences in losses between the train and test examples, which are not significantly different on average but still can be exploited. They first generate data from the model by sampling sequences based on its log likelihood. Then, they predict outputs that contain memorized text based on its confidence as quantified by its perplexity.

Findings

The authors’ methods are able to identify 67% of candidate samples as verbatim training examples. These samples contain a significant amount of private and personally identifiable information, including the full name, physical address, email address, phone number, and fax number of an individual. Furthermore, larger language models were more vulnerable, memorizing more training data than smaller language models.

Originality and Value

The authors quantify the privacy risks associated with large language models and demonstrate the extent to which information is memorized in language models.

39. How Much Knowledge Can You Pack Into the Parameters of a Language Model?

Adam Roberts, Colin Raffel, Noam Shazeer

Purpose

Language models are able to implicitly store and retrieve knowledge. The authors aim to explore this ability and determine the utility of this approach by fine-tuning pre-trained models to answer questions.

Methods

The authors work with the T5 models fine-tuned on various open-domain question answering datasets. They use salient span masking, which first uses BERT to identify sentences that contain named entities and dates, then is requires the model to reconstruct masked-out spans from these sentences.

Findings

The authors show that across the datasets, performance improves with increasing model size. They are able to achieve state-of-the-art performance on the WebQuestions dataset, and their largest models surpass most other methods on Natural Questions and TriviaQA. This is despite their method not using any external resources and previous approaches operating in the “open-book” setting.

Originality and Value

The work reveals the capabilities of large language models in effectively attaining competitive results in question answering without external knowledge.

40. Learning to Generate Reviews and Discovering Sentiment

Alec Radford, Rafal Jozefowicz, Ilya Sutskever

Purpose

The ability of language models to learn representations reflecting concepts occurring in texts in an unsupervised manner can be utilized in natural language processing tasks. The authors study the use of unsupervised representations to learn concepts pertaining to sentiment for application in sentiment analysis.

Methods

The authors focused on byte-level language modeling for the task of sentiment analysis. To align with this, they worked with the very large Amazon product review dataset. The model architecture used was a single layer multiplicative LSTM (mLSTM), and methods were applied to effectively train the large-scale model, including data-parallelism. Feature representations are extracted by inputting preprocessed and UTF-8 encoded sequences into the model processes and obtaining the final cell states. A logistic regression classifier is trained on the model’s representations on datasets for multiple tasks.

Findings

Evaluation of their model the MR and CR sentiment analysis datasets result in a great improvement over the state-of-the-art, while two other datasets show no significant change. These results indicate the model’s ability to learn representations of text relevant for application in the domain. Further evaluation of their model on the Stanford Sentiment Treebank dataset achieves 91.8%, outperforming yet another state-of-the-art.

Originality and Value

The research demonstrates “the sensitivity of learned representations to the data distribution they are trained on”. This quality is shown to make language models’ representations effective for application in tasks relevant to the training domain.

41. Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish

Purpose

Language modeling performance is dependent on factors including model size, data size, and compute used for training. The authors aim to determine the configuration of these factors to optimize model performance.

Methods

The authors’ experiments are based on training language models on WebText2. They characterize model scaling by training a variety of model of differing model size (from 768 to 1.5 billion parameters), dataset size (from 22 million to 23 billion tokens), shape (depth, width, attention heads, and feed-forward dimension), as well as context length and batch size. They analyze the trends for optimal performance as they scale the models.

Findings

The authors’ analyses determine that model performance is most strongly dependent on scale: model size, dataset size, and compute used for training. The authors identified a "sweet spot" in terms of model size for each value of compute budget that maximizes performance. Additionally, the paper suggests that these scaling laws may be used to predict the performance of future models based on their size, and that larger models will continue to perform better and will also be more sample efficient.

Originality and Value

The paper provides valuable analyses relevant for scaling language models that are becoming increasingly larger. The authors’ work is important to optimizing model performance as more models are trained. Furthermore, they present a look into the performance of language models in the future.

42. Multitask Prompted Training Enables Zero-Shot Task Generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, et al.

Purpose

Implicit multitask learning enables the effectiveness of large language models in zero-shot generalization to new tasks. In the paper, the authors aim to create a model that can better generalize to held-out tasks and perform robustly to diverse prompt wording with explicit multitask training.

Methods

For their data, the authors developed a mixture of natural language prompted datasets, holding out four of the tasks for zero-shot learning. The prompts include an input template, a target template, and associated metadata. The authors then fine-tuned a pretrained model, using an encoder-decoder architecture based on T5, and specifically used the adapted T5+LM model. The authors trained three versions of the model, called T0 (mixture dataset), T0+ (mixture dataset + GPT-3’s evaluation datasets), and T0++ (mixture dataset + GPT-3’s evaluation datasets + SuperGLUE). They evaluated zero-shot generalization of their models on the held out datasets.

Findings

Through a comparison of their multitask prompted training with the T5+LM baseline and various GPT-3 models, the authors determine whether their method improves zero-shot generalization, observing a significant improvement in performance. The T0 model “[matched or exceeded]...all GPT-3 models on 9 out of 11 held-out datasets”. In additional analyses, the authors find the significance of training on diverse prompts on robustness to wording through two ablation experiments. They determine that more prompts per dataset are beneficial, while prompts from more datasets are not consistently beneficial.

Originality and Value

This research presents the effectiveness of supervised multitask training of language models for zero-shot learning that enables improved performance from much larger models.

43. Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan

Purpose

Developing generalizable models with a limited amount of annotated examples is valuable in multimodal machine learning research. The authors thus aim to introduce the Flamingo family of Visual Language Models (VLM) in order to address this ability.

Methods

The authors present the Flamingo Visual Language Models which bridge vision and language models to work with inputs of sequences of visual and textual data and learn visual concepts from natural language descriptions for few-shot learning. Their model is based on the transformer architecture trained on both image-text pairs as well as text-only data of image descriptions. The architecture includes an image and a language encoder that encode input images and descriptions into a visual and language feature representations which are pretrained. A transformer decoder then uses these feature representations to generate the output description of the image, pretrained on text-only data. In their experiments, the authors fine-tune the trained model to evaluate in the few-shot setting for tasks like object detection and image captioning.

Findings

In the low-resource setting, Flamingo surpassed the performance of the state-of-the-art in 6 of the 16 considered tasks despite using minimal task-specific training data.

Originality and Value

The paper presents a novel approach to few-shot learning in vision by utilizing natural language to effectively learn visual concepts from descriptions. Furthermore, the research is important to paths in studying multimodal applications of language models.

44. OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer

Purpose

Training large language models is computationally costly and difficult to study. Therefore, the authors present OPT, a suite of decoder-only pre-trained transformers that are much more efficient to train and are released with full detailed documentation. They aim to replicate GPT-3 with the latest best practices.

Methods

The authors develop eight models of ranging parameter sizes trained, unsupervised, on the corpora used in training RoBERTa, the Pile, and PushShift.io Reddit. They describe their challenges and experimentation during training, including hardware failure, which they addressed with diagnostics tests and restarts from checkpoints, loss divergence, which they addressed with hyperparameter adjustments and restarts from checkpoints, and other mid-flight changes.

Findings

Despite using a fraction of the computational power to train, OPT matched GPT-3’s overall average performance on the NLP tasks. The authors note that performance on individual tasks, however, showed greater variance.

Originality and Value

The authors provide an effective and well-documented suite of models that are fully available and analyzed for bias and toxicity. They require much less to train and still compare favorably to GPT-3.

45. REALTOXICITYPROMPTS: Evaluating Neural Toxic Degeneration in Language Models

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, Noah A. Smith

Purpose

Language models learn toxicity from texts and have the capability to generate biased texts. In order to address toxic degeneration in language models, the paper introduces a dataset, REALTOXICITYPROMPTS, of naturally occurring prompts with toxicity scores and examines where and why such toxic degeneration appears.

Methods

The authors base “toxicity” on PERSPECTIVE API, a widely used tool for toxic language detection. In order to create the REALTOXICITYPROMPTS dataset, they generate toxicity scores using the api with sentences selected from the OPENWEBTEXT CORPUS. They equally sample sentences from four toxicity ranges to obtain 100K sentences. They then split each sentence into half as prompt-continuation pairs with separate toxicity scores.

Findings

The examination of their datasets shows that although toxic prompts often result in toxic generations, non-toxic prompts also significantly yield toxic generations as well. Of their detoxification methods, they identified nontoxic DAPT and PPLM as most effective at reducing toxic generations.

Originality and Value

The paper introduces a dataset for evaluating toxic degenerations in language models and presents the effectiveness of multiple detoxification methods. The authors establish the relevance of addressing toxicity by quantifying toxicity in pretrained language models and various important datasets.

46. Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides,Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer,Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese,Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell,Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland,Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama,Cyprien de Masson d’Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel,William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu and Geoffrey Irving

Purpose

As language models grow in size, understanding the gains and other effects of scale is increasingly relevant. The paper aims to present an analysis of the scaling of Transformer-based language modeling.

Methods

The authors present Gopher, a transformer-based language model trained on a large corpus of text data. In training Gophers, the authors apply techniques for scaling language models efficiently, with model parallelism, data parallelism, and gradient accumulation, which allows it to handle the large amount of data and computational requirements. In experimentations, the authors examined the performance improvements with scale. They additionally assessed toxicity and bias, using Perspective AI’s scoring. The authors evaluate Gopher on various NLU benchmarks, including an examination of the model's ability to generalize to unseen data and its robustness to different types of input.

Findings

Gopher achieves state-of-the-art performance across the majority of the 152 diverse evaluated tasks. The analyses show that Gopher showed significant but unbalanced gains in model performance as the model size increased, improving more in domains of science, technology, social sciences, and humanities, while having less benefit in math, logical reasoning, and common sense. Furthermore, their analyses demonstrated that toxicity of larger models were more consistent with prompt toxicity than for smaller models.

Originality and Value

The work provides a large-scale language model and an in-depth analysis of the role of model scale in language model performance. It also provides methods and insights on understanding changes in large language models with scale and assessments of potential harms.

47. Calibrate Before Use: Improving Few-Shot Performance of Language Models

Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, Sameer Singh

Purpose

Few-shot learning with GPT-3 can yield unstable performance. The authors thus aim to improve few-shot learning with a contextual calibration method that modifies the model parameters based on the estimated bias of the bias.

Methods

The authors work with datasets for text classification, fact retrieval, and information extraction and run on models of the GPT-3 suite. They then examine the sensitivity of the models’ performance and characterize its stability with Majority Label Bias, Recency Bias, and Common Token Bias.

Findings

The applied calibration reduces variability, increases average and worst-case performance, and improves overall performance across the tasks.

Originality and Value

The authors demonstrate the volatility of language models’ few-shot performance and present an effective method for improving few-shot performance and instability of language models by quantifying and calibrating to their bias. Their work advances the understanding of the difficulties of few-shot learning and why they occur.

48. Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals and Laurent Sifre

Purpose

Training large languages can be computationally costly. The authors thus investigate the optimal model size and number of tokens for training language models for a given compute budget.

Methods

The authors analyze the effects of greater model size versus number of tokens on model performance when training a transformer language model, exploring optimal compute with three approaches. They first fix model sizes and vary the number of training tokens, then they change model size and adjust the number of training tokens based on IsoFLOP profiles so that the total FLOPs are fixed, and finally they fit to a parametric loss function. They train over 400 language models ranging from 70 million to 16 billion parameters and 5 to 500 billion tokens. Based on their analysis of their methods, the authors determine the optimal model size and number of tokens and experiment on adjusting an existing language model, Gopher, to train on the identified values and present Chinchilla, a new language model. They compare the performance of their model trained on the new model and data size with the original model on various tasks.

Findings

The authors identify that the model size and the number of training tokens should be scaled equally for compute-optimal training. They find that most language models do not align with this and are thus significantly undertrained. Chinchilla, developed by adjusting the model size for the Gopher based on its compute budget, outperforms Gopher and even larger models across evaluation tasks. They suggest that their findings imply that improvements from scaling to larger datasets are dependent on data quality.

Originality and Value

Efficient training of large language models is increasingly important as larger and larger models are developed, and the authors show the relevance of considerations in dataset scaling. They demonstrate that models can be trained more effectively and efficiently while optimizing computation.

49. Release Strategies and the Social Impacts of Language Models

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, Miles McCain, Alex Newhouse, Jason Blazakis, Kris McGuffie, Jasmine Wang

Purpose

As language models grow in size, accessibility, and uses, they also raise misuse concerns that need to be carefully addressed for safe deployment. The authors discuss the release of large language models, particularly OpenAI’s work, and provide recommendations for responsible publication in AI.

Methods

AND

Findings

The authors examine various release strategies for large language models and the potential social impacts of these models. OpenAI’s GPT-2 models were released in stages with partnerships across university research labs to conduct risk and benefit analyses. They describe notable uses of GPT-2 models, in domains ranging from software engineering and writing to art and health, as well as misuses. These harmful uses included biasing language models to favor certain well-resourced groups.

Originality and Value

The paper provides a valuable set of best practices for responsible model release as large language models become more powerful and deployed to various areas. It highlights the need for careful consideration of the social implications of language models, and the importance of responsible release strategies in order to mitigate potential negative impacts.

50. How Many Data Points is a Prompt Worth?

Teven Le Scao, Alexander M. Rush

Purpose

Prompting is an important method for fine-tuning pretrained language models for classification, and can significantly impact results especially in low-data based problems. This research paper thus aims to quantify the efficiency that prompting provides by answering how many data points a prompt is worth with a new metric, average data advantage.

Methods

The authors first establish the two transfer learning settings for text classification that they study: head-based and prompt-based. In the former, the model predicts an output class from the pretrained representations, while in the latter a task-specific pattern string induces a text output as the class prediction. The authors follow PET notation, with a pattern and a verbalizer, where the pattern turns the input text into a sequence with a masked token which the model predicts and is mapped to the verbalizer to obtain the corresponding class.

Findings

In all tasks except for one, prompt-based learning shows significant improvements in performance. Prompting also remains better while more data is added. The calculated average advantage metric showed prompting to add the equivalent of hundreds of data points of improvement.

Originality and Value

The research paper demonstrated the effectiveness of prompting for fine-tuning pretrained language models and proposed a metric for measuring the advantage it provides, especially in low-data tasks.

51. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, Luke Zettlemoyer

Purpose

In-context learning enables large language models to perform new tasks with conditioning on a few new ‘demonstration’ pairs. In this research paper, the authors examine the role of demonstrations in in-context learning performance to understand how this inference works and can be used.

Methods

The authors perform their experiments using 12 models in total, including 6 language models each with two inference methods, direct and channel. Evaluations are done on datasets of varying NLP tasks.

Findings

The researchers identify what aspects of demonstrations contribute to performance gains to gain insight into how in-context learning works. While ground truth labels do not play a large role, they find that using out-of-distribution inputs instead of training data inputs drops the performance. In studying the impact of the label space, results were inconsistent across direct and channel models; while direct models were more significantly affected by labels of random English words, channel models were not. In their experiments regarding the format, the researchers found the importance of the format for in-context learning in that without the format, performance dropped to that of with no demonstrations.

Originality and Value

In-context learning provides important value in its potential to outperform zero-shot learning methods, but is little understood. The work provides insight into the effectiveness of in-context learning that enables avenues for future work.

52. ByT5: Towards a token-free future with pre-trained byte-to-byte models

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel

Purpose

Token-free models, operating directly on raw text, are more robust to training with different languages and noise and avoid complex text preprocessing pipelines, but are often trained to handle the longer sequences of bytes. The authors aim to demonstrate the effectiveness of transformer architectures in training token-free models.

Methods

The authors develop their model ByT5 based on a standard token-based transformer architecture with small modifications: they use the raw bytes, with some ids reserved for special tokens rather than SentencePiece; change the pretraining objective to span corruption, where the model must fill in spans of tokens replaced with a single id; and use a deeper architecture for the encoder than the decoder instead of balanced size. ByT5 is trained on a large dataset of raw bytes and fine-tuned to various NLP tasks. The authors demonstrate the effectiveness of the transformer architecture with only minor modifications in training byte-based models by evaluating their model in comparison to mT5.

Findings

The performance of ByT5 is competitive with token-based mT5 in the NLP tasks, and outperforms mT5 with small model sizes, generative tasks, many multilingual tasks, and noise. The results show the potential of byte-based models in succeeding across NLP tasks while requiring less preprocessing and having fewer limitations on the types of input it can handle. The gains with ByT5 are achieved despite pretraining with 4x less text than mT5, showing the data efficiency of byte-based models.

Originality and Value

The work establishes the value and potential of byte-based models in NLP tasks in being more applicable across data types and inputs, being more data efficient, and working with transformer architectures. It opens new paths towards the development of byte-based models.

53. The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy

Purpose

Diverse and large datasets are important in developing generalizable language models. The authors present The Pile, an 825 GiB English text corpus, and establish its effectiveness for training large language models for improved cross-domain and generalized performance.

Methods

The authors first construct the corpus from various datasets. The new datasets included are derived from fourteen different sources, including two which are extensions of previous datasets. They further incorporate eight more existing datasets, including a new filtered subset of the Common Crawl dataset. This subset is extracted by using jusText on Web Archive files, obtaining higher quality outputs from the ranging qualities in the original Common Crawl dataset. Together, the Pile is composed of twenty-two total constituent sub-datasets.

Findings

When compared to traditional language modeling benchmarks, the Pile showed significant improvements on WikiText. The Pile also yielded significantly better results on both Raw CC and CC-100, as well as many of the other datasets. The authors demonstrated that models trained on the Pile have greater cross-domain generalization while maintaining high performance on traditional benchmarks.

Originality and Value

The Pile is a key advancement as it provides a valuably diverse and large dataset with extensive ethical considerations. The careful examination of the data’s content distributions and legalities is an important example of how data should be carefully managed in NLP. It is important to understand the data used in NLP models for safe and knowledgeable deployment of models.

54. Deduplicating Training Data Makes Language Models Better

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, Nicholas Carlini

Purpose

Large-scale text corpora are a crucial part of developing large language models, yet as the size of such data increases, duplicates in the data affect model training as well as contribute to inaccurate evaluation from greater train-test overlap. As manual review of large datasets is impossible, the authors propose two methods for deduplicating large datasets to improve model training efficiency, evaluation, and performance.

Methods

The authors worked with four commonly used datasets in pretrained language models and benchmarking: Wikipedia (Wiki-40B), One-Billion Word benchmark (LM1B), Colossal Cleaned Common Crawl (C4), and RealNews.

Findings

By applying NEARDUP, the authors identified between 3.04% to 13.63% duplicates in their studies corpora, while EXACTSUBSTR found up to 19.4% duplicates. Near duplicate deduplication shows the ability to locate more subtle duplicates, such as sentences with slightly different formatting that are otherwise identical. The authors also find from 4.6% to 14.4% overlap with the train set in the validation and test sets.

Originality and Value

The research paper demonstrates the importance of deduplicating datasets with the increasing relevance of large corpora for growing language models and presents methods for deduplicating data effectively in ways that include more subtle duplicates.

55. GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, Samuel Weinbach

Purpose

The work presents GPT-NeoX-20B, an autoregressive language model improving upon the architecture of GPT-3, with the goal of releasing a freely and openly available model with transparent documentation of the model and its development.

Methods

The authors develop GPT-NeoX-20B with an architecture following that of GPT-3 with modifications in order to further the model. They use rotary embeddings instead of learned positional embeddings, parallel attention and feed-forward layers, all dense layers rather than both dense and sparse, and other further smaller changes. They train on the Pile. The authors provide an evaluation of their model across natural language tasks, mathematical tasks, and advanced knowledge-based tasks.

Findings

In training their model, the authors note differences from their results and those described in the literature. Despite examining using duplicated data, they observed no evidence of performance loss as previously described, and they also found that their model was exceptionally effective in few-shot learning. The model also performs particularly well on knowledge-based and mathematical tasks.

Originality and Value

The authors introduce GPT-NeoX-20B, an autoregressive Transformer language model with notable modifications from the GPT-3 architecture. The model is especially effective in few-shot learning. They provide the model, openly available, with transparent release of model weights and information.

56. TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, Owain Evans

Purpose

Models have a tendency to answer questions with false statements as a result of imitating human texts, with false beliefs and misconceptions. This problem has implications that range from unintentional inaccuracy to malicious misuse and fraud. To address this, this paper proposes a benchmark to quantify the truthfulness of models, determining how likely a model is to make false statements and analyzing its causes.

Methods

The authors develop a benchmark, TruthfulQA, that determines the ability of language models on generating zero-shot truthful answers, with 817 questions across 38 categories. They define a truth as a claim describing a “literal truth about the real world.” They wrote questions that humans might answer falsely, testing them on GPT-3-175B to filter out questions consistently answered correctly, then wrote additional questions that both humans and models might answer falsely (based on their filtering). They validated their questions with external researchers.

Findings

Human participants produced 94% true answers, while the best evaluated model (GPT-3-175B with helpful prompt) produced 58% true answers. Larger models performed worse in terms of truthfulness, while smaller models were less informative. The GPT-judge model was also able to predict truthfulness with 90-96% validation accuracy, with robust results across the evaluated models. Overall, TruthfulQA revealed the low truthfulness of language models and demonstrated their tendencies to mimic popular misconceptions.

Originality and Value

The research provides an evaluation method for model’s truthfulness, which is an important and difficult to measure aspect of its performance. The study highlights issues of large language models in generating false answers.

57. Exploring and Predicting Transferability across NLP Tasks

Tu Vu, Tong Wang, Tsendsuren Munkhdalai, Alessandro Sordoni, Adam Trischler, Andrew Mattarella-Micke, Subhransu Maji, Mohit Iyyer

Purpose

With the effectiveness of transferring large language models to downstream tasks, the authors aim to examine transferability to different tasks and the benefit of fine-tuning. In doing so, they develop task embeddings that can be used to predict the most transferable source tasks for downstream tasks.

Methods

In their exploration of task transferability, the authors work with 33 tasks across text classification/regression, question answering, and sequence labeling. They fine-tune a pretrained BERT model on an intermediate source task and then fine-tune it again on the final target task, experimenting with using full and limited sizes of the datasets while fine-tuning. The impact of transfer learning is quantified as the relative transfer gain when using particular source and target tasks.

Findings

The experiments of the authors determine that transfer gains are still possible in small, limited source datasets and that transferring across task classes can often be effective. Their evaluations on their task embeddings find TEXTEMB and TASKEMB improve transferability prediction. TASKEMB outperforms the other methods, demonstrating the effectiveness of TASKEMB and task similarity as a predictor of effective transfer.

Originality and Value

The research provides a large-scale study of transferability between NLP tasks with pretrained language models, establishing the benefit of transfer learning, especially in low-data target tasks. The word further provides methods to guide the best selection of source tasks for target tasks. The implications provide an understanding behind the effectiveness of transferability.

58. Do Prompt-Based Models Really Understand the Meaning of Their Prompts?

Albert Webson and Ellie Pavlick

Purpose

Improvements in zero-shot and few-shot performance through prompt-based fine-tuning of language models have created a rise in prompting in NLP and brought the question of the extent to which models are able to learn from meaningful prompts. In this research, the authors tested whether well-crafted prompts were more effective than misleading ones to determine if models really understand their meanings.

Methods

The authors work with the NLI task of classifying entailment, using various large pretrained language models in their baseline experiments (BERT, DistilBERT, RoBERTa, ALBERT, and T5) and additionally experiment with an instruction-tuned model and a substantially larger model (T0 and GPT-3). The authors manually wrote 5 categories of prompt templates to test their effect on the models: Instructive, Misleading-Moderate, Misleading-Extreme, Irrelevant, and Null.

Findings

The results revealed no significant difference in the performance of models trained with irrelevant templates to those trained with instructive templates, regardless of different numbers of shots. Different levels of misleading templates also showed no consistent changes in performance, although instructive templates performed better than both misleading categories. Finally, null templates performed the worst. In zero-shot results, only T0 attained a performance that was significantly above random, yet showed no practical difference between misleading-moderate and instructive templates. Models generally performed as well with some poor prompts as they did with proper ones.

Originality and Value

Despite the common claim in literature that well-crafted prompts were needed to attain the best results from models, the research presented evidence that, often, model performance could be just as effective with less meaningful prompts.

59. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, Matt Gardner

Purpose

The data and large corpora used in training large language models are very important to developing and understanding models, especially as their size and diversity continues to grow. Therefore, the authors present methods towards documenting corpora through their documentation of the Colossal Clean Crawled Corpus, and demonstrate the relevance of providing such investigations through their findings. They emphasize the importance of including the following elements in documentations: metadata, included data, and excluded data.

Methods

The authors work with the English Colossal Clean Crawled Corpus (C4) to provide a documentation of the large, web scraped corpus and demonstrate their methods and value for use in further datasets.

Findings

The authors examine the findings to their analysis to establish the importance of documenting datasets for effective and knowledgeable language model development. They unexpectedly find a significant amount of text from patents and US military websites. They also find machine-generated text and contamination from benchmarks, raising considerations for model performance. Their evaluations show that blocklist filtering disproportionately removes text from and about minority individuals.

Originality and Value

The work presents some of the first documentation of C4.EN, and through their important and unexpected findings, argues for the relevance of the metadata, the included data, and the excluded data in documentation of datasets. The authors raise important considerations for the ethics, accuracy, and evaluation of language models.

60. What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers

Boseop Kim, HyoungSeok Kim, Sang-Woo Lee, Gichang Lee, Donghyun Kwak, Dong Hyeon Jeon, Sunghyun Park, Sungju Kim, Seonhoon Kim, Dongpil Seo, Heungsub Lee, Minyoung Jeong, Sungjae Lee, Minsub Kim, Suk Hyun Ko, Seokhun Kim, Taeyong Park, Jinuk Kim, Soyoung Kang, Na-Hyeon Ryu, Kang Min Yoo, Minsuk Chang, Soobin Suh, Sookyo In, Jinseong Park, Kyungduk Kim, Hiun Kim, Jisu Jeong, Yong Goo Yeo, Donghoon Ham, Dongju Park, Min Young Lee, Jaewook Kang, Inho Kang, Jung-Woo Ha, Woomyoung Park, Nako Sung

Purpose

The authors aim to examine areas unaddressed by GPT-3, including non-English language modeling and prompt- and in-context- based learning, through the development of HyperCLOVA, a Korean variant of GPT-3 trained on a large Korean-centric corpus, and demonstrate the effectiveness of prompt-based learning. Furthermore, the work aims to make AI more accessible to nonexperts of ML through the introduction of an interactive prompt-engineering interface.

Methods

The authors first introduce HyperCLOVA, a large Korean in-context learning-based language model trained on a large Korean-centric corpus constructed from various datasets. Because of the agglutinative nature of the Korean language, the authors employ morpheme-aware byte-level BPE in their tokenization. The model architecture follows the transformer decoder architecture of GPT-3. They evaluate their model on a variety of benchmarks, including datasets from the Korean-NLU benchmark KLUE and report results from experiments with few-shot learning and prompt-based tuning.

Findings

HyperCLOVA achieves state-of-the-art performance in-context zero-shot and few-shot performance on various Korean downstream tasks that are further boosted by prompt-based learning.

Originality and Value

The authors advance important concepts not addressed by GPT-3 through HyperCLOVA. Furthermore, they discuss HyperCLOVA Studio towards achieving the No Code AI paradigm and argue for the value of opportunities through such frameworks.

61. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Gary Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Maitreya Patel, Kuntal Kumar Pal, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Shailaja Keyur Sampat, Savan Doshi, Siddhartha Mishra, Sujan Reddy, Sumanta Patro, Tanay Dixit, Xudong Shen, Chitta Baral, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi, Daniel Khashabi

Purpose

In order to address the generalizability of language models to a variety of new tasks, the authors introduce SUPER-NATURALINSTRUCTIONS (SUP-NATINST), a benchmark of diverse NLP tasks and their instructions. They further utilize this benchmark in order to develop a model that can effectively follow a variety of in-context instructions and generalize to new tasks.

Methods

The authors first present their benchmark dataset SUP-NATINST with diverse NLP tasks and their instructions. The instructions for each contain a definition and examples. The data and the instructions were constructed by expert NLP practitioners. The dataset is diverse, with 1616 across 76 task types, with representation across languages and domains.

Findings

Tk-INSTRUCT surpasses the performance of the baselines, and other instruction-tuned models also obtain better results, demonstrating the effectiveness of instruction-tuning for stronger generalization. They also identify that while more observed tasks improve the generalization, more training instances do not.

Originality and Value

The authors introduce a large-scale benchmark of a diverse NLP tasks and their instructions, and demonstrate the value of such a benchmark to enable generalization to unseen tasks through the performance of instruction-tuned models.

62. ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Shuohuan Wang, Yu Sun, Yang Xiang, Zhihua Wu, Siyu Ding, Weibao Gong, Shikun Feng, Junyuan Shang, Yanbin Zhao, Chao Pang, Jiaxiang Liu, Xuyi Chen, Yuxiang Lu, Weixin Liu, Xi Wang, Yangfan Bai, Qiuliang Chen, Li Zhao, Shiyong Li, Peng Sun, Dianhai Yu, Yanjun Ma, Hao Tian, Hua Wu, Tian Wu, Wei Zeng, Ge Li, Wen Gao, Haifeng Wang

Purpose

With the effectiveness of knowledge-enhanced pre-trained models presented by the ERNIE 3.0 framework, the authors aim to further the potential of such models through greatly scaling ERNIE 3.0.

Methods

The authors develop ERNIE 3.0 Titan with 260 billion parameters. Their model is based on ERNIE 3.0, pretrained with Knowledge Masked Language Modeling, masking phrases and named entities, and Document Language Modeling, training on longer texts and thus larger context sizes. The authors propose a generative pretraining technique to optimize a self-supervised adversarial loss and a controllable language modeling loss. The former learns to distinguish between original and generated texts, improving efficiency while enabling ERNIE 3.0 Titan to re-ranking the credibility of the generated results. The latter makes use of prompted attribute sets, including genre, topic, keywords, sentiment, and length. The data includes both the adversarial and controllable datasets. They authors expand their data into Chinese corpora, obtaining the “largest Chinese dense knowledge-enhanced language model at the time of training”.

Findings

ERNIE 3.0 Titan outperformed the state-of-the-art models on 68 NLP datasets. Furthermore, in the zero-shot setting, it achieved consistently strong performance compared to recently proposed large-scale Chinese language and surpassed GPT-3 on the CKBQA-sub dataset.

Originality and Value

The authors present a large-scale knowledge-enhanced language model with 260 billion parameters, developing the largest Chinese dense pre-training model so far. Their methods are demonstrated to be effective to build a credible language model.

63. Ask Me Anything: A simple strategy for prompting language models

Simran Arora, Avanika Narayan, Mayee F. Chen, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, Christopher Ré

Purpose

As prompting has risen as a method to create more broadly applicable language models, the challenges of crafting “perfect” prompts effective for various tasks have become relevant. The authors propose a method of aggregating multiple imperfect prompts for more effective and high quality performance.

Methods

The authors describe a method for prompting by aggregating the predictions of multiple prompts, each prompt producing a vote for the input’s label to obtain a final prediction.

Findings

AMA is evaluated on 20 language benchmarks with 14 different LLMs, ranging from a variety of SuperGLUE, NLI, classification, and QA tasks. The authors identify that over the 20 tasks, AMA gives an “average improvement of 41% over the 6B parameter model’s few-shot (k = 3) performance”. AMA is best on tasks relying on NLU abilities where requisite information is provided in the task input.

Originality and Value

The authors propose a method that scalably generates multiple prompts from task inputs and combines their answers using weak supervision to give the final prediction. This method lifts performance from baseline prompting methods and is more effective, enabling a 30x smaller LM to exceed other performances.

64. REALM: Retrieval-Augmented Language Model Pre-Training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, Ming-Wei Chang

Purpose

In order to more effectively perform in question-answering tasks with language models that learn from knowledge stored implicitly in data, the authors present a more effective way to capture knowledge in pretraining with a learned textual knowledge retriever.

Methods

The authors’ framework is REALM––Retrieval-Augmented Language Model––which works in two steps: retrieval and prediction. REALM is pre-trained on predicting masked tokens in a corpus and fine-tuned on the OpenQA task of question-answering. The model first retrieves relevant documents from a knowledge corpus and then condition on both the original input and the retrieved information to generate the output.

Findings

The model substantially outperforms other baseline models, establishing the value of the authors’ methods. The authors analyze the effectiveness of REALM, finding that both the encoder and retriever benefit from REALM, and perform best with both components together.

Originality and Value

The weakness of many language models on few-shot question-answering tasks is in their difficulty in capturing knowledge, and the methods proposed by the authors present an effective way to tackle these relevant issues. The implications of the work expand to applications in structured knowledge, the multilingual setting, and the multi-modal setting.

65. Recitation-Augmented Language Models

Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, Denny Zhou

Purpose

The inconsistency in the language of few shot learning with the training texts reduces the ability of models to answer factual information correctly. The research aims to improve the accuracy of the information generated by large language models by mimicking humans’ ability to recite relevant knowledge to answer questions. They propose RECITation-augmented gEneration (RECITE) for effective performance on knowledge-intensive NLP tasks.

Methods

The authors first work with single-hop question answering, where questions based on evidence in the corpus documents are answered. Their method involves reciting a relevant passage about the question before answering the question. In order to do this they use prompt-based learning with example questions and evidence pairs. To generate the final answer, the recited passages are appended to the question-answer pairs as a single prompt.

Findings

The recitation methods show significant improvement from standard prompting based baselines on single-hop and multiple-hop question answering on the NQ, TriviaQA, and HotpotQA benchmarks applied to the PaLM-62B, UL2-20B, and OPT-30B models.

Originality and Value

The research presents a method for accurate question-answering without requiring external corpora. The technique makes use of the full capacity of large language models and their data.

66. Entailment Semantics Can Be Extracted from an Ideal Language Model

William Merrill, Alex Warstadt, Tal Linzen

Purpose

Determining whether a sentence entails another can, mathematically, be reduced to simply modeling their probabilities in the language. Therefore, the probabilistic nature of language models suggests that an ideal language model could be utilized to extract entailment semantics. The researchers aim to prove this.

Methods

The authors’ reasoning is based on background about the logic of entailment and the structure of language models. They assume an ideal context where sentences are generated by Gricean agents (who follow the theoretical principles of communication and pragmatics). They show, in such data, entailment is based on probabilities by explaining that a sentence x and sentence y that it entails are equally probable of occurring. Thus, an ideal language model could extract entailment based on the sentences’ probability of occurring.

Findings

The researchers determined that entailment relations could be significantly (greater-than-chance) by extracted from language model predictions. The language models’ predictions showed distinction between entailing and non-entailing sentences, and the trigram model showed better performance than the text frequency model. Furthermore, they confirmed that the size of the corpus needed to extract entailment grew predictably with sentence length, in alignment with the theoretical background.

Originality and Value

The researchers’ work provides an explanation for how distributional information encodes semantic information and demonstrates the ability of language models to extract them. The work thus gives further insight into the capabilities of language models.

67. Knowledge Unlearning for Mitigating Privacy Risks in Language Models

Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, Minjoon Seo

Purpose

The memorization of data in large language models poses great security threats and potential for violation of privacy of personally identifiable information. In order to address privacy concerns of large language models, the authors aim to reduce privacy risks for LMs post hoc through “knowledge unlearning”.

Methods

In order to “unlearn” language models and reverse privacy risks, the authors propose negating the original training objective of the model.

Findings

While OPT’s deduplication is helpful in gaining a lower Extraction Likelihood Memorization Accuracy than GPT-NEO, GPT-NEO with the proposed knowledge unlearning method demonstrated effective protection against privacy risks with the lowest scores. While the authors identify degradation of its generation capabilities in previous methods, the unlearning method shows little impairment in the larger language models.

Originality and Value

The authors present a simple method for mitigating the privacy risks posed by large language models without significantly impairing model performance. The work is valuable to ensuring the protections of individuals and advancing safe deployment of language models.

Comments