Skip to main content

Tackling Challenges in Language Model Training with LAS and NC State

SUPERB: Improving the Efficiency of Creating and Maintaining ML Models

Research from the Menzies Lab by Kewen Peng and Tim Menzies, NC State University, Department of Computer Science

Large language models are powerful tools, but their training and inference processes can be extremely resource intensive. As an example, training Meta’s recent Llama 2 7B model from scratch on 100 NVIDIA A100 GPUs would require nearly 3 months of continuous computation [1]. Similar challenges have been encountered in classical machine learning, albeit usually at smaller scales and with tabular rather than text data, and the field has developed a range of techniques to reduce the burden of training. One prominent family of methods involves first identifying those subsets of data that would be most informative for training, slicing the data set either “horizontally” by finding the most important examples, or “vertically” by finding the most important features or columns. However, while these approaches have proven effective for tabular data, their application to text data is less than clear-cut. Tim Menzies’ RISE lab at North Carolina State University has been working to address this challenge, with LAS collaboration and support, and over the past year, they have demonstrated applications across several mission-relevant problem areas.

One such application area is training customized sentence embedding models. Embeddings are at the core of many current AI/ML applications, particularly in information retrieval settings such as in LAS’ GUESS prototype; and while general-purpose embeddings often yield quite good performance across a range of tasks, embedding models fine-tuned to a specific application domain can result in meaningful improvements in performance (e.g. [2]).

Training domain-specific embedding models is one area where classical ML training optimization techniques translate well to a language modeling context. Kewen Peng, a graduate student in Prof. Menzies’ lab, demonstrated an active learning approach to embedding model training. Using Sentences Involving Compositional Knowledge (SICK) [3] as a reference data set, training began by using only 20% of the available data, and inferring predictions (a comparatively inexpensive operation) on the remaining 80%. At each following training cycle, the 10% of samples whose predictions had changed the most since the previous cycle were added to the training set. Overall, Peng’s embedding model took less than 10% of the operations to train as required by standard methods, while achieving nearly identical performance (Spearman r, the standard evaluation metric used on this data set, of .80 vs .81).  

The challenge increases with generative language modeling tasks. Unlike embedding-based retrieval tasks, for which clear quantitative performance metrics can typically be found, one must often rely on human judgment to evaluate the quality of a machine-generated response, making many training optimization methods difficult or impossible to apply in practice. Machine translation is one example of a mission-relevant generative language modeling task where rapid model retraining could prove impactful, but modern translation models themselves require extensive computation to train from scratch. As a more accessible application domain, Suvodeep Majumder focused on optimizing training methods for generative question-answering models, where a user poses a question in natural language and the machine produces a natural language response. Rather than focusing on changes in predictions between training cycles as in the active learning approach to training embedding models, Majumder adopted a self-learning paradigm. Here, in between training cycles, examples where the partially trained model was highly confident in the accuracy of its own response were assumed to be correct, and added to the training data for the next cycle of training. Rather than optimizing for the computational burden of training, self-learning reduces the human requirements to produce manually labeled gold-standard data. Using this approach and starting with only 20% of the full labeled training data set, the self-learning model was able to achieve 78% of the performance of the full model (F1 of 0.84 vs 0.90) on a held-out evaluation set.

Retraining or fine-tuning machine learning models is often thought of in the context of domain adaptation, where a model trained on a large, general data set is refined with a much smaller set of examples from a particular specialized application domain. But another area where specialized models could potentially be helpful is in tuning for the particular *user*, rather than the application domain. For instance, one could imagine a language model tuned to detect and summarize reports according to the inferred interests of a specific analyst, as in LAS’ SCADS grand challenge problem. This sort of approach would seem to imply training a customized model for each individual user, with a correspondingly high resource burden. But what if this were unnecessary?

In a 2022 publication, Jang et al [4] described an approach to personalized question-answering and article summarization in which the generative model was adapted to an individual user not by model retraining and fine-tuning, but by providing examples of that individual’s interests in parallel to their question, presenting these examples to a single pretrained model. Despite using a comparatively small model compared to today’s commercial state of the art, the researchers were able to produce fluent summaries that were often responsive to the stated interests and background of the model’s users, offering a potentially promising approach to effective and scalable language model personalization.

The state of the art in language modeling has advanced extraordinarily rapidly over the past few years, and some aspects of the Jang et al work have since been superseded by more recent approaches. In ongoing research, Kewen Peng of the Menzies lab is updating the Jang et al architecture, using cross-attention rather than concatenating the representations of user interests and queries, as well as experimenting with other advances. A goal is to advance state-of-the-art performance in personalized question-answering and summarization, and to determine whether similar approaches might present a viable means of achieving the aims of LAS’ SCADS grand challenge.

LAS’ collaboration with Menzies’ lab at North Carolina State University has made significant strides in addressing the resource-intensive nature of training language models. By applying classical machine learning training optimization techniques to language modeling tasks, such as active learning and self-learning paradigms, we have demonstrated promising results in training domain-specific embedding and generative question-answering models. Our ongoing research, including updating Jang et al’s work, offers new possibilities for effective and scalable language model personalization. As machine learning research continues to evolve at a rapid pace, our collaboration with Prof. Menzies’ group illustrates LAS’ commitment to translating the state of the art into mission-relevant applications.

Acknowledgement

This material is based upon work done, in whole or in part, in coordination with the Department of Defense (DoD). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the DoD and/or any agency or entity of the United States Government.

References

[1] Touvron, H., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288.

[2] Rasmy, L., Xiang, Y., Xie, Z., Tao, C., & Zhi, D. (2021). Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine, 4(1), 86.

[3] Bentivogli, L., Bernardi, R., Marelli, M., Menini, S., Baroni, M., & Zamparelli, R. (2016). SICK through the SemEval glasses. Lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. Language Resources and Evaluation, 50, 95-124.

[4] Jang, Y., et al. (2022). Call for Customized Conversation: Customized Conversation Grounding Persona and Knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 10, pp. 10803-10812).