relation: https://openaccess.city.ac.uk/id/eprint/32651/ title: Towards Knowledge-Grounded Natural Language Understanding and Generation creator: Whitehouse, C. subject: P Language and Literature subject: QA75 Electronic computers. Computer science subject: T Technology description: This thesis investigates how natural language understanding and generation with transformer models can benefit from grounding the models with knowledge representations. Currently, the most prevailing paradigm for training language models is through pre-training on abundant raw text data and fine-tuning on downstream tasks. Although language models continue to advance, especially the recent trend of Large Language Models (LLMs) such as ChatGPT, there seem to be limits to what can be achieved with text data alone and it is desirable to study the impact of applying and integrating rich forms of knowledge representation to improve model performance. The most widely used form of knowledge for language modelling is structured knowledge in the form of triples consisting of entities and their relationships, often in English. This thesis explores beyond this conventional approach and aims to address several key questions: • Can knowledge of e ntities extend its benefits beyond entity-centric tasks such as entity linking? • How can we faithfully and effectively extract such structured knowledge from raw text, especially noisy web text? • How do other types of knowledge, beyond structured knowledge, contribute to improving NLP tasks? To this end, we study various tasks including multimodal and multilingual applications and consider a wide spectrum of knowledge, structured knowledge that is typically represented as triples with entities and their relations, and unstructured knowledge including parametric knowledge preserved in language models, knowledge distilled from Large Language Models, etc. Knowledge-grounding with structured knowledge. We begin by investigating the integration of structured knowledge into language models. Knowledge of entities has shown benefits for entity-centric tasks such as entity linking and relation extraction, however, most studies have been limited to monolingual settings. We expand knowledge-grounding with structured knowledge, specifically entities, in two directions of research. Firstly, we study whether knowledge of entities can benefit real-world fake news detection. We hypothesise that the world knowledge embedded in entities can contribute to assessing the truthfulness of news statements. Evaluation of various knowledge integration approaches on distinct datasets reveals that knowledge-enhanced language models improve fake news detection when incorporated with a relevant and up-to-date knowledge base. The second direction expands beyond English and focuses on multilingual entities. We introduce EntityCS, where we first construct a code-switched (CS) training corpus from Wikipedia, by switching entities in English to their counterparts in other languages. Then we intermediate-train a pretrained multilingual model on this corpus for joint masked language modelling and entity prediction. Subsequent fine-tuning of the model on entity-centric downstream tasks consistently improves zero-shot cross-lingual transferability, demonstrating the benefit of integrating knowledge of multilingual entities. Extracting structured knowledge from web text. We continue by studying effective, faithful, and robust extraction of structured knowledge from web text. Most existing information extraction (IE) datasets are constrained to Wikipedia articles, and models trained on such a rich factual text corpus show poor performance when applied to more noisy text from the web. To address these challenges, we introduce WebIE, a new dataset that takes raw sentences as input and structured triples as output. WebIE emphasises data quality by introducing negative examples and undergoing rigorous human annotation. We also propose faithful generative information extraction pipelines. Our experiments with entity planning training and prefix-trie decoding show improvement in accurately extracting knowledge on the web. Knowledge-grounding beyond structured knowledge. To address our last research question, we study the impact of a broader sense of knowledge, including parametric knowledge (knowledge stored in the latent parameters of the models) derived from a model’s self-explanations and knowledge distilled from LLMs via data augmentation. We expand the application to multimodal language models and study knowledge intensive visual question answering (VQA). We introduce a unified approach for fine-tuning multimodal models for jointly generating answers and explanations. Our experiments demonstrate enhancement in both answer accuracy and explanation quality. Lastly, as LLMs continue to advance in performance and size, we explore the utility of distilling common-sense knowledge from general-purpose LLMs to benefit smaller task-specific models. We prompt various LLMs to generate diverse examples on several challenging and scarce multilingual common sense datasets. This augmentation shows consistent enhancements on fine-tuned smaller models, shedding light on data augmentation strategies for scenarios with limited training data. In summary, this thesis explores the role of knowledge grounding in natural language understanding and generation across a broad spectrum of tasks. We found that incorporating relevant and up-to-date knowledge of entities benefits fake news detection, and entity-focused code-switching significantly enhances zero-shot cross lingual transfer on entity-centric tasks. In terms of effective and faithful approaches to extracting structured knowledge, our study found that integrating negative examples and training with entity planning significantly improves performance. Additionally, we established that other general forms of knowledge, such as parametric and distilled knowledge, enhance multimodal and multilingual knowledge intensive tasks. This research shows the tangible benefits of diverse knowledge integration and motivates further exploration in this direction. date: 2024 type: Thesis type: NonPeerReviewed format: text language: en identifier: https://openaccess.city.ac.uk/id/eprint/32651/1/Whitehouse%20thesis%202024%20PDF-A.pdf identifier: Whitehouse, C. (2024). Towards Knowledge-Grounded Natural Language Understanding and Generation. (Unpublished Doctoral thesis, City, University of London)