Text Augmentation Using Corrupted-Text Python Library

Last Updated : 05 Sep, 2024

Text augmentation is an essential technique in Natural Language Processing (NLP) that helps improve model robustness by expanding the training data. One popular method is introducing corrupted or noisy text to simulate real-world scenarios where data may not always be clean.

The article explores how to implement text augmentation using the corrupted-text Python library, a powerful tool that allows easy corruption of text data with different severities.

Text Augmentation in Natural Language Processing

Text augmentation involves modifying existing text data to create new variants. This can include changes in word order, synonym replacement, or introducing spelling errors. The goal is to enhance the dataset's variability, providing models with a wider range of inputs to learn from. This process helps in building more resilient and versatile NLP systems.

The Corrupted-Text Python Library offers a unique approach to text augmentation by applying model-independent corruptions. These corruptions mimic real-world errors, such as bad autocorrections and typos, making the augmented data more realistic. By using this library, developers can simulate out-of-distribution text scenarios and effectively test their models' robustness against such variations.

Corrupted-Text Python Library

The Corrupted-Text Python Library is designed to generate out-of-distribution text datasets through the application of common corruptions. Unlike model-specific adversarial approaches, this library focuses on realistic outliers to help researchers study model robustness. The corruptions are applied on a per-word basis, ensuring that each modification is independent and contributes to the overall variability of the text.

Corruptions implemented in this library include bad autocorrection, bad autocompletion, bad synonym replacement, and typographical errors. These alterations are based on common words, which are extracted from a base dataset. The corruptions mimic realistic text input errors found in everyday communication, such as incorrect autocorrections on mobile keyboards or dictionary-based translations without context.

The severity of corruption can be adjusted, allowing users to control the percentage of words affected. Higher severities result in more extensive corruption, which can significantly impact model accuracy. Users can also define weights for each corruption type, tailoring the augmentation process to their specific needs. The library provides insights into how such corruptions affect model performance, as demonstrated by the accuracies table included in the documentation.

To install the Corrupted-Text Python Library, simply run this command:

 pip install corrupted-text.

Implementation: Text Augmentation Using Corrupted-Text Python Library

To implement the Corrupted-Text Python Library, we will first import the corrupted_text library and load_dataset() function from the ‘datasets’ library. Using huggingface's datasets library, we will load the AG News dataset for demonstration. We will create both the training dataset and the testing dataset.

from datasets import load_dataset
import corrupted_text

train_data = load_dataset("ag_news", split="train")["text"]
test_data = load_dataset("ag_news", split="test")["text"]

Output:

Downloading readme: 100%
 8.07k/8.07k [00:00<00:00, 99.7kB/s]
Downloading data: 100%
 18.6M/18.6M [00:00<00:00, 20.6MB/s]
Downloading data: 100%
 1.23M/1.23M [00:00<00:00, 4.62MB/s]
Generating train split: 100%
 120000/120000 [00:00<00:00, 255345.38 examples/s]
Generating test split: 100%
 7600/7600 [00:00<00:00, 124199.17 examples/s]

Next, we will fit the corrupted on both training and testing datasets, using the TextCorrupter() function of the corrupted_text library.

text_corruptor = corrupted_text.TextCorruptor(base_dataset=test_data + train_data, cache_dir=".mycache")

Output:

Calculating Levenshtein distances: 100%|██████████| 4000/4000 [00:07<00:00, 505.19it/s]

We will then proceed to corrupt a small sample of the test dataset with a severity of 0.5. This will generate a list of corrupted text strings, simulating realistic text errors. Finally, we will print the results in a formatted style to illustrate the differences between the original and corrupted texts, highlighting the potential impact of these corruptions on NLP models.

sample_data = test_data[:10]

corruption_severity = 0.5
corrupted_texts = text_corruptor.corrupt(sample_data, severity=corruption_severity, seed=1)

print(f"==== Corruption Results with Severity {round(corruption_severity, 2)} ====")
for index in range(len(sample_data)):
    print(f"Original Text {index + 1}:")
    print(f"{sample_data[index]}\n")
    print(f"Corrupted Text {index + 1}:")
    print(f"{corrupted_texts[index]}\n")
    print("-" * 40)

Output:

Corrupting dataset: 100%|██████████| 3/3 [00:00<00:00, 252.20it/s]==== Corruption Results with Severity 0.5 ====
Original Text 1:
Fears for T N pension after talks Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.

Corrupted Text 1:
Fekrs uor T N pension afternoon talks Unions representing worked at Turnbr Newall saying they are ' disappointing ' afterward talking with striker arena firms Federal Mogul.

----------------------------------------
Original Text 2:
The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com) SPACE.com - TORONTO, Canada -- A second\team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for\privately funded suborbital space flight, has officially announced the first\launch date for its manned rocket.

Corrupted Text 2:
Tae Race is On : Second Private Team Sets Launch Date for Human Spaceflight ( SPACl. computer ) SPACE. com - TORnNTO, Canada -- A instant \ team hf rocketeers competing for the # 36 ; 10 zillion Ansari X Pruze, a contest yor \ pirates funded subscribers shape flights, has officially declared their first \ launch database for itk manned Eruca vesicaria sativ.

----------------------------------------
Original Text 3:
Ky. Company Wins Grant to Study Peptides (AP) AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins.

Corrupted Text 3:
yy. Company Wins Grant to Study Peptides ( AP ) AP - A company found away a chemistry researcher at theory University of Louisville wan a grant to develop a method of reducing better peptides, wyich are short chains of amino acids, nhe bidding blocks of proteins.

----------------------------------------

The output of your implementation provides both the original and corrupted versions of the sample text after applying the text augmentation process with a corruption severity of 0.5 using the corrupted-text library.

The corrupted version of the text introduces various types of errors, such as:

Character-level changes: Swapping or altering letters (e.g., "Fears" → "Fekrs").
Word-level substitutions: Replacing words with similar-sounding or nonsensical alternatives (e.g., "after" → "afternoon").
Random insertion: Inserting unrelated terms or letters, such as "(SPACl. computer)".
Grammatical and semantic changes: Altering the grammatical structure, making the text harder to understand.

The corruption severity of 0.5 introduces a moderate level of noise, making the text less readable but still somewhat understandable. This simulates real-world noisy data that can be used to improve the robustness of NLP models.

Conclusion

In conclusion, the Corrupted-Text Python Library offers a robust solution for generating realistic out-of-distribution text datasets. By applying independent corruptions, it allows researchers to explore model vulnerabilities and improve robustness against common text errors. This library is a valuable tool for those looking to enhance their text datasets and build more resilient NLP models.