Open In App

NLP Augmentation with nlpaug Python Library

Last Updated : 21 Aug, 2024
Summarize
Comments
Improve
Suggest changes
Like Article
Like
Share
Report
News Follow

Data augmentation is a crucial step in building robust AI models, especially during the data preparation phase. This process involves adding synthetic data to the existing datasets to enhance the quality and diversity of the training data. For textual models, such as generative chatbots and translators, text augmentation becomes essential. Since these models are often trained on extensive text corpora, manually augmenting and adding text is impractical. Instead, we use libraries like `nlpaug` to automate this process. By generating multiple variations of existing text data, `nlpaug` helps increase the size and quality of the training dataset, leading to better model performance.

In this article, we'll explore the nlpaug Python library, covering installation and key features such as OCR Augmentor, Keyboard Augmentor, and Random Augmentor. By the end, you'll understand how to use nlpaug for effective text augmentation in your projects.

To install the nlpaug library, use the following command:

pip install nlpaug

If you are using a notebook environment like Google Colab or Jupyter, add an exclamation mark ‘!’ before the command.

OCR Augmenter

The Optical Character Recognition (OCR) Augmenter allows us to introduce errors into text that might occur during the OCR process. This is particularly useful when working with datasets that consist mostly of scanned or digitized text, as it helps models identify and handle common OCR mistakes.

Steps to implement OCR Augmenter:

  1. Import the Module: Import the nlpaug.augmenter.char module.
  2. Define Sample Text: Create a variable to hold your sample text.
  3. Create OCR Augmenter: Use the OcrAug() function to create an OCR-based augmenter.
  4. Apply the Augmenter:
    • Apply the augmenter to the text using the augment() function.
    • Specify n=3 to generate three augmented versions of the text.
  5. Print the Results: Use the print() function to display both the original and augmented texts.
import nlpaug.augmenter.char as nac

text = "GeeksforGeeks offers countless tutorials on various programming languages."
aug = nac.OcrAug()
augmented_text = aug.augment(text, n=3)

print("Original text:", text)
print("Augmented Text:", augmented_text)

Output:

Original text: GeeksforGeeks offers countless tutorials on various programming languages.
Augmented Text: ['GeeksforGeeks uffeks countless tutorials un various programming lan9ua9e8.', 'Geeksf0kCeers offek8 c0ontles8 tutorials on various programming languages.', 'GeeksforGeeks offers countless totokial8 on various programming lan9ua9e8.']

Keyboard Augmenter

Using the Keyboard Augmenter, we introduce typographical errors that are commonly made during typing. This augmenter is primarily used for creating robust language models by training them to handle realistic typing mistakes. By including these artificial errors, we can train our models to better manage real-world text input that often contains typos.

1 Input and 1 Output

This is one type of keyboard augmenter, where one input generates only one output, which is the usual approach. To implement this, we will use the KeyboardAug() function. Next, we will apply this augmenter to the sample text, generating the augmented text. Finally, we will print both the original and the augmented text.

aug = nac.KeyboardAug()
augmented_text = aug.augment(text)

print("Original text:", text)
print("Augmented Text:", augmented_text)

Output:

Original text: GeeksforGeeks offers countless tutorials on various programming languages.
Augmented Text: ['GeeksforGeeks ofBerw countless tutorials on bxTious programming lAngIagec.']

1 Input and n Output

This is a keyboard augmenter where one input generates multiple outputs, which can be useful for creating diverse text variations. To implement this, we will use the KeyboardAug() function. Next, we will apply this augmenter to the sample text, specifying that we want to generate two variations. Finally, we will print both the original and the augmented texts.

aug = nac.KeyboardAug()
augmented_text = aug.augment(text, n=2)

print("Original text:", text)
print("Augmented Text:", augmented_text)

Output:

Original text: GeeksforGeeks offers countless tutorials on various programming languages.
Augmented Text: ['GeeksforGeeks oCters countless tu4lriaOs on various progdammLBF languages.', 'G2eksforGReJc offe%D Fointl#ss tutorials on various programming languages.']

n Input and n Output

In this example, we will work with a list of text samples that we want to augment using a keyboard augmenter. We will begin by creating an instance of the KeyboardAug() function from the nlpaug library. Then, we will apply this augmenter to our list of text samples, generating the augmented versions. Finally, we will print both the original and the augmented texts to observe the changes.

from nlpaug.augmenter.char import KeyboardAug

texts = [
'GeeksforGeeks provides comprehensive tutorials on data structures.',
'Mastering algorithms is key to acing coding interviews at top companies.'
]

keyboard_augmenter = KeyboardAug()
augmented_articles = keyboard_augmenter.augment(texts)

print("Original text:", texts)
print("Augmented Text:", augmented_articles)

Output:

Original text: ['GeeksforGeeks provides comprehensive tutorials on data structures.', 'Mastering algorithms is key to acing coding interviews at top companies.']
Augmented Text: ['GeeksforGeeks oCters countless tu4lriaOs on various progdammLBF languages.', 'G2eksforGReJc offe%D Fointl#ss tutorials on various programming languages.']

Random Augmenter

The Random Augmenter is the most commonly used of all. With this augmenter, we can insert, substitute, swap, and delete characters randomly from the input text. By adding these random changes, we can create diverse variations of our original text, which is beneficial for training and building robust NLP models.

Insert character randomly

In this case, we will insert random characters at random positions. To implement this, we will create an instance of RandomCharAug() with the action set to "insert". Next, we will apply this augmenter to our sample text, generating the augmented text. Finally, we will print both the original and the augmented text to compare the results.

aug = nac.RandomCharAug(action="insert")
augmented_text = aug.augment(text)

print("Original text:", text)
print("Augmented Text:", augmented_text)

Output:

Original text: GeeksforGeeks offers countless tutorials on various programming languages.
Augmented Text: ['GeeksforGeeks 3offergs countless tutorials on pvwar&ious programming langPuVadges.']

Substitute character randomly

In this example, we will use a Random Augmenter that substitutes random characters in the text. To achieve this, we will create an instance of RandomCharAug() with the action set to "substitute". Next, we will apply this augmenter to our sample text, generating the augmented text. Finally, we will print both the original and the augmented text to see the differences.

aug = nac.RandomCharAug(action="substitute")
augmented_text = aug.augment(text)

print("Original text:", text)
print("Augmented Text:", augmented_text)

Output:

Original text: GeeksforGeeks offers countless tutorials on various programming languages.
Augmented Text: ['GeeksforGeeks offers countless tutorOaXs on Zario!D #rograFBin* languages.']

Swap character randomly

In this example, we will use a Random Augmenter that swaps characters within the text. To implement this, we will create an instance of RandomCharAug() with the action set to "swap". Next, we will apply this augmenter to our sample text, generating the augmented text. Finally, we will print both the original and the augmented text to see the effect of the character swapping.

aug = nac.RandomCharAug(action="swap")
augmented_text = aug.augment(text)

print("Original text:", text)
print("Augmented Text:", augmented_text)

Output:

Original text: GeeksforGeeks offers countless tutorials on various programming languages.
Augmented Text: ['EgekfsorGkees offser countless uttoraisl on various programming languages.']

Delete character randomly

In this example, we will use a Random Augmenter that deletes characters from the text. To achieve this, we will create an instance of RandomCharAug() with the action set to "delete". Next, we will apply this augmenter to our sample text, generating the augmented text. Finally, we will print both the original and the augmented text to observe the deletions.

aug = nac.RandomCharAug(action="delete")
augmented_text = aug.augment(text)

print("Original text:", text)
print("Augmented Text:", augmented_text)

Output:

Original text: GeeksforGeeks offers countless tutorials on various programming languages.
Augmented Text: ['GeeksforGeeks ffes countless utoils on vrou programming languages.']



Next Article
Article Tags :
Practice Tags :

Similar Reads

three90RightbarBannerImg