K_1.1. Tokenized Inputs Outputs - Transformer, T5_EN

Transformer and T5 are trained with a dataset in the form of Teacher Forcing. The dataset used in this article only needs to be used to understand the model. In the case of a pretrained model, the dataset needs to be processed into a more complex form. Pretrained models are covered in the advanced course.

The figure below shows examples of encoder input, decoder input, and final output of Transformer and T5. In many implementation examples, the decoder output and the final output are used without distinction, but in order to use the same model for pretraining or classification, separating these two makes it clear when the model is extended.

The reason for using three inputs and outputs is that the language model of Transformer and T5 is learned with the teacher forcing concept. Please see other articles on Teacher Forcing.

As an example of translating English "I am a student" into French "je suis étudiant", let's take a look at the tokenized process and how it goes. Here, the English "I am a student" is called the source sentence, and the French "je suis étudiant" is called the target sentence.

Encoder Inputs

Source sentences are indexed tokens generated by the source tokenizer.
Generally, [BOS] tokens or [EOS] tokens are excluded, but depending on the implementer, these two tokens may be added.

Decoder Inputs

[BOS] + Target sentences are indexed tokens generated by the target tokenizer.
Remember to add [BOS] tokens as we use the Teacher forcing concept.

Final Outputs

Target sentence + [EOS] are indexed tokens generated by the target tokenizer.
Remember to add the [EOS] token to the end of the sentence as we use the Teacher forcing concept.

Dataset build case studies

There is no fixed method for the process of creating a dataset. However, you can roughly follow the sequence below.

In this article, we will look at how to create a dataset in tensorflow 2.X with 3 cases. In the case of Pytorch, I will write a separate article to look at how learning and evaluation is done in the transformer family.

In this article, we will introduce only the following three cases. For the complete code, please refer to the Github repository.

BPemb tokenizer is used and the teacher forcing data form is created in the train step
Keras Tokenizer is used and the teacher forcing data form is created in the train step
Keras Tokenizer is used and the teacher forcing data form is created in the dataset

1. BPEmb tokenizer를 사용하고, teacher forcing data 형태가 train step 에서 만들어 지는 경우

BPemb tokenizer is used and the teacher forcing data form is created in the train step

The official Tensorflow tutorial makes a Portuguese-English translator, but in article we are going to make an English-German translator. Basically, only the codes below are my original. As I said, this is not an article on NLP, so all you have to know is that at every iteration you get a batch of (64, 41) sized tensor as the source sentences, and a batch of (64, 42) tensor as corresponding target sentences. 41, 42 are respectively the maximum lengths of the input or target sentences, and when input sentences are shorter than them, the rest positions are zero padded, as you can see in the codes below.

1. Tokenizer Install & import

BPemb tokenizer requires a separate installation process.

! pip install BPEmb
from bpemb import BPEmb

2. Copy or load raw data to Colab

ENCODER_LEN = 41
DECODER_LEN = ENCODER_LEN
BATCH_SIZE  = 128
BUFFER_SIZE = 20000

import urllib3
import zipfile
import shutil
import pandas as pd

pd.set_option('display.max_colwidth', None)

http = urllib3.PoolManager()
url ='http://www.manythings.org/anki/deu-eng.zip'
filename = 'deu-eng.zip'
path = os.getcwd()
zipfilename = os.path.join(path, filename)
with http.request('GET', url, preload_content=False) as r, open(zipfilename, 'wb') as out_file:       
    shutil.copyfileobj(r, out_file)

with zipfile.ZipFile(zipfilename, 'r') as zip_ref:
    zip_ref.extractall(path)

train_df = pd.read_csv('deu.txt', names=['SRC', 'TRG', 'lic'], sep='\t')
del train_df['lic']
print(len(train_df))

train_df = train_df.loc[:, 'SRC':'TRG']

train_df.head()

train_df["src_len"] = ""
train_df["trg_len"] = ""
train_df.head()

for idx in range(len(train_df['SRC'])):
    # initialize string
    text_eng = str(train_df.iloc[idx]['SRC'])

    # default separator: space
    result_eng = len(text_eng.split())
    train_df.at[idx, 'src_len'] = int(result_eng)

    text_deu = str(train_df.iloc[idx]['TRG'])
    # default separator: space
    result_deu = len(text_deu.split())
    train_df.at[idx, 'trg_len'] = int(result_deu)

print('Translation Pair :',len(train_df)) # 리뷰 개수 출력

3. [Optional] Delete duplicated data

train_df = train_df.drop_duplicates(subset = ["SRC"])
print('Translation Pair :',len(train_df)) # 리뷰 개수 출력

train_df = train_df.drop_duplicates(subset = ["TRG"])
print('Translation Pair :',len(train_df)) # 리뷰 개수 출력

4. [Optional] Select samples

# 그 결과를 새로운 변수에 할당합니다.
is_within_len = (8 < train_df['src_len']) & (train_df['src_len'] < 20) & (8 < train_df['trg_len']) & (train_df['trg_len'] < 20)
# 조건를 충족하는 데이터를 필터링하여 새로운 변수에 저장합니다.
train_df = train_df[is_within_len]

dataset_df_8096 = train_df.sample(n=1024*8, # number of items from axis to return.
          random_state=1234) # seed for random number generator for reproducibility

print('Translation Pair :',len(dataset_df_8096)) # 리뷰 개수 출력

5. Preprocess and build list

raw_src = []
for sentence in dataset_df_8096['SRC']:
    sentence = sentence.lower().strip()
    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ."
    sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
    sentence = re.sub(r'[" "]+', " ", sentence)
    # removing contractions
    sentence = re.sub(r"i'm", "i am", sentence)
    sentence = re.sub(r"he's", "he is", sentence)
    sentence = re.sub(r"she's", "she is", sentence)
    sentence = re.sub(r"it's", "it is", sentence)
    sentence = re.sub(r"that's", "that is", sentence)
    sentence = re.sub(r"what's", "that is", sentence)
    sentence = re.sub(r"where's", "where is", sentence)
    sentence = re.sub(r"how's", "how is", sentence)
    sentence = re.sub(r"\'ll", " will", sentence)
    sentence = re.sub(r"\'ve", " have", sentence)
    sentence = re.sub(r"\'re", " are", sentence)
    sentence = re.sub(r"\'d", " would", sentence)
    sentence = re.sub(r"\'re", " are", sentence)
    sentence = re.sub(r"won't", "will not", sentence)
    sentence = re.sub(r"can't", "cannot", sentence)
    sentence = re.sub(r"n't", " not", sentence)
    sentence = re.sub(r"n'", "ng", sentence)
    sentence = re.sub(r"'bout", "about", sentence)
    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    sentence = re.sub(r"[^a-zA-Z?.!,]+", " ", sentence)
    sentence = sentence.strip()
    raw_src.append(sentence)

raw_trg = []

def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
            if unicodedata.category(c) != 'Mn')

for sentence in dataset_df_8096['TRG']:
    # 위에서 구현한 함수를 내부적으로 호출
    sentence = unicode_to_ascii(sentence.lower())

    # 단어와 구두점 사이에 공백을 만듭니다.
    # Ex) "he is a boy." => "he is a boy ."
    sentence = re.sub(r"([?.!,¿])", r" \1", sentence)

    # (a-z, A-Z, ".", "?", "!", ",") 이들을 제외하고는 전부 공백으로 변환합니다.
    sentence = re.sub(r"[^a-zA-Z!.?]+", r" ", sentence)

    sentence = re.sub(r"\s+", " ", sentence)

    raw_trg.append(sentence)

print(raw_src[:5])
print(raw_trg[:5])

print('Translation Pair :',len(raw_src)) # 리뷰 개수 출력

6. Tokenizer define

# You load BPEmb model for each language.
SRC_tokenizer = BPEmb(lang='en', vs=10000, dim=100)
TRG_tokenizer = BPEmb(lang='de', vs=10000, dim=100)

# 시작 토큰과 종료 토큰을 고려하여 단어 집합의 크기를 + 2
n_enc_vocab = 10000
n_dec_vocab = 10000

print('Encoder 단어 집합의 크기 :',n_enc_vocab)
print('Decoder 단어 집합의 크기 :',n_dec_vocab)

7. Tokenizer test

lines = [
  "It is winter and the weather is very cold.",
  "Will this Christmas be a white Christmas?",
  "Be careful not to catch a cold in winter and have a happy new year."
]
for line in lines:
    txt_2_ids = SRC_tokenizer.encode_ids(line)
    ids_2_txt = SRC_tokenizer.decode_ids(txt_2_ids)
    print("Input     :", line)
    print("txt_2_ids :", txt_2_ids)
    print("ids_2_txt :", ids_2_txt,"\n")

lines = [
  "Es ist Winter und das Wetter ist sehr kalt.",
  "Wird dieses Weihnachten eine weiße Weihnacht?",
  "Achten Sie darauf, sich im Winter nicht zu erkälten und kommen Sie gut ins neue Jahr."
]
for line in lines:
    txt_2_ids = TRG_tokenizer.encode_ids(line)
    ids_2_txt = TRG_tokenizer.decode_ids(txt_2_ids)
    print("Input     :", line)
    print("txt_2_ids :", txt_2_ids)
    print("ids_2_txt :", ids_2_txt,"\n")

8. Tokenize

# 토큰화 / 정수 인코딩 / 시작 토큰과 종료 토큰 추가 / 패딩
tokenized_inputs, tokenized_outputs = [], []

for (sentence1, sentence2) in zip(raw_src, raw_trg):
    sentence1 = SRC_tokenizer.encode_ids(sentence1)
    sentence2 = TRG_tokenizer.encode_ids_with_bos_eos(sentence2)

    tokenized_inputs.append(sentence1)
    tokenized_outputs.append(sentence2)

9. Pad sequences

# 패딩
tkn_sources = tf.keras.preprocessing.sequence.pad_sequences(tokenized_inputs, maxlen=ENCODER_LEN, padding='post')
tkn_targets = tf.keras.preprocessing.sequence.pad_sequences(tokenized_outputs, maxlen=DECODER_LEN, padding='post')

10. Data type define

tkn_sources = tf.cast(tkn_sources, dtype=tf.int64)
tkn_targets = tf.cast(tkn_targets, dtype=tf.int64)

11. Check tokenized data

print('질문 데이터의 크기(shape) :', tkn_sources.shape)
print('답변 데이터의 크기(shape) :', tkn_targets.shape)

# 0번째 샘플을 임의로 출력
print(tkn_sources[0:5])
print(tkn_targets[0:5])

12. Build dataset

dataset = tf.data.Dataset.from_tensor_slices((tkn_sources, tkn_targets))

dataset = dataset.cache()
dataset = dataset.shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

In this example, the decoder input and final output are not divided. As mentioned before, since the Transformer model uses teacher forcing, it is necessary to divide these two, and this process is done in the train step.

Since it is coded in this way in the tensorflow tutorial, there is no additional code change.

In other examples on Github, please refer to the example that separates the dataset in the process of creating it.

If you just replace datasets and modules for encoding, you can make translators of other pairs of languages.

We are going to train a seq2seq-like Transformer model of converting those list of integers, thus a mapping from a vector to another vector. But each word, or integer is encoded as an embedding vector, so virtually the Transformer model is going to learn a mapping from sequence data to another sequence data. Let’s formulate this into a bit more mathematics-like way: when we get a pair of sequence data $ \boldsymbol{X} = (\boldsymbol{x}^{(1)}, \dots, \boldsymbol{x}^{(\tau x)}) $ , and $ \boldsymbol{Y} = (\boldsymbol{y}^{(1)}, \dots, \boldsymbol{y}^{(\tau y)}) $ , where $ \boldsymbol{x}^{(t)} \in \mathbb{R}^{|\mathcal{V}{\mathcal{X}}|}$ , $ \boldsymbol{x}^{(t)} \in \mathbb{R}^{|\mathcal{V}{\mathcal{Y}}|}$ , respectively from English and German corpus, then we learn a mapping $ f: \boldsymbol{X} \to \boldsymbol{Y}$.

In this implementation the vocabulary sizes are both $10002$. Thus $ |\mathcal{V}{\mathcal{X}}|=|\mathcal{V}{\mathcal{Y}}|=10002$ .

2. Keras Tokenizer를 사용하고, teacher forcing data 형태가 train step 에서 만들어 지는 경우

Keras Tokenizer is used and the teacher forcing data form is created in the train step

1. Tokenizer Install & import

Keras Tokenizer is a tokenizer provided by default in tensorflow 2.X and is a word level tokenizer. It does not require a separate installation.

2. Copy or load raw data to Colab

ENCODER_LEN = 41
DECODER_LEN = ENCODER_LEN
BATCH_SIZE  = 128
BUFFER_SIZE = 20000

N_EPOCHS = 20

import urllib3
import zipfile
import shutil
import pandas as pd

pd.set_option('display.max_colwidth', None)

http = urllib3.PoolManager()
url ='http://www.manythings.org/anki/deu-eng.zip'
filename = 'deu-eng.zip'
path = os.getcwd()
zipfilename = os.path.join(path, filename)
with http.request('GET', url, preload_content=False) as r, open(zipfilename, 'wb') as out_file:       
    shutil.copyfileobj(r, out_file)

with zipfile.ZipFile(zipfilename, 'r') as zip_ref:
    zip_ref.extractall(path)

train_df = pd.read_csv('deu.txt', names=['SRC', 'TRG', 'lic'], sep='\t')
del train_df['lic']
print(len(train_df))

train_df = train_df.loc[:, 'SRC':'TRG']

train_df.head()

train_df["src_len"] = ""
train_df["trg_len"] = ""
train_df.head()

for idx in range(len(train_df['SRC'])):
    # initialize string
    text_eng = str(train_df.iloc[idx]['SRC'])

    # default separator: space
    result_eng = len(text_eng.split())
    train_df.at[idx, 'src_len'] = int(result_eng)

    text_deu = str(train_df.iloc[idx]['TRG'])
    # default separator: space
    result_deu = len(text_deu.split())
    train_df.at[idx, 'trg_len'] = int(result_deu)

print('Translation Pair :',len(train_df)) # 리뷰 개수 출력

3. [Optional] Delete duplicated data

train_df = train_df.drop_duplicates(subset = ["SRC"])
print('Translation Pair :',len(train_df)) # 리뷰 개수 출력

train_df = train_df.drop_duplicates(subset = ["TRG"])
print('Translation Pair :',len(train_df)) # 리뷰 개수 출력

4. [Optional] Select samples

# 그 결과를 새로운 변수에 할당합니다.
is_within_len = (8 < train_df['src_len']) & (train_df['src_len'] < 20) & (8 < train_df['trg_len']) & (train_df['trg_len'] < 20)
# 조건를 충족하는 데이터를 필터링하여 새로운 변수에 저장합니다.
train_df = train_df[is_within_len]

dataset_df_8096 = train_df.sample(n=1024*8, # number of items from axis to return.
          random_state=1234) # seed for random number generator for reproducibility

print('Translation Pair :',len(dataset_df_8096)) # 리뷰 개수 출력

5. Preprocess and build list

raw_src = []
for sentence in dataset_df_8096['SRC']:
    sentence = sentence.lower().strip()
    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ."
    sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
    sentence = re.sub(r'[" "]+', " ", sentence)
    # removing contractions
    sentence = re.sub(r"i'm", "i am", sentence)
    sentence = re.sub(r"he's", "he is", sentence)
    sentence = re.sub(r"she's", "she is", sentence)
    sentence = re.sub(r"it's", "it is", sentence)
    sentence = re.sub(r"that's", "that is", sentence)
    sentence = re.sub(r"what's", "that is", sentence)
    sentence = re.sub(r"where's", "where is", sentence)
    sentence = re.sub(r"how's", "how is", sentence)
    sentence = re.sub(r"\'ll", " will", sentence)
    sentence = re.sub(r"\'ve", " have", sentence)
    sentence = re.sub(r"\'re", " are", sentence)
    sentence = re.sub(r"\'d", " would", sentence)
    sentence = re.sub(r"\'re", " are", sentence)
    sentence = re.sub(r"won't", "will not", sentence)
    sentence = re.sub(r"can't", "cannot", sentence)
    sentence = re.sub(r"n't", " not", sentence)
    sentence = re.sub(r"n'", "ng", sentence)
    sentence = re.sub(r"'bout", "about", sentence)
    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    sentence = re.sub(r"[^a-zA-Z?.!,]+", " ", sentence)
    sentence = sentence.strip()
    raw_src.append(sentence)

raw_trg = []

def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
            if unicodedata.category(c) != 'Mn')

for sentence in dataset_df_8096['TRG']:
    # 위에서 구현한 함수를 내부적으로 호출
    sentence = unicode_to_ascii(sentence.lower())

    # 단어와 구두점 사이에 공백을 만듭니다.
    # Ex) "he is a boy." => "he is a boy ."
    sentence = re.sub(r"([?.!,¿])", r" \1", sentence)

    # (a-z, A-Z, ".", "?", "!", ",") 이들을 제외하고는 전부 공백으로 변환합니다.
    sentence = re.sub(r"[^a-zA-Z!.?]+", r" ", sentence)

    sentence = re.sub(r"\s+", " ", sentence)

    raw_trg.append(sentence)

print(raw_src[:5])
print(raw_trg[:5])

6. Tokenizer define

df1 = pd.DataFrame(raw_src)
df2 = pd.DataFrame(raw_trg)

df1.rename(columns={0: "SRC"}, errors="raise", inplace=True)
df2.rename(columns={0: "TRG"}, errors="raise", inplace=True)
train_df = pd.concat([df1, df2], axis=1)

print('Translation Pair :',len(train_df)) # 리뷰 개수 출력

raw_src  = train_df['SRC']
raw_trg  = train_df['TRG']

src_sentence  = raw_src.apply(lambda x: "<SOS> " + str(x) + " <EOS>")
trg_sentence  = raw_trg.apply(lambda x: "<SOS> "+ x + " <EOS>")

filters = '!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n'
oov_token = '<unk>'

# Define tokenizer
SRC_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters = filters, oov_token=oov_token)
TRG_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters = filters, oov_token=oov_token)

SRC_tokenizer.fit_on_texts(src_sentence)
TRG_tokenizer.fit_on_texts(trg_sentence)

n_enc_vocab = len(SRC_tokenizer.word_index) + 1
n_dec_vocab = len(TRG_tokenizer.word_index) + 1

print('Encoder 단어 집합의 크기 :',n_enc_vocab)
print('Decoder 단어 집합의 크기 :',n_dec_vocab)

7. Tokenizer test

lines = [
  "It is winter and the weather is very cold.",
  "Will this Christmas be a white Christmas?",
  "Be careful not to catch a cold in winter and have a happy new year."
]
for line in lines:
    txt_2_ids = SRC_tokenizer.texts_to_sequences([line])
    ids_2_txt = SRC_tokenizer.sequences_to_texts(txt_2_ids)
    print("Input     :", line)
    print("txt_2_ids :", txt_2_ids)
    print("ids_2_txt :", ids_2_txt[0],"\n")

lines = [
  "C'est l'hiver et il fait très froid.",
  "Ce Noël sera-t-il un Noël blanc ?",
  "Attention à ne pas attraper froid en hiver et bonne année."
]
for line in lines:
    txt_2_ids = TRG_tokenizer.texts_to_sequences([line])
    ids_2_txt = TRG_tokenizer.sequences_to_texts(txt_2_ids)
    print("Input     :", line)
    print("txt_2_ids :", txt_2_ids)
    print("ids_2_txt :", ids_2_txt[0],"\n")

8. Tokenize

# 토큰화 / 정수 인코딩 / 시작 토큰과 종료 토큰 추가 / 패딩
tokenized_inputs  = SRC_tokenizer.texts_to_sequences(src_sentence)
tokenized_outputs = TRG_tokenizer.texts_to_sequences(trg_sentence)

9. Pad sequences

# 패딩
tkn_sources = tf.keras.preprocessing.sequence.pad_sequences(tokenized_inputs,  maxlen=ENCODER_LEN, padding='post', truncating='post')
tkn_targets = tf.keras.preprocessing.sequence.pad_sequences(tokenized_outputs, maxlen=DECODER_LEN, padding='post', truncating='post')

10. Data type define

tkn_sources = tf.cast(tkn_sources, dtype=tf.int64)
tkn_targets = tf.cast(tkn_targets, dtype=tf.int64)

11. Check tokenized data

print('질문 데이터의 크기(shape) :', tkn_sources.shape)
print('답변 데이터의 크기(shape) :', tkn_targets.shape)

# 0번째 샘플을 임의로 출력
print(tkn_sources[0:5])
print(tkn_targets[0:5])

12. Build dataset

dataset = tf.data.Dataset.from_tensor_slices((tkn_sources, tkn_targets))

dataset = dataset.cache()
dataset = dataset.shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

Since it is coded in this way in the tensorflow tutorial, there is no additional code change.

In other examples on Github, please refer to the example that separates the dataset in the process of creating it.

3. Keras Tokenizer를 사용하고, teacher forcing data 형태가 dataset 에서 만들어 지는 경우

Keras Tokenizer is used and the teacher forcing data form is created in the dataset

1. Tokenizer Install & import

Keras Tokenizer is a tokenizer provided by default in tensorflow 2.X and is a word level tokenizer. It does not require a separate installation.

2. Copy or load raw data to Colab

ENCODER_LEN = 41
DECODER_LEN = ENCODER_LEN
BATCH_SIZE  = 128
BUFFER_SIZE = 20000

N_EPOCHS = 20

import urllib3
import zipfile
import shutil
import pandas as pd

pd.set_option('display.max_colwidth', None)

http = urllib3.PoolManager()
url ='http://www.manythings.org/anki/deu-eng.zip'
filename = 'deu-eng.zip'
path = os.getcwd()
zipfilename = os.path.join(path, filename)
with http.request('GET', url, preload_content=False) as r, open(zipfilename, 'wb') as out_file:       
    shutil.copyfileobj(r, out_file)

with zipfile.ZipFile(zipfilename, 'r') as zip_ref:
    zip_ref.extractall(path)

train_df = pd.read_csv('deu.txt', names=['SRC', 'TRG', 'lic'], sep='\t')
del train_df['lic']
print(len(train_df))

train_df = train_df.loc[:, 'SRC':'TRG']

train_df.head()

train_df["src_len"] = ""
train_df["trg_len"] = ""
train_df.head()

for idx in range(len(train_df['SRC'])):
    # initialize string
    text_eng = str(train_df.iloc[idx]['SRC'])

    # default separator: space
    result_eng = len(text_eng.split())
    train_df.at[idx, 'src_len'] = int(result_eng)

    text_deu = str(train_df.iloc[idx]['TRG'])
    # default separator: space
    result_deu = len(text_deu.split())
    train_df.at[idx, 'trg_len'] = int(result_deu)

print('Translation Pair :',len(train_df)) # 리뷰 개수 출력

3. [Optional] Delete duplicated data

train_df = train_df.drop_duplicates(subset = ["SRC"])
print('Translation Pair :',len(train_df)) # 리뷰 개수 출력

train_df = train_df.drop_duplicates(subset = ["TRG"])
print('Translation Pair :',len(train_df)) # 리뷰 개수 출력

4. [Optional] Select samples

# 그 결과를 새로운 변수에 할당합니다.
is_within_len = (8 < train_df['src_len']) & (train_df['src_len'] < 20) & (8 < train_df['trg_len']) & (train_df['trg_len'] < 20)
# 조건를 충족하는 데이터를 필터링하여 새로운 변수에 저장합니다.
train_df = train_df[is_within_len]

dataset_df_8096 = train_df.sample(n=1024*8, # number of items from axis to return.
          random_state=1234) # seed for random number generator for reproducibility

print('Translation Pair :',len(dataset_df_8096)) # 리뷰 개수 출력

5. Preprocess and build list

raw_src = []
for sentence in dataset_df_8096['SRC']:
    sentence = sentence.lower().strip()
    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ."
    sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
    sentence = re.sub(r'[" "]+', " ", sentence)
    # removing contractions
    sentence = re.sub(r"i'm", "i am", sentence)
    sentence = re.sub(r"he's", "he is", sentence)
    sentence = re.sub(r"she's", "she is", sentence)
    sentence = re.sub(r"it's", "it is", sentence)
    sentence = re.sub(r"that's", "that is", sentence)
    sentence = re.sub(r"what's", "that is", sentence)
    sentence = re.sub(r"where's", "where is", sentence)
    sentence = re.sub(r"how's", "how is", sentence)
    sentence = re.sub(r"\'ll", " will", sentence)
    sentence = re.sub(r"\'ve", " have", sentence)
    sentence = re.sub(r"\'re", " are", sentence)
    sentence = re.sub(r"\'d", " would", sentence)
    sentence = re.sub(r"\'re", " are", sentence)
    sentence = re.sub(r"won't", "will not", sentence)
    sentence = re.sub(r"can't", "cannot", sentence)
    sentence = re.sub(r"n't", " not", sentence)
    sentence = re.sub(r"n'", "ng", sentence)
    sentence = re.sub(r"'bout", "about", sentence)
    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    sentence = re.sub(r"[^a-zA-Z?.!,]+", " ", sentence)
    sentence = sentence.strip()
    raw_src.append(sentence)

raw_trg = []

def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
            if unicodedata.category(c) != 'Mn')

for sentence in dataset_df_8096['TRG']:
    # 위에서 구현한 함수를 내부적으로 호출
    sentence = unicode_to_ascii(sentence.lower())

    # 단어와 구두점 사이에 공백을 만듭니다.
    # Ex) "he is a boy." => "he is a boy ."
    sentence = re.sub(r"([?.!,¿])", r" \1", sentence)

    # (a-z, A-Z, ".", "?", "!", ",") 이들을 제외하고는 전부 공백으로 변환합니다.
    sentence = re.sub(r"[^a-zA-Z!.?]+", r" ", sentence)

    sentence = re.sub(r"\s+", " ", sentence)

    raw_trg.append(sentence)

print(raw_src[:5])
print(raw_trg[:5])

6. Tokenizer define

df1 = pd.DataFrame(raw_src)
df2 = pd.DataFrame(raw_trg)

df1.rename(columns={0: "SRC"}, errors="raise", inplace=True)
df2.rename(columns={0: "TRG"}, errors="raise", inplace=True)
train_df = pd.concat([df1, df2], axis=1)

print('Translation Pair :',len(train_df)) # 리뷰 개수 출력

raw_src  = train_df['SRC']
raw_trg  = train_df['TRG']

src_sentence  = raw_src.apply(lambda x: "<SOS> " + str(x) + " <EOS>")
trg_sentence  = raw_trg.apply(lambda x: "<SOS> "+ x + " <EOS>")

filters = '!"#$%&()*+,-./:;=?@[\\]^_`{|}~\t\n'
oov_token = '<unk>'

# Define tokenizer
SRC_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters = filters, oov_token=oov_token)
TRG_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters = filters, oov_token=oov_token)

SRC_tokenizer.fit_on_texts(src_sentence)
TRG_tokenizer.fit_on_texts(trg_sentence)

n_enc_vocab = len(SRC_tokenizer.word_index) + 1
n_dec_vocab = len(TRG_tokenizer.word_index) + 1

print('Encoder 단어 집합의 크기 :',n_enc_vocab)
print('Decoder 단어 집합의 크기 :',n_dec_vocab)

7. Tokenizer test

lines = [
  "It is winter and the weather is very cold.",
  "Will this Christmas be a white Christmas?",
  "Be careful not to catch a cold in winter and have a happy new year."
]
for line in lines:
    txt_2_ids = SRC_tokenizer.texts_to_sequences([line])
    ids_2_txt = SRC_tokenizer.sequences_to_texts(txt_2_ids)
    print("Input     :", line)
    print("txt_2_ids :", txt_2_ids)
    print("ids_2_txt :", ids_2_txt[0],"\n")

lines = [
  "C'est l'hiver et il fait très froid.",
  "Ce Noël sera-t-il un Noël blanc ?",
  "Attention à ne pas attraper froid en hiver et bonne année."
]
for line in lines:
    txt_2_ids = TRG_tokenizer.texts_to_sequences([line])
    ids_2_txt = TRG_tokenizer.sequences_to_texts(txt_2_ids)
    print("Input     :", line)
    print("txt_2_ids :", txt_2_ids)
    print("ids_2_txt :", ids_2_txt[0],"\n")

8. Tokenize

# 토큰화 / 정수 인코딩 / 시작 토큰과 종료 토큰 추가 / 패딩
tokenized_inputs  = SRC_tokenizer.texts_to_sequences(src_sentence)
tokenized_outputs = TRG_tokenizer.texts_to_sequences(trg_sentence)

9. Pad sequences

# 패딩
tkn_sources = tf.keras.preprocessing.sequence.pad_sequences(tokenized_inputs,  maxlen=ENCODER_LEN, padding='post', truncating='post')
tkn_targets = tf.keras.preprocessing.sequence.pad_sequences(tokenized_outputs, maxlen=DECODER_LEN, padding='post', truncating='post')

10. Data type define

tkn_sources = tf.cast(tkn_sources, dtype=tf.int64)
tkn_targets = tf.cast(tkn_targets, dtype=tf.int64)

11. Check tokenized data

print('질문 데이터의 크기(shape) :', tkn_sources.shape)
print('답변 데이터의 크기(shape) :', tkn_targets.shape)

# 0번째 샘플을 임의로 출력
print(tkn_sources[0:5])
print(tkn_targets[0:5])

12. Build dataset

# 디코더의 실제값 시퀀스에서는 시작 토큰을 제거해야 한다.
dataset = tf.data.Dataset.from_tensor_slices((
    {
        'inputs': tkn_sources,
        'dec_inputs': tkn_targets[:, :-1] # 디코더의 입력. 마지막 패딩 토큰이 제거된다.
    },
    {
        'outputs': tkn_targets[:, 1:]  # 맨 처음 토큰이 제거된다. 다시 말해 시작 토큰이 제거된다.
    },
))

dataset = dataset.cache()
dataset = dataset.shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

# 임의의 샘플에 대해서 [:, :-1]과 [:, 1:]이 어떤 의미를 가지는지 테스트해본다.
print(tkn_targets[0]) # 기존 샘플
print(tkn_targets[:1][:, :-1]) # 마지막 패딩 토큰 제거하면서 길이가 40가 된다.
print(tkn_targets[:1][:, 1:]) # 맨 처음 토큰이 제거된다. 다시 말해 시작 토큰이 제거된다. 길이는 역시 40가 된다.

In this example, the decoder input and final output (= outputs) in the dataset are defined so that teacher forcing can be used.
As shown in the output result, you can see that BOS token is defined in the decoder input, and EOS token is defined in the final output (= outputs).

There is no fixed format for defining these input/output tokens, but keep in mind that BOS tokens must be defined in the decoder input and EOS tokens are defined in the final output (=outputs).

In the next chapter, we will look at how these defined input/output tokens are learned through the process.

마지막 편집일시 : 2023년 7월 12일 4:29 오후

댓글 0 피드백

이전글 : K_1. Transformer from scratch 1_EN
다음글 : K_1.2. How it works, step-by-step_EN