Text Retrieval Slide - Update

Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

AI VIETNAM

All-in-One Course
(TA Session)

Text Retrieval
Project

Dinh-Thang Duong – TA

Year 2023
AI VIETNAM
All-in-One Course
(TA Session)

Outline
➢ Introduction
➢ Create Corpus
➢ Text Representation
➢ Text Normalization
➢ Ranking
➢ Optional: Semantic Search with BERT
➢ Question 2
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Getting Started

Most famous
search engines

3
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Getting Started

Search

4
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Text Retrieval

User Relevant Documents


Query
“Today news” Today news
TR System
articles

Text Retrieval (TR) (also called as Document Ad-hoc Retrieval2: A system aims to provide
Retrieval)1: A branch of Information documents from within the collection that
Retrieval (IR) where the system matching of are relevant to an arbitrary user information
some stated user search query against a set need, communicated to the system by
of texts. means of a one-off, user-initiated query

1: https://en.wikipedia.org/wiki/Document_retrieval 5
2: https://nlp.stanford.edu/IR-book/html/htmledition/an-example-information-retrieval-problem-1.html
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Text Retrieval

Query

Search

Relevant
Documents 6
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Applications

Search Engines Find desire documents within


very large corpus
7
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Challenges

Document
Indexing

Text
Representation

Share the same some IR challenges How to satisfy information need?

8
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Basic Text Retrieval Pipeline
• Query: a text describes user’s information need. Input:
• Search Query (a text)
• Corpus: a set of documents (texts). • The Corpus (collection of
Corpus documents)
• Relevance: satisfaction of user’s information need.
Output:
• Information need: the topic about which the user • Relevant Documents (collection
desires to know more. of documents)

• Terms: indexed units (usually words). Indexing


Feedback
• Index: a data structure for storing documents.

Output
Input
Query Relevant
Query Searching
Processing Documents
9
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Project Statement
With MSMARCO Dataset, create a simple text retrieval program using Vector Space Model.

10
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Project Statement

’what is the official language


in Fiji’

Query

Text Retrieval
System

Corpus

MSMarco Entity Corpus


Relevant Documents 11
AI VIETNAM
All-in-One Course
(TA Session) Introduction
❖ Vector Space Model 3. Calculate similarity between vectors

Given a query and collection of documents: term1


1. Brings raw text to vector space

Vector
Similarty

Raw text Vector Representation term3


2. Indexing

The Vector Space

4. Ranking
12
AI VIETNAM
All-in-One Course
(TA Session) Introduction
Input
Vectorizer

Corpus Preprocessing Bag-of-Words

Our Text Retrieval


Vocabulary Vectorizer Indexing
Pipeline

Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
13
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
Input
Vectorizer

Corpus Preprocessing Bag-of-Words

Our Text Retrieval


Vocabulary Vectorizer Indexing
Pipeline

Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
14
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
❖ Problem

15
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
❖ Download

Download MSMARCO via Huggingface Datasets

16
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
❖ Step 1: Install datasets

17
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
❖ Step 2: Load MS_MARCO

We use MS_MARCO version 1.1 and


its test set

18
AI VIETNAM
All-in-One Course
(TA Session) Create Corpus
❖ Step 4: Extract text

1. Only use sample with type == entity 2. Load text (you can only load passage_text) and append to
corpus

19
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
Input
Vectorizer

Corpus Preprocessing Bag-of-Words

Our Text Retrieval


Vocabulary Vectorizer Indexing
Pipeline

Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
20
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Introduction

S1 = “this is a book”
S2 = “machine learning book”

Similarity?

S1 = [15, 30, 14, 50]


S2 = [12, 35, 10, 49]

21
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Challenges

Word relations:
• Synonymy: <water, H2O>…
• Antonymy: <up, down>...
• Polysemy: sentence, mouse..
• Similarity: <car, trunk>…
• Relatedness: <coffee, cup>…
• Connotation: great (positive),
terrible (negative)…

22
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Representation Taxonomy
Word
Embeddings
Without machine Transformer-
learning based
Context- Context-
Independent Dependent
Bag-of-Words

TF-IDF GPT
With machine
BERT Family
learning RNN-based

Word2Vec FastText CoVe


Skip-Gram

ELMo
GloVe
CBOW
23
https://medium0.com/nlplanet/two-minutes-nlp-11-word-embeddings-models-you-should-know-a0581763b9a9
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Introduction to Bag-of-Words

1. The vocabulary

2. Weighting terms
Bag-of-Words (BoW): A text representation method that represent text as
method
the bag, disregarding grammar and even word order but keeping multiplicity.

24
https://en.wikipedia.org/wiki/Bag-of-words_model
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Bag-of-Words Pipeline
1

Text Create
Normalization Dictionary
Corpus
(List of paragraphs) New text representation

14 2 9 36 89

2
Text
Vectorize
Normalization

A string
(Text) 25
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Dictionary

Set of
doc_i = [‘book’, ‘deep’, documents

‘learning’]

Dictionary = […,‘book’, ‘ good’, ‘algorithm’, ‘vietnam’, …]

Unique words in corpus


26
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Weighting: Term-frequency
Corpus
doc1 = “deep learning book” [‘deep’, ‘learning’, ‘book’]
Tokenization
doc2 = “machine learning algorithm” [‘machine’, ‘learning’, ‘algorithm’]
doc3 = “learning ai from scratch” [‘learning’, ‘ai’, ‘from’, ‘scratch’]

doc4 = “ai vietnam” [‘ai’, ‘vietnam’]

Vocabulary = deep learning book machine algorithm ai from scratch vietnam

👉 Given a string = “vietnam machine learning deep learning book”

deep learning book machine algorithm ai from scratch vietnam

BoW 1 2 1 1 0 0 0 0 1

Binary BoW 1 1 1 1 0 0 0 0 1
27
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Vectorizer

Input
Vectorizer
s = “Hello AI
VIETNAM” Output
Text Normalization
[0, 0, 1, …, 0]
(vector n elements)
Bag-of-words
An n words
Vocabulary

28
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Vectorizer

29
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Indexing
Corpus

Indexing
Feedback

Output
Input
Query Relevant
Query Searching
Processing Documents

30
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Indexing

Indexing: The process of organizing and structuring a collection of E.g: Inverted Indexing
documents or data to facilitate efficient retrieval of information. It
involves creating an index that enables quick access to relevant
documents based on search queries or specific attributes.
31
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Indexing
Document-term Matrix: A mathematical matrix that describes the frequency of terms that occur in a collection of
document.

word_1 word_2 word_3 … word_n Terms


doc_1
doc_2
Documents

doc_n

We can use document-term matrix as a database to indexing documents

32
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Document-term Matrix

Vocabulary = deep learning book machine algorithm ai from scratch vietnam

doc1 = “deep learning book”


doc2 = “machine learning algorithm” Represent raw texts as the form doci = (w1, w2, w3, …, wn)
doc3 = “learning ai from scratch” • wi ∈ vocab
doc4 = “ai vietnam” • n: vocab size
Represent documents

deep learning book machine algorithm ai from scratch vietnam

doc1 1 1 1 0 0 0 0 0 0
Represent
doc2 0 1 0 1 1 0 0 0 0 terms
doc3 0 1 0 0 0 1 1 1 0

doc4 0 0 0 0 0 1 0 0 1 33
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Doc-Term Matrix as Index

doc1 = “Học sách học AI.” doc1 = [‘học’, ‘sách’, ‘học’, ‘ai’]
Normalize & Tokenize
doc2 = “Sách Học Máy” doc2 = [‘sách’, ‘học’, ‘máy’]

doc3 = “Người ấy là ai?” doc3 = [‘người’, ‘ấy’, ‘là’, ‘ai’]

học sách ai máy người ấy là


Create
Vocabulary
doc1 2 1 1 0 0 0 0

doc2 1 1 0 1 0 0 0 Vocab = [‘học’, ‘sách’, ‘ai’,


‘máy’, ‘ người’, ‘ấy’, là’]
doc3 0 0 1 0 1 1 1
Each row is also document index 34
AI VIETNAM
All-in-One Course
(TA Session) Text Representation
❖ Indexing Code

35
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
Input
Vectorizer

Corpus Preprocessing Bag-of-Words

Our Text Retrieval


Vocabulary Vectorizer Indexing
Pipeline

Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
36
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Motivation

doc1 = “Học sách học AI.” doc1 = [‘Học’, ‘sách’, ‘học’, ‘AI.’]
Tokenize
doc2 = “Sách Học Máy” doc2 = [‘Sách’, ‘Học’, ‘Máy’]

doc3 = “Người ấy là ai?” doc3 = [‘Người’, ‘ấy’, ‘là’, ‘ai?’]

Vocabulary = Học sách học AI. Sách Máy Người ấy là ai?

vocab_size = 10

Both refers to the meaning of “học” Both refers to the meaning of “sách”
37
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Motivation

doc1 = “Học sách học AI.” doc1 = [‘học’, ‘sách’, ‘học’, ‘ai’]
Preprocess & Tokenize
doc2 = “Sách Học Máy” doc2 = [‘sách’, ‘học’, ‘máy’]

doc3 = “Người ấy là ai?” doc3 = [‘người’, ‘ấy’, ‘là’, ‘ai’]

Vocabulary = sách học ai máy người ấy là ai

vocab_size = 8

Reduce unnecessary complexity for text representation

38
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Input/Output

Lowercasing

Input Output
Punctuations Removal
“Hello, this is AI
“hello ai vietnam”
VIETNAM!”
Stopwords Removal

Stemming

Optional: URL Removal, HTML Tags Removal…


39
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Lowercasing

“Hello, this is AI VIETNAM!” “hello, this is ai vietnam!”

“hello, this is ai vietnam! “hello this is ai vietnam”

40
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stopwords Removal Less crucial words ➔ No need to represent

41
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stemming

42
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stemming
If we not consider semantic of words: We shouln’t include all forms of a word into dictionary
but only the root form

change

changing
The same meaning as
changes change

changing

changer

 Reduce the size of vector representation


 Avoid Sparse Vector
43
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stemming

Input Output

change

changing

changes Stemming chang

changing

changer

Stemming Rules

44
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stemming

Stemming Methaod: Porter Stemmer

45
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Stemming

46
AI VIETNAM
All-in-One Course
(TA Session) Text Normalization
❖ Final Text Normalization Function

Lowercasing

Punctuations Removal

Stopwords Removal

Stemming

47
AI VIETNAM
All-in-One Course
(TA Session) Ranking
Input
Vectorizer

Corpus Preprocessing Bag-of-Words

Our Text Retrieval


Vocabulary Vectorizer Indexing
Pipeline

Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
48
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Motivation

read book ai machine learn how

doc1 1 1 1 0 0 0

doc2 0 1 0 1 1 0
Ranked List
DocID Similarity
doc3 0 0 1 0 2 1
distance(q, d) d2 0.8165
d3 0.5774
d1 0.0000
query 0 0 0 1 1 0

49
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Cosine Similarity

𝒃
Dot product favours long vectors (higher value in dimensions)

50
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Cosine Similarity

term 1: ”learning”
Similarity
term 2: ”information”

51
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Ranking based on similarity value

𝑟1 = 𝑐𝑜𝑠𝑖𝑛𝑒 𝑑1, 𝑞 = 0.308


học sách ai máy người ấy là
𝑟2 = 𝑐𝑜𝑠𝑖𝑛𝑒 𝑑2, 𝑞 = 0.218
doc1 2 1 1 0 0 0 0
𝑟3 = 𝑐𝑜𝑠𝑖𝑛𝑒 𝑑3, 𝑞 = 0.756
doc2 1 1 0 1 0 0 0

Descending
doc3 0 0 1 0 1 1 1
Sort

DocID Similarity

query 0 0 2 1 1 0 1 d3 0.756
d1 0.308
d2 0.218
52
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Ranking code & results

1. Calculate query vector.


2. Calculate similarity between
query vector and each of doc
vector.
3. Sort similarity in descending
order.

53
AI VIETNAM
All-in-One Course
(TA Session) Ranking
❖ Ranking code & results

54
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Introduction

Transformer: Type of deep neural network architecture that is


used to solve the problem of transduction or transformation of
input sequences into output sequences.

55
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Introduction

56
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ BERT

BERT (Bidirectional Encoder Representations from


Transformers): A pretrained model that utilized the
Transformer Encoder architecture and trained on a large
amount of text data to understand and generate
contextual representations of words.

57
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ BERT

Input Output

’what is the official language


0.123 0.456 0.789 0.567 0.890
in Fiji’

Vector Representation of dim=384

58
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Semantic Search

Idea: Use BERT (Sentence Transformer) to generate vector


representation for query and document. Then scoring their similarity
using Cosine Similarity.

59
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Why Semantic Search?
Semantic Search: A search technique that aims to understand the meaning or semantics of a query and the content
being searched.

0.123 0.456 0.789 0.567 0.890

BERT output can catch semantic meaning

60
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
Input

Corpus

Semantic Search
BERT Encode Indexing
Pipeline

Output
Input
Query Cosine Relevant
Query Ranking
Processing Similarity Documents
61
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Step 1: Import BERT and encode corpus

62
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Step 2: Define Cosine Similarity function

63
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Step 3: Define Ranking function

64
AI VIETNAM
All-in-One Course
(TA Session) Optional: Semantic Search with BERT
❖ Step 4: Search

65
AI VIETNAM
All-in-One Course
(TA Session) Question

?
66
67

You might also like