Module-5:: Network Analysis

Module-5:
Natural Language Processing Word Clouds, n-Gram Language Models, Grammars, An

Aside: Gibbs Sampling, Topic Modeling, Word Vectors, Recurrent Neural Networks,
Example: Using a Character-Level RNN, Network Analysis, Betweenness Centrality,
Eigenvector Centrality, Directed Graphs and PageRank, Recommender Systems, Manual
Curation, Recommending What’s Popular, User-Based Collaborative Filtering, Item-Based
Collaborative Filtering, Matrix Factorization.
Natural Language Processing Word Clouds, n-Gram Language Models, Grammars, An

Aside: Gibbs Sampling, Topic Modeling, Word Vectors, Recurrent Neural Networks,
Example: Using a Character-Level RNN,
NLP, which stands for Natural Language Processing, is a fascinating field of computer
science and artificial intelligence (AI) that deals with the interaction between computers
and human language. Here's a breakdown of what NLP is and what it does:
Goal of NLP:
The main goal of NLP is to bridge the gap between how humans use language and how
computers can process information. Humans communicate in a complex and nuanced way,
using slang, sarcasm, and context to convey meaning. NLP aims to develop techniques
that allow computers to understand the intricacies of human language and perform tasks
like:
● Understanding the meaning of text: This involves analyzing the syntax (grammar)
and semantics (meaning) of words and sentences.
● Generating human-like text: NLP can be used to create chatbots that can hold
conversations, translate languages, or write different kinds of creative content.
● Extracting information from text: NLP can be used to analyze documents, emails,
social media posts, or other forms of text data to identify important information.
How NLP Works:
NLP combines several techniques from various fields, like computer science, linguistics,
and statistics. Here are some of the key building blocks:
● Machine Learning: NLP algorithms are often trained on large amounts of text data.
This data can be used to teach the algorithms how to recognize patterns in language
and perform tasks like text classification or sentiment analysis.
● Deep Learning: A subfield of machine learning, deep learning uses artificial neural
networks to model the complexities of human language. These models can learn to
identify relationships between words, understand context, and generate more
natural-sounding text.
● Computational Linguistics: This field focuses on the application of computer
science techniques to study language. It provides NLP with the foundation for
understanding the structure and meaning of language.
Applications of NLP:
NLP has a wide range of applications across various industries. Here are some examples:
● Machine Translation: NLP is used to develop machine translation systems that can
translate text from one language to another.
● Chatbots: Virtual assistants like Siri or Alexa use NLP to understand your questions
and requests and respond in a helpful way.
● Sentiment Analysis: NLP can be used to analyze social media posts, customer
reviews, or other forms of text data to understand the sentiment or opinion being
expressed.
● Text Summarization: NLP can be used to automatically generate summaries of long
documents or articles.
● Spam Filtering: Email spam filters use NLP techniques to identify and filter out
unwanted emails.
Word Clouds
A word cloud is a visual representation of text data where the size of each word indicates
its frequency or importance.
Python code
from wordcloud import WordCloud
import matplotlib.pyplot as plt
text = "Data science is an interdisciplinary field that uses scientific methods, processes,
algorithms and systems to extract knowledge and insights from structured and
unstructured data."
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
n-Gram Language Models
An n-gram is a contiguous sequence of n items from a given sample of text or speech. For
example, a bi-gram (2-gram) model considers sequences of 2 words.
N-grams are groups of consecutive words extracted from text used to capture the context
and structure of language. They are useful in providing a simple way to analyze text,
allowing algorithms to understand and predict linguistic patterns, enhancing feature
engineering for machine learning models, and improve the performance of tasks like
text classification, sentiment analysis, and language translation. By considering the order
and combination of words, n-grams offer a more nuanced and approach to analyzing and
processing text data. Example of a group of 3 (trigram):
Sentence: “She enjoys reading books on weekends.”
Trigram Example: (“She”, “enjoys”, “reading”) (“enjoys”, “reading”, “books”)

(“reading”, “books”, “on”) etc…
N-gram language models are a fundamental concept in Natural Language Processing

(NLP). They are a statistical approach for analyzing sequences of words and predicting the
next word in a sequence. Here's a breakdown of n-grams and how they work:
What are N-grams?
An n-gram is a contiguous sequence of n words. The value of n determines the type of

n-gram:
● Unigram (n=1): A single word. For example, "the", "book", "amazing".

● Bigram (n=2): A sequence of two words. For example, "the book", "amazing
weather", "I went".
● Trigram (n=3): A sequence of three words. For example, "I went to the", "the
weather was amazing", "she likes to read".
How N-gram Language Models Work:
1. Training: An n-gram language model is trained on a large corpus of text data. This
data can be books, articles, websites, or any other source of written text.
2. Frequency Counting: The model then counts the frequency of each n-gram in the
training data. This tells the model how often specific sequences of words appear
together.
3. Probability Estimation: Based on the frequencies, the model estimates the
probability of a word appearing after a given sequence of n-1 words. For example, a
trigram model might estimate the probability of the word "park" following the
bigram "went to the".
Applications of N-gram Language Models:
N-gram models have various applications in NLP tasks, including:
● Autocorrect: They can suggest corrections for misspelled words by analyzing the
surrounding words and identifying the most likely sequence.
● Speech Recognition: They can help improve speech recognition accuracy by
considering the probabilities of different word sequences following a recognized
word.
● Machine Translation: N-gram models can be used to translate text by predicting
the most likely word sequence in the target language that corresponds to the source
language sentence.
● Text Prediction: They can be used in features like auto-completion in messaging
apps or search bars, suggesting the most likely next word based on the user's input.
Limitations of N-gram Models:
● Data Dependency: The accuracy of n-gram models is highly dependent on the

training data. They might struggle with unseen n-grams or language not present in
the training corpus.
● Long-range Dependencies: N-gram models primarily consider the n preceding
words. This can be limiting for tasks involving long-range dependencies in a
sentence, where the meaning of a word might depend on words far away in the
sequence.
● Data Sparsity: With higher n values (e.g., trigrams, tetragrams), the number of
possible n-grams grows exponentially. This can lead to data sparsity, where some
n-grams might not have any occurrences in the training data, making probability
estimation difficult.
Python code
from nltk import ngrams
sentence = "Data science is an interdisciplinary field"

n=2
bigrams = list(ngrams(sentence.split(), n))
print("Bigrams:", bigrams)
Grammars
Grammars in computational linguistics are used to describe the syntax of languages, often
using formal rules. A context-free grammar can be defined to parse sentences.
Context Free Grammar is formal grammar, the syntax or structure of a formal language
can be described using context-free grammar (CFG), a type of formal grammar. The
grammar has four tuples: (V,T,P,S).
V - It is the collection of variables or nonterminal symbols.
T - It is a set of terminals.
P - It is the production rules that consist of both terminals and nonterminals.
S - It is the Starting symbol.
Python code
import nltk
from nltk import CFG
# Define a simple grammar

grammar = CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Data' | 'science' | 'field'
V -> 'is'
""")
# Create a parser
parser = nltk.ChartParser(grammar)
sentence = "Data is science".split()

for tree in parser.parse(sentence):
print(tree)
An Aside: Gibbs Sampling
Gibbs sampling is a Markov Chain Monte Carlo (MCMC) algorithm for obtaining a
sequence of observations from a specified multivariate probability distribution.
Python code
import numpy as np
def gibbs_sampling(iterations):
x, y = 0, 0
samples = []
for _ in range(iterations):
x = np.random.normal(y, 1)
y = np.random.normal(x, 1)
samples.append((x, y))
return samples
samples = gibbs_sampling(1000)
print("Samples:", samples[:10])
Gibbs sampling is commonly used for statistical inference (e.g. determining the best value
of a parameter, such as the number of people likely to shop at a particular store on a given
day, the candidate a voter will most likely vote for, etc.).
Topic Modeling
Topic modeling is a type of statistical model for discovering the abstract "topics" that
occur in a collection of documents. Latent Dirichlet Allocation (LDA) is a common
algorithm for this.
Python code
import gensim
from gensim import corpora
documents = [
"Data science is a field that combines scientific methods",
"Machine learning is a method of data analysis",
"Deep learning is a subset of machine learning"
]
texts = [doc.split() for doc in documents]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word=dictionary,

passes=15)
topics = lda.print_topics(num_words=4)
for topic in topics:
print(topic)
Word Vectors
Word vectors, or word embeddings, are dense vector representations of words that
capture their meanings.
Python code
from gensim.models import Word2Vec
sentences = [
["data", "science", "is", "fun"],
["machine", "learning", "is", "cool"],
["deep", "learning", "is", "a", "subset", "of", "machine", "learning"]
]
model = Word2Vec(sentences, vector_size=50, min_count=1, workers=4)

word_vectors = model.wv
print("Word Vector for 'learning':", word_vectors['learning'])
Recurrent Neural Networks (RNNs)
RNNs are a type of neural network designed for sequence data, such as time series or
natural language.
Example: Using a Character-Level RNN
Here is a more detailed example of using a character-level RNN to generate text, which
was also covered in a previous response:
Python code
import tensorflow as tf
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
import numpy as np
# Prepare the data

text = "Hello world! Welcome to the world of RNNs."
chars = sorted(set(text))
char2idx = {u:i for i, u in enumerate(chars)}
idx2char = np.array(chars)
text_as_int = np.array([char2idx[c] for c in text])
seq_length = 10
examples_per_epoch = len(text) // (seq_length + 1)
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)
def split_input_target(chunk):
input_text = chunk[:-1]
target_text = chunk[1:]
return input_text, target_text
dataset = sequences.map(split_input_target)
BATCH_SIZE = 64
BUFFER_SIZE = 10000
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
vocab_size = len(chars)
embedding_dim = 256
rnn_units = 1024
# Build the model

class MyModel(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, rnn_units):
super().__init__()
self.embedding = Embedding(vocab_size, embedding_dim)
self.rnn = SimpleRNN(rnn_units, return_sequences=True,
recurrent_initializer='glorot_uniform')
self.dense = Dense(vocab_size)
def call(self, x):

x = self.embedding(x)
x = self.rnn(x)
return self.dense(x)
model = MyModel(vocab_size, embedding_dim, rnn_units)

model.compile(optimizer='adam', loss=tf.keras.losses.sparse_categorical_crossentropy)
# Train the model

EPOCHS = 10
history = model.fit(dataset, epochs=EPOCHS)
# Generate text
def generate_text(model, start_string):
num_generate = 1000
input_eval = [char2idx[s] for s in start_string]
input_eval = tf.expand_dims(input_eval, 0)
text_generated = []
temperature = 1.0
model.reset_states()
for i in range(num_generate):
predictions = model(input_eval)
predictions = tf.squeeze(predictions, 0)
predictions = predictions / temperature
predicted_id = tf.random.categorical(predictions, num_samples=1)[-1, 0].numpy()
input_eval = tf.expand_dims([predicted_id], 0)
text_generated.append(idx2char[predicted_id])
return start_string + ''.join(text_generated)
print(generate_text(model, start_string="Hello "))
Conclusion
Each of these concepts and techniques plays a crucial role in various natural language
processing and machine learning applications. By understanding and applying these
techniques, we can build powerful models to analyze and generate text data, recommend
products, and discover hidden topics in large document collections.
What are Word Clouds
Word clouds are visual representations of text data. The bigger the word the more often its
used in the text. They are used as an engaging way to highlight important words in a
dataset. From this patterns may emerge. Word clouds are friendly on the eyes enabling
viewers to grasp the essence of the content at a glance, which is invaluable in areas like
data exploration, presentation, and educational contexts.
word cloud example from this walk-through
Processing word clouds involves the steps of preparing your text data and then using
software or libraries to generate the visual representation. Here's a breakdown of the
process:
Text Preprocessing:
1. Data Cleaning: This involves removing irrelevant characters, punctuation, and

special symbols from the text. You might also want to convert all text to lowercase
for consistency.
2. Tokenization: Break down the text into individual words or phrases (tokens). This
is the basic unit of analysis in a word cloud.
3. Stop Word Removal: Stop words are common words like "the," "a," "is," etc., that
don't carry much meaning on their own. You can remove these words to focus on the
more content-rich terms.
4. Lemmatization/Stemming (Optional): This process reduces words to their base
form (lemma) or stem. For example, "running," "runs," and "ran" would all be
converted to "run." This can help improve the accuracy of the word frequency
analysis.
Word Cloud Generation:
Once your text data is preprocessed, you can use various tools and libraries to generate the
word cloud. Here are some common options:
● Programming Languages: Python is a popular choice for NLP tasks, with libraries
like WordCloud offering functionalities to create and customize word clouds.
● Online Tools: Several online word cloud generators allow you to upload your text or
paste it directly. These tools often offer various customization options for font,
color, and layout.
Customization:
● Font and Color: You can choose fonts and color palettes that best represent the
theme of your word cloud.
● Layout: Some tools allow you to control the overall layout of the words, like a
circular or rectangular shape.
● Word Frequency and Size: You can define how the font size scales with word
frequency. A steeper slope will create a more dramatic difference between large and
small words.
Additional Considerations:
● Stop Word Lists: Different tools and libraries might have built-in stop word lists,
but you can also create custom lists to remove specific words not relevant to your
analysis.
● Custom Dictionaries: For specialized fields, you might want to create custom
dictionaries to ensure relevant terms aren't filtered out as stop words.
● Background Image (Optional): Some tools allow you to overlay the word cloud on a
background image for a more visually appealing presentation.
In NLP, grammars are formal systems that define the structure of a language. They act like
a set of rules that specify how words can be combined to form valid sentences. There are
two main types of grammars used in NLP:
1. Constituency Grammars (Phrase Structure Grammars):

○ These grammars focus on the hierarchical structure of a sentence, breaking it
down into its constituent parts like phrases (noun phrases, verb phrases,
etc.).
○ They use a set of rules that define how these phrases can be combined to form
complete sentences.
○ An example of a constituency grammar rule might be: Sentence -> Noun
Phrase + Verb Phrase. This rule states that a sentence can be formed by a
Noun Phrase followed by a Verb Phrase.
2. Dependency Grammars:
○ These grammars focus on the relationships between individual words in a
sentence, rather than on phrases.
○ They define dependency relations between words, such as subject-verb,
verb-object, etc.
○ An example of a dependency grammar rule might be: "went" depends on "I"
(subject-verb relation).
Benefits of Using Grammars in NLP:
● Syntactic Analysis: Grammars can be used to analyze the syntax (structure) of a

sentence, identifying its grammatical components and their relationships. This
helps in tasks like machine translation, where understanding the sentence structure
is crucial for accurate translation.
● Error Detection: Grammar rules can be used to identify grammatically incorrect
sentences or phrases. This can be helpful in applications like spell checkers or
grammar correction tools.
● Language Learning: Formal grammars can be used as a tool for understanding and
learning the structure of a language.
Limitations of Grammars in NLP:
● Complexity of Natural Language: Human language is complex and nuanced, and

formal grammars can struggle to capture all the possible variations and exceptions.
● Focus on Structure: Grammars primarily focus on the syntactic structure, not
necessarily the meaning or semantics of a sentence. This can be limiting for tasks
like sentiment analysis or natural language generation.
● Computational Cost: Parsing complex sentences using some grammar formalisms
can be computationally expensive.
Beyond Basic Grammars:
While basic constituency and dependency grammars provide a foundation, NLP often
utilizes more advanced grammar formalisms like:
● Probabilistic Grammars: These assign probabilities to different grammar rules,

reflecting the likelihood of different sentence structures.
● Context-Free Grammars (CFGs): These are a type of constituency grammar that
allows for recursion, enabling the generation of complex sentences.
● Tree Adjoining Grammars (TAGs): This type of grammar can handle phenomena
like coordination and ellipsis, which are challenging for simpler grammar models.
This introduction dives into the concept of grammar and its role in Natural Language
Processing (NLP). Here are the key points:
● Humans vs. Computers: Human language is easy for humans to understand, but
computers need a structured approach. Grammar provides this structure.
● Syntax: It refers to the way words are arranged to form sentences. It defines the
sentence structure explicitly.
● Regular Languages vs. Grammar: While regular languages and parts of speech deal
with word order, they cannot handle complex relationships like grammatical roles.
Grammar helps model these relationships.
● Mathematically, a grammar G can be written as a 4-tuple (N, T, S, P) where:
○ N or VN = set of non-terminal symbols or variables.
○ T or ∑ = set of terminal symbols.
○ S = Start symbol where S ∈ N
○ P = Production rules for Terminals as well as Non-terminals.
○ It has the form
○ 𝛼→𝛽
○ α→β, where α and β are strings on
○ 𝑉𝑁∪∑
○ VN∪∑, and at least one symbol of α belongs to VN
● Types of Grammar in NLP:
○ Context-Free Grammar (CFG): It uses rules to express how symbols can be
grouped and ordered. It's powerful and efficient but limited in capturing
some language complexities.
● Formalism in rules for context-free grammar: A sentence in the language defined by
a CFG is a series of words that can be derived by systematically applying the rules,
beginning with a rule that has s on its left-hand side.
○ Use of parse tree in context-free grammar: A convenient way to describe a
parse is to show its parse tree, simply a graphical display of the parse.
○ A parse of the sentence is a series of rule applications in which a syntactic
category is replaced by the right-hand side of a rule that has that category on
its left-hand side, and the final rule application yields the sentence itself.
● Example: A parse of the sentence "the giraffe dreams" is: s => np vp => det n vp =>
the n vp => the giraffe vp => the giraffe iv => the giraffe dreams
○ Constituency Grammar: It focuses on breaking down sentences into phrases
based on their function. It's easier to understand and implement but may not
be as powerful for all languages.
○ Dependency Grammar: It focuses on the relationships between individual
words in a sentence. It can be more accurate but is also more complex to
implement.
● Strengths and Weaknesses: Each grammar type has its advantages and limitations.
The choice depends on the specific NLP task and language being processed.
● Beyond Basic Grammars: More advanced grammar formalisms like probabilistic
grammars and tree adjoining grammars are also used in NLP.
Word Vectors
Word vectors (or word embeddings) are numerical representations of words that capture
their meanings based on their context within a large corpus of text. Common techniques
for generating word vectors include Word2Vec, GloVe, and FastText. These vectors allow
words with similar meanings to have similar representations in a high-dimensional space.
Recurrent Neural Networks (RNNs)
RNNs are a type of neural network designed for sequential data. They maintain a hidden
state that can capture information about previous inputs in the sequence, making them
suitable for tasks like language modeling, text generation, and time series prediction.
Character-Level RNN
A character-level RNN operates on individual characters rather than words. This can be
useful for tasks like generating text, spelling correction, or any application where
understanding and generating sequences of characters is important.
Example: Using a Character-Level RNN
Let's consider an example where we use a character-level RNN to generate text. We'll train
the RNN on a corpus of text (e.g., a book or a collection of articles) and then use it to
generate new text character by character.
Step-by-Step Implementation
1. Prepare the Data

○ Load the text corpus.
○ Create a vocabulary of all unique characters in the corpus.
○ Map each character to an index and vice versa.
2. Build the RNN Model
○ Define the RNN architecture (e.g., using LSTM or GRU cells).
○ Specify the input and output dimensions.
○ Use a softmax layer to predict the next character in the sequence.
3. Train the RNN
○ Convert the text into sequences of input characters and corresponding target
characters.
○ Train the model on these sequences using a suitable loss function (e.g.,
categorical cross-entropy) and optimizer (e.g., Adam).
4. Generate Text
○ Provide a seed text to the trained RNN.
○ Use the RNN to predict the next character iteratively, appending each
predicted character to the seed text.
Here’s a simplified implementation in Python using PyTorch:
python
Copy code
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
# Load text corpus

with open('text_corpus.txt', 'r') as file:
text = file.read()
# Create character vocabulary

chars = sorted(list(set(text)))
char_to_idx = {char: idx for idx, char in enumerate(chars)}
idx_to_char = {idx: char for idx, char in enumerate(chars)}
vocab_size = len(chars)
# Hyperparameters
hidden_size = 128
seq_length = 100
learning_rate = 0.01
# Model definition
class CharRNN(nn.Module):
def __init__(self, vocab_size, hidden_size):
super(CharRNN, self).__init__()
self.hidden_size = hidden_size
self.embed = nn.Embedding(vocab_size, hidden_size)
self.rnn = nn.LSTM(hidden_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, vocab_size)
def forward(self, x, hidden):

x = self.embed(x)
out, hidden = self.rnn(x, hidden)
out = self.fc(out.contiguous().view(-1, self.hidden_size))
return out, hidden
def init_hidden(self, batch_size):

return (torch.zeros(1, batch_size, self.hidden_size),
torch.zeros(1, batch_size, self.hidden_size))
# Training the model

model = CharRNN(vocab_size, hidden_size)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()
def get_batch(data, seq_length):

idx = np.random.randint(0, len(data) - seq_length - 1)
input_seq = data[idx:idx+seq_length]
target_seq = data[idx+1:idx+seq_length+1]
input_seq = torch.tensor([char_to_idx[char] for char in input_seq],
dtype=torch.long)
target_seq = torch.tensor([char_to_idx[char] for char in target_seq],
dtype=torch.long)
return input_seq, target_seq
for epoch in range(100):

hidden = model.init_hidden(1)
input_seq, target_seq = get_batch(text, seq_length)
input_seq = input_seq.unsqueeze(0)
target_seq = target_seq.unsqueeze(0).view(-1)
model.zero_grad()
output, hidden = model(input_seq, hidden)
loss = criterion(output, target_seq)
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print(f'Epoch {epoch}, Loss: {loss.item()}')
# Generating text
def generate_text(model, start_str, length):
model.eval()
hidden = model.init_hidden(1)
input_seq = torch.tensor([char_to_idx[char] for char in start_str],
dtype=torch.long).unsqueeze(0)
generated_text = start_str
with torch.no_grad():
for _ in range(length):
output, hidden = model(input_seq, hidden)
output = output[-1]
predicted_idx = torch.argmax(output).item()
predicted_char = idx_to_char[predicted_idx]
generated_text += predicted_char
input_seq = torch.tensor([[predicted_idx]], dtype=torch.long)
return generated_text
# Example usage
print(generate_text(model, 'Once upon a time', 100))
Network Analysis, Betweenness Centrality, Eigenvector Centrality, Directed Graphs and
PageRank,
Network analysis is a powerful tool for understanding the structure and relationships
within complex systems. It examines how elements (nodes) are connected and explores
the flow of information or influence between them. Here, we'll delve into some key
concepts related to network analysis:
1. Betweenness Centrality:
This measure identifies nodes that act as bridges or critical intermediaries within a
network. It calculates how often a particular node lies on the shortest paths between
other nodes. A high betweenness centrality score indicates a node with significant
control over information flow, making it potentially influential. Imagine a road network;
a bridge with high betweenness centrality would be crucial for traffic flow between two
regions.
2. Eigenvector Centrality:
This measure assesses the importance of a node based on the importance of its
neighbors. It considers the quality and quantity of connections a node has. Nodes
connected to other well-connected nodes receive higher scores. Eigenvector centrality is
particularly useful for undirected graphs where connections don't have a specific
direction.
3. Directed Graphs:
Networks can be directed or undirected. In directed graphs, connections (edges) have a

specific direction, indicating the flow of information or influence. For example, in a
citation network, a directed edge from paper A to paper B means paper B cites paper A.
Undirected graphs, on the other hand, don't have a specific direction for connections.
Think of a social network where friendship connections are usually considered
bidirectional (undirected).
4. PageRank:
Considered a variant of eigenvector centrality, PageRank is specifically designed for

directed graphs. It's the algorithm that powers Google Search, ranking web pages based
on their importance. PageRank considers both the number of backlinks a webpage
receives and the importance of the pages linking to it. A webpage with many backlinks
from high-ranking pages gets a higher PageRank score, indicating its significance within
the web ecosystem.
Betweenness Eigenvector Directed

Feature Centrality Centrality Graphs PageRank
Quality and quantity Connection Importance in
Focus Shortest paths of connections direction directed networks
Works for both
Network directed and Primarily for Essential Specifically for
type undirected undirected concept directed graphs
### PageRank
PageRank is an algorithm used by Google Search to rank web pages in their search engine
results. It measures the importance of each node in a graph based on the number and
quality of links to it. PageRank can be thought of as a variant of eigenvector centrality for
directed graphs.
Let's go through an example using Python and ǹetworkx̀.
### Prerequisites
Ensure you have ǹetworkx̀ and m̀atplotlib̀ installed:
pip install networkx matplotlib

import networkx as nx
import matplotlib.pyplot as plt
G = nx.DiGraph()
# Add nodes
nodes = range(1, 6)
G.add_nodes_from(nodes)
# Add directed edges

edges = [(1, 2), (1, 3), (2, 3), (3, 4), (4, 5), (5, 1), (5, 3)]
G.add_edges_from(edges)
# Compute Betweenness Centrality

betweenness = nx.betweenness_centrality(G)
print("Betweenness Centrality:", betweenness)
# Compute Eigenvector Centrality

eigenvector = nx.eigenvector_centrality(G, max_iter=1000)
print("Eigenvector Centrality:", eigenvector)
# Compute PageRank
pagerank = nx.pagerank(G)
print("PageRank:", pagerank)
# Draw the graph with centrality values

pos = nx.spring_layout(G)
plt.figure(figsize=(10, 7))
# Draw nodes
nx.draw_networkx_nodes(G, pos, node_size=700)
# Draw edges
nx.draw_networkx_edges(G, pos, edgelist=edges, arrowstyle='-|>', arrowsize=20)
# Draw labels
nx.draw_networkx_labels(G, pos, font_size=20, font_family="sans-serif")
# Draw edge labels

edge_labels = {edge: f'{G.in_degree(edge[1])}' for edge in G.edges}
nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels)
plt.title("Directed Graph with Betweenness, Eigenvector Centrality, and PageRank")

plt.show()
`
### Explanation
1. **Create a Directed Graph:**

- We initialize a directed graph using ǹx.DiGraph()`.
- Add nodes and directed edges to the graph.
2. **Compute Betweenness Centrality:**

- Use ǹx.betweenness_centrality(G)`to compute the betweenness centrality for each
node.
3. **Compute Eigenvector Centrality:**

- Use ǹx.eigenvector_centrality(G)`to compute the eigenvector centrality for each node.
4. **Compute PageRank:**
- Use ǹx.pagerank(G)`to compute the PageRank for each node.
5. **Draw the Graph:**

- We use m̀atplotlib̀ to visualize the directed graph.
- Nodes, edges, labels, and edge labels are drawn to illustrate the graph structure.
By running this code, you will get the centrality measures and a visualization of the
directed graph with edge labels showing the in-degree of each node. This example
provides a basic understanding of how to perform network analysis with betweenness
centrality, eigenvector centrality, and PageRank in directed graphs using Python's
ǹetworkx̀ library.
Recommender Systems, Manual Curation, Recommending What’s Popular,
User-Based Collaborative Filtering, Item-Based Collaborative Filtering, Matrix
Factorization
Recommender systems are algorithms that provide users with suggestions for products or
services. They are widely used in various applications like online shopping, streaming
services, and social media. Here, we will explore different types of recommender systems,
including manual curation, recommending what’s popular, user-based collaborative
filtering, item-based collaborative filtering, and matrix factorization.
Types of Recommender Systems
1. Manual Curation
2. Recommending What’s Popular
3. User-Based Collaborative Filtering
4. Item-Based Collaborative Filtering
5. Matrix Factorization
Manual Curation
Manual curation involves human experts selecting and recommending items to users. This
method relies heavily on domain expertise and is commonly used in editorial contexts.
Recommending What’s Popular
This method recommends items that are popular among all users. It's a simple and
effective technique, especially for new users for whom there is no historical data.
python
Copy code
import pandas as pd
# Sample data: item popularity

data = {
'item': ['item1', 'item2', 'item3', 'item4', 'item5'],
'popularity': [50, 60, 40, 70, 30]
}
df = pd.DataFrame(data)
# Recommend top N popular items

N=3
popular_items = df.sort_values(by='popularity', ascending=False).head(N)
print("Popular Items:\n", popular_items)
User-Based Collaborative Filtering
User-based collaborative filtering recommends items to a user based on the preferences of

similar users.
python
Copy code
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Sample user-item interaction matrix

user_item_matrix = np.array([
[5, 3, 0, 0],
[4, 0, 0, 5],
[1, 1, 0, 5],
[0, 0, 5, 4],
[0, 3, 4, 5]
])
# Compute user similarity matrix

user_similarity = cosine_similarity(user_item_matrix)
# Find top N similar users for a specific user

user_id = 0
N=2
similar_users = np.argsort(-user_similarity[user_id])[1:N+1]
print("Similar Users:\n", similar_users)
Item-Based Collaborative Filtering
Item-based collaborative filtering recommends items to a user based on the similarity

between items.
python
Copy code
# Compute item similarity matrix
item_similarity = cosine_similarity(user_item_matrix.T)
# Find top N similar items for a specific item

item_id = 0
N=2
similar_items = np.argsort(-item_similarity[item_id])[1:N+1]
print("Similar Items:\n", similar_items)
Matrix Factorization
Matrix factorization is a technique where the user-item interaction matrix is factorized

into lower-dimensional matrices, which can then be used to predict missing entries.
python
Copy code
import numpy as np
from sklearn.decomposition import TruncatedSVD
# Sample user-item interaction matrix

user_item_matrix = np.array([
[5, 3, 0, 1],
[4, 0, 0, 1],
[1, 1, 0, 5],
[1, 0, 4, 4],
[0, 1, 5, 4],
])
# Perform matrix factorization using SVD

svd = TruncatedSVD(n_components=2)
user_factors = svd.fit_transform(user_item_matrix)
item_factors = svd.components_
# Reconstruct the user-item matrix

reconstructed_matrix = np.dot(user_factors, item_factors)
print("Reconstructed User-Item Matrix:\n", reconstructed_matrix)
Conclusion
Recommender systems are essential tools for personalizing user experiences. Here’s a
summary of the methods:
1. Manual Curation: Human experts select and recommend items.

2. Recommending What’s Popular: Recommend the most popular items.
3. User-Based Collaborative Filtering: Recommend items based on similar users.
4. Item-Based Collaborative Filtering: Recommend items based on similar items.
5. Matrix Factorization: Factorize the user-item matrix to predict missing entries.
Each method has its own advantages and use cases. For instance, manual curation is
effective for editorial content, while collaborative filtering methods are powerful for
leveraging user interactions. Matrix factorization techniques are particularly effective for
dealing with sparse data. Combining these methods can often lead to even better
recommendation systems.

Module-5:: Network Analysis

Uploaded by

Copyright:

Available Formats

Module-5:: Network Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module-5:: Network Analysis

Uploaded by

Copyright:

Available Formats

Module-5:

Natural Language Processing Word Clouds, n-Gram Language Models, Grammars, An

Natural Language Processing Word Clouds, n-Gram Language Models, Grammars, An

How NLP Works:

wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

Sentence: “She enjoys reading books on weekends.”

Trigram Example: (“She”, “enjoys”, “reading”) (“enjoys”, “reading”, “books”)

N-gram language models are a fundamental concept in Natural Language Processing

What are N-grams?

An n-gram is a contiguous sequence of n words. The value of n determines the type of

● Unigram (n=1): A single word. For example, "the", "book", "amazing".

How N-gram Language Models Work:

N-gram models have various applications in NLP tasks, including:

Limitations of N-gram Models:

● Data Dependency: The accuracy of n-gram models is highly dependent on the

sentence = "Data science is an interdisciplinary field"

V - It is the collection of variables or nonterminal symbols.

P - It is the production rules that consist of both terminals and nonterminals.

S - It is the Starting symbol.

# Define a simple grammar

sentence = "Data is science".split()

An Aside: Gibbs Sampling

texts = [doc.split() for doc in documents]

lda = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word=dictionary,

model = Word2Vec(sentences, vector_size=50, min_count=1, workers=4)

Recurrent Neural Networks (RNNs)

Example: Using a Character-Level RNN

# Prepare the data

# Build the model

def call(self, x):

model = MyModel(vocab_size, embedding_dim, rnn_units)

# Train the model

return start_string + ''.join(text_generated)

print(generate_text(model, start_string="Hello "))

data exploration, presentation, and educational contexts.

word cloud example from this walk-through

1. Data Cleaning: This involves removing irrelevant characters, punctuation, and

Word Cloud Generation:

1. Constituency Grammars (Phrase Structure Grammars):

Benefits of Using Grammars in NLP:

● Syntactic Analysis: Grammars can be used to analyze the syntax (structure) of a

Limitations of Grammars in NLP:

● Complexity of Natural Language: Human language is complex and nuanced, and

Beyond Basic Grammars:

● Probabilistic Grammars: These assign probabilities to different grammar rules,

Recurrent Neural Networks (RNNs)

Example: Using a Character-Level RNN

1. Prepare the Data

Here’s a simplified implementation in Python using PyTorch:

# Load text corpus

# Create character vocabulary

def forward(self, x, hidden):

def init_hidden(self, batch_size):

# Training the model

def get_batch(data, seq_length):

for epoch in range(100):

Networks can be directed or undirected. In directed graphs, connections (edges) have a

Considered a variant of eigenvector centrality, PageRank is specifically designed for

Betweenness Eigenvector Directed

1. Create a Directed Graph:

2. Compute Betweenness Centrality:

3. Compute Eigenvector Centrality:

5. Draw the Graph: