Starbucks Review
Starbucks Review
Starbucks Review
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from textblob import TextBlob
import warnings
import re
import matplotlib.dates as mdates
from collections import Counter
In [71]:
starbucks_data = pd.read_csv('/kaggle/input/starbucks-reviews-dataset/reviews_da
** at the
Reviewed Starbucks
1 Courtney July 16, 5.0 by the fire ['No Images']
2023 station on
I just
Reviewed wanted to
2 Daynelle July 5, 5.0 go out of my ['
Twp, PA
2023 way to
Me and my
Reviewed friend were
3 Taylor May 26, 5.0 at Starbucks ['No Images']
2023 and my
I’m on this
Reviewed kick of
4 Tenessa Jan. 22, 5.0 drinking 5 ['
2023 cups of
warm wa...
The basic statistical overview of the dataset reveals the
Total Reviews: 850 reviews are present in the dataset.
Date Range: The reviews are dated between April 1, 2010, and September 9, 2017.
Ratings: The average rating is approximately 1.87 (on a scale of 1 to 5). The minimum rating given is 1, and the
maximum is 5.
** This overview indicates a relatively low average rating, which could suggest a trend of negative reviews.
However, we should explore further to confirm this.
In [72]:
# Basic statistical overview
basic_stats = {
"Total Reviews": starbucks_data.shape[0],
"Unique Locations": starbucks_data['location'].nunique(),
"Date Range": (starbucks_data['Date'].min(), starbucks_data['Date'].max()),
"Rating": {
"Average Rating": starbucks_data['Rating'].mean(),
"Min Rating": starbucks_data['Rating'].min(),
"Max Rating": starbucks_data['Rating'].max()
{'Total Reviews': 850,
'Unique Locations': 633,
'Date Range': ('Reviewed April 1, 2010', 'Reviewed Sept. 9, 2017'),
'Rating': {'Average Rating': 1.8709219858156028,
'Min Rating': 1.0,
'Max Rating': 5.0}}
we can use TextBlob for sentiment analysis. TextBlob is a simpler tool compared to VADER but is still effective
for basic sentiment analysis. It provides a polarity score that ranges from -1 (very negative) to 1 (very positive).
In [73]:
# Function to classify sentiment using TextBlob
def classify_sentiment_textblob(row):
sentiment = TextBlob(row).sentiment.polarity
if sentiment > 0:
return 'Positive'
elif sentiment < 0:
return 'Negative'
return 'Neutral'
Positive 52.000000
Negative 40.941176
Neutral 7.058824
Name: proportion, dtype: float64
The histogram shows the distribution of Starbucks ratings in the dataset. We observe a prominent trend of low
ratings, with the majority being 1-star ratings. This contrasts with the sentiment analysis results, where a
majority of the reviews were classified as positive. This discrepancy might suggest that customers who had
negative experiences were more inclined to give low ratings, while the textual content of their reviews was more
nuanced or mixed.
In [75]:
# Function for manual tokenization
def manual_tokenize_for_frequency(text):
# Splitting the text into words using regular expressions
words = re.findall(r'\b\w+\b', text.lower())
return words
[('the', 3374),
('i', 3346),
('to', 2366),
('and', 2363),
('a', 1832),
('my', 1124),
('it', 1057),
('starbucks', 1055),
('of', 1014),
('was', 996),
('in', 960),
('for', 806),
('that', 798),
('they', 716),
('is', 713),
('me', 633),
('at', 627),
('on', 623),
('have', 613),
('coffee', 597)]
In [76]:
# Ensuring necessary NLTK resources are available'punkt')'stopwords')
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(text.lower())
tokens = [word for word in tokens if word.isalpha() and word not in stop_wor
return tokens
# Apply the function to the review column and concatenate all tokens
all_words = sum(starbucks_data['Review'].apply(clean_tokenize), [])
[('starbucks', 1051),
('coffee', 595),
('customer', 286),
('one', 279),
('get', 276),
('drink', 275),
('store', 259),
('service', 255),
('time', 237),
('like', 224),
('order', 218),
('said', 217),
('go', 214),
('would', 213),
('card', 197),
('went', 180),
('asked', 164),
('back', 163),
('told', 159),
('ordered', 150)]
1. Temporal Analysis:
We'll examine how ratings and sentiments have changed over time. This can provide insights into whether
customer experiences have improved or declined. For the temporal analysis, we first need to parse the 'Date'
column into a datetime format. Then, we can analyze the trends in ratings and sentiments over time.
2. Location-based Insights:
We'll explore whether there are notable differences in ratings or sentiments across different locations.
In [77]:
import calendar
Amber and
LaDonna at
Wichita the
0 Helen Sept. 13, 5.0 ['No Images'] S
Falls, TX Starbucks
** at the
Reviewed Starbucks
1 Courtney July 16, 5.0 by the fire ['No Images']
2023 station on
I just
Reviewed wanted to
2 Daynelle July 5, 5.0 go out of ['
Twp, PA
2023 my way to
Me and my
friend were
Seattle, at
3 Taylor May 26, 5.0 ['No Images']
WA Starbucks
and my
I’m on this
Reviewed kick of
4 Tenessa Jan. 22, 5.0 drinking 5 [' J
2023 cups of
warm wa...
In [78]:
# Correcting the function for replacing abbreviated month names
def replace_month_abbr_corrected(date_str):
for abbr, full in month_abbr_to_full.items():
# Replace only if the abbreviation is followed by a period
if f"{abbr}." in date_str:
return date_str.replace(f"{abbr}.", full)
return date_str
I just
Reviewed wanted to
2 Daynelle July 5, 5.0 go out of ['
Twp, PA
2023 my way to
Me and my
friend were
Seattle, at
3 Taylor May 26, 5.0 ['No Images']
WA Starbucks
and my
I’m on this
Reviewed kick of
4 Tenessa Jan. 22, 5.0 drinking 5 [' J
2023 cups of
warm wa...
In [79]:
# Trim the dataset to start from 2010
trimmed_data = starbucks_data[starbucks_data['Parsed_Date'] >= '2010-01-01']
In [80]:
# Get sentiment polarity for each review
starbucks_data['sentiment'] = starbucks_data['Review'].apply(lambda x: TextBlob
In [82]:
# Check the DataFrame structure after groupby and unstack
# Now 'Parsed_Date' is the index, filter the DataFrame to exclude dates before 200
sentiment_proportions = sentiment_proportions[sentiment_proportions.index >= '20
▾ LatentDirichletAllocation
LatentDirichletAllocation(n_components=5, random_state=0)
In [87]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
# Tokenizing text
tokenizer = Tokenizer(num_words=5000)
X = tokenizer.texts_to_sequences(starbucks_data['Review'])
X = pad_sequences(X, maxlen=200)
Amber and LaDonna at the Starbucks on Southwest Parkway are always so warm a
nd welcoming . There is always a smile in their voice when they greet you at
the drive-thru. And their customer service is always spot-on .
In [89]:
import gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from nltk.tokenize import word_tokenize
# LDA model
lda_model = LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15)
# Display topics
for idx, topic in lda_model.print_topics(-1):
print('Topic: {} \nWords: {}'.format(idx, topic))
Topic: 0
Words: 0.050*"." + 0.042*"the" + 0.036*"i" + 0.027*"and" + 0.024*"to" + 0.022
*"," + 0.021*"a" + 0.014*"of" + 0.013*"it" + 0.012*"in"
Topic: 1
Words: 0.053*"." + 0.041*"i" + 0.037*"the" + 0.028*"to" + 0.027*"," + 0.025
*"and" + 0.017*"a" + 0.013*"starbucks" + 0.011*"in" + 0.010*"it"
Topic: 2
Words: 0.035*"the" + 0.027*"," + 0.023*"." + 0.021*"i" + 0.018*"a" + 0.017*"t
o" + 0.013*"and" + 0.012*"of" + 0.010*"starbucks" + 0.008*"for"
Topic: 3
Words: 0.031*"," + 0.019*"the" + 0.016*"." + 0.014*"to" + 0.011*"!" + 0.009
*"of" + 0.008*"they" + 0.007*"starbucks" + 0.007*"you" + 0.006*"and"
Topic: 4
Words: 0.053*"." + 0.040*"i" + 0.033*"the" + 0.030*"," + 0.029*"and" + 0.028
*"to" + 0.023*"a" + 0.019*"my" + 0.019*"was" + 0.013*"it"
The topics provided from the LDA (Latent Dirichlet Allocation) output seem to be dominated by common English
words (like "the", "and", "to", commas, and periods) rather than more meaningful, topic-specific terms. This is a
common issue in topic modeling, especially when dealing with raw text data. These common words are often
referred to as "stop words" and typically don't contribute much to the understanding of the text's content.
2. Remove Punctuation: Punctuation marks (like commas and periods) should be removed from the text as they
don't contribute to topic modeling.
3. Lemmatization/Stemming: These processes reduce words to their base or root form, helping in consolidating
similar words.
def preprocess(text):
# Tokenize and lower case
tokens = word_tokenize(text.lower())
# Remove stopwords and punctuation
tokens = [word for word in tokens if word not in stop_words and word not in
return tokens
# LDA model
lda_model = LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15)
# Display topics
for idx, topic in lda_model.print_topics(-1):
print('Topic: {} \nWords: {}'.format(idx, topic))
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
Topic: 0
Words: 0.022*"starbucks" + 0.019*"card" + 0.008*"``" + 0.008*"''" + 0.008*"wo
uld" + 0.008*"n't" + 0.007*"one" + 0.007*"said" + 0.007*"customer" + 0.006*"s
Topic: 1
Words: 0.021*"starbucks" + 0.014*"’" + 0.009*"said" + 0.009*"``" + 0.008*"''"
+ 0.008*"coffee" + 0.008*"one" + 0.008*"order" + 0.007*"get" + 0.007*"n't"
Topic: 2
Words: 0.022*"starbucks" + 0.017*"coffee" + 0.009*"like" + 0.008*"one" + 0.00
8*"'s" + 0.008*"n't" + 0.007*"time" + 0.006*"would" + 0.006*"store" + 0.006
Topic: 3
Words: 0.023*"starbucks" + 0.020*"coffee" + 0.010*"n't" + 0.009*"drink" + 0.0
09*"get" + 0.006*"'s" + 0.005*"would" + 0.005*"go" + 0.005*"service" + 0.004
Topic: 4
Words: 0.028*"starbucks" + 0.014*"coffee" + 0.011*"customer" + 0.010*"servic
e" + 0.008*"n't" + 0.007*"store" + 0.007*"drink" + 0.006*"get" + 0.006*"'s" +
The updated topics from your LDA model provide a clearer insight into the themes present in the Starbucks
reviews. Let's interpret each topic based on the most frequent words:
Keywords: "starbucks", "coffee", "drink", "go", "cup", "get", "'s", "ordered", "order", "time" Interpretation: This
topic seems to revolve around the customer experience at Starbucks, focusing on ordering and drinking
coffee. Words like "go", "cup", "get", "ordered", and "order" suggest a focus on the purchasing process and
the products themselves.
Keywords: "coffee", "starbucks", "'s", "like", "n't", "store", "caramel", "time", "get", "would" Interpretation: This
topic might be related to customers' opinions on the quality and variety of coffee, including specific
preferences (like "caramel"). The presence of words like "like", "n't" (don't), and "would" indicates a mix of
preferences and opinions.
Keywords: "starbucks", "’", "card", "coffee", "store", "service", "go", "like", "n't", "customer" Interpretation: The
focus here appears to be on the service aspect, including customer service, with mentions of "card"
(possibly related to loyalty cards or payments), "store", and "service". The word "customer" suggests a
focus on customer experience.
Keywords: "starbucks", "n't", "customer", "get", "coffee", "one", "service", "would", "store", "drink"
Interpretation: This topic also relates to customer service and the in-store experience. It includes elements
of customer interactions and service quality, with words like "customer", "service", and "store".
Keywords: "starbucks", "customer", "n't", "coffee", "said", "", "''", "one", "drink", "time" Interpretation: This
topic may involve customer interactions, possibly including complaints or specific incidents (indicated by
"said", "", and "''"). It reflects discussions about customer experiences and interactions at Starbucks
In [91]:
def preprocess_date(date_str):
# Removing the prefix 'Reviewed' and stripping any leading/trailing whitespace
date_str = date_str.replace('Reviewed', '').strip()
# Plot
plt.title('Sentiment Over Time')
plt.ylabel('Average Sentiment')
Time Span:
The x-axis represents time, spanning from the year 2000 to just beyond 2024, indicating that the data covers a
period of approximately 25 years.
Sentiment Score:
The y-axis indicates the average sentiment score, which seems to range from -1 to 1. This is a typical range for
sentiment scores, where -1 represents a very negative sentiment, 1 represents a very positive sentiment, and 0
represents a neutral sentiment.
The plot shows a considerable amount of volatility in sentiment over time, with sentiment scores fluctuating
frequently between positive and negative values. This could suggest varying customer satisfaction over time or
different responses to products or services.
Trend Analysis:
There is no clear long-term trend upwards or downwards, indicating that there hasn't been a significant long-
term improvement or decline in sentiment.
Data Density:
There are periods where the data points become denser, visible by the thicker clustering of the line. This could
indicate more frequent reviews or more consistent sentiment during these periods.
When interpreting this chart, it would be important to consider external factors that could influence sentiment
scores, such as marketing campaigns, product launches, social events, or changes in customer service policies.
Additionally, the source and preprocessing of the text data, the sentiment analysis method used, and the volume of
data points collected on each date can all significantly impact the interpretation of the results.
In [93]:
import seaborn as sns
Y-Axis (Rating):
The vertical axis represents the rating associated with each review. The ratings is on a scale from 1 to 5, which
is a common rating system for reviews.
Data Distribution:
There is a cluster of data points across all rating levels with shorter review lengths, suggesting that many users
leave shorter reviews regardless of the rating they give. For longer reviews, there appears to be a concentration
of points towards the higher ratings (4 and 5). This could imply that customers who leave more detailed
feedback tend to give higher ratings, or are more engaged and possibly more satisfied.
There are a few longer reviews with lower ratings (1 and 2). These might be cases where customers provided
detailed negative feedback.
No Clear Trend:
There doesn't appear to be a clear or strong linear relationship between the length of the review and the rating
given. While there is some indication that longer reviews might correlate with higher ratings, the data points are
quite dispersed, indicating variability.
Data Density:
The chart shows a high density of reviews with fewer characters across all ratings, indicating that most
customers tend to leave brief reviews.
From this scatter plot, it is not possible to definitively conclude that longer reviews correlate with higher ratings,
although there is some suggestion of this in the data. For a more detailed analysis, statistical methods could be
used to calculate the correlation coefficient or fit a regression line to the data. Additionally, sentiment analysis on
the review text could provide further insights into the relationship between the content of the reviews, their length,
and the ratings.
Location-Based Analysis
Analyze the ratings and sentiments by location to see if customer satisfaction varies geographically.
In [94]:
import seaborn as sns
import matplotlib.pyplot as plt
# Sort and take the top 10 locations with the highest average rating
top_locations = average_rating_per_location.sort_values(ascending=False).head(1
# Set plot title and labels with increased font sizes for readability
plt.title('Top 10 Locations with Highest Average Rating', fontsize=16)
plt.xlabel('Location', fontsize=14)
plt.ylabel('Average Rating', fontsize=14)
# Show plot
Topic Analysis by Ratings
For topic modeling based on high and low ratings, we'll use the Latent Dirichlet Allocation (LDA) model from the
Gensim library. We'll first separate the reviews into high-rated and low-rated groups, preprocess the text, and
then apply LDA to each group to identify prevalent topics.
In [95]:
import gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string