CS363 Spring 2021 Homework 2
CS363 Spring 2021 Homework 2
CS363 Spring 2021 Homework 2
Context
This homework is to help familiarize you with the systems primer material from the first
lecture and to give you a sense of the scale of working with data. A reminder that homework
is optional this quarter.
Problem 1
You are designing a system to crowdsource student evaluations of college courses (very
much like Carta). As part of this system, you want to store a table of reviews containing the
following information (for each review):
1
• Academic Quarter - char[10]
• Course ID - char[5]
Question 1.1
What is the size of each row in bytes?
Question 1.2
Assume that this data is stored on a hard disk in disk blocks and the disk blocks are grouped
in DB blocks.
Question 1.3
How many rows can be stored per DB block?
Question 1.4
Relevant information:
• According to the registrar, this quarter there are 16500 students (undergrad and
graduate) enrolled at Stanford.
• Of these, about 1500 are grad students finishing their dissertations, which will be
excluded.
• Of the remainder, some are grad students who might be doing research full-time and
not taking classes. Let’s ignore this and approximate that there are exactly 15,000
class-taking students at Stanford.
• Let’s assume that on average each class-taking student takes 3 classes per quarter
(including the summer quarter).
2
• Let us also assume that writing course evaluations is optional and 50% of students who
take a course will write an evaluation. (Note: If a student chooses not to write an
evaluation, no row is added to the table.)
After 10 years (40 quarters), how large in MB will the table of course reviews be? Round
your answer to 1 decimal place.
Question 1.5
How many DB blocks would be needed to store the table of course reviews?
Question 1.6
Now, let’s say you would like to read from your table and retrieve a particular evaluation
(row). Let’s make the following assumptions about the hardware that is used to store the
table:
How long would it take in hours to retrieve an evaluation (row) if the table rows are stored
randomly on disk (we must seek and scan every row)? Round your answer to 1 decimal
place.
Question 1.7
How long would it take in seconds if the rows are grouped in disk blocks (which are
randomly stored on disk)? Round your answer to 3 decimal places.
Problem 2
You’re the leader of the Infrastructure team of a popular-enough startup. You’re concerned
with performance metrics for a particular table that is queried frequently. The table has the
following specs,
3
• RAM: 64GB
The system receives numerous queries each second; each query consists of fetching some
random row of the table. For the purpose of this problem, let’s assume parsing and
transferring queries take zero time. We’ll also assume seeks in RAM take zero time.
Note: Remember that the average time for finding a record during a full scan is half of the
maximum time.
Question 2.1
The current architecture involves having all the data stored on the system’s hard disk. All
rows are randomly stored on the disk. Any fetch queries need to fetch the associated data
from there.
What is the average response time in secs for a query, i.e., time to fetch a row? Assume that
all rows are equally likely to be queried.
Question 2.2
While you’re surprised at the high response time, an Analytics expert drops by your office.
She mentions that a fixed 1% of the table rows are responsible for 90% of the query traffic.
Would you suggest any change to the current architecture given this information? State your
suggestion concisely. What’s the average response time in secs after your suggestion?
Notes:
• Assume the table rows in the described 1% are equally responsible for 90% of the
query traffic. The remaining 99% of rows have equally likely queries.
• In order to reduce calculations, assume the architecture change we’re looking for does
not involve DB Blocks. Try to think of another way to reduce response time using other
resources.
4
Problem 3
Imagine you are designing a table to store recent transactions for an online shopping
platform and there are 1 trillion transactions. You want to record the following information:
• user id
• user name
• item id
• item name
• transaction id
• amount of money ($) for the transaction (e.g. $4.11, $670.50, etc)
Assume there are 1 billion users, and 1 billion items for sale on the platform. The longest
string for user and item names contain 64 characters. You should consider proper data types
listed below: byte, short, int, long, float, double, boolean, char.
Question 3.1
What is the size of each row in bytes? Think about the size of each column by selecting
proper data types. You need to select the most suitable data type for each column by
considering efficiency.
Question 3.2
What data type should you use for each column? You need to fill one of the following data
types: byte, short, int, long, float, double, boolean, char. You are not required to add the
number of the data types used for the column. For instance, you only need to put char for
char[10].
What is the most appropriate data type for the following column: User ID?
Question 3.3
What is the most appropriate data type for the following column: User Name?
Question 3.4
What is the most appropriate data type for the following column: Item ID?
Question 3.5
5
What is the most appropriate data type for the following column: Item Name?
Question 3.6
What is the most appropriate data type for the following column: Transaction ID?
Question 3.7
What is the most appropriate data type for the following column: Amount of money?
Question 3.8
What is the size of the table in TB? (1 point)
6
Problem 4
This question follows from question 3. For this question, assume that the size of the table is
200 TB.
Question 4.1
How long in seconds will it take to read the whole table from RAM?
Question 4.2
How long in days (round to nearest integer) will it take to read the whole table from disk if
each row of the table is stored randomly in the disk?
Question 4.3
How long in days (round to nearest integer) will it take to read the whole table from disk if
the table is stored in DB blocks? (1 point)
Question 4.4
What is the cost in dollars for saving the table in RAM? Assume RAM costs $6000/TB.
Question 4.5
What is the cost in dollarsfor saving the table in disk? Assume disk space costs $100/TB.
Problem 5
You have decided to start a new e-commerce site that you anticipate will host billions of
products and you hope millions of users. You realize you will need a database system to keep
track of all your data.
Question 5.1
What tables might you need for this? For example, a table to log each order a user placed for
a product might be a good idea. List at least two other tables that you want to include in your
design. (There are many correct answers).
Question 5.2
Let’s calculate the size of one of our tables. Consider the above example of a table to log
orders. We would like to keep track of the following:
7
• Order ID: int64
• Product ID: see part 6.2
• User ID: see part 6.3
• Quantity: int32
• Timestamp: 4 bytes
• IP address: 4 bytes
• Mailing address: char[100]
We want the ability to host 5 billion products. How many bits should the Product ID be to
store a unique ID for each product?
Given our answer above, what data type should we use?
• tinyint (1 byte)
• smallint (2 byte)
• int (int32 – 4 bytes)
• bigint (int64 – 8 bytes)
Question 5.3
We want the ability to account for 1 billion users. How many bits should the User ID be to
store a unique ID for each user?
Given our answer above, what data type should we use?
• tinyint (1 byte)
• smallint (2 byte)
• int (int32 – 4 bytes)
• bigint (int64 – 8 bytes)
Question 5.4
Given the above and your answers for 5.2 and 5.3, how big is one row of our table (one
record) in bytes?
Question 5.5
How big is the entire table in MB if we assume that we store the data for a week, and we
receive 100 million orders in a day?
8
Question 5.6
Let’s calculate the time it takes to perform operations on our database. Let’s assume that our
table is actually 10 GB (no matter what you got for an answer in part 5.4). Use the values
given in the instructions for all calculations.
How much time does it take in milliseconds to look up a record if our table is in RAM?
Question 5.7
How much time does it take in days (round to nearest day) to look up a record on disk if all
records are in random locations?
Question 5.8
Now let’s say you divide your data into blocks so that each block is 64 MB. The blocks are still
scattered randomly on disk. How long would it take in seconds (round to nearest second) to
look up a record then?
Question 5.9
Now, let’s think about scale. Use the values given in the instructions for all calculations. If we
had 10 machines, how would this impact the speed of looking up one record? How many
times faster would look up be?
Question 5.10
What if the data was stored on another machine? How long would it in milliseconds take to
get that data if that other machine had the data readily available in RAM (round to nearest
millisecond)?