NoSQL-Assignemnt 2
NoSQL-Assignemnt 2
ITNPBD3
ASSIGNMENT TWO
SUBMISSION DATE- 26/05/2023
Introduction
This assignment is based on creating a non-relational database design. Firstly I’d like to highlight
the importance of data that plays a crucial role in every industry- be it with decision-making,
implementing changes in business, investing in a particular project, or even hiring a potential
candidate for a particular job post. Data science has tremendously helped in streamlining
business processes, coming up with convenient ways for managers to keep track of business
performance, enhancing research and development, monitoring performance appraisal of its
employees, and tracking sales records and expenses faster to facilitate decisions on improving
profitability, managing stocks, and improving the supply chain-the possibilities are endless.
There are several websites including social media, e-commerce, agencies, education sites, and
many more that use database software to manage and store data efficiently.
- Reactions Entity: Facebook tracks users’ interaction with posts using likes, reactions,
and comments. Attributes that could be involved are:
User ID
Post ID
Reaction Type
- Comments Entity: Users also interact with each other by commenting on each other’s
posts/contents. The attributes involved could be:
User ID
Friend ID
Comments ID
Timestamps
- Events Entity: User events on Facebook are possibly managed by MySQL. The key
attributes involved could be:
Users ID
Events ID
Event Details
Event Reactions
Participation list
- Flexibility: Each of the entities explained above has complex data structures and
therefore shouldn’t stay within the rigidity of database schema as running queries may
become slow. Instead, for all these entities Facebook can adopt a schema-less design,
allowing a list of values to be stored within a document, adding flexibility to its data
modeling.
- Fault tolerance: The data structure in Facebook could be modeled in a way that can
withstand failures such as a network partition. It could be done through replication. The
same data (for example a user who wants to check their account balance on their business
page) should produce the same result if queries are run from a different location. In this
case, part of the data should be replicated in more than one node (devices) in order to
generate consistent search results.
- Performance and Speed: With billions of users trying to run/ process the same type of
data at the same time having multiple read/write loads, Facebook must acquire efficient
query optimization techniques for operations like generating codes for two-factor
authentication when running Facebook from more than one device, generating quick
query results to see the total engagements from promotional contents, or when refreshing
a newsfeed.
Shared Risks
NoSQL Data is distributed across clusters in wide geographical areas. Storing data on more than
one cluster (computer) satisfies local data protection. For instance, there might be a financial
transaction to run Facebook ads, after which balances are automatically updated. It is important
to have replicas of the current balance because if there is a network partition, running queries
may not show the updated balance. Thus having replicas will enable Facebook to easily access
the updated data from other machines where the updated figure is replicated. Additionally, the
wider the geographic spread of replication, the safer the data becomes.
Better Performance
NoSQL allows fast retrieval of data, speedy query performance, and personalized
recommendations for Facebook, all of which are necessary for its all business operations,
including the entities mentioned above. It can perform fast read/write operations as data are
stored in denormalized formats. Thus running any kind of query, for example, the number of
times a user visited a Facebook page can be tracked much faster using NoSQL queries. On a
note, NoSQL is more appropriate in terms of flexibility, scalability, faster read/write operations,
efficient data storage, and query optimization, all of which add to its performance, speed, and
reliability.
As per the business requirements mentioned earlier, it is recommended that Facebook uses the
NoSQL database as the benefits that it serves clearly cater to their business requirements.
Initially, Facebook used MySQL, but even after compressing data, the spaces weren’t enough to
accommodate mass data. This led to having additional hardware to keep up with the storage but
this greatly added to the cost. Thus they switched to operating such big data using a couple of
NoSQL database software. Below is a list of operations that are run by NoSQL software on
Facebook:
- Newsfeed: Facebook uses NoSQL software to generate relevant contents for its users to
match their interests and preferences.
- Logs: logs are a way to track bugs. Considering that Facebook data operates on a mass
scale, it cannot run with logs that greatly help with bug fixes. To manage logs, Facebook
uses a distributed database called LogDevice.
- Analytics: Facebook uses PrestoDB, HBase, and Apache Hadoop for running analytics
and data warehousing. When refreshing a newsfeed, it uses analytics to match user
preferences in order to display the related posts instantly in the newsfeed.
- Messages and Notifications: Facebook uses Apache Cassandra to run an inbox search.
The software handles message threading, retrieve chat history, and display read/unread
statuses.
- Ads and Targeting: NoSQL software used by Facebook stores user preferences and
demographic data to suggest ads, track ad clicks, and uses analytics to optimize ad
campaigns.
Embedding decisions:
I have embedded Author ID, Post Id, Post Type and Timestamps within “Posts” collection and
added a reference of Comment ID and Reaction Type to show associate which posts received
what type of reaction and comments.
Some more references:
- User ID ( Relationship) represents Author ID (Posts)
- One of the Friend IDs ( Relationship) represents one of the Friend IDs(Reactions)
- One of the Friend ID ( Relationship) represents one of the Friend IDs (Comments)
- User ID ( Relationship) represents one of the Page Admins (Pages)
- One of the Friend ID (Relationship) represents one of the Page Admins (Pages)
- User ID ( Relationship) represents one of the Event Admins (Posts)
- One of the Friend ID (Relationship) represents one of the Event Admins (Posts)
- User ID ( Relationship) represents Author ID (Posts)
Document representation in MongoDB
I have highlighted some examples of the MongoDB syntax to show how the collections Posts
and Events are connected to the Relationships Collection.
Relationships Collection:
db.relationships.insert({'UserId': "Jack123", 'FriendID':["John890", "Sara321",
"Donna456","Clara673"],'RelationshipType':
[{'John890':"Friend",'Sara321':"Friend",'Donna456':"Follower",'Clara673':"Follower"}]})
Posts Collection:
db.posts.insert ({'AuthorID':"Jack123",'PostID':"98765",'PostType':"Picture",
'Timestamp':30,'ReactionType':"Heart",'CommentID': 5658})
Events Collection:
db.events.insert({'EventName':"Jack's Art Exhibition",'EventID': "Jack.Art.Exb",'Event
Details':{'Descrition':"Finest Art in town",'Date':2023-4-30,'Location':"Street11"},'Event
Admin':"Jack123",'Event Partipants':["John890","Sarah321","Donna456"]})
Design Decisions for a Distributed Cluster:
Sharding: Through Shard Keys and Indexing, (like hashed indexes to optimize
queries) data can be split in small portions in shards across multiple machines. For
instance, the User ID is linked to most of the collections, and based on that
relationship, we can distribute the data across wide networks. For example in
terms of the events collection, the event which I named “Jack’s Art Exhibition” to
be held at “Street 11” should have shards (part of the data) closest to that location.
A business page targeting multiple locations could have customer data specific to
their locations and sharding can be done based on their location.
Replicas could be set for the event details in other nodes (Machines) that are
located close to “Street 11” for local data protection. The data must have high
availability and fault tolerance.