RDBMS To MongoDB Migration
RDBMS To MongoDB Migration
Schema Design 2
From Rigid Tables to Flexible and Dynamic BSON
Documents 3
Other Advantages of the Document Model 4
Joining Collections for Data Analytics 4
Defining the Document Schema 5
Modeling Relationships with Embedding and
Referencing 5
Embedding 6
Rerencing 6
Different Design Goals 6
Indexing 6
Index Types 7
Optimizing Performance With Indexes 8
Schema Evolution and the Impact on Schema Design 9
Application Integration 10
MongoDB Drivers and the API 10
Mapping SQL to MongoDB Syntax 10
MongoDB Aggregation Framework 10
MongoDB Connector for BI 11
Atomicity in MongoDB 11
Maintaining Strong Consistency 12
Write Durability 12
Implementing Validation & Constraints 13
Foreign Keys 13
Document Validation 13
Enforcing Constraints With Indexes 13
Conclusion 16
We Can Help 16
Resources 17
Introduction
1
Or
Organization
ganization Migrated F
Frrom Applic
Application
ation
Figur
Figuree 1: Case Studies
Schema Design From the legacy relational data model that flattens data
into rigid 2-dimensional tabular structures of rows and
columns
The most fundamental change in migrating from a
To a rich and dynamic document data model with
relational database to MongoDB is the way in which the
embedded sub-documents and arrays
data is modeled.
Figur
Figure
e 2: Migration Roadmap
2
R DB
DBMMS MongoDB embedded data structures. For example, data that belongs
to a parent-child relationship in two RDBMS tables would
Database Database commonly be collapsed (embedded) into a single
Table Collection document in MongoDB.
Row Document
Column Field
Index Index
Figur
Figuree 3: Terminology Translation
With sub-documents and arrays, JSON documents also Modeling the same data in MongoDB enables us to create
align with the structure of objects at the application level. a schema in which we embed an array of sub-documents
This makes it easy for developers to map the data used in for each car directly within the Person document.
the application to its associated document in the database. {
first_name: Paul,
By contrast, trying to map the object representation of the surname: Miller,
data to the tabular representation of an RDBMS slows city: London,
location: [45.123,47.232],
down development. Adding Object Relational Mappers cars: [
(ORMs) can create additional complexity by reducing the { model: Bentley,
flexibility to evolve schemas and to optimize queries to year: 1973,
value: 100000, .},
meet new application requirements. { model: Rolls Royce,
year: 1965,
The project team should start the schema design process value: 330000, .},
by considering the applications requirements. It should ]
}
model the data in a way that takes advantage of the
document models flexibility. In schema migrations, it may
be easy to mirror the relational databases flat schema to
the document model. However, this approach negates the
advantages enabled by the document models rich,
3
platform in Figure 5. In this example, the application relies
on the RDBMS to join five separate tables in order to build
the blog entry. With MongoDB, all of the blog data is
contained within a single document, linked with a single
reference to a user document that contains both blog and
comment authors.
Joining Collections
Typically it is most advantageous to take a denormalized
data modeling approach for operational databases the
efficiency of reading or writing an entire record in a single
Figur
Figuree 5: Pre-JOINing Data to Collapse 5 RDBMS Tables operation outweighing any modest increase in storage
to 2 BSON Documents requirements. However, there are examples where
normalizing data can be beneficial, especially when data
In this simple example, the relational model consists of only from multiple sources needs to be blended for analysis
two tables. (In reality most applications will need tens, this can be done using the $lookup stage in the
hundreds or even thousands of tables.) This approach does MongoDB Aggregation Framework.
not reflect the way architects think about data, nor the way
in which developers write applications. The document The Aggregation Framework is a pipeline for data
model enables data to be represented in a much more aggregation modeled on the concept of data processing
natural and intuitive way. pipelines. Documents enter a multi-stage pipeline that
transforms the documents into aggregated results. The
To further illustrate the differences between the relational pipeline consists of stages; each stage transforms the
and document models, consider the example of a blogging documents as they pass through.
4
While not offering as rich a set of join operations as some MongoDB
Applic
Application
ation R DB
DBMMS Action
RDBMSs, $lookup provides a left outer equi-join which Action
provides convenience for a selection of use cases. A left
Create INSERT to (n) insert() to 1
outer equi-join, matches and embeds documents from the Product tables (product document
"right" collection in documents from the "left" collection. Record description, price,
manufacturer, etc.)
As an example if the left collection contains order
documents from a shopping cart application then the Display SELECT and JOIN find() single
Product (n) product tables document
$lookup operator can match the product_id references
Record
from those documents to embed the matching product
details from the products collection. Add INSERT to review insert() to
Product table, foreign key review
MongoDB 3.4 introduces a new aggregation stage called Review to product record collection,
reference to
$graphLookup to recursively lookup a set of documents
product
with a specific defined relationship to a starting document. document
Developers can specify the maximum depth for the
More
recursion, and apply additional filters to only search nodes
Actions
that meet specific query predicates. $graphLookup can
recursively query within a single collection, or across Figur
Figuree 6: Analyzing Queries to Design the Optimum
Schema
multiple collections.
Worked examples of using $lookup as well as other This analysis helps to identify the ideal document schema
agggregation stages can be found in the blog Joins and and indexes for the application data and workload, based
Other Aggregation Enhancements. on the queries and operations to be performed against it.
The life-cycle of the data and growth rate of documents Modeling Relationships with Embedding
As a first step, the project team should document the and Referencing
operations performed on the applications data, comparing: Deciding when to embed a document or instead create a
reference between separate documents in different
1. How these are currently implemented by the relational
collections is an application-specific consideration. There
database
are, however, some general considerations to guide the
2. How MongoDB could implement them decision during schema design.
Figure 6 represents an example of this exercise.
5
Embedding With m:1 or m:m relationships where embedding would
not provide sufficient read performance advantages to
Data with a 1:1 or 1:many relationship (where the many
outweigh the implications of data duplication
objects always appear with, or are viewed in the context of
their parent documents) are natural candidates for Where the object is referenced from many different
embedding within a single document. The concept of data sources
ownership and containment can also be modeled with To represent complex many-to-many relationships
embedding. Using the product data example above,
To model large, hierarchical data sets
product pricing both current and historical should be
embedded within the product document since it is owned The $lookup stage in an aggregation pipeline can be used
by and contained within that specific product. If the product to match the references with the _ids from the second
is deleted, the pricing becomes irrelevant. collection to automatically embed the referenced data in
the result set.
Architects should also embed fields that need to be
modified together atomically. (Refer to the Application
Integration section of this guide for more information.) Different Design Goals
Not all 1:1 and 1:m relationships should be embedded in a Comparing these two design options embedding
single document. Referencing between documents in sub-documents versus referencing between documents
different collections should be used when: highlights a fundamental difference between relational and
document databases:
A document is frequently read, but contains an
embedded document that is rarely accessed. An The RDBMS optimizes data storage efficiency (as it
example might be a customer record that embeds was conceived at a time when storage was the most
copies of the annual general report. Embedding the expensive component of the system.
report only increases the in-memory requirements (the
MongoDBs document model is optimized for how the
working set) of the collection
application accesses data (as performance, developer
One part of a document is frequently updated and time, and speed to market are now more important than
constantly growing in size, while the remainder of the storage volumes).
document is relatively static
Data modeling considerations, patterns and examples
The combined document size would exceed MongoDBs including embedded versus referenced relationships are
16MB document limit discussed in more detail in the documentation.
Referencing
Indexing
Referencing enables data normalization, and can give more
In any database, indexes are the single biggest tunable
flexibility than embedding. But the application will issue
performance factor and are therefore integral to schema
follow-up queries to resolve the reference, requiring
design.
additional round-trips to the server.
Indexes in MongoDB largely correspond to indexes in a
References are usually implemented by saving the _id
relational database. MongoDB uses B-Tree indexes, and
field1 of one document in the related document as a natively supports secondary indexes. As such, it will be
reference. A second query is then executed by the immediately familiar to those coming from a SQL
application to return the referenced data. background.
Referencing should be used: The type and frequency of the applications queries should
inform index selection. As with all databases, indexing does
1. A required unique field used as the primary key within a MongoDB document, either generated automatically by the driver or specified by the user.
6
not come free: it imposes overhead on writes and resource creating array indexes if the field contains an array, it
(disk and memory) usage. will be indexed as an array index.
7
Figur
Figure
e 7: Visual Query Profiling in MongoDB Ops Manager
search index return documents in relevance order. Each The number of index entries scanned
collection may have at most one text index but it may The number of documents read
include multiple fields.
How long the query took to resolve, reported in
MongoDBs storage engines all support all index types and milliseconds
the indexes can be created on any part of the JSON
Alternate query plans that were assessed but then
document including inside sub-documents and array
rejected
elements making them much more powerful than those
offered by RDBMSs. While it may not be necessary to shard the database at the
outset of the project, it is always good practice to assume
that future scalability will be necessary (e.g., due to data
Optimizing Performance With Indexes
growth or the popularity of the application). Defining index
MongoDBs query optimizer selects the index empirically by keys during the schema design phase also helps identify
occasionally running alternate query plans and selecting keys that can be used when implementing MongoDBs
the plan with the best response time. The query optimizer auto-sharding for application-transparent scale-out.
can be overridden using the cursor.hint() method.
MongoDB provides a range of logging and monitoring tools
As with a relational database, the DBA can review query to ensure collections are appropriately indexed and queries
plans and ensure common queries are serviced by are tuned. These can and should be used both in
well-defined indexes by using the explain() function development and in production.
which reports on:
The MongoDB Database Profiler is most commonly used
The number of documents returned during load testing and debugging, logging all database
operations or only those events whose duration exceeds a
Which index was used if any
configurable threshold (the default is 100ms). Profiling
Whether the query was covered, meaning no documents data is stored in a capped collection where it can easily be
needed to be read to return results searched for relevant events it is often easier to query
Whether an in-memory sort was performed, which this collection than parsing the log files.
indicates an index would be beneficial
8
Delivered as part of MongoDBs Ops Manager and Cloud Each customer may buy or subscribe to different
Manager platforms, the new Visual Query Profiler provides services from their vendor, each with their own sets of
a quick and convenient way for operations teams and contracts.
DBAs to analyze specific queries or query families. The
Modeling this real-world variance in the rigid,
Visual Query Profiler (as shown in Figure 7) displays how
two-dimensional schema of a relational database is
query and write latency varies over time making it simple
complex and convoluted. In MongoDB, supporting variance
to identify slower queries with common access patterns
between documents is a fundamental, seamless feature of
and characteristics, as well as identify any latency spikes.
BSON documents.
The visual query profiler will analyze data it collects to
MongoDBs flexible and dynamic schemas mean that
provide recommendations for new indexes that can be
schema development and ongoing evolution are
created to improve query performance. Once identified,
straightforward. For example, the developer and DBA
these new indexes need to be rolled out in the production
working on a new development project using a relational
system and Ops/Cloud Manager automates that process
database must first start by specifying the database
performing a rolling index build which avoids any impact to
schema, before any code is written. At a minimum this will
the application.
take days; it often takes weeks or months.
MongoDB Compass provides the ability to visualize explain
MongoDB enables developers to evolve the schema
plans, presenting key information on how a query
through an iterative and agile approach. Developers can
performed for example the number of documents
start writing code and persist the objects as they are
returned, execution time, index usage, and more. Each
created. And when they add more features, MongoDB will
stage of the execution pipeline is represented as a node in
continue to store the updated objects without the need for
a tree, making it simple to view explain plans from queries
performing costly ALTER TABLE operations or re-designing
distributed across multiple nodes.
the schema from scratch.
9
Application Integration MongoDB offers an extensive array of advanced query
operators.
instance, Java developers can simply code against minimums, averages, standard deviations, and related data)
MongoDB natively in Java; likewise for Ruby developers, as documents progress through the pipeline.
PHP developers, and so on. The drivers are created by
Additionally, the Aggregation Framework can manipulate
development teams that are experts in their given language
and combine documents using projections, filters,
and know how programmers prefer to work within those
redaction, lookups (JOINs), and recursive graph lookups.
languages.
The SQL to Aggregation Mapping Chart shows a number
of examples demonstrating how queries in SQL are
Mapping SQL to MongoDB Syntax handled in MongoDBs Aggregation Framework. To enable
For developers familiar with SQL, it is useful to understand more complex analysis, MongoDB also provides native
how core SQL statements such as CREATE, ALTER, support for MapReduce operations in both sharded and
INSERT, SELECT, UPDATE, and DELETE map to the unsharded collections.
MongoDB API. The documentation includes a comparison
chart with examples to assist in the transition to MongoDB
Query Language structure and semantics. In addition,
10
Figur
Figure
e 8: Uncover new insights with powerful visualizations generated from MongoDB
Business Intelligence Integration Converts the returned results into the tabular format
MongoDB Connector for BI expected by the BI tool, which can then visualize the
data based on user requirements
Driven by growing requirements for self-service analytics,
faster discovery and prediction based on real-time Additionally, a number of Business Intelligence (BI)
operational data, and the need to integrate multi-structured vendors have developed connectors to integrate MongoDB
and streaming data sets, BI and analytics platforms are one with their suites (without using SQL), alongside traditional
of the fastest growing software markets. relational databases. This integration provides reporting, ad
hoc analysis, and dashboarding, enabling visualization and
To address these requirements, modern application data analysis across multiple data sources. Integrations are
stored in MongoDB can be easily explored with available with tools from a range of vendors including
industry-standard SQL-based BI and analytics platforms. Actuate, Alteryx, Informatica, JasperSoft, Logi Analytics,
Using the BI Connector, analysts, data scientists and MicroStrategy, Pentaho, QlikTech, SAP Lumira, and Talend.
business users can now seamlessly visualize
semi-structured and unstructured data managed in
MongoDB, alongside traditional data in their SQL Atomicity in MongoDB
databases, using the same BI tools deployed within millions
Relational databases typically have well developed features
of enterprises.
for data integrity, including ACID transactions and
constraint enforcement. Rightly, users do not want to
SQL-based BI tools such as Tableau expect to connect to a sacrifice data integrity as they move to new types of
data source with a fixed schema presenting tabular data. databases. With MongoDB, users can maintain many
This presents a challenge when working with MongoDBs capabilities of relational databases, even though the
dynamic schema and rich, multi-dimensional documents. In technical implementation of those capabilities may be
order for BI tools to query MongoDB as a data source, the different.
BI Connector does the following:
MongoDB write operations are ACID at the document level
Provides the BI tool with the schema of the MongoDB including the ability to update embedded arrays and
collections to be visualized. Users can review the sub-documents atomically. By embedding related fields
schema output to ensure data types, sub-documents, within a single document, users get the same integrity
and arrays are correctly represented guarantees as a traditional RDBMS, which has to
Translates SQL statements issued by the BI tool into synchronize costly ACID operations and maintain
equivalent MongoDB queries that are then sent to referential integrity across separate tables.
MongoDB for processing
11
Document-level atomicity in MongoDB ensures complete logging, where users are typically analyzing trends in the
isolation as a document is updated; any errors cause the data, rather than discrete events.
operation to roll back and clients receive a consistent view
With stronger write concerns, write operations wait until
of the document.
MongoDB applies and acknowledges the operation. This is
Despite the power of single-document atomic operations, MongoDBs default configuration. The behavior can be
there may be cases that require multi-document further tightened by also opting to wait for replication of
transactions. There are multiple approaches to this the write to:
including using the findAndModify command that allows
A single secondary
a document to be updated atomically and returned in the
same round trip. findAndModify is a powerful primitive on A majority of secondaries
top of which users can build other more complex A specified number of secondaries
transaction protocols. For example, users frequently build
All of the secondaries even if they are deployed in
atomic soft-state locks, job queues, counters and state
different data centers (users should evaluate the
machines that can help coordinate more complex
impacts of network latency carefully in this scenario)
behaviors. Another alternative entails implementing a
two-phase commit to provide transaction-like semantics.
The documentation describes how to do this in MongoDB,
and important considerations for its use.
Write Durability
MongoDB uses write concerns to control the level of write
guarantees for data durability. Configurable options extend
from simple fire and forget operations to waiting for
acknowledgments from multiple, globally distributed
Figur
Figuree 9: Configure Durability per Operation
replicas.
If opting for the most relaxed write concern, the application The write concern can also be used to guarantee that the
can send a write operation to MongoDB then continue change has been persisted to disk before it is
processing additional requests without waiting for a acknowledged.
response from the database, giving the maximum
The write concern is configured through the driver and is
performance. This option is useful for applications like
highly granular it can be set per-operation, per-collection,
12
or for the entire database. Users can learn more about unlike an RDBMS where everything must be defined and
write concerns in the documentation. enforced. For any key it might be appropriate to check:
Dynamic schemas bring great agility, but it is also important Adding the validation checks to a collection is very intuitive
that controls can be implemented to maintain data quality, to any developer or DBA familiar with MongoDB, as
especially if the database is powering multiple applications, Document Validation uses the standard MongoDB Query
or is integrated into a larger data management platform Language.
that feeds into upstream and downstream systems. Rather
than delegating enforcement of these controls back into Validation rules can be managed from the Compass GUI.
application code, MongoDB provides Document Validation Rules can be created and modified directly using a simple
within the database. Users can enforce checks on point and click interface, and any documents violating the
document structure, data types, data ranges, and the rules can be clearly presented. DBAs can then use
presence of mandatory fields. As a result, DBAs can apply Compasss CRUD support to fix data quality issues in
data governance standards, while developers maintain the individual documents.
benefits of a flexible document model.
There is significant flexibility to customize which parts of Enforcing Constraints With Indexes
the documents are and ar
aree not validated for any collection As discussed in the Schema Design section, MongoDB
supports unique indexes natively, which detect and raise an
13
error to any insert operation that attempts to load a
duplicate value into a collection. A tutorial is available that
describes how to create unique indexes and eliminate
duplicate entries from existing collections.
Views
DBAs can define non-materialized views that expose only a
subset of data from an underlying collection, i.e. a view that
filters out specific fields. DBAs can define a view of a
collection that's generated from an aggregation over
another collection(s) or view. Permissions granted against
the view are specified separately from permissions granted
to the underlying collection(s).
Figur
Figuree1
10:
0: Multiple Options for Data Migration
Views are defined using the standard MongoDB Query
Language and aggregation pipeline. They allow the As records are retrieved from the RDBMS, the
inclusion or exclusion of fields, masking of field values, application writes them back out to MongoDB in the
filtering, schema transformation, grouping, sorting, limiting, required document schema.
and joining of data using $lookup and $graphLookup to Consistency checkers, for example using MD5
another collection. checksums, can be used to validate the migrated data.
You can learn more about MongoDB read-only views from All newly created or updated data is written to
the documentation. MongoDB only.
Project teams have multiple options for importing data from Incremental migration can be used when new application
existing relational databases to MongoDB. The tool of features are implemented with MongoDB, or where
choice should depend on the stage of the project and the multiple applications are running against the legacy
existing environment. RDBMS. Migrating only those applications that are being
modernized enables teams to divide projects into more
Many users create their own scripts, which transform
manageable and agile development sprints.
source data into a hierarchical JSON structure that can be
imported into MongoDB using the mongoimport tool. Incremental migration eliminates disruption to service
availability while also providing fail-back should it be
Extract Transform Load (ETL) tools are also commonly
necessary to revert back to the legacy database.
used when migrating data from relational databases to
MongoDB. A number of ETL vendors including Informatica, Many organizations create feeds from their source
Pentaho, and Talend have developed MongoDB connectors systems, dumping daily updates from an existing RDBMS
that enable a workflow in which data is extracted from the to MongoDB to run parallel operations, or to perform
source database, transformed into the target MongoDB application development and load testing. When using this
schema, staged, and then loaded into collections. approach, it is important to consider how to handle deletes
to data in the source system. One solution is to create A
Many migrations involve running the existing RDBMS in
and B target databases in MongoDB, and then alternate
parallel with the new MongoDB database, incrementally
daily feeds between them. In this scenario, Database A
transferring production data:
14
receives one daily feed, then the application switches the
MongoDB Atlas: Database as a
next day of feeds to Database B. Meanwhile the existing
Database A is dropped, so when the next feeds are made Service For MongoDB
to Database A, a whole new instance of the source
database is created, ensuring synchronization of deletions
MongoDB can run the database for you! MongoDB Atlas
to the source data.
provides all of the features of MongoDB, without the
operational heavy lifting required for any new application.
MongoDB Atlas is available on-demand through a
Operational Agility at Scale pay-as-you-go model and billed on an hourly basis, letting
you focus on what you do best.
The considerations discussed thus far fall into the domain
Its easy to get started use a simple GUI to select the
of the data architects, developers, and DBAs. However, no
instance size, region, and features you need. MongoDB
matter how elegant the data model or how efficient the
Atlas provides:
indexes, none of this matters if the database fails to
perform reliably at scale or cannot be managed efficiently. Security features to protect access to your data
The final set of considerations in migration planning should Built in replication for always-on availability, tolerating
focus on operational issues. complete data center failure
15
MongoDB Stitch: Backend as a MongoDB University
Service Courses are available for both developers and DBAs:
Take advantage of the free tier to get started; when you Conclusion
need more bandwidth, the usage-based pricing model
ensures you only pay for what you consume. Learn more
Following the best practices outlined in this guide can help
and try it out for yourself.
project teams reduce the time and risk of database
migrations, while enabling them to take advantage of the
Supporting Your Migration: benefits of MongoDB and the document model. In doing
so, they can quickly start to realize a more agile, scalable
MongoDB Services and cost-effective infrastructure, innovating on applications
that were never before possible.
MongoDB and the community offer a range of resources
and services to support migrations by helping users build
MongoDB skills and proficiency. MongoDB services
We Can Help
include training, support, forums, and consulting. Refer to
the "We Can Help" section below to learn more about We are the MongoDB experts. Over 3,000 organizations
support from development through to production. rely on our commercial products, including startups and
more than half of the Fortune 100. We offer software and
services to make your life easier:
16
MongoDB Enterprise Advanced is the best way to run
Resources
MongoDB in your data center. It's a finely-tuned package
of advanced software, support, certifications, and other
services designed for the way you do business. For more information, please visit mongodb.com or contact
us at sales@mongodb.com.
MongoDB Atlas is a database as a service for MongoDB,
letting you focus on apps instead of ops. With MongoDB Case Studies (mongodb.com/customers)
Atlas, you only pay for what you use with a convenient Presentations (mongodb.com/presentations)
hourly billing model. With the click of a button, you can Free Online Training (university.mongodb.com)
scale up and down when you need to, with no downtime, Webinars and Events (mongodb.com/events)
full security, and high performance. Documentation (docs.mongodb.com)
MongoDB Enterprise Download (mongodb.com/download)
MongoDB Stitch is a backend as a service (BaaS), giving
MongoDB Atlas database as a service for MongoDB
developers full access to MongoDB, declarative read/write
(mongodb.com/cloud)
controls, and integration with their choice of services.
MongoDB Stitch backend as a service (mongodb.com/
MongoDB Cloud Manager is a cloud-based tool that helps cloud/stitch)
you manage MongoDB on your own infrastructure. With
automated provisioning, fine-grained monitoring, and
continuous backups, you get a full management suite that
reduces operational overhead, while maintaining full control
over your databases.
New York Palo Alto Washington, D.C. London Dublin Barcelona Sydney Tel Aviv
US 866-237-8815 INTL +1-650-440-4474 info@mongodb.com
2017 MongoDB, Inc. All rights reserved.
17