Nosql Datawarehouse

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 11

NOSQL DATA-WAREHOUSE

I. INTRODUCTION

1.1 Data warehouse Definition

W.H. Inmonn [19] defines the data warehouse as the subject oriented, integrated, non volatile, and time
variant collection of data in support of management decisions.

 Subject Oriented data: a Datawarehouse stores data by subjects and not by application.
These business subjects vary from one business to another eg. for a retail company sales,
products, customer may be critical business subjects.

 Integrated data: The integration process forms the data into a single cohesive
environment. We remove all inconsistencies and errors and finally transform the data into
a common format before storing into the data warehouse.

 Nonvolatile data: We cannot update the data in the data warehouse in real time. Business
transactions update the data in operational databases in real time. New records are added
to the data warehouse periodically but existing records are not modified.

 Time variant data: The decision makers can view the data across the field of time at
whichever level of detail they may wish. This allows the business analysts to view the
patterns and trends over time.

1.2 Multi- Dimension Modeling: The datawarehouse is represented as a multi-dimensional model which
makes it easy to understand for business users, because the structure is divided into measurements/facts and
context/dimensions. Facts are related to the organization’s business processes and operational system
whereas the dimensions surrounding them contain context about the measurement.
 Dimension: The managers and executives think of business in terms of business dimension.
These are the syntactical categories that allow the users to specify multiple ways to look at the
same information under which analysis has to be performed.
 Hierarchies: Hierarchies are logical structures that use ordered levels as a means of
organizing data. A hierarchy can be used to define data aggregation. For example, in a time
dimension, a hierarchy might aggregate data from the month level to the quarter level to
the year level. A hierarchy can also be used to define a navigational drill path and to establish
a family structure.
 Facts/Measures: The numbers that users analyze are the measurement, facts or metrics that
measure the success of business. We have the two elements for analysis: what is being
analyzed i.e. the dimension and evaluation criteria for what is being analyzed which are called
measures or facts, for example: in a retail sale, price and discount are measures used for
analysis.

1.3 Types of Data warehouse


Depending upon the type of the data that is stored in a data warehouse a Data warehouse can be can
be structured as well as unstructured.

1.3.1 Structured Data-warehouse: The traditional data warehouse were structured, they stored the
data in the form of relation.
1.3.1.1 Representation of a Structured Data warehouse: There are basically two ways of representing a
structured warehouse ROLAP and MOLAP

 Relational database(ROLAP)
 Multidimensional Database(MOLAP)
1.3.1.1.1 ROLAP Data warehouse: The data in the data warehouse is stored in relational database.
It can handle a large amount of data and can leverage all functionalities of slicing and dicing.
Drawbacks of ROLAP: It may take too long to answer queries and thus become unusable for
answering business questions because of limited SQL functionalities. Such a scenario may require
significant time and effort to be put into remodeling data structures.
Solution to ROLAP: In order to address these issues, Multi-Dimensional data warehousing
(MOLAP) modeling disciplines were created
1.3.1.1.2 MOLAP Data Warehouse: In this the data is stored in multi-dimensional cube and not in
relational database. It is used for fast retrieval that serve the analytical needs of departments, and
then ’virtually’ integrating these data marts for consistency through an Information Bus. The
disadvantages of this model are that it can handle only a limited amount of data.

1.3.1.2 Drawbacks: The traditional structured data warehouse solved many fundamental problems
such as storing data for historical purposes, maintained data integrity, accessibility, and enhanced
decision making capability but it still suffered from a lot of problems such as general inflexibility in
integrating unstructured data such as text, email, sound etc. The traditional data warehouses were
made of repetitive, transaction based data. The issue of query contention and scalability remains
unaddressed.
The most effective way to bridge the gap between a structured and unstructured data is to build a
NOSQL document warehouse.

II. NEED OF NO SQL DATA WAREHOUSE [2][3]

Traditional Data warehouse designed over MOLAP or ROLAP are not always easy to leverage the
opportunities of storing unstructured data .Growth in data volumes, number of data sources and type of
data are a major area of concern for the organization .It is becoming impractical or even impossible load
unstructured data into a traditional DW because of high storage and retrieval cost, performance, latency
reasons.

The Solution to these problems is to use NOSQL Database in designing a data warehouse.

2.1 NOSQL Introduction:


Big Data NoSQL databases were pioneered by top internet companies like Amazon, Google, LinkedIn and
Facebook to overcome the drawbacks of RDBMS. NoSQL is a dynamic and cloud friendly approach to
dynamically process unstructured data with ease. NoSQL Database, also known as “Not Only SQL” is an
alternative to SQL database which does not require any kind of fixed table schemas unlike the SQL.
NoSQL generally scales horizontally and avoids major join operations on the data. NoSQL database can be
referred to as structured storage which consists of relational database as the subset.

2.1.1 NoSQL data store types: There are four new different types of data stores in NoSQL [13].

2.3.1.1 Key value databases: It is a combination of key and a value. Key is a unique identifier to a
particular data entry. Value is a kind of data that is pointed by a key. Key value databases seem to
be as hash tables or look up tables. In this type of database, there is only one way to query that is
with the help of key (unique) and all the keys may name in any data objects and are arranged in an
alphabetical order [13]. For higher availability of data stores data objects are replicated.
For example, let’s take an example of bank database as shown in Figure 2. [13]
BANK DATABASE

Key Value

ID:1
1 Joining Date: 15-July-1985
Designation: Cashier

ID:2
2 Joining Date: 19-March-1982
Designation: Manager

Figure 1: Key Value (KV) databases

In the given figure there are two columns representing key and a value. Here key is unique and
representing their values or attributes corresponding to it and data is represented in the form of
ring and the partitioning of data is done on the basis of their alphabets (in sorted order) and data is
also replicated in the form of ring.

2.1.2 Document Stores


A document store is similar to a key-value store in that stored objects are associated (and therefore
accessed via) character-string keys. Document Stores Databases are schema free and are not fixed
in nature. Documents are addressed in the database using key (unique) that represents that
document. [13]
RDBMS vs Document store

 Records do not need to have a uniform structure, i.e. different records may have
different columns.
 The types of the values of individual columns can be different for each record.
 Columns can have more than one value (arrays).
 Records can have a nested structure.

Document stores often use internal notations, which can be processed directly in applications,
mostly JSON.

2.1.3Graph Database: A graph G = (V, E) consists of a set of vertices V and a set of edges E. An
edge e ∈ E is a pair of vertices (v1, v2) ∈ V × V. If the graph is directed, these pairs are ordered.
The data models between different GDs generally differ in what kind of graphs they support;
whether it is directed, whether self-loops (i.e., edges e = (v1, v2) where v1 = v2) are permitted,
whether multi-edges (i.e., two or more identical edges between the same vertices) are permitted,
and so on. Examples of common GDs include: Neo4j , InfiniteGraph24 , HyperGraphDB25, sons
GraphDB26 and InfoGrid27.Graph traversals are executed with constant speed independent of
total size of the graph. There are no set operations involved that decrease performance as seen
with join operations in RDBMS. [10]
2.1.4 Columnar Database: Columnar Databases are also known as column family databases
because they are column-oriented databases. There are two types of column oriented databases
whose detail is as given below:
(1) Wide-Column data stores [13]

It is one type of NoSQL database. Wide Column data stores are those databases that are used for
processing of web, streaming of data and documents. The structure of wide Column data store is
as depicted in figure 4 below:

Figure 2: Structure of Wide column data stores

Meaning of each field provided in the structure of wide column data store databases is as depicted
below in Table 1.

ATTRIBUTE MEANING

It is a key that is unique in nature. It may be a string or


Row no
number.

Column Name Data stored on the basis of column family.

Column Description It describes the stored data item.

Time stamp It tells the complete time of particular instance.

Data value Value or attributes related to that corresponding key.

Table 1: Meaning of fields in structure of wide column databases

(2) Column oriented databases:

To understand column oriented databases let’s take an example of bank database given in Table 2
whose attribute fields are EmpID, Salary and designation and values corresponding to it are as
depicted in database.

EmpID Salary Designation

100 10,000 Clerk

200 20,000 Assistant Manager

300 30,000 Manager


400 40,000 Zonal Head

Table 2: Example of Bank database

Representation of Row oriented databases and column oriented databases:

 Row oriented databases are those databases in which all the rows are put together one by
one.
 Column oriented databases those databases in which all the values containing columns
are put together.

III. ADVANTAGES OF NOSQL

The advantages that NOSQL Datawarehouse offers over a traditional datawarehouse:-

3.1 Managing Large Chunks of Data: NoSQL databases can easily handle numerous read/write cycles,
several users and amounts of data ranging in petabytes. 

3.2 Complex Schema Management: - A NOSQL datawarehouse provides a highly scalable, schemaless
structure and cost effective cloud-based storage solution. The schema-less nature of the NOSQL eliminates
the need of transformation of schema between various data sources (during ETL) and the datawarehouse.
This transformation was just a waste of time in traditional datawarehouse.

LOAD

NOSQL DATAWAREHOUSE

(Cost effecive, schema-less, Scalable)


=>Change schema-agility

ana
lyti
cs
OLAP TOOLS

Figure 3: NOSQL Datwarehouse

Among the various NOSQL Database we prefer to choose Document Oriented database.
Document Stores Databases are schema free and are not fixed in nature. The central concept of a
document-oriented database is the notion of a Document. Documents encapsulate and encode data
(or information) in some standard format(s): JSON, XML, BSON, YAML, binary forms (like
PDF and MS Word). Document is similar to row or record in relation DB, but more flexible and
are retrieved on the basis of their contents.There are various kinds of Document oriented databases
such as MongoDB, CouchDB etc.

Document database implementations offer a variety of ways of organizing documents, including


notion of

 Collections: groups of documents, where depending on implementation, a document may be


enforced to live inside one collection, or may be allowed to live in multiple collections
 Tags and non-visible metadata: additional data outside the document content

 Directory hierarchies: groups of documents organized in a tree-like structure, typically based


on path or URI

  The following is a document, encoded in JSON:


{
FirstName: "Bob",
Address: "5 Oak St.",
Hobby: "sailing"
}
NOSQL documents are schema-less. This can be explained as follows:- Let’s say we have a
document collection named Customers with the following data inside:
{ Id: 1, Name: “John Doe” },

{ Id: 2, Name: “Bob Smith” }

Because of the collection’s schemaless nature, nothing prevents us from adding another

document, like this:

{ Id: 1, Name: “John Doe” },


{ Id: 2, Name: “Bob Smith” },{Id: 3, FirstName: “Alice”, LastName: “Christopher” }
These two documents share some structural elements with one another, but each also has unique
elements. The structure and text and other data inside the document is called usually referred to as the
document's content and may be referenced via user defined methods. Unlike a relational database
where every record contains the same fields, leaving unused fields empty (NULL); there are no empty
(NULL) 'fields' in either document (record) in the document. This approach allows new information to
be added to some records without requiring that every other record in the database share the same
structure or the schema.

Document databases typically provide for additional metadata to be associated with and stored along
with the document content. That metadata may be related to facilities the datastore provides for
organizing documents, providing security, or other implementation specific features.

While NoSQL databases are technically schemaless meaning that they allow us to store documents in
any shape we want, the notion of schema itself doesn’t vanish from our domain model. Schemaless
databases just shift the responsibility to maintain the schema to us, developers. The use of NoSQL
storage means a move from explicitly defined data structures to implicit ones.Schema migration in
NOSQL is the responsibilty of the developer.

The document database offers an API or query language that allows the user to retrieve documents
based on content (or metadata). For example, you may want a query that retrieves all the documents
with a certain field set to a certain value. The set of query APIs or query language features available, as
well as the expected performance of the queries, varies significantly from one implementation to
another. Likewise, the specific set of indexing options and configuration that are available vary greatly
by implementation.
JSON is flexible and hence has all the benefits of schema-on-read. Each document can have a
different structure and the schema can evolve freely when needed. JSON also provides scalabilty.No
ALTER TABLE is required for JSON;

If we have terabytes of data it is very difficult to do ALTER TABLE. By incorporating Big Data with
NoSQL, cost is reduced both in terms of hardware as well as in terms of staff / development effort.

3.3 Horizontal scalability: Nodes can often be dynamically added to (or removed from) a cluster without
any downtime, giving linear effects on storage and overall processing capacities. Usually, there is no
(realistic) upper bound on the number of machines that can be added.
Figure 5: Scalabilty of NO SQL[15]
3.4 BASE instead of ACID: Brewer’s CAP theorem [5] states that a distributed system can have at most
two of the three properties Consistency, Availability and Partition tolerance . For a growing number of
applications, having the last two are most important. Building a database with these while providing
ACID properties is difficult, which is why Consistency and Isolation often are forfeited, resulting in
the so called BASE approach [5]. BASE stands for Basically Available, Softstate, Eventual
consistency, with which one means that the application is available basically all the time, is not always
consistent, but will eventually be in some known state.
Follows CAP Theorem: The CAP-Theorem postulates that only two of the three different aspects of
scaling out are can be achieved fully at the same time.

 Strong Consistency: all clients see the same version of the data, even on updates to the
dataset - e. g., by means of the two-phase commit protocol (XA transactions), and ACID,
 High Availability: all clients can always find at least one copy of the requested data,
even if some of the machines in a cluster are down,
 Partition-tolerance: the total system keeps its characteristic even when being deployed
on different servers, transparent to the client.


Figure 4: NOSQL Characteristics

IV. COMPARISON SOME OF NOSQL DATABASES (FOUR CATEGORIES) WITH A MATRIX ON BASIS OF FEW
ATTRIBUTES- DESIGN, INTEGRITY, INDEXING, DISTRIBUTION, SYSTEM[24]

Table 3: Comparison some of NoSQL Databases

V BENEFITS OF NOSQL WAREHOUSE OVER TRADITIONAL WAREHOUSE

5.1 Flexible Schema: Document oriented databases are schemaless which means two documents can have very
different schema and data values, unlike relational model where each row in a table will have same
columns. For example, Users in Twitter could be stored in a document-oriented database where each document
contains the profile and some recent tweets of the corresponding users. User profile will have a set of attributes,
however, the set of attributes present might vary from one user to another (e.g., some users won't provide DOB,
Document-oriented databases being schemaless, all these user profiles would be stored efficiently by storing
only relevant attributes specific to each user. On the other hand a relational model will have one column for
each of possible user profile attribute; hence many columns in the user profiles row contain NULLs  

5.2 Fast Writes: Many document stores supports multi-version-concurrency-control there by making the writes
to documents really fast. Whereas writes to RDMBS could be slow for various reasons like locks.

5.3 Sharding/Partitioning: Document stores are effectively key value stores with document id being the key
and the document itself being the value. Under such a setting, one can simply shard/partition the document store
by simply partitioning the key space. This process is rather complicated in a RDBMS setting where there are
multiple tables and the query workload contains joins (i.e., query execution needs distribution support).
VI OLAP on NOSQL Datwarehouse

The most widely used analysis in Business Intelligence is OLAP (on-line an-alytical processing). Typical OLAP
operations include rollup (increasing the level of aggregation) and drill-down (decreasing the level of aggregation or
increasing detail) along one or more dimension hierarchies, slice and dice (selection and pro-jection), and pivot (re-
orienting the multidimensional view of data) [13]. To per-form OLAP operations quickly usually a special OLAP
server is used either within the data warehouse or as a separate solution.

According to some business analyst’s current Business Intelligence platforms and data warehouses do not cover all
the data necessary for decision making in today’s complex economic environment. Big volumes of data (so called
“big data”) cannot be processed by traditional warehouses and OLAP servers that base on RDBMS solutions.
Instead of them the solutions originated from the NoSQL movement like MapReduce (Hadoop), Hive, Pig or jaql
should be applied.

REFERENCES

1. Abiteboul, S., Cluet, S., Milo, T.: “Correspondence and Translation for Heterogeneous Data.” Proceedings of the '97 ICDT, Delphi, Greece
(1997) 352{363
2. Banerjee, J., Kim, W., Kim, H., Korth, H.: “Semantics and Implementation of Schema Evolution in Ob ject-Oriented Databases.”
Proceedings of the '87 ACM SIGMOD, San Francisco, CA (1987) 311{322
3. Buneman, P., Davidson, S., Hillerbrand, G., Suciu, D.: “A Query Language and Optimization Techniques for Unstructured Data.
“Proceedings of the '96 ACM SIGMOD, Montreal, Canada (1996) 505{516
4. Renu Kanwar, Prakriti Trivedi, Kuldeep Singh,” NoSQL, a Solution for Distributed Database Management System”, International Journal
of Computer Applications (0975 – 8887) Volume 67– No.2, April 2013
5. Dr.K.Chitra, B.JeevaRani,” Study on Basically Available, Scalable and Eventually Consistent NOSQL Databases”, International Journal of
Advanced Research in Computer Science and Software Engineering, Volume 3, Issue 7, July 2013
6. http://www.allthingsdistributed.com/2007/12/eventually_consistent.
7. http://s3.amzonaws.com/AllThings/Distributed/sosp/amazon-dynamo
8. http://labs.google.com/papers/bigtable.html
9. http://www.cs.brown.edu/~ugur/osfa.pdf
10. http://www.readwriteweb.com/archives/amazon_dynamo.php
11. http://nosql-database.org
12. A B M Moniruzzaman,” NoSQL Database: New Era of Databases for Big data AnalyticsClassification, Characteristics and Comparison”,
International Journal of Database Theory and Application Vol. 6, No. 4, August, 201
13. Vatika Sharma” SQL and NoSQL Databases”, International Journal of Advanced Research in Computer Science and Software
Engineering 2 (8), August- 2012, pp. 20-27
14. https://www.quora.com/What-are-the-limitations-of-NoSql-databases
15. http://smallbusiness.chron.com/disadvantages-data-warehouse-73584.html
16. http://www.loginradius.com/engineering/relational-database-management-system-rdbms-vs-nosql/
17. http://www.networkworld.com/article/2226514/tech-debates/what-s-better-for-your-big-data-application--sql-or-nosql-.html?
page=2
18. Frank S.C. Tseng, Annie Y.H. Chou,” The concept of document warehousing for multi-dimensional modeling of textual-based
business intelligence”, Decision Support Systems 42 (2006) 727– 744
19. W.H. Inmon, Building the Data Warehouse, John Wiley and Sons, New York, NY, 1993.
20. Martha Eliana Mendoza, Erwin Alegría, Carlos Cobos, Elizabeth Leon,” Multidimensional analysis model for a document
warehouse that includes textual measures”, Decision Support Systems APRIL 2015
21. D. Zhang, C. Zhai, J. Han, A. Srivastava, N. Oza,” Topic modeling for OLAP on multidimensional text databases:topic cube and its
applications,” Statistical Analysis and Data Mining, 2(5-6) (2009) 378-395.
22. Maria Indrawan,” Reflection on NoSQL”, 978-0-7695-4779-4/12 $26.00 © 2012 IEEE

23. C. Cobos, J. Andrade, W. Constain, M. Mendoza, E. León, Web document clustering based on Global-Best Harmony Search, K-
means, Frequent Term Sets and Bayesian Information Criterion, in: Proceedings IEEE Congress on Evolutionary Computation (IEEE
CEC), (IEEE, Barcelona, Spain, 2010), pp. 4637-4644.

24. B. G. Tudorica and C. Bucur, “A comparison between several NoSQL databases with comments and notes”, Roedunet International
Conference (RoEduNet), 2011 10th, IEEE, (2011) June, pp. 1-5.

BIOGRAPHY

Ms. Gunjan is a Assistant Professor in the Computer science & Technology Department, at Manav Rachna
University. She is pursuing PhD in Computer science. Her research interests are Big Data, Information System.

You might also like