0% found this document useful (0 votes)
20 views

HBase (Unit 4)

HBase is a distributed column-oriented database built on top of HDFS that provides Bigtable-like capabilities for the Hadoop ecosystem, with data stored in tables containing rows, columns, and versions. It uses a master-slave architecture with a single master and multiple region servers that host regions, and allows for fast random reads and writes through its data model of keys, column families, and columns.

Uploaded by

The piano guy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

HBase (Unit 4)

HBase is a distributed column-oriented database built on top of HDFS that provides Bigtable-like capabilities for the Hadoop ecosystem, with data stored in tables containing rows, columns, and versions. It uses a master-slave architecture with a single master and multiple region servers that host regions, and allows for fast random reads and writes through its data model of keys, column families, and columns.

Uploaded by

The piano guy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

HBase: Overview

• HBase is a distributed column-oriented data


store built on top of HDFS

• HBase is an Apache open source project whose goal


is to provide storage for the Hadoop Distributed
Computing

• Data is logically organized into tables, rows and


columns

1
HBase: Part of Hadoop’s
Ecosystem

HBase is built on top of HDFS

HBase files are


internally stored
in HDFS

2
HBase vs. HDFS
• Both are distributed systems that scale to hundreds or
thousands of nodes

• HDFS is good for batch processing (scans over big files)


• Not good for record lookup
• Not good for incremental addition of small batches
• Not good for updates

3
HBase vs. HDFS (Cont’d)
• HBase is designed to efficiently address the above points
• Fast record lookup
• Support for record-level insertion
• Support for updates (not in place)

• HBase updates are done by creating new versions of


values

4
HBase vs. HDFS (Cont’d)

If application has neither random reads or writes  Stick to HDFS

5
HBase Data Model

6
HBase Data Model
• HBase is based on Google’s Bigtable model
• Key-Value pairs

Column Family

Row key

TimeStamp value

7
HBase Logical View

8
HBase: Keys and Column
Families
Each record is divided into Column Families

Each row has a Key

Each column family consists of one or more Columns

9
Column family named “anchor”
Column family named “Contents”

Column
Time
Row key “content Column “anchor:”
• Key Stamp
s:”
• Byte array
• Serves as the primary key “<html>
t12
…”
for the table
“com.apac Column named “apache.com”
“<html>
• Indexed far fast lookup he.ww t11
…”
w”
• Column Family t10
“anchor:apache
.com”
“APACH
E”
• Has a name (string)
“anchor:cnnsi.co
• Contains one or more t15
m”
“CNN”
related columns
“anchor:my.look. “CNN.co
t13
ca” m”
• Column
“com.cnn.w “<html>
• Belongs to one column ww” t6
…”
family
“<html>
• Included inside the row t5
…”
• familyName:columnName “<html>
t3
…”

10
Version number for each row

Column
Time
Row key “content Column “anchor:”
Stamp
• Version Number s:”

• Unique within each “<html>


t12
key …” value
“com.apac
“<html>
• By default System’s he.ww
w”
t11
…”
timestamp t10
“anchor:apache “APACH
.com” E”
• Data type is Long
“anchor:cnnsi.co
t15 “CNN”
m”
• Value (Cell) “anchor:my.look. “CNN.co
t13
ca” m”
• Byte array
“com.cnn.w “<html>
t6
ww” …”

“<html>
t5
…”
“<html>
t3
…”

11
Notes on Data Model
• HBase schema consists of several Tables
• Each table consists of a set of Column Families
• Columns are not part of the schema

• HBase has Dynamic Columns


• Because column names are encoded inside the cells
• Different cells can have different columns

“Roles” column family


has different columns
in different cells

12
Notes on Data Model (Cont’d)
• The version number can be user-supplied
• Even does not have to be inserted in increasing order
• Version number are unique within each key

• Table can be very sparse


Has two columns
• Many cells are empty [cnnsi.com & my.look.ca]

• Keys are indexed as the primary key


HBase Physical Model

14
HBase Physical Model
• Each column family is stored in a separate file (called HTables)

• Key & Version numbers are replicated with each column family

• Empty cells are not stored

15
Example

16
Column Families

17
HBase Regions
• Each HTable (column family) is partitioned horizontally
into regions
• Regions are counterpart to HDFS blocks

Each will be one region

18
HBase Architecture

19
Three Major Components
• The HBaseMaster
• One master

• The HRegionServer
• Many region servers

• The HBase client

20
HBase Components
• Region
• A subset of a table’s rows, like horizontal range partitioning
• Automatically done

• RegionServer (many slaves)


• Manages data regions
• Serves data for reads and writes (using a log)

• Master
• Responsible for coordinating the slaves
• Assigns regions, detects failures
• Admin functions

21
Big Picture

22
ZooKeeper
• HBase depends on
ZooKeeper

• By default HBase manages


the ZooKeeper instance
• E.g., starts and stops
ZooKeeper

• HMaster and HRegionServers


register themselves with
ZooKeeper

23
Creating a Table
HBaseAdmin admin= new HBaseAdmin(config);
HColumnDescriptor []column;
column= new HColumnDescriptor[2];
column[0]=new HColumnDescriptor("columnFamily1:");
column[1]=new HColumnDescriptor("columnFamily2:");
HTableDescriptor desc= new HTableDescriptor(Bytes.toBytes("MyTable"));
desc.addFamily(column[0]);
desc.addFamily(column[1]);
admin.createTable(desc);

24
Operations On Regions: Get()
• Given a key  return corresponding record

• For each value return the highest version

• Can control the number of versions you want

25
Operations On Regions: Scan()

26
Select value from table where
Get() key=‘com.apache.www’ AND
label=‘anchor:apache.com’

Time
Row key Column “anchor:”
Stamp

t12

t11
“com.apache.www”

t10 “anchor:apache.com” “APACHE”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6

t5

t3
Select value from table
Scan() where anchor=‘cnnsi.com’

Time
Row key Column “anchor:”
Stamp

t12

t11
“com.apache.www”

t10 “anchor:apache.com” “APACHE”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6

t5

t3
Operations On Regions: Put()
• Insert a new record (with a new key), Or

• Insert a record for an existing key


Implicit version number
(timestamp)

Explicit version number

29
Operations On Regions: Delete()

• Marking table cells as deleted

• Multiple levels
• Can mark an entire column family as deleted
• Can make all column families of a given row as deleted

30
HBase: Joins
• HBase does not support joins

• Can be done in the application layer


• Using scan() and get() operations

31
Altering a Table

32
Logging Operations

33
HBase Deployment

Master
node

Slave
nodes

34
HBase vs. HDFS

35
HBase vs. RDBMS

36
When to use HBase

37

You might also like