Huawei OceanStor Dorado V3 All Flash Storage Technical White Paper PDF
Huawei OceanStor Dorado V3 All Flash Storage Technical White Paper PDF
Huawei OceanStor Dorado V3 All Flash Storage Technical White Paper PDF
Issue 1.0
Date 2017-03-30
and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.
All other trademarks and trade names mentioned in this document are the property of their respective
holders.
Notice
The purchased products, services and features are stipulated by the contract made between Huawei and
the customer. All or part of the products, services and features described in this document may not be
within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements,
information, and recommendations in this document are provided "AS IS" without warranties, guarantees or
representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made in the
preparation of this document to ensure accuracy of the contents, but all statements, information, and
recommendations in this document do not constitute a warranty of any kind, express or implied.
Contents
4 Best Practices........................................................................................................................... 33
5 Conclusion .............................................................................................................................. 36
6 Acronyms and Abbreviations ............................................................................................... 37
Figures
Figure 3-25 Interoperability between high-end, mid-range, and entry-level storage .........................................24
Figure 3-26 Active-active arrays ....................................................................................................................25
This document describes the architecture and key features and technologies of Huawei
OceanStor Dorado V3 all-flash storage systems (OceanStor Dorado V3 for short),
highlighting the unique advantages and customer benefits.
2 Overview
To survive in an increasingly fierce competition environment and shorten the rollout time of
new services, enterprises' IT systems need to transform from a traditional cost center to a
powerful weapon with the ability to help enterprises improve their competitiveness and
achieve business success. In addition to providing high performance and robust reliability for
mission-critical services, storage systems must address service growth needs, enhance service
agility, and help services flexibly adapt to an increasingly fierce competition environment.
Storage technologies have developed rapidly over the past two decades, with many reliability
enhancing technologies being introduced, such as RAID, RAID2.0+, remote replication, and
active-active arrays. CPU performance that represents the computing capability has been
improved by nearly 580 times. I/O channel performance is also almost 1000 times higher than
before. However, improvement in storage media is only 20 times. Disks have become a
roadblock to the improvement of storage systems. Therefore, the overall IT system
performance cannot meet fast-growing service requirements.
To address this challenge, a major change occurs in the storage industry. Solid state disks
(SSDs) are gradually replacing hard disk drives (HDDs). SSDs have absolute advantages over
HDDs in terms of performance, reliability, and power consumption. However, they bring
about new problems, for example, the amount of written data is limited, and the I/O stack of
traditional storage system software is designed for HDDs and cannot bring SSD performance
into full play, resulting in non-optimal total cost of ownership (TCO). To solve these problems,
all-flash storage systems specially designed for SSDs are introduced. The emerging all-flash
storage systems do not inherit many enterprise network features from traditional enterprise
storage, so users cannot achieve an optimal IT solution that balances reliability, user habits,
and performance.
To address this issue, Huawei launches the new-generation all-flash storage system OceanStor
Dorado V3.
3 Solution
Huawei OceanStor Dorado V3 all-flash storage systems (the OceanStor Dorado V3 for short)
are dedicated to setting a new benchmark for the enterprise storage field and providing data
services of the highest level for enterprises' mission-critical businesses. With the advanced
all-flash architecture and rich data protection solutions, the OceanStor Dorado V3 delivers
world-leading performance, efficiency and reliability that meet the storage needs of various
applications such as large-scale database OLTP/OLAP, VDI, and VSI. Applicable to sectors
such as government, finance, telecommunications, energy, transportation, and manufacturing,
the OceanStor Dorado V3 is the best choice for mission-critical applications.
RAM: a component responsible for the Flash Translation Layer (FTL) table and data
caching to provide fast data access.
NAND FLASH: physical entity for data storage
Disk enclosure: contains SSDs, manages disks, and interconnects and provisions service
access. It consists of a subrack, expansion modules, and SSDs.
1 Fan module 2 Fan module latch 3 Fan module handle 4 AC power module
5 AC power module latch 6 AC power module latch handle 7 Data switch alarm indicator 8 AC power module running indicator
1 Power socket 2 Management network port 3 Serial port 4 Management network port
5 PCIe port link/speed indicator 6 PCIe port
Expansion
module A
Expansion
module B
Disk enclosure
configuration (with
coffer disks) Expansion
module A
Scale-out: Controllers of OceanStor Dorado V3 are redundantly connected using PCIe 3.0 for
data transmission, while redundant GE networks enable scale-out management.
Data switch 1
Data switch 0
3 U controller
connect to the user's enclosure 1
management network
3 U controller
enclosure 0
Data switch 1
Data switch 0
3 U controller
enclosure 1
3 U controller
enclosure 0
Engine1
Controller1A Controller1B
FlashLink FlashLink
Flash Flash
Flash Space Flash Space
Cache Cache
Manage Manage
Cluster Cluster
ment ment
BDM BDM
HSSD
SSD Driver
DSW0 DSW1
Management Management
Engine0
Controller0A Controller0B
FlashLink FlashLink
Flash Flash
Flash Space Flash Space
Cache Cache
Manage Manage
Cluster Cluster
ment ment
BDM BDM
HSSD
SSD Driver
The storage controller software architecture mainly consists of the Cluster &
Management plane and service plane.
− The Cluster & Management plane provides a basic environment for system running,
controls multi-controller Scale-out logic, and manages alarms, performance, and user
operations.
− The service plane schedules storage service I/Os, realizes data Scale-out capabilities,
and implements controller software–related functions of the FlashLink technology
such as deduplication and compression, full-stripe sequential write, cold and hot data
separation, global wear leveling, and anti-wear leveling.
HSSD software architecture mainly includes Huawei-developed disk drives, realizes
basic SSD functions and hardware-related functions of the FlashLink technology such as
I/O priority and multi-channel data flows.
Controller A Controller B
Block device
Block device management (BDM)
management (BDM)
1. Write I/Os enter Flash Space after passing the protocol layer. The system checks whether
the I/Os belong to this controller. If no, the I/Os are forwarded to the peer controller.
2. If yes, the I/Os are written to local Flash Cache and mirrored to peer-end Flash Cache.
3. A write success is returned to the host.
4. Flash Cache flushes data to Flash Pool where the data will be deduplicated and
compressed.
a. Flash Pool divides the received data into data blocks with a fixed length (8 KB).
b. Flash Pool calculates the fingerprint value of each data block and forwards the data
block to the owning controller based on the fingerprint value.
c. After the local controller receives data blocks, Flash Pool checks the fingerprint
table.
d. If the same fingerprints exist in the fingerprint table, obtain the locations where
related data is stored, and compare the data with data blocks byte by byte. If they
are the same, the system increases the reference count of the fingerprints and does
not write the data blocks to SSDs.
e. If the fingerprint table does not contain the same fingerprints or the data and data
blocks are not consistent, compress the data blocks (based on the size of 1 KB).
5. Flash Pool combines the data into full stripes and writes it to SSDs.
a. Compressed I/Os are merged into write stripes of which the size is an integer
multiple of 8 KB.
b. If the I/Os are merged into full-write stripes, calculate the checksum and write the
data and checksum to disks.
c. If the I/Os are not merged into full-write stripes, add 0s to the tail before data is
written to disks (the 0s will be cleared in subsequent garbage collection).
d. Data is written to a new location every time and metadata mapping relationships are
updated.
e. After a message is returned indicating that I/Os are successfully written to disks,
Flash Cache deletes the corresponding data pages.
Controller A Controller B
1. Flash Space analyzes received I/Os and judges whether the I/Os belong to the local
controller. If no, it forwards the I/Os to the peer controller.
2. The owning controller searches for the desired data in Flash Cache and returns the data
to the host. If it cannot find the data, Flash Pool will continue the process.
3. Flash Pool divides the I/Os into data blocks with a fixed size (8 KB), determines the
owning controller of each data block based on the LBA, and forwards the data block to
the owning controller.
a. On the owning controller of a data block, check the LBA-fingerprint mapping table
and obtain the fingerprint.
b. Forward the data block read request to the fingerprint owning controller according
to fingerprint forward rules.
c. On the fingerprint owning controller, find the fingerprint-storage location mapping
table and read the data at the storage location.
4. Decompress data on the fingerprint owning controller and return data to the host.
3.2 FlashLink
3.2.1 Introduction
As stated in Chapter 2 "Overview", while SSDs have absolute advantages over HDDs in
terms of performance, reliability, and power consumption, they bring about new problems.
As the only vendor capable of developing both storage arrays and SSDs, Huawei adopts
innovative FlashLink in OceanStor Dorado V3. This most cost-effective all-flash storage
system has resolved these problems. The following figure shows FlashLink in the system
architecture.
FlashLink FlashLink
Flash Flash
Flash Space Flash Space
Cache Cache
Manage Manage
Cluster Cluster
ment ment
BDM BDM
HSSD
SSD Driver
As shown in the preceding figure, FlashLink contains the major I/O modules of controller
software and disk drivers. The following figure shows functions provided by FlashLink.
RAID2.0+ RAID-TP
This layer stores hot and cold data separately based on the characteristics of SSDs to
obtain better performance and reliability.
Data service layer: Merges all writes to one full stripe and then sends the data to disks,
completely eliminating write penalty. Efficient I/O scheduling is realized and RAID
usage is improved. You can configure priorities from host interfaces to disks to grant the
highest priority to latency-sensitive I/Os.
3.2.2 RAID-TP
The growing capacity of single disks requires longer reconstruction time if a disk fails. To
ensure system reliability, you can improve reconstruction speed to reduce the time but this
affects service performance. You can also increase redundancy to ensure system reliability
during the reconstruction process at the cost of disk utilization.
Based on the redirect-on-write (ROW) technology, OceanStor Dorado V3 adopts full-stripe
writes and eliminates write penalties completely. Write performance has no loss in large-stripe
RAID configurations, where the cost of adding one redundant disk can be ignored. For
example, OceanStor Dorado V3 supports 23+3 RAID configuration at most. Compared with
23+2 RAID configuration, capacity usage is reduced by only 3.5% but reliability improves by
two orders of magnitude. In summary, OceanStor Dorado V3 ensures reliability, performance,
and efficiency in scenarios of large-capacity disks by using the RAID-TP function.
P/E Cycle
40%
35%
30%
25%
20%
P/E Cycle
15%
10%
5%
0%
SSD 0 SSD 1 SSD 2 SSD 3
However, if OceanStor Dorado V3 is approaching the end of its life, for example, the disk
wearing level reaches over 80%, multiple disks may fail at the same time and data may be lost
if global wear leveling is still used. In this case, the system enables anti-global wear leveling
to avoid failures of SSDs in batch. The system selects the SSD that is most severely worn and
writes data onto this disk as long as it has idle space. As a result, this SSD runs out of its
service life faster than other disks. Users will be prompted to replace this disk. In this way,
SSDs will not fail in batch.
P/E Cycle
100%
98%
96%
94%
92% P/E Cycle
90%
88%
86%
SSD 0 SSD 1 SSD 2 SSD 3
Every SSD reserves certain space for garbage collection. During system running, one disk
may have more bad blocks than others, consuming more redundant space. As a result,
performance and reliability of the disk are affected. If the system keeps using the policy of
global wear leveling, the performance of the entire system will be affected and this disk will
be more vulnerable to damage. FlashLink obtains partial redundant space from each disk as
shared redundant space, called global capacity redundancy. Based on the remaining redundant
space on each SSD, FlashLink dynamically adjusts the data allocation algorithm to reduce
data written to the SSD that is most severely worn and scatter data to other disks, improving
SSD lifespan and storage performance.
SSDs save data with the same labels to the same block. In this way, hot and cold data are
stored in different blocks, reducing the amount of data migration for garbage collection, and
improving performance and reliability of SSDs.
1 1 1 1 1 2 2 2 3 3 4
1 1 1 1 1 2 Write new data and A b C D F A B C A B A
A B C D F A modify data.
When new data is written to SSDs, the system merges data of LUN 1 and LUN 2 as a full
stripe and then writes data to disks, as shown in the left part of the figure. If some data is
modified as shown in the right part of the figure, the user modifies 1B to 1b in LUN 1, writes
new data 3A and 3B to LUN 3, and new data 4A to LUN 4. The system merges 1b, 2B, 2C,
3A, 3B, and 4A as a new stripe and then writes data to disks. The system marks 1B data block
as junk data.
After a long time of running, a large amount of junk data will be generated and the system
cannot find any space for full-stripe write. When the amount of junk data reaches a certain
value, FlashLink uses global garbage collection to clear storage space, ensuring enough space
for full-stripe write at any capacity usage rate.
1 1 1 1 1 2
P P Direct garbage
A B C D F A
collection
1 2 2 3 3 4
P P
b B C A B A
1 X
P P Garbage collection
c 2
after the data is
copied and saved
X X X at another location
1 3 4
1 1 X X X X
P P
b c 2 1 3 4
Host I/Os
Back-end I/Os
with high priority
The original volume (source LUN) and snapshot use a mapping table to access physical space.
The initial data of the origin volume is ABCDE and is saved in sequence in terms of physical
space. The metadata of the snapshot is null. All read requests to the snapshot are redirected to
the origin volume.
When the origin volume receives a write request in which C is changed to F, the data is
directly written into new physical space P5 instead of being overwritten into physical
space P2, as shown in step 1 in the preceding figure.
After the data is written into the new physical space, mapping item L2->P2 is inserted
into the metadata of the snapshot. In this way, accesses to logical address L2 of the
snapshot are not redirected to the origin volume and data is directly read from physical
space P2, as shown in step 2 in the preceding figure.
L2->P2 in the mapping metadata of the origin volume is changed to L2->P5, as shown
in step 3 in the preceding figure.
Data in the origin volume is changed to ABFDE and data in the snapshot is still ABCDE.
Figure 3-23 Distribution of the LUN data and metadata of the source LUN
Root
The system has a unique metadata organization method and supports high-speed query,
deletion, insertion, and update of metadata. Snapshots using this organization method have no
performance loss.
When a snapshot is being created, IOPS and latency of the source LUN remain
unchanged.
When a snapshot is being deleted, IOPS and latency of the source LUN remain
unchanged.
When a snapshot is mapped to the host for read and write, IOPS and latency of the
source LUN remain unchanged.
The following shows how to delete snapshot TP1 to explain the implementation principles of
lossless snapshot.
Deleting the data exclusively occupied by a snapshot: Check whether mapping data exists at
the time one time point greater than the snapshot activation time. If mapping data exists, the
data is exclusively occupied by a snapshot. If not, the data is shared. The user has created
three snapshots at time points TP0, TP1, and TP2. The latest time point is TP3. In the first
figure on the left, LBA0 only has data in VERSION0 that is shared by VERSION0,
VERSION1, VERSION2. LBA1 has no data in VERSION0. Therefore, VERSION1,
VERSION2, and VERSION3 occupy data exclusively. <LBA:1 VERSION:1 VALUE:P1>
needs to be deleted.
Lossless snapshots are realized through the metadata management mechanism with time
points and high-speed random access of SSDs.
Primary
LUN
Snapshot
Technical advantages
− Data compression
Data compression is supported specific to iSCSI links and data compression ratio
varies with the service data type. The maximum ratio for database services can be
4:1.
− Quick response to host requests
After a host writes data to the primary LUN at the primary site, the primary site
immediately returns a write success to the host before the data is written to the
secondary LUN. In addition, data is synchronized from the primary LUN to the
secondary LUN in the background and does not affect the access to the primary LUN.
HyperReplication/A does not synchronize incremental data from the primary LUN to
the secondary LUN in real time. Therefore, the amount of lost data is determined by
the synchronization period (ranging from 3 to 1440 minutes, 30s by default) that is
specified by the user based on site requirements.
− Splitting, switchover of primary and secondary LUNs, and rapid fault recovery
The asynchronous remote replication supports splitting, synchronization,
primary/secondary switchover, and recovery after disconnection.
− Consistency group
Consistency group functions are available, such as creating and deleting consistency
groups, creating and deleting member LUNs, splitting LUNs, synchronization,
primary/secondary switchover, and forcible primary/secondary switchover.
− Interoperability between high-end, mid-range, and entry-level storage
Developed on the OceanStor OS unified storage software platform, OceanStor
Dorado V3 is completely compatible with the replication protocols of Huawei
high-end, mid-range, and entry-level storage products. Remote replication can be
Application scenarios
Remote data disaster recovery and backup: For HyperReplication/A, the write latency of
foreground applications is independent from the distance between the primary and
secondary sites. Therefore, HyperReplication/A applies to disaster recovery scenarios
where the primary and secondary sites are far away from each other, or the network
bandwidth is limited.
Quorum server
7. Compatibility with other features: HyperMetro can work with existing features such as
HyperSnap, SmartThin, SmartDedupe, and SmartCompression.
Fingerprint
distribution
…...
The efficiency of deduplication varies with data type. In VDI applications, deduplication ratio
can reach 10 times while in database scenarios, deduplication ratio is smaller than two times.
Deduplication can be disabled based on LUNs. In scenarios where higher performance and a
small deduplication ratio are required, for example, databases, you can disable deduplication.
following figure, 8 KB data blocks are compressed, converged into full stripes, and then
written to disks.
8K 8K 8K 8K
8K 8K 8K … 8K
8K 8K 8K 8K
Compress
1K 2K 3K 4K 8K
1K 2K 3K 4K … 8K
1K 2K 3K 4K 8K
Stripe
The efficiency of compression varies with the data type and the compression ratio is generally
2 to 3.5 times. Compression can be disabled based on LUNs. In scenarios where higher
performance is required, you can disable compression.
eDevLUNs and local LUNs have the same properties. For this reason, SmartMigration is used
to provide online migration for heterogeneous LUNs.
SmartVirtualization applies to:
1. Heterogeneous array takeover
As customers build their data center over time, their storage arrays in the data center may
come from different vendors. How to efficiently manage and use storage arrays from
different vendors is a technical challenge that storage administrators must tackle. Storage
administrators can leverage the heterogeneous virtualization takeover function of
3.4.1.2 CLI
The CLI allows administrators and other system users to perform supported operations. You
can define key-based SSH user access permission to enable users to compile scripts on a
remote host. You are not required to save the passwords in the scripts and log in to the CLI
remotely.
3.4.2.2 SNMP
SNMP interfaces can be used to report alarms and connect to northbound management
interfaces.
3.4.2.3 SMI-S
SMI-S interfaces support hardware and service configuration and connect to northbound
management interfaces.
3.4.2.4 Tools
OceanStor Dorado V3 provides diversified tools for pre-sales assessment and post-sales
delivery. These tools can be accessed through WEB, ToolKit, DeviceManager,
SystemReporter, and CloudService and effectively help users deploy, monitor, analyze, and
maintain OceanStor Dorado V3.
obtain OpenStack Cinder Driver and integrate it to OpenStack so that their products support
OceanStor Dorado V3.
OceanStor Dorado V3 provides four versions of OpenStack Cinder Driver: OpenStack Juno,
Kilo, Liberty, and Mitaka. In addition, OceanStor Dorado V3 supports commercial versions of
OpenStack such as Huawei FusionSphere OpenStack, Red Hat OpenStack Platform, and
Mirantis OpenStack.
This section lists part of host compatibility information. For more information about OceanStor
DoradoV3 compatibility, visit
http://support-open.huawei.com/ready/index.jsf;jsessionid=D6185C2E6A2B671741F928B671E4B098.
4 Best Practices
5 Conclusion
OceanStor Dorado V3, an all-flash storage array specially designed for critical enterprise
services, adopts the multi-controller architecture dedicated to flash storage and disk-controller
coordination FlashLink technology to meet the requirements of unexpected service growth.
Furthermore, 1 ms gateway-free active-active design ensures always-on businesses. Inline
deduplication and compression technologies improve efficiency and cut TCO by 90%.
OceanStor Dorado V3 provides high-performance, reliable, and efficient storage for enterprise
applications such as databases and virtualization, helping the financial industry, governments,
enterprises, and carriers smoothly evolve to the flash memory age.