Apache Iotdb: Time-Series Database For Internet of Things
Apache Iotdb: Time-Series Database For Internet of Things
Apache Iotdb: Time-Series Database For Internet of Things
Chen Wang1,2 , Xiangdong Huang1∗ , Jialin Qiao1 , Tian Jiang1 , Lei Rui1 , Jinrui Zhang3 , Rong Kang1 ,
Julian Feinauer4 , Kevin A. McGrail5 , Peng Wang6 , Diaohan Luo1 , Jun Yuan1 , Jianmin Wang1 ,
Jiaguang Sun1
1 School of Software, Tsinghua University, 2 EIRI, Tsinghua University, 3 Microsoft, 4 Pragmatic Industries GmbH,
5 InfraShield.com, 6 Fudan University
[email protected]
ABSTRACT As an illustrative example, a single wind turbine can generate hun-
The amount of time-series data that is generated has exploded due dreds of data points every 20 ms [7] to monitor conditions, detect
to the growing popularity of Internet of Things (IoT) devices and faults and make decisions. Future operations can then be decided
applications. These applications require efficient management of the by a set of sophisticated queries against the acquired time-series by
time-series data on both the edge and cloud side that support high data scientists. Typical uses are signal decomposition and filtration,
throughput ingestion, low latency query and advanced time series segmentation for different working conditions, and failure pattern
analysis. In this demonstration, we present Apache IoTDB managing matching.
time-series data to enable new classes of IoT applications. IoTDB Consequently, the IoT-related service market has spawned new
has both edge and cloud versions, provides an optimized columnar workloads on time-series processing blended by:
file format for efficient time-series data storage, and time-series data- Edge computing: As edge devices have gained more computa-
base with high ingestion rate, low latency queries and data analysis tional power and edge computing has grown more popular, managing
support. It is specially optimized for time-series oriented operations time-series data and supporting advanced analysis on the edge side is
like aggregations query, down-sampling and sub-sequence similarity trending. It requires the time-series database to be capable of running
search. An edge-to-cloud time-series data management application on both edge and cloud side, while remaining well organized for
is chosen to demonstrate how IoTDB handles time-series data in real- data synchronization.
time and supports advanced analytics by integrating with Hadoop Long-life, large volume historical data: The volume of data
and Spark. An end-to-end IoT data management solution is shown in IIoT is large. For example, the sensors on a Boeing model 787
by integrating IoTDB with PLC4x, Calcite, and Grafana. airliner produce upwards of half a terabyte of data per flight [8].
Compared with data center monitoring applications where the data
PVLDB Reference Format: is kept for a week or month, industrial users usually choose to keep
Chen Wang1,2 , Xiangdong Huang1∗ , Jialin Qiao1 , Tian Jiang1 , Lei Rui1 , all historical data for audit and statistical analysis of the whole life
Jinrui Zhang3 , Rong Kang1 , Julian Feinauer4 , Kevin A. McGrail5 , Peng cycle of devices.
Wang6 , Diaohan Luo1 , Jun Yuan1 , Jianmin Wang1 , Jiaguang Sun1 . Apache High throughput data ingestion: As illustrated in the wind tur-
IoTDB: Time-series Database for Internet of Things. PVLDB, 13(12): 2901 - bine example, the database needs to handle the ingestion of tens of
2904, 2020. millions of time-series data points per second stably in a 24×7×365
doi:10.14778/3415478.3415504 manner. It becomes more challenging when the arrival of time-series
data cannot be guaranteed to be in order due to various device and
1 INTRODUCTION network problems including device failure, weak communication
Nowadays, IoT applications are becoming increasingly popular in signal or network congestion.
many areas. Examples can be found in consumer electronics in- Low latency and complex queries: Queries are typically used in
cluding smart home devices, wearables and connected healthcare three scenarios. (1) The value of the latest data point is required for
as well as in industrial applications with the rise of Industrial IoT real-time monitoring with a short on boarding interval. (2) Applica-
(IIoT). Compared to traditional time-series usage for IT such as tions, like those for fault detection, regularly retrieve time-series data
infrastructure monitoring, the major characteristics of these IoT ap- having a timestamp or time window filter for given time-series IDs,
plications are real-time data management with lower latency and and the results are ordered by time. (3) The interactive, exploratory
more advanced analytics on the time-series datasets. Furthermore, queries by data scientists are more complicated and unpredictable,
when IoT is used in industrial applications, intelligent equipment where conditions on value and similarity of sub-sequence are applied
usually produces one to two orders of magnitude more data than on arbitrary lengths of historical time-series.
consumer-oriented IoT devices. This makes it even harder for ana- Advanced data analytics: Besides queries, advanced IoT data
lytics to produce valuable insights in a reasonable amount of time. analytics like signal processing and machine learning algorithms
are also necessary for data scientists to process the historical data.
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
However, the support by big data ecosystems such as Apache Spark
this license. For any use beyond those covered by this license, obtain permission by requires ETL from time-series database and keeping two costly
emailing [email protected]. Copyright is held by the owner/author(s). Publication rights copies of huge historical data respectively.
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 13, No. 12 ISSN 2150-8097. Time-series databases, like OpenTSDB [9] and KairosDB [2], are
doi:10.14778/3415478.3415504 built on top of existing NoSQL stores but suffer from insufficient
2901
Chen Wang1,2 , Xiangdong Huang1∗ , Jialin Qiao1 , Tian Jiang1 , Lei Rui1 , Jinrui Zhang3 , Rong Kang1 , Julian Feinauer4 , Kevin A. McGrail5 , Peng Wang6 , Diaohan Luo1 ,
Jun Yuan1 , Jianmin Wang1 , Jiaguang Sun1
Cluster Engine present a unique time-series ID. The Metadata Management module
Series-based
Raft Protocol
Hadoop manages the naming space of devices with a tree structure. For in-
Restful Partitioner Eco- stance, Location1.Windfarm2.Manufactuer3.Turbine4 is a full path
APi systems
Single-node IoTDB to describe a single wind turbine. The design of IoTDB chooses
to store the data in an open native time-series file format for both
Metadata Management
JDBC
Manager Adaptor database access with Query/Storage Engine and Hadoop/Spark ac-
Cache Data Reader Manager
+ Manager cess against a single copy of the data. It also serves as a distributed
SQL-like
Language time-series database, where data is partitioned by grouping of time-
Time Ordered Memtable Hive series in Cluster Engine among different nodes while time-based
Detector
Management
2902
Apache IoTDB: Time-series Database for Internet of Things
Chunk Group Differential (D) files are used for infrequent update and deletion
Chunk Header: Page
operations where updates are only for correction in case of data
Chunk 1
Summary Info quality issues and deletion is for truncating the data older than a cer-
…… Page Header:
Page 1 Summary Info tain timestamp. The log-style records are appended in the versioned
Chunk m
Column for D-file, and will be similarly merged to TsFile as O3-TsFile. The
… timestamp: query will sequentially scan the D-file to get the latest value, if the
Group Footer T1, T2, …, Tn
Page n
target time-series exists.
Column for
values:
Chunk V1, V2, …, Vn
2903
Chen Wang1,2 , Xiangdong Huang1∗ , Jialin Qiao1 , Tian Jiang1 , Lei Rui1 , Jinrui Zhang3 , Rong Kang1 , Julian Feinauer4 , Kevin A. McGrail5 , Peng Wang6 , Diaohan Luo1 ,
Jun Yuan1 , Jianmin Wang1 , Jiaguang Sun1
edge-side
distance
measuring Edge Side Cloud Side
PLC !"#$%&$'(
sensor
cloud-side
file sync
scala> spark.sql(select *
jdbc:calcite> select count(*) from tsfile_table).show()
3 DEMONSTRATION from (select sensor0 from
root.plc4jDemo where device =
… and time >… and time <…)
IoTDB System: We first demonstrate IoTDB’s usage on the edge
side. We install a Raspberry PI with a Mitsubishi programmable (a). Calcite Integration (b). Spark Integration
logic controller (PLC) , an industrial distance measuring sensor
and a gyroscope sensor as an intelligent IoT device. An IoTDB is
deployed on the Raspberry PI to manage the time-series data locally.
The Raspberry PI collects distance-measuring data at a frequency of
100 Hz from the PLC and the data is ingested into IoTDB locally. The
edge IoTDB synchronizes the generated TsFiles to the cloud every (c). Zeppelin Integration (d). Grafana Integration
10 seconds. The angle changes (x, y, z, accelerated-x, accelerated-y,
and accelerated-z) from the gyroscope are collected at a frequency of
Figure 5: IoTDB integration for advanced functions
5 Hz, IoTDB JDBC is used to send the data to the cloud in real-time.
Figure 3 shows the real sensors, the PLC and the Raspberry Pi with
IoTDB. When the sensors are moved, we can see the visual time write and query optimizations are also discussed. These contribu-
series being updated on the left top screen. tions are demonstrated with a proof of concept application. This
On the cloud, an IoTDB instance receives both the batch TsFiles illustrates how time-series data is managed, visualized, explored and
and the streaming data points in real-time. The effect of the File Sync analyzed in IoT world using Apache IoTDB.
is shown in Figure 4 (d). In the IoTDB-CLI console (Figure 4 (a)),
we can see there are 7 time-series in IoTDB. Figure 4 (b) shows that ACKNOWLEDGMENTS
an aggregation query to down-sample the distance measuring data As an incubating Apache project, there are many contributors de-
from 100 Hz to 1 Hz. Figure 4 (c) shows using KV-match index to get voted to IoTDB. We thank all those who contribute to the community.
the most similar sub-sequence from the distance time-series curve
when given a sample curve. Figure 4 (a) to (c) are all finished by REFERENCES
using IoTDB SQL, while Figure 4 (d) requires running two IoTDB [1] http://iotdb.apache.org. [n.d.]. Apache IoTDB (incubating). http://iotdb.apache.org
instances and setting the receiver’s IP in IoTDB’s configuration. [2] http://kairosdb.github.io. [n.d.]. KairosDB. http://kairosdb.github.io
[3] http://parquet.apache.org. [n.d.]. Apache Parquet. http://parquet.apache.org
To support advanced analysis, e.g., interactive data exploring and [4] https://s.apache.org/tsdb-comparison. [n.d.]. TSDB Comparison. https://s.apache.
signal computing, we show how to integrate IoTDB with other sys- org/tsdb-comparison
tems. Figure 5 (a) shows the integration with Calcite to use a nested [5] https://www.influxdata.com/blog/influxdb-markedly-outperforms-opentsdb-
in-time-series-data-metrics benchmark. [n.d.]. InfluxDB vs OpenTSDB.
query to query data from IoTDB. Figure 5 (b) shows using Spark- https://www.influxdata.com/blog/influxdb-markedly-outperforms-opentsdb-in-
SQL to translate the data in IoTDB to DataFrame and leveraging the time-series-data-metrics-benchmark/
[6] Xiangdong Huang, Jianmin Wang, and et.al. 2016. PISA: An Index for Aggregat-
capability of DataFrame for complex analysis. In Figure 5 (c), we ing Big Time Series Data. In CIKM.
integrate Zeppelin with IoTDB for exploratory analysis. Figure 5 (d) [7] IEC 61400-25-6:2016 2016. Wind energy generation systems - Part 25-6: Commu-
shows using Grafana to visualize time-series data in IoTDB. nications for monitoring and control of wind power plants - Logical node classes
and data classes for condition monitoring. Standard. International Electrotechnical
Commission, Switzerland.
4 CONCLUSION [8] Jussi Ronkainen and Antti Iivari. 2015. Designing a Data Management Pipeline
for Pervasive Sensor Communication Systems. In FNC/MobiSPC.
In this paper, we present Apache IoTDB, a high performance data- [9] B Sigoure. 2010. OpenTSDB: The distributed, scalable time series database. Proc.
base for time-series data management on the edge and cloud. A OSCON 11 (2010).
[10] Jiaye Wu, Peng Wang, Chen Wang, Wei Wang, and Jianmin Wang. 2019. KV-
native time-series oriented columnar file format, TsFile, is intro- match: A Subsequence Matching Approach Supporting Normalization and Time
duced for improved query performance and storage efficiency. More Warping. In ICDE.
2904