1. Introduction
A smart grid is an electrical network that includes digital and emerging technologies to supervise and manage the operating processes for the transport of electricity from generation sources to end-users [
1]. The main aim of smart grids is to optimize operational capacity as well as integrate new technologies and renewable energy sources to improve efficiency, reliability, and safety while reducing the environmental impact with economic and social benefits. Among the new applications incorporated by smart grids are emerging technologies for energy storage, smart metering integration, distribution automation for rapid failure detection, and real-time demand response [
2,
3].
Smart grids also promote the integration of new information and communication technologies (ICT) to achieve bidirectional communications and automated control. This combination of traditional and new digital technologies adds intelligence to the network since it increases operational capacity to acquire, communicate, process, analyze, and protect data automatically. Such intelligence allows the operator to have a better situational awareness of the network to enhance control by improving the real-time response to variations that may arise on the grid, which means a contribution towards the interoperability of the grid for the safe and reliable exchange of information [
4]. However, a smart grid is technologically and economically viable only if the stakeholders involved can overcome several challenges in the areas of available energy resources evaluation, improvement of control, management, and monitoring systems. The latter includes a detailed observation of the operation and electricity demand, which implies establishing advanced metering systems that allow constant monitoring of the energy supply (utility) side and demand (client) side. In this regard, the incorporation of big data systems as a technological option offers the possibility to analyze data at nearly all stages in the energy supply process [
5].
Since the smart grid incorporates significant processes like generation, transmission, and distribution of electricity, the installation for each of these processes includes a wide variety of equipment like generators, turbines, transformers, switches, current, and potential transformers, among others. The final goal of using ICTs is to measure, monitor, and even control every point in the system. It means all assets can be governed and managed by intelligent devices from generation to consumption, and is optimized based on environmental, social, institutional, or economic restrictions. Thus, the application of ICTs in the smart grid could collect large amounts of data, which also demands the use of big data and data analytics approaches [
6].
Experts from McKinsey Global Institute Energy evaluated 150 use cases from several companies resulting in a full study about the impact of data analytics on them. Their reports indicate that the energy sector is one of the nine industries with the most positive potential implications of big data and data analytics. According to that study, utilities potentially have many areas of benefit around incorporating big data and data analytics [
7]. For instance, work from Tu et al. suggests big data applications for wide-area situational awareness (perceiving, understanding, and projecting events in the system), state estimation, classification, and detection of events for power distribution systems [
8]. According to Schuelke-Leech et al., the main aspects in which big data research for smart grids could fit are generation, transmission, distribution, billing (interfaces with the client), markets, and regulation [
9]. In particular, Zhou et al. mention four sectors around big data applications: management on the generation side, microgrids and renewable energy management, collaborative operations and asset management, and demand-side management [
10]. They believe that the most important focus of big data applications for smart grids are energy (savings achieved with big data), exchange (integration with other sources), and empathy (satisfy needs). They call this the “Big Data 3E.” Other applications mentioned by the authors are validation and calibration of plants, demand projection in the short term, demand response, estimation of parameters for distribution systems, and security and protection systems.
The arrival of big data applications for the smart grid brings benefits for both utilities and clients. Big data provides the opportunity to monitor, correct, and better integrate smart grid technologies, extract knowledge from data flowing through the grid, and further generate value and profits for utilities and customers [
9]. It can imply notable benefits like increased grid stability and reliability, the efficient use of assets, an improved experience, and customer satisfaction. Likewise, the management of data generated from different components of the grid is fundamental for the successful implementation and operation of every process in the network [
8].
One of the most critical technologies in the deployment of smart grids is the advanced metering infrastructure (AMI). This technology has allowed, in recent years, the installation of a large number of smart energy meters and other measurement terminals on the end-user side [
10]. Smart meters produce data every 15 min (or less), which means that traditional databases and statistical analysis are no longer enough to extract the real value inside raw data coming from such meters. In addition to operational data, there are also other data sources to manage, such as the energy market, geographic information systems (GIS), or demographic data. This whole scenario allows identifying an increasing availability of high volumes of AMI data, the rise of advanced technologies for information analysis, and a strong need to make informed decisions to improve the planning and operation of the grid.
This work finds its purpose in the confrontation of two clear challenges in the panorama previously exposed. First, there is the use of big data techniques for the handling and processing of large volumes of raw data from smart meters. These devices generate data at time intervals that previous technologies could not offer. Second, the use of data analytics techniques to process such raw data and transform it into knowledge that adds value to the company or its customers.
The case study implemented big data and data analytics techniques with energy consumption information from 5567 London households. The households participated in the low carbon London project between 2011 and 2014. Approximately 1100 customers underwent a dynamic time of use (dToU) energy price scheme in the 2013 year. Our work consisted of integrating a big data architecture and the use of data analytics techniques for advanced metering infrastructure (AMI) data analysis. The latter stage focused on generating value for the utility or its clients through different tasks: analyzing consumer behavior, energy consumption forecasting, and identifying correlations with exogenous data that may lead to better customer characterization and better grid planning and operation.
The following section presents a compilation of several applications developed around big data and data analytics for AMI data. The subsequent chapter presents the case study implemented, describing the data sources, the big data framework, and the data analytics techniques implemented. Lastly, the visualization and access stages account for how advanced information analysis can deliver, beyond a superficial description, relevant knowledge for utilities and their clients.
2. Big Data and Data Analytics for AMI
One essential component of the smart grid is the advanced metering infrastructure (AMI) technology since it provides utilities with a considerable amount of new information, which was not available with previous measurement systems. Data coming from AMI Smart Meters offer valuable information that utilities can use to optimize business operations or even customer service. AMI systems also provide a pervasive communication infrastructure for constant monitoring and remote control of the grid components [
11].
The extensive use of smart meters allows all the stakeholders involved in the operation of the smart grid to obtain benefits in the execution of their roles, and for the client to actively participate in the electric energy value chain, which generates new markets and possibilities for business in a smart grid [
12]. These features show AMI systems as a bidirectional enabler for awareness of energy consumption in the grid. On one hand, the utility can know the customers’ consumption patterns in near real-time. On the other hand, the customer becomes an active agent who knows his consumption behavior in detail and can actively impact network management, as is the case of energy efficiency or demand response programs.
With AMI, it is possible to do continuous monitoring of customer consumption, event occurrence, and power quality, as well as open possibilities for constant interaction between users and utilities. One of the most significant consequences of this type of system is the arrival of a high volume of data that has to be processed. According to data from US Energy Information Administration, the number of AMI devices went from 49.1 million to 150.8 million between 2014 and 2017 in which their significant increase was allocated to residential and commercial buildings [
13,
14].
This data growing extents the acknowledge about the consumption patterns of customers improving demand response programs, specifying better tariff schemes, and monitoring other details of the network. The large volume of available information requires dedicated platforms and intensive algorithms for processing.
A study by Alahakoon D. presents an overview of the smart metering process, starting from AMI, going through advanced analytics, and then reaching all smart grid stakeholders [
15]. The author states two perspectives from stakeholders’ point of view: one seeks to offer some benefits from data mining and the other aims to fulfill an established need through data analytics. The same author also presents a more comprehensive study proposing the critical elements of data analysis from the smart meters: data capture, transfer and storage, technology and algorithms, and stakeholder-related applications [
16]. The work puts a particular emphasis on the challenges faced with the arrival of big data and the increase of platforms based on cloud processing such as real-time data processing, regulation of pay-per-use price models, and even security and privacy issues. Wang et al. also present a review of advanced data analytics for smart meters [
17]. The paper focuses mainly on the collection of descriptive, predictive, and prescriptive analytical work for three main applications pointed out by them: load analysis, load forecasting, and load management.
The work developed around data analytics applications for data coming from smart meters generally proposes two perspectives: the processing platforms/architectures and the methods and algorithms available to process the data.
On the one hand, regarding processing platforms Shyam et al. present Apache Spark as a platform to store and run analytics for applications such as automatic demand response, pricing, and real-time data streaming [
18]. Liu et al. propose a hybrid solution form smart meter data analytics, combining Spark or Hive for data processing and MADlib, which is a machine learning toolkit for in-database analysis [
19]. Similarly, another study focused on the performance and efficiency of some advanced data analytics platforms [
20]. The authors proposed three stages in the study. The first one is the development of a performance benchmark to evaluate different platforms. The second stage offers a solution to the problem of data availability by taking into account privacy and security aspects. Lastly, the authors implement the benchmark for evaluation purposes. The evaluation shows five advanced data analytics platforms: Matlab, MADlib, System C, Hive, and Spark Streaming. They propose an algorithm to generate large realistic datasets from a small volume of real data. The work from Daki et al., also presents different concepts on data management in smart grids [
21]. In their review, the authors propose an architecture for customer advanced data analytics and show various stages for data processing like data sources, integration, storage, analytics, and visualization. Stoyanov et al. developed a study to capture and consult data from Hadoop (Hive) from smart meters. The authors highlight that a centralized model is better if the volume of data is not considered high, i.e., the limits of modern hard disks [
22].
On the other hand, several authors have focused on advanced data analytics approaches for process smart metering data. Jha et al. show several advanced data analytics developed on AMI data implemented in the Puducherry Smart Grid Pilot project. In that project, they make data validation, identify meter tampering and missing information, perform energy audit and accounting, and identify peak demand and consumer profiles [
23]. Work from Yu et al. shows an analysis of AMI data using fuzzy systems in the Tatung University campus. They use a combination of Cascading Style Sheets and Google chart API to support enhanced reading and real-time visualization. The result of this work is the integration of data from smart meters into a web platform for consistent visualization and the presentation of basic statistics and measurements [
24]. Although many of these analyses have been done in time-domain mainly, other authors use a frequency domain methodology to characterize and analyze load profiles [
25].
Regarding load profile analysis, the work by Hayn et al. performs a characterization of the consumption profiles of the clients and carry out socio-demographic studies to evaluate the influence of specific technologies and appliances on their consumption patterns [
26]. Ramos et al. developed a framework to characterize medium voltage (MV) users using knowledge discovering from databases (KDD) and identifying the load profiles. The proposed methodology includes pre-processing, clustering algorithms, selection, segmentation, and classification [
27]. Work from Kojury-Naftchali, is focused on the self-organizing map (SOM) to obtain the characterization of the customer’s electricity consumption behavior [
28]. Besides the evaluation of customer consumption patterns, some works have increased the level of detail to reach load disaggregation to identify individual uses of some domestic appliances and evaluate their electric consumption footprint. For example, the non-intrusive load monitoring, implementing hidden Markov models, and deep learning or deep sparse coding algorithms [
29,
30,
31].
One of the most studied applications of AMI systems is load forecasting. Forecasting is a type of regression aimed to predict the future value of a variable given its past values. Such autoregressive models may or may not consider additional or exogenous variables that share the same time-series [
32].
The dynamic electricity market depends on the adequate load forecasting for appropriate demand-side management and planning as well as advanced analysis of external data that influence the behavior of customers and the electricity market. Such actions guarantee efficiency and savings for both utilities and customers.
There are several reviews of different linear and nonlinear models for forecasting tasks [
33,
34]. Hayes et al. suggest that nonlinear models have a better performance against linear regressors for predicting this type of variable like the Nonlinear AutoRegressive eXogenous model (NARX) [
35]. Authors in the literature use different approaches for energy consumption forecasting. Most of them use machine learning algorithms. Among the most common algorithms to perform load forecasting, we uncovered the random forest estimation [
36,
37], autoregressive models, and neural networks [
38]. Some works use feature selection combined with wavelet transform or differential evolution algorithms for short-term load forecasting [
39,
40]. Ali et al. use singular value decomposition (SVD) to perform short term load forecasting [
41]. Aman et al. also use machine learning methods to forecast energy consumption patterns in a university campus microgrid, and mention possible applications for energy consumption planning and conservation [
42,
43]. Recent work from Taieb et al. present a hierarchical probabilistic approach for electricity forecasting using the
MinT approach [
44].
Other authors have included external data sources for their works. For example, Liu et al. study load forecasting by implementing a Map/Reduce framework. The authors cluster geographical data according to weather conditions [
45]. The work from Cui et al. reviews the concept of prosumers (simultaneous producer and customer) and the impact of social media on their generation/consumption patterns [
46]. Chen et al. also study the short-term load forecasting using deep residual networks, which, when using two public datasets, showed that it is more accurate and robust compared to other state-of-the-art forecasting models [
47].
Another crucial aspect of the deployment of AMI is the possibility of active participation of customers in smart grids. For example, Kwac et al. present a methodology for demand response targeting by formulating optimization techniques to solve a stochastic knapsack problem (SKP) based on high-resolution data collection [
48]. Mogles et al. analyzed the effect of personalized messages through in-home displays about consumption patterns. This work shows how, with adequate customer involvement, energy savings went up to 22%, and the energy literacy went from 0.52 to 1.28 on a scale from zero to four [
49]. Work from Tascikaraoglu, A. presents a compilation of works focused on AMI data-driven demand response, which highlights the use of Artificial Neural Networks (ANN), online clustering, and distributed data analysis [
4].
AMI data analysis applications do not only focus on identifying demand profiles, customer classification, or forecasting tasks. For example, Gómez Lopez et al. make an overview of AMI data potential applications for power distribution loss reduction [
50]. Work from Botev et al. presents a data-driven model to identify possible sources of non-technical losses from AMI data. Their method is based on spectral analysis of periodic patterns, using features in the frequency domain. They highlight the model’s ability to perform online analysis [
51]. One of the most recent surveys on the subject presents a review of machine learning techniques to detect energy theft using smart metering data. The work suggests that these techniques can be simple (supervised or unsupervised learning) or hybrid. The later technique combines any of the above with more sophisticated methods like extreme learning machines, genetic support vectorial machines, and Boolean rules fuzzy logic SVM. The authors point out how challenges in energy theft detection have not adequately been addressed yet. Such challenges can be data imbalance (normal samples in the same range), Big Data’s 3V (volume, velocity, and variety), feature description and selection, and non-malicious factors (change of residents or appliances, or seasonality) [
52].
The privacy of customer data is also a significant challenge faced by developments around AMI data, which are consistently available with precise details about their consumption habits. In this sense, Foreman et al. present a methodology to anonymize customer data with smart meters installed on their properties, while preserving the billing services and automatic connection to the centralized system of a utility [
53].
Considering emerging technologies, Bereş et al. present a study of several tools based on cloud computing for data analytics. The work mentions the benefits of considering aspects like security, availability, and reliability. Moreover, they show the possibility to process data in real-time safely [
54]. Yan et al. also present a fog computing model to process AMI data. This approach mentions several challenges like expansion flexibility, efficiency, reliability, and high costs associated with cloud computing for AMI applications [
55].
Next,
Table 1 presents an overview of some works related to AMI data applications mentioned so far.
Although
Table 1 presents several applications based on AMI data, it is necessary to have a development framework with an adequate methodology to transform raw AMI data into usable knowledge according to the required application. The National Institute of Standards and Technology (NIST) proposed a reference architecture framework for the development of big data projects [
56]. Considering the surrounding concepts of big data, not only for AMI projects but for any application, we used the NIST reference architecture framework.
Big Data Reference Architecture
Reference architectures generally serve as a template for developing solutions in an orderly manner in a specific field and may be used for comparison and alignment purposes. The architecture proposed by NIST brings together common elements found in different documented case studies around the world. The reference architecture presented in
Figure 1 also includes general considerations on big data, its implications, and requirements [
57].
Five primary roles compose the reference architecture.
System Orchestrator: it defines and integrates the required data application activities into an operational vertical system. It provides the overarching requirements about business ownership, governance, data science, and system architecture.
Data Provider: it introduces new data or information sources into the big data system, either online or offline. It is also responsible for data persistence (hosting), data scrubbing (remove PII – personally identifiable information), metadata (for history and repurposing), policy for others’ access to data, and query without transferring (sometimes).
Big Data Framework Provider: supplies a computing infrastructure while protecting the privacy and integrity of data. Some resources or services used by the big data application provider are infrastructure framework (networking, computing, storage, environmental), data platform (physical storage, file systems, logical storage), and processing (software support for applications).
Big Data Application Provider: it executes a life cycle to meet security and privacy requirements. It also develops system orchestrator-defined requirements, mechanisms to capture data, preparation, analytics (discovery for finding value in big volume datasets), visualization (exploratory, explicatory, or explanatory), and access to the results of the data system.
Data Consumer: includes end-users or other systems that use the results of the big data application provider: search and retrieve, download, analyze locally, and reporting and visualization.
The data consumer role is in charge of understanding the results. For this, the tasks of visualization and access to the data developed by the application provider are of great importance. On the one hand, visualization allows the results of the analysis carried out to be communicated to an audience to facilitate interpretation and understanding [
58]. On the other hand, the access stage allows the information to be delivered to the data consumer efficiently according to their activity or job [
59,
60].
In a utility, there can be different types of data consumers including developer engineers, operational coordinators, decision-makers, and others. Depending on their role in the operation of the utility, each one requires different access to information aggregated or disaggregated at a certain level. An operation coordinator may require detailed operational data about events or clients, while an executive officer may require more comprehensive results on the performance and economy of the utility. Visualization and access to the correct type of information facilitate the communication of results in the way that each individual requires [
61].
One of the chapters of the framework presented by NITS documents 51 use cases in industries referring to big data and data analytics applications [
62]. From all 51 documented use cases, the only one related to smart grid and AMI data is called “Machine Learning for Demand Forecasting in Smart Grids.” This case studied machine learning methods for energy forecasting consumption patterns in the USC campus microgrid, which could be useful for energy planning and conservation [
42,
43]. The case study presented by NIST shows an increasing need in the energy sector for implementing applications related to data processing and analysis to improve the operation of the smart grid.
So far, we presented an overview showing the importance of AMI deployment within smart grids, given the variety of applications that can potentially generate value for a utility and its customers. A review of some developments focused on the processing of a large volume of AMI data generated by smart meters using big data and data analytics techniques. The model formulated by NIST, described above, stands out as a development framework for big data applications.
As motivators for this development, it is necessary to consider the importance of the role of AMI in the smart grid, the availability of a high volume of AMI data, and the growing need for developments focused on analysing this raw data. In this way, the main objective of this work is to implement an exploratory study case that allows us to demonstrate the potential of big data and data analytics techniques applied to AMI data processing.
According to the roles depicted by the architecture, for the case described in this paper, there is a big data framework provider in charge of implementing the information value chain. In addition, a data provider includes mainly AMI data coming from smart meters and some exogenous data. The objective is to develop an application acting as a big data application provider that covers the entire process, from collection to final access so that utilities can use that data as the data consumer.
The next section describes the case study in detail. The case study presented is exploratory, i.e., the aim was to explore the benefits of different techniques (or at least some) of data analytics techniques applied in an AMI dataset to extract knowledge from raw data. This exploration includes implementing machine learning algorithms for descriptive and predictive analysis. However, a utility can require implementing more sophisticated analytical techniques for a very particular purpose, e.g., the identification of consumption patterns of a customer sector, the financial evaluation of one of its demand response programs, or the identification of losses at different points in the grid.
4. Methods for the Big Data Analytics Application Development
The application implementation process consists of four main stages: data collection, data preparation, analysis, and visualization. We present the following.
4.1. Data Collection
The data collection stage implemented a Zeppelin notebook, running a PySpark interpreter to store all data collected into Spark data frames. A Spark data frame is a distributed set of data organized into columns, similar to a relational table. These data frames allow scalability of computation in processing clusters and integration with all big data tooling and infrastructure via Spark [
71].
The raw data available in comma-separated values (CSV) files were imported into the processing cluster in eight data frames, as presented in
Table 2.
4.2. Data Preparation
Data preparation is also called the ETL stage from big data lifecycle: extraction (by collecting data), transformation (by curating and preparing data), and loading (by saving data warehouse into HBase). The ETL process begins with raw data, which is the basis for building a data warehouse. The construction of the warehouse included two processes: filtering and data imputation. Filtering to discard incomplete records that cannot be estimated or that are not suitable to be considered in a specific metric (e.g., the total month consumption value cannot be calculated with incomplete records for several days in a month). Data imputation to complete missing data, when possible (e.g., a missing energy value between two time stamps for a meter, when the immediately preceding and next time instants are available).
In addition to data filtering and data imputation, it was also necessary to establish correlations and groupings between the different data sources or analyzing data with different granularity. Therefore, a data warehouse included different data frames, according to their type, granularity, or the kind of analysis to be performed.
Since our study was exploratory, we built a data warehouse with different time granularities in addition to the original raw data time-base (30 min): half-hourly, hourly, daily, weekly, and monthly. We also created data frames with information corresponding to the ACORN groups and the individual characteristics of the clients.
The initial database contained around 167 million records distributed in 8 data frames. The consolidated data warehouse stored about 676 million records distributed across 29 Spark data frames of different combinations of time granularities and grouping criteria.
This redundancy of data is then a consequence of the use of big data tools and facilitates data analytics. Therefore, it is possible to deal with complete datasets instead of a sample of data, as is done in some statistical approaches.
Although Spark data frames lie in memory, it is possible to assign some persistence of data frames on the hard disk. Nevertheless, the most appropriate way, especially if persistent access to the data is required, is to save them into the Hadoop File System (HDFS) permanently. For this purpose, the entire data warehouse was migrated to HBase using Apache Phoenix [
72].
4.3. Data Analytics
Once we adequately transformed and stored the necessary data with different granularities, an initial descriptive analysis was necessary before using different algorithms to perform predictive analysis. We performed a descriptive visual analysis by connecting the data stored in the processing cluster to Tableau visualization tools [
73].
4.3.1. Descriptive Analytics
For this task, we depicted a graphic description of data with Tableau to understand the usage of energy by different user categories and time granularities.
For example,
Figure 3 shows the energy consumption per hour of a household with standard tariff (Std) on 13 April 2013.
The case study considered five Acorn categories: affluent achievers, rising prosperity, comfortable communities, financially stretched, and urban adversity. However, rising prosperity, comfortable communities, and financially stretched groups presented reasonably similar trends. Therefore, for this study, these categories were grouped as a comfortable category. The new classification, referred in this paper as grouped categories, correspond to:
Affluent, for all households classified as acorn affluent achievers
Comfortable, grouping the households from rising prosperity, comfortable communities, and financially stretched categories
Adversity for households belonging to the urban adversity acorn category.
Figure 4 presents an example graph with the average energy consumption for each day of August 2013 for each grouped category.
We analyzed consumption patterns for each grouped category at different time horizons: hourly, daily, and monthly. We also made a differentiation between consumption on workdays and holidays. For each category and time horizon, we estimated average, maximum, and minimum consumption and compared users with Std and ToU tariffs.
In addition to the analysis of the consumption patterns of users and categories, we included information about exogenous variables such as weather and seasons. We analyzed the effect of variables such as temperature, daylight hours, and seasons of the year on the consumption habits of each grouped category.
Section 5 presents the results and some interesting findings from this stage of descriptive analysis. This first stage of descriptive analysis allowed us to identify different characteristics of electricity consumption in households and their categories and some relationships with other variables such as temperature and light hours. However, the power of data analytics extends its scope from the descriptive to a predictive approach, as is the case of machine learning tasks for clustering and forecasting.
4.3.2. Predictive Analytics
The main goal of predictive analytics is to use current and historical information to find future patterns or characteristics not explicit in the available information [
74]. In this work, we implemented two main tasks of predictive analytics: clustering and forecasting.
Clustering
Clustering is a machine learning task aimed to group sets of objects with similar characteristics [
32]. In this case study, the households were already grouped into acorn categories and then gathered into three global categories (affluent, comfortable, and adversity), called grouped categories. Nevertheless, such geo-demographic classification segments users according to characteristics like household income and size, wealth, or social grade, but does not consider the electricity consumption patterns of each household [
64]. We propose the implementation of a clustering algorithm to assign three new categories to the data: high, medium, and low, considering only electricity consumption patterns of customers, leaving aside their geo-demographic characteristics. This new categorization of users is not intended to be better or to exclude the existing segmentation, but is rather complementary.
Therefore, we now have users segmented considering two different criteria. On one hand, the criteria includes their geo-demographic characteristics and, on the other hand, the criteria involves their consumption patterns. For example, users from the adversity category (geo-demographic based) intuitively should match those users assigned by our clustering algorithm to the low consumption category (energy consumption-based). However, this expectation is not mandatory, e.g., adversity users with high energy consumptions. The possibility of having two user segmentation criteria that may or may not coincide is useful when identifying possible atypical behaviors. The evidence of possible atypical behaviors might be one of the advantages of having complementary grouping criteria.
We used K-means as a clustering algorithm. K-means proposes that a set of objects has as many centroids as groups/categories require. Each object is assigned to the group with the closest centroid to the object’s coordinates. Next, Equation (1) describes the objective function of the K-means algorithm.
where
is the number of clusters,
is the number of samples
, and
is the centroid of each cluster.
The selected value of corresponds to the expectation of identifying three household segments, which is similar to the three grouped categories presented.
There were two possible options considered in the algorithm implementation: the Spark MLlib library [
75] and the Python Scikit-learn library [
76]. While the first one requires a distributed processing environment, the second is better for applications that, by their nature, do not necessarily require distributed computing. The Scikit-learn library was more appropriate for this case study, considering that the amount of elements to be grouped is small (near 5500 households). We do not need to make use of distributed computing resources, at least, for the clustering task of this specific case study. The MLlib library is more appropriate for a more significant amount of records. Moreover, it was used TSlearn, an extension of Scikit-learn, specialized in the treatment of time series data [
77].
The main input arguments for the k-means clustering algorithm in TSlearn are the number of clusters () and the data set that will be grouped (5509 timelines with one for each household). These timelines have 24 measurements including one for each hour of the day, which forms a consumption pattern of each client over a day.
Identified clusters were denominated as high, medium, and low, referring to the electricity consumption of each group of clients. For example,
Figure 5 presents a graph with the average energy consumption for each day of August 2013 for each TSlearn category.
For this clustering exercise, we run 50 experiments to guarantee repeatability since the assignment of centroids of each initial cluster was random at the beginning of each experiment.
Once each household had an assigned group, it was possible to make comparisons between the grouped categories and those found in this work, which is now called TSlearn categories. The results of the clustering task are presented later in
Section 5.
Forecasting
As presented before in
Section 2, forecasting is one of the main applications of AMI data analytics. A critical task of predictive analytics addressed in this work is forecasting.
For the forecasting application, this work uses the Nonlinear AutoRegressive eXogenous model (NARX) model, expressed as follows.
where
is the variable to forecast,
is the estimation function,
is the autoregression order,
is the order of the exogenous inputs, and
represents a time delay in the exogenous inputs if it is necessary.
The estimation function
represents an estimator with a learning capacity. Some conventional estimators in machine learning are artificial neural networks, decision trees, and random forests [
75].
On the one hand, artificial neural networks (ANN) are assemblies of single neurons that acquire knowledge from experience with historical data. The assemblies establish their learning parameters to predict a future value with the least possible error, which is similar to the learning process of a real neuronal system. On the other hand, decision trees are trees whose ramifications represent possible decisions made according to the characteristics of each record in the available data. In the end, after the decisions are taken based on the available characteristics, each record is assigned a final value or a class label depending on the machine learning application [
78].
Due to their implementation simplicity, decision trees have evolved into more robust versions as random forests, which are ensembles of decision trees. A random forest operation proposes that several weak learners join to form a robust learner to increase predictive power and robustness for more extensive and more complicated datasets. These estimators have become widely used for predicting energy consumption [
36,
37].
For this case study, the estimation function () was a random forest. The variable to be forecasted () was the energy consumption, and the environmental temperature was the exogenous input (), given the high relation with the electricity consumption identified in the previous sections. The forecasting was made recurrently with hour granularity. The consumption of the previous 24 h defined the consumption value of the following hour, i.e., . It defined the order of the auto—regressor and the exogenous inputs .
The prediction was made recurrently for each hour of the day, each grouped category, and each meter. For this task, we used data for 2013. A total of 70% of the data was used for model training and 30% was used for testing. The library used for the forecasting algorithm was fireTS, which is an extension of Scikit-learn specialized in the prediction of time series [
79].
The results of the clustering task are presented later in
Section 5.
4.4. Visualization and Access
Data visualization and access support information analysis of the results presented as part of the information value chain presented in the NIST framework in
Figure 1. For this work, part of the data used and generated corresponds to development stages like ETL and data warehouse building. In contrast, another part focuses on the presentation and visualization of results.
Since this study case was exploratory, the visualization of the data focused on the presentation of the results obtained for the two types of analytics addressed (descriptive and predictive analytics). We designed Tableau dashboards [
80] to present results as detailed as the available data allowed. Its design aimed to provide global access for readers to all obtained results and not only for a specific profile, as previously mentioned in
Section 2. For this case study, we developed five Tableau workbooks containing several dashboards to visualize the results.
The first two workbooks correspond to the results of descriptive analytics. They include a set of dashboards that display electricity consumption by customers with different time granularities (hour, day, month, and year) and a section to visualize the difference in consumption between workdays and holidays. Dashboards also include visualization of consumption and payments differentiated by tariff (Std or ToU).
The last three books condense the results of predictive analytics, which result from implementing the forecasting and clustering algorithms. The dashboards with the forecasting results allow seeing information disaggregated by customers and categories. The clustering dashboards present a comparison between the groups assigned to the TSlearn categories and the original grouped categories.
Figure 6 shows an example of the dashboards developed for result visualization corresponding to descriptive graphs with information per day each month. The upper left graph shows consumption per day for Acorn categories. The lower-left graph presents the same information for grouped categories. The right side shows the average consumption values by grouped categories and tariff program, while the lower-right graph presents consumption by day for each tariff program.
Throughout
Section 5, we present some other graphs and dashboards designed for displaying results. The main contribution of the data visualization stage, which is transversal to tasks of descriptive or predictive analytics, is to facilitate the interpretation of results obtained when treating the information. Then end users can show knowledge that is not perceptible in the first instance and identify some trends or behaviors of interest quickly.
Below, we detail the results obtained in the implementation of the data analytics stages.
6. Conclusions and Upcoming Developments
This paper presents the results of the application of big data and data analytics concepts in a case study with AMI data, taken from the smart energy meters in the city of London. The work developed shows the transformation of raw data into knowledge that allows rapid identification of trends, average and individual patterns, unusual events, and, in general, provides support for data-driven decisions aimed at the best planning/operation of the distribution system.
Data processing on a distributed platform allowed the consolidation of a data warehouse with more than 600 million records including aggregate information in different temporal granularities and categorical grouping. In addition, it includes external variables that are highly related to the patterns of consumption for about 5,500 households.
Descriptive analysis showed how households that participated in a dynamic time of use (ToU) tariff program had lower electricity consumption, which indicates the positive impact that the implementation of a demand response program can have when supported by the installation of Advanced Metering Infrastructure (AMI).
There is a high relation between variables like temperature and light hours with users’ electricity consumption. In addition, it is important to notice how these changes influence specific consumption habits. For example, the consumption between workdays is, in general, lower than during holidays. However, this behavior is different in the summer months, which may be related to seasonal changes in temperature.
Given that the number of elements to be grouped was relatively small (5509 households) and we knew the number of clusters to group users (we aimed to get three groups), K-means was a fit candidate as a clustering algorithm. However, the potential application of this clustering task can grow up to hundreds of millions of users [
13,
14]. In those cases, there are other algorithms like batch k-means, spectral clustering, or Variational Bayesian Gaussian Mixture (VBGM), according to each case requirement [
76].
The application of clustering techniques (TSlearn categories, consumption-based) allowed proposing a new household segmentation different from the one assigned initially (Grouped Categories, geo-demographic based). When compared, these two types of categories allowed identification of users with atypical patterns. It might represent industrial or commercial activities in residential sectors, technical or non-technical losses or even help planning better pricing schemes for demand response programs.
The computational resources used for data processing (a distributed data processing cluster) facilitated the implementation of the case study from the role of a big data framework provider. For example, this includes consolidation into a distributed framework of a data warehouse. This consolidation, which involved managing millions of records, was a simple but time-consuming task. Future work might include a test benchmark to evaluate the performance of this type of application, depending on the computational infrastructure detected.
The forecasting task presents one of the main challenges in the analysis of AMI data. In this case study, the implemented algorithm presented significantly low error indicators, even in the case of individual meters. The algorithms implemented in this case study are not mandatory. Nevertheless, the results show the potential of combining an autoregressive algorithm (NARX) with an ensemble of learning entities (Random Forest) and libraries specialized in time series data (fireTS) as part of the data life cycle for this type of application.
For future work, we propose a more sophisticated development of the stages of visualization and access. As mentioned in
Section 2, a more ambitious implementation of a data analytics application for a utility may require result visualization for people with different roles within the company. Thus, visualization and access must explicitly support the work of each role. Some works from
Section 2 mention several approaches to these issues [
59,
61].
Furthermore, the inclusion of a sentimental analysis of information might be useful. That is the correlation of electricity consumption with events that can show trends in social networks: international impact news, world-order events such as Olympic Games, FIFA World Cups, and others.