Realistic and Very Fast Simulation of Individual ... - Alexis Bondu .fr

distributed; ii) the electricity production has to be scheduled by taking into account the ... engine, e-commerce, social networks ... etc). One year of individual .... Markov chains are exploited to learn the time dependencies from the observed time ...
612KB taille 1 téléchargements 353 vues
Realistic and Very Fast Simulation of Individual Electricity Consumptions Alexis Bondu

Abstract— The incoming smart grid represents a significant break for the European utilities in terms of data volume to be processed. In France, one year of individual consumptions represents more than 600 billion data points. Since real data is not yet available, our objective consists in simulating realistic individual consumptions. A new generative model of time series is proposed which combines the MODL coclustering approach with Markov chains. This approach is evaluated on a real dataset provided by the Irish CER. Our experiments demonstrate the ability of the generative model to efficiently reproduce the dynamic of the original time series.

I. I NTRODUCTION The French electrical grid is close to being modernized by exploiting information and communication technologies. The emerging “smart grid” has multiple objectives : i) the grid control and the electricity supply quality have to be optimized, despite the fact that power stations are highly distributed; ii) the electricity production has to be scheduled by taking into account the uncertainty related to renewable energy (ex: wind, sun exposure); iii) the electrical demand needs to be coordinated to flatten the consumption peaks and to limit their environmental impacts. The “smart meters” are the first step in the smart grid project deployment. These new digital meters are expected to be installed in all French households within a few years. Smart meters are able to record the individual power consumptions in real time, and to send this information to a data center through a communication network. Currently, all technical choices are not yet finalized, but we can reasonably assume that the recorded time series will be sampled every 30 minutes for the 35 million smart meters. The incoming smart meters represent a significant change for the European utilities, even if the amount of processed data is not as important as in dot-com companies (ex: search engine, e-commerce, social networks ... etc). One year of individual consumptions represents more than 600 billion recorded data points, which correspond to 4.4 Terabytes of raw data1 . This volume of data could increase to 30 Terabytes once stored in a database. EDF2 needs to anticipate how to manage such a volume of time series in terms of storage, querying and analysis. Alexis Bondu is with EDF R&D, 1 avenue du Général de Gaulle 92140 Clamart France. http://alexisbondu.free.fr Asma Dachraoui is with EDF R&D, 1 avenue du Général de Gaulle 92140 Clamart France. 1 We consider that each data point is coded on 8 bytes. 2 “Electricité de France” is the main French provider of electricity.

Asma Dachraoui

Currently, only a small subset of individual consumptions is available through experiments which consist in deploying smart meters in small geographic areas. That is the reason why EDF needs to simulate individual consumptions in order to test several types of distributed database on a large scale. Previous works have been carried out on the processing of massive electrical time series by using the Hadoop framework [1]. The storage and the querying aspects have already been investigated. Our next goal consists in simulating more realistic individual consumptions in order to anticipate testing “Data Mining” and “Machine Learning” algorithms. The individual consumptions constitute a distinctive kind of data : i) this is a very large set of time series which are progressively recorded; ii) the diversity of individual behaviors induces a wide variety of shapes; iii) the volatility of these time series is very high; iv) the sum of these time series is a smooth time series with cyclical patterns. For instance, the upper time series of Figure 1 represents the sum of individual consumptions in France for one week, and the lower time series gives an example of individual consumptions in the same period of time.

Fig. 1. Example of an individual consumptions for one week (lower time series) associated with the sum of individual consumptions in the same period of time (upper time series).

Our objective is to automatically build a generative model from a sample of individual consumptions. In an ideal case, this model must reproduce: i) the diversity and the volatility of the individual consumptions; ii) the aggregated consumption of electricity. We also aim to implement an efficient generative model with a very low time complexity. In this article, Section II is dedicated to the related work with a brief overview on the modeling of the individual electricity consumptions. In Section III, we propose a new simulation approach which is based on Datagrid models and Markov chains. In Section IV an original evaluation protocol is presented and our approach is evaluated and compared

with a baseline approach. Lastly, perspectives and future works are discussed in Section V. II. R ELATED WORK The most pragmatic approach consists in duplicating the training set in order to simulate an arbitrary important number of individual consumptions. This simple idea does not meet our objective which is to reliably test “Data Mining” and “Machine Learning” algorithms on a very large simulated dataset. For instance, a similarity based algorithm (ex : KNN, K-means ... etc.) would be biased by the fact that duplicates exist in the handled simulated dataset. EDF also needs to query large databases of time series by using similarity measures [2], in this case the same type of bias occurs. In the literature, the simulation of individual consumptions is closely associated with forecasting methods : the electricity consumption can be modeled as a function of several explicative variables (ex : the temperature, calendar information ... etc.) [3]. In such approaches, the simulation process generates the values of the explicative variables, which are the inputs of a forecasting model. I. Richarson [4] proposes to model the individual consumptions depending on the set of appliances within each household, the consumption pattern of each appliance, the number of occupants, and the occupancy behavior. In practice, it is difficult to simulate these inputs in a realistic way. The main challenge is to build a generative model which takes into account the correlations between explicative variables. More formally, an estimate of the joint density of all explicative variables is required. Furthermore, this approach suffers from an important number of user parameters which need to be adjusted. End-use models [5] constitute a “bottom up” approach which models the consumption by distinguishing three nested levels of electrical usage : i) the customer classes (ex : residential customer); ii) the end-use classes (ex : heating water); iii) the appliance categories (ex : solar hot water storage tank). This kind of model can be used to simulate individual consumptions. J. Paatero and al. [6] exploit public data and public statistics in order to instantiate a simplified end-use model. In this case, the need for detailed data is bypassed by using a representative data sample and statistical averages. This approach has the advantage of being easily interpretable. But, the simulation of each electrical usage is learned from average behaviors. A stochastic noise is added in order to emulate the volatility of the individual consumptions. The main limitation of this approach is that the original shapes of time series are not faithfully reproduced. Individual consumptions can be considered as a sum of appliances that switch on and off over time, according to a stochastic process. Markov chains are an efficient way of modeling such stochastic process, by defining the possible states of the observed system and by estimating the transition probabilities between states. This kind of model have already been exploited to simulate individual electricity consumptions. O. Ardakanian [7] proposes a simulation approach able to optimize the sizing of transformers in the distribution grid.

First of all, 4 classes of households are manually defined by considering the house size, the nature of the heating and the cooling systems. The day period is divided into 3 time intervals : on-peak (7am-11am and 5pm-9pm), midpeak (11am-5pm), and off-peak periods (12am-7am and 9pm12am). Then, Markov chains are learned within each time interval and each class, the states being set by using a clustering algorithm. O. Ardakanian highlights the necessity of discretizing time and partition meters to efficiently exploit Markov chains. The main difficulty is to automatically optimize these modeling choices given the data. We consider that an ideal simulation approach should satisfy the following conditions : • not involve user parameters; 3 • not require explicative variables for each household ; • automatically catch the diversity of individual behaviors; • reproduce the shapes of real individual consumptions; • match with the sum of real individual consumptions; • be statistically reliable and avoid overfitting; • be efficient in term of time complexity. III. A NEW APPROACH COMBINING DATAGRID MODELS AND M ARKOV CHAINS In this article, we propose a new simulation approach which combines coclustering models and Markov chains. Intuitively, the individual consumptions can be viewed as random processes which are partitioned in order to gather customers with similar behaviors. The MODL coclustering approach is used as a non-parametric density estimator, and Markov chains are exploited to learn the time dependencies from the observed time series. Section III-A presents the MODL coclustering approach on which Datagrid models are based. Section III-B highlights the weakness of the MODL approach as a generative model of time series, and shows that Markov chains can be exploited to fill this gap. Lastly, Section III-C is dedicated to the computational performances of the proposed approach. A. MODL : Clustering approach for time series Our choice to start with the MODL is motivated by the following properties: •



MODL is theoretically grounded and exploits an objective Bayesian approach [8] which turns the discretization problem into a task of model selection. The Bayes formula is applied by using a hierarchical and uniform prior distribution and leads to an analytical criterion which represents the probability of a model given the data. Then, this criterion is optimized in order to find the most probable model given the data. The number of intervals and their bounds are automatically chosen. The best discretization model estimates the joint density of explicative variables by a multidimensional data

3 In this case, the same weather conditions as the training period are reproduced.



grid which can be interpreted as a nonparametric piecewise constant estimator. MODL is a nonparametric approach according to C. Robert [8] : the number of modeling parameters increases continuously with the number of training examples. Any joint distribution can be estimated, provided that enough examples are available. Notations for time series : The input dataset D is a collection of N time series denoted by Si (with i ∈ [1, N ]). In our application area N corresponds to the number of customers. Each time series consists of mi data points, which are couples of values X and timestamps T . The total PN number of data points is m = i=1 mi .

1) Data Grid Models: The MODL coclustering approach allows one to automatically estimate the joint density of several (numerical or categorial) variables, by using a data grid model [9]. A data grid model consists in partitioning each numerical variable into intervals, and each categorical variable into groups. The cross-product of the univariate partitions constitutes a data grid model, which can be interpreted as a nonparametric piecewise constant estimator of the joint density. A Bayesian approach selects the most probable model given the dataset, within a family of data grid models. 2) Application to time series: Each time series consists of mi data points, which are characterized by two variables : T represents the timestamps and X is the values of the data i points. The ith time series is denoted by Si = (tij , xij )m j=1 . The MODL approach handles a re-encoded dataset P which contains the m data points of D in a tabular format. Each data point is characterized by three variables : C represents the “id” of the original time series, T and X. As illustrated in Figure 5, the coclustering approach is applied to estimate the joint density P (C, T, X) by a trivariate coclustering model. As P (T, X|C) = P (C,T,X) P (C) , this model can also be interpreted as an estimator of the joint density between T and X, which is constant within each cluster of time series. In others words, this clustering approach gathers time series with respect to the joint density of their data points. This property is particularly interesting when the volatility of handled time series is important.

drawing a cell according to the estimated P (C, T, X), and then, by uniformly drawing values of C, T and X within the previously selected cell. However, the simulation of realistic time series is more complex because the data points of a given time series are not independent over time. As explained in Section III-B, the main weakness of the MODL approach is the lack of modeling temporal correlations. 3) Formalism of the MODL approach: This paragraph briefly presents the formalism4 of the MODL approach, more details can be found in [10]. A clustering model (denoted by M ∈ M) is defined by : • a number of clusters of time series; • a number of intervals for each data point dimension (time and values); • the partition of the time series into the clusters; • the distribution of the data points on the cells of the data grid; • for each cluster, the distribution of the data points on the time series belonging to the same cluster. Notations for trivariate coclustering : • • • • • • • • • • •

The MAP (maximum a posteriori) is the most probable model given the data which is selected by a Bayesian approach within M, the set of all possible coclustering models. The MAP maximizes the product of the prior distribution P (M ) and the likelihood of data given the model P (D|M ). The exploited prior distribution P (M ) is derived from the minimum length description principle. The prior for the parameters of clustering model is chosen hierarchically and uniformly at each level : •

• •

Fig. 2. Illustration of a trivariate coclustering model applied on time series.

The key idea is that a trivariate coclustering can be exploited as a generative model of data points. The joint density P (C, T, X) is considered as uniform within each cell. The simulation of data points can be carried out by

kC : number of clusters of time series; kT , kX : number of intervals of time and values; k = kC kT kX : number of cells of the data grid; kC (i) : index of the cluster that contains the series Si ; {niC } : number of time series within each cluster iC ; {mi } : number of data points of each time series Si ; {miC } : number of data points within each cluster iC ; {mjT } : number of data points in the time intervals jT ; {mjX } : number of data points in the intervals jX ; {miC jT } : number of data points in the time intervals jT and within the cluster iC . {miC jT jX } : number of data points belonging to each cell (iC , jT , jX ).



the numbers of clusters kC and of intervals kT , kX are independent from each other, and uniformly distributed between 1 and N for time series, and between 1 and m for the point dimensions; for a given number kC of clusters, every partition of the N time series into kC clusters is equiprobable; for a given model size (kC , kT , kX ), every distribution of the m data points on the k cells is equiprobable; within a given cluster of time series, every distribution of the points on the time series belonging to the same

4 This technical part can be avoided without compromising the understanding of the main ideas of this article.

cluster is equiprobable; for a given interval of T [resp. X], every distribution of the ranks of the T [resp. X] values of points is equiprobable. Taking the negative logarithm of P (M )P (D|M ), a clustering model M is Bayes optimal if the value of the following criteria is minimal : •

c(M ) = log N + 2 log m + log B(N, kC )

(1)

    X kC miC + niC − 1 m+k−1 log + log + n iC − 1 k−1 i =1 C

+ log m! −

kC X

kT X

kX X

log miC jT jX !

iC =1 jT =1 jX =1

+

kC X iC =1

log miC !−

N X

log mi !+

kT X

jT =1

i=1

log mjT !+

kX X

log mjX !

jX =1

The first two lines of Equation 1 correspond to the prior distribution P (M ). The terms “log N ” and “2 log m” relate to the prior distribution of the numbers of clusters and intervals. This means that the probability of observing a particular value of kC [resp. kT and kX ] is supposed to be 1/N [resp. 1/m]. The probability of a given partition of the time series into clusters is given by 1/B(N, k), where B(N, k) is the number of possible ways of partitioning a set of N elements into k subsets possibly empty 5 . The prior probability of observing a given multinomial distribution of the m data points on the k cells of the data grid is given  miC +niC −1 repreby 1/ m+k−1 . Lastly, the prior term 1/ k−1 niC −1 sents the probability of observing a particular multinomial distribution of the data points of the cluster iC , on the time series belonging to the same cluster. The hierarchical prior distribution penalizes complex models which include a large number of groups and intervals. The last two lines of Equation 1 correspond to the likelihood of data P (D|M ). The third line stands for the likelihood of the distribution of the data points on the cells. The last line corresponds to the likelihood of the distribution of the points of each cluster on the time series of the cluster, followed by the likelihood of the distribution of the ranks of the T values [resp. X values] in each interval. The likelihood favors complex models with a good fit of the distribution of data. The MODL approach is intrinsically regularized, a tradeoff is naturally reached between the complexity of the models and their generalization ability. An efficient algorithm based on a greedy heuristic and a neighborhood exploration has been implemented to optimize the evaluation criterion [9]. This algorithm is su6 per linear and finds a good coclustering model within a √ O(m m log m) time complexity. 5 More

Pk

precisely, B(N, k) = i=1 S(N, i) where S(N, i) is the Stirling number of the second kind [11] 6 A super linear time complexity is less than O(m2 ).

B. Modeling temporal correlations with Markov chains The MODL approach has several interesting properties for simulating individual consumptions. This is a parameterfree approach which estimates P (C, T, X) by avoiding overfitting. The diversity of individual behaviors is caught by partitioning the original time series with respect to the joint density of their data points. But, the main weakness of the MODL approach is the lack of modeling temporal correlations within time series. 1) Baseline approaches: Algorithm 1 describes how a coclustering model can be exploited to simulate time series. This baseline approach is referred to as "Baseline-flat" in the rest of this article. The simulation of S˜i begins by drawing a cluster ID iC from the density P (C) (step B). The probability m of each cluster is estimated by p(iC ) ' miC . For each time interval jT , an interval of values jX is drawn from the density P (X|C ∈ iC , T ∈ jT ) (step C). The probabilities of the intervals of values are estimated by p(jX |iC , jT ) ' miC jT jX miC jT . For each time step, a value x is uniformly drawn from the selected interval of values jX . Lastly, x is added to the currently simulated time series S˜i as a data point.

/*Generative model based on a trivariate coclustering*/ For each simulated time series S˜i do (A) S˜i = ∅ /*Initialisation*/ (B) draw a cluster ID iC from P (C) For each time interval jT do (C) draw an interval of value jX from P (X|C ∈ iC , T ∈ jT ) For each time step within jT do (D) uniformly draw a value x within jX (E) S˜i ← Concat(S˜i , x) /*Update*/ end For end For end For

Algorithm 1: Baseline-flat simulation approach.

Fig. 3. Examples of simulated consumptions by drawing a new interval of values at each time interval (upper time series), or at each time step (lower time series), over five days.

As shown in Figure 3, the generated time series have too low a volatility compared with the original individual consumptions. The drawing of a new interval of values at each time interval is an arbitrary choice (step C of Algorithm

1). Another possible choice is to draw a new interval of values at each time step. This second baseline approach is referred to as "Baseline-noisy" in the rest of this article. In this case, the volatility of the generated time series is too important.

Fig. 6.

Fig. 4. Sum of synthetic time series simulated by a coclustering model (black curve) compared with the sum of the original time series (gray curve) over three days.

In both cases, the sum of the simulated time series leads to a piecewise constant curve (see Figure 4). Each piece of the curve corresponds to the expectancy of data points over all clusters, during a particular time interval. As a preliminary result, the two baseline approaches seem to be unable: i) to reproduce the shapes and the volatility of the original individual consumptions, ii) to match with the sum of the original individual consumptions. This intuitive result is confirmed in Section IV by an objective evaluation. 2) Proposed approach: In this article, we propose a new generative model of time series which associates the MODL approach with Markov chains. This approach is referred to as “MODL-Markov” in the rest of this article. A trivariate coclustering is first exploited to partition the original time series and discretize the values of X. Then, Markov chains are associated with each cluster in order to learn the dynamic of time series over the intervals of values.

Example of a Markov chain associated with the cluster iC .

Markov chains are often described as directed graphs, where the edges are weighted by the probabilities of going from one state to another. Figure 6 gives an example of Markov chain associated with the cluster iC . The same three states are repeated over the time, and the matrices P1iC , P2iC and P3iC contain the transition probabilities at each time step. We assume that the transitions between states follow a none-stationary stochastic process. That is the reason why we choose to exploit acyclic Markov chains.

/*Mixing coclustering model and Markov chains*/ For each simulated time series S˜i do (A) S˜i = ∅ /* Initialisation */ (B) draw a cluster ID iC from P (C) /* ——– First data point ——– */ (C) draw an interval of value jX from P (X|C ∈ iC , T = 1) (D) uniformly draw a value x within jX (E) S˜i ← Concat(S˜i , x) /* Update */ /* ——– Others data points ——– */ For each time step t do (F) draw the next jX from transition matrix PtiC (G) uniformly draw a value x within jX (H) S˜i ← Concat(S˜i , x) /* Update */ end For end For

Fig. 5. Exploiting Markov chains within a Datagrid model in order to model temporal correlations.

As shown in Figure 5, the discretization of the values into kX intervals is exploited to define the states of the Markov chains, denoted by {k} with k ∈ [1, kX ]. Any time series Si can be recoded as a sequence of states {kti }, where the t-th data point belongs to the k-th interval of values. For each cluster iC and for each time step t, a transition matrix PtiC is estimated. These matrices contain the conditional probabilities P (kt+1 |kt ), given a particular cluster of time series. According to the Markov assumption, the probability of observing a particular state depends only on the previous state: P (kt+1 |kt , kt−1 ... k1 ) = P (kt+1 |kt ). In practice, conditional probabilities are estimated by counting transitions within the dataset D. The MODL approach is regularized which allows us to reliably estimate the transition probabilities to some extent (this point could be analytically studied in future works).

Algorithm 2: MODL-Markov simulation approach. Algorithm 2 describes how a generative model is exploited to simulate time series. The simulation of S˜i begins with the drawing of a cluster ID iC from the density P (C) (step B). The first data point is simulated by : i) drawing an interval of value jX from the density P (X|C ∈ iC , T = 1) (step C); ii) and uniformly drawing a value x within the interval jX (step D). Other data points are simulated by drawing, at each time step t, the next interval of values jX from the transition matrix PtiC (step F). As previously, a value x is uniformly drawn within the selected interval of values (step G) and x is added to S˜i as a simulated data point. Figure 7 plots two examples of simulated and real individual consumptions belonging to the same cluster. A closer look at the simulated time series shows that our approach is able to reproduce the shapes of the original individual

consumptions. We propose an evaluation protocol in Section IV which aims at objectively assessing the quality of the simulation through a classifier.

Fig. 7. MODL-Markov approach : examples of simulated (upper curve) and real (lower curve) individual consumptions over a 5 days, belonging to the same cluster.

As illustrated in Figure 8, the sum of the simulated time series is close to the sum of the original time series. That means our approach is able to simulate realistic individual consumptions which match with the national electricity demand. This point is objectively evaluated in Section IV by using an appropriate criterion.

high performance random engines. During our test, data points are written in the RAM memory in order to avoid I/O slow-down, a single thread powered by a 2.0 Ghz Xeon CPU is used. Under these conditions, the simulation rate reaches 560.000 data points per second. Our simulator is implemented in a parallel architecture. During the simulation, a given core9 can independently process a particular subset of smart meters and/or a particular time period. Theoretically, one year of individual consumptions composed of 35 million time series sampled every 30 minutes can be simulated in less than 24 hours by using 13 cores. In practice, one year of consumptions is simulated by a collection of 52 weekly models due to the important space complexity. Weekly models succeed one another during the simulation, which decreases the simulation rate to 360.000 data points per second. One year of individual consumptions is effectively simulated in less than 24 hours by using 20 cores. For comparison purposes, the alternative approach described in [3] reaches 140.000 simulated data points per second under similar experimental conditions. IV. E VALUATION

Fig. 8. MODL-Markov approach: sum of simulated time series (black curve) compared with the sum of the original time series (gray curves) over three days.

C. Computational performance First, the generative model needs to be learned from the input dataset D. The learning stage involves the optimization √ of the trivariate coclustering model within a O(m m log m) time complexity (see Section III-A.3). Then Markov chains are associated with each cluster and the transition matrices are estimated by counting data points within a O(m) time complexity. In the end, the overall time complexity of the learning stage is super linear, which means our approach reasonably scales with the size of D. Nevertheless, the space complexity7 of our approach appears to be important. Let m∗i be the length of the longest time series, the transition 2 ) space matrices are stored in memory within a O(kC .m∗i .kX complexity. In practice, this issue can be easily fixed by splitting time series according to equal length periods and by implementing generative models for each period. From an algorithmic point of view, the simulation of a single data point requires very few elementary operations. If the data point is the first one of a simulated time series, three random drawings are required to: i) choose a cluster of time series; ii) choose an interval of value; iii) generate a value within the interval. For the other data points, only two random drawings are required to: i) choose the next interval of value from transition matrices; ii) generate a value within the interval. In both cases, the simulation of a single data point is processed within a O(1) time complexity. We have implemented a fast simulator in JAVA by using the Colt8 library, which is an Open Source project providing 7 The 8 The

space complexity represents the size of used RAM memory. Colt library is available at http://acs.lbl.gov/software/colt/.

In this section, several generative models are learned from a unique public dataset. These models are evaluated and compared, both at the aggregated level and at the individual consumptions level. In the literature, most approaches use statistical tests in order to evaluate the similarity between the distribution of the simulated and real time series. For instance, in [12] a Mann-Whitney U test is exploited. In [6] different statistics on the simulated and real time series are compared to demonstrate the realism of the simulation. Our evaluation protocol is more challenging since it is based on a classifier which aims at discriminating real and simulated time series. A. Experimental protocols A real dataset provided by the Irish CER (Comity of Energy Regulation) [13] is exploited to learn several generative models. This dataset contains the individual consumptions of 4600 households for 500 days, sampled every 30 minutes. We chose to process only the first week (between the wednesday 07/15/2009 and the 07/22/2009) in order to limit the computation time of the learning stage. In our experiments, three generative models are learned by : • • •

the baseline-flat approach (see Algorithm 1); the baseline-noisy approach (see Section III-B.1); the MODL-Markov approach (see Algorithm 2).

The generative models are first evaluated with respect to the aggregated simulated consumption. The sum of 4600 simulated individual consumptions is compared with the sum of the real individual consumptions over the same time period. The Mean Absolute Percentage Error (MAPE) is retained as evaluation criterion. 9 A multi-core CPU is composed by several independent processing units which are called “cores”.

The realism of the simulated individual consumptions is also evaluated by using a classifier. Our evaluation protocol is inspired by previous works which combine supervised and unsupervised approaches in order to evaluate the quality of the unsupervised task. For instance, the cascade evaluation [14] consists in enriching a supervised dataset with the cluster ID of each example. Then, the cluster ID is exploited by a classifier as an additional explicative variable. The cascade evaluation estimates the quality of the unsupervised task by measuring the improvement of the classifier when the cluster ID is used. Another example is the use of a classifier to detect changes in the distribution of a data stream [15]. In this approach, two temporal windows are defined which are respectively dedicated to the “current” and to the “normal” behaviors of the observed system. Changes are quantified by the ability of the classifier to discriminate both classes. We propose to learn a classifier which aims at discriminating “real” and “simulated” time series. The intuition is that the inability of the classifier to discriminate both classes reflects the realism of the simulation. In our experiments, the handled dataset is composed by 9200 time series described by 336 numerical explicative variables. The explicative variables correspond to the values of the data points over 7 days. The 4600 simulated time series are labeled by the class “0” and the 4600 real time series are labeled by the class “1”. In our experiments, two kinds of classifier are exploited : 1) Naive Bayes classifier provided with a 10 equalfrequency discretization: A Naive Bayes classifier estimates the conditional distribution of classes P (c|v1 ...vi ) by applying the Bayes rule, the class value is denoted by y and the explicative variables are denoted by vi . The term P (v1 ...vi |c) needs to be estimated since the vi are numerical variables. In this case, the vi are discretized into 10 equal-frequency intervals in order to estimate P (v1 ...vi |c). 2) Naive Bayes classifier provided with the supervised MODL discretization: The supervised MODL approach [16] aims at finding the best discretization of the variables vi in order to estimate the conditional distribution of classes P (c|v1 ...vi ). As presented in Section III-A, the MODL approach leads to an analytical evaluation criterion which represents −log(P (M |D)). This approach is intrinsically regularized : a tradeoff is naturally reached between the complexity of the discretization models and their generalization ability. Numerous comparative experiments indicate that a naive Bayes classifier provided with the MODL discretization gives good results in practice and avoids over-fitting [16]. In all cases, the classifiers are evaluated by using the multiclass AUC (Area Under the ROC Curve) [17]. Recall that a perfect classifier reaches an AUC equal to 1, and a random classifier reaches an AUC equal to 0.5. Our objective is to approach an AUC equal to 0.5 which reflects a good quality of the simulation.

B. Results 1) Coclustering model: The evaluated generative models share the same trivariate coclustering model, which is composed of : i) 160 groups of meters; ii) 26 intervals of values; iii) 22 intervals of time. The relatively large number of groups indicates this coclustering model catches a wide variety of behaviors within the original dataset. Each group of meters can be represented by a bivariate datagrid which estimates P (T, X|C). Figure 9 plots four examples of groups. Group A gathers very common individual consumptions. Daily patters can be clearly identified : the consumptions during the day and during the night are separated by the time intervals, most of the values are very low during the night. Group B is an example of energy saving behavior. The consumptions are particularly low during working days, and normal during the weekend (right part of the datagrid). Group C gathers high consumptions during working days. Group D is the most atypical one with permanently very high consumptions. In the end, trivariate coclustering models appear to be a good way of exploring large datasets of time series and catching typical distributions of data points over time.

Fig. 9. Examples of groups of meters represented by bivariate datagrids. The gray level of the cells denotes the probability P (T, X|C).

2) Aggregated and desegregated evaluation: The comparison between the baseline approaches and the proposed approach aims at evaluating the interest of the Markov chains to make the simulation more realistic. The first line of Table I gives the MAPE reached by each generative model on the aggregated simulated consumptions. Both baseline approaches have a similar error around 11%. The quality of the simulation is significantly improved by the MODLMarkov approach with a MAPE equal to 1.7%. This result is consistent with Figure 4 and 8 which plot the sum of simulated time series of the baselines and the proposed approaches. The second line of Table I shows the AUC reached by the naïve Bayes classifier provided with a 10 equal-frequency discretization. In the case of the baseline approaches, real and simulated consumptions are reasonably well discriminated with an AUC around 0.75. By contrast, such a weak classifier is not able to differentiate real and synthetic consumptions simulated by the MODL-Markov

approach (AUC = 0.49). This result demonstrates the realism of our approach on the level of individual consumptions.The last line of Table I gives the AUC of the naïve Bayes classifier provided with the MODL discretization. In a general way this classifier is more efficient than the previous one. Criterion M AP E AU C10eqf req AU CM ODL

Baseline-noisy 11.3% 0.78 0.88

Baseline-flat 10.7% 0.75 0.88

MODL-Markov 1.7% 0.49 0.65

TABLE I E VALUATION OF THREE GENERATIVE MODELS

In the particular case of the MODL-Markov approach this classifier reaches an AUC equal to 0.65. A deep inspection of the classifier indicates how the realism of the simulated consumptions could be improved. Figure 10 plots the conditional distribution of classes depending on the value of X at time step #249 (the more informative variable of the classifier). The proportion of real data points in the interval ] − ∞, 0.0005] is 95%, and 5% in the interval ]0.0005, 0.0045]. This result shows that the distribution of real data points is not uniform within the first interval of values of the trivariate coclustering model, namely ] − ∞, 0.0045]. In other words, when the real data points have a value of less than 0.0045, this value equals 0 in most cases. The same phenomenon is observed on numerous explicative variables, which explains that the AUC value is superior to 0.5. This issue could be fixed in future works by estimating the density of real data points within the first interval of values.

Fig. 10. Inspection of the naïve Bayes classifier provided with the MODL discretization and evaluating the MODL-Markov simulation approach : plot of the conditional distribution of classes depending on X at time step #249.

In the end, our experiments demonstrate the interest of mixing a trivariate coclustering model with Markov chains in order to make realistic simulations of individual consumptions. The presented results are encouraging. A possible improvement in the proposed approach has been identified. V. C ONCLUSION The incoming smart grid represents a significant break for the European utilities. Our objective consists in simulating realistic individual consumptions in order to test in advance “Data Mining” and “Machine Learning” algorithms. In this article, a new generative model of time series is proposed which combines a trivariate coclustering model with Markov chains. This approach reaches high computational performances, with 560.000 simulated data points per second, by

using a single core (see Section III-C). An original evaluation protocol is proposed in Section IV which compares real and simulated time series : i) at the aggregated level; ii) at the level of individual consumptions. This evaluation gives encouraging results and highlights the interest of the Markov chains to reproduce the dynamic of the original time series. In future works, our approach could be refined by estimating the distribution of data points within the first interval of values. The proposed experimental protocol could be extended to consider other classifiers, feature spaces and KPI (Key Performance Indicator). Then, extensive experiments could be carried out to evaluate the other approches of the literature. Our approach could be associated with a forecasting model learned from the aggregated level in order to drive the simulation by explicative variables, such as calendar information or weather conditions. R EFERENCES [1] L. Daio Pires Dos Santos, M. L. Picard, A. Gomez da Silva, D. Worms, B. Jacquin, and C. Bernard, “Massive Smart Meter Data Storage and Processing on top of Hadoop,” in International Workshop on End-toend Management of Big Data, VLDB (International Conference on Very Large Data Bases), Istanbul, 2012. [2] A. Berard and G. Hebrail, “Searching time series with Hadoop in an electric power company,” in BigMine ’13 International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, 2013, pp. 15–22. [3] P. Pompey, A. Bondu, Y. Goude, and M. Sinn, “Massive-Scale Simulation of Electrical Load in Smart Grids using Generalized Additive Models,” to appear in Lecture Notes in Statistics: Modeling and Stochastic Learning for Forecasting in High Dimension, 2014. [4] I. Richardson, M. Thomson, D. Infield, and C. Clifford, “Domestic electricity use: A high-resolution energy demand model,” Energy and Buildings, 2010. [5] H. Willis, Spatial Electric Load Forecasting, ser. Power Engineering Series. Marcel Dekker, 2002. [6] J. Paatero and P. Lund, “A model for generating household electricity load profiles,” International Journal of Energy Research, vol. 30, no. 5, pp. 273–290, 2006. [7] O. Ardakanian, S. Keshav, and C. Rosenberg, “Markovian models for home electricity consumption,” in GreenNets ’11, 2nd ACM SIGCOMM workshop on Green networking, 2010, pp. 31–36. [8] C. Robert, The Bayesian Choice : From Decision Theoretic Foundations to Computational Implementation. Springer, 2007. [9] M. Boullé, Hands on pattern recognition. Microtome, 2010, ch. Data grid models for preparation and modeling in supervised learning. [10] ——, “Functional data clustering via piecewise constant nonparametric density estimation,” Pattern Recognition, vol. 45, no. 12, pp. 4389– 4401, 2012. [11] M. Abramowitz and I. Stegun, Handbook of mathematical functions. Dover, New York, 1965. [12] M. Muratori, M. C. Roberts, R. Sioshansi, V. Marano, and G. Rizzoni, “A highly resolved modeling technique to simulate residential power demand,” Applied Energy, vol. 107, no. C, pp. 465–473, 2013. [13] “Electricity smart metering customer behavior trials findings report,” Commission for Energy Regulation, Dublin, Tech. Rep., 2011. [14] L. Candillier, I. Tellier, F. Torre, and O. Bousquet, “Cascade evaluation of clustering algorithms,” in 17th European Conference on Machine Learning (ECML’2006), ser. LNCS, J. Fürnkranz, T. Scheffer, and M. Spiliopoulou, Eds., vol. LNAI 4212. Berlin, Germany: Springer Verlag, september 2006, pp. 574–581. [15] A. Bondu and M. Boullé, “A supervised approach for change detection in data streams,” in IJCNN (International Joint Conference on Neural Networks). IEEE, 2011, pp. 519–526. [16] M. Boullé, “MODL: A bayes optimal discretization method for continuous attributes,” Machine Learning, vol. 65, no. 1, pp. 131–165, 2006. [17] T. Fawcett, “Roc graphs: Notes and practical considerations for data mining researchers,” T. Fawcett.Technical Report HPL-2003-4, HP Labs, 2003.