An adaptive forecasting methodology by utilizing change point detection technique on time series

(1)

ISTANBUL TECHNICAL UNIVERSITY GRADUATE SCHOOL OF SCIENCE ENGINEERING AND TECHNOLOGY

AN ADAPTIVE FORECASTING METHODOLGY BY UTILIZING CHANGE POINT DETECTION TECHNIQUE ON TIME SERIES

M.Sc. THESIS

Ali NASER NAEIMI AVVAL

Department of Industrial Engineering Industrial Engineering Programme

(2)

(3)

M.Sc. THESIS

(507161134)

Department of Industrial Engineering Industrial Engineering Programme

Ali NASER NAEIMI AVVAL

ISTANBUL TECHNICAL UNIVERSITY GRADUATE SCHOOL OF SCIENCE

ENGINEERING AND TECHNOLOGY

(4)

(5)

P Ali NASER NAEIMI AVVAL

(507161134)

FarukBEYCA

(6)

(7)

v

Ali Naser Naeimi Avval, a M.Sc. student of ITU Graduate School of Science Engineering and Technology student ID 507161134, successfully defended the thesis AN ADAPTIVE FORECASTING METHODOLGY BY UTILIZING CHANGE POINT DETECTION TECHNIQUE ON TIME SERIES

prepared after fulfilling the requirements specified in the associated legislations, before the jury whose signatures are below.

... Istanbul Technical University

Jury Members : Prof. Dr. Nizamettin BAYYURT ... Istanbul Technical University

Prof. Dr. Selim Zaim ... Istanbul Sehir University

Date of Submission : 16 April 2020 Date of Defense : 20 May 2020

(8)

(9)

vii

To all the people, who encouraged me to go my own way and always supported me, especially my dear father and mother who dedicated and sacrificed their life in order to provide a peaceful, hopeful and happy life for our family, my dear brother who thought me to be determined in all the hardships and guided me throughout my life, finally love of my life, my dear wife whose ambition and energy is the best reason to face all the problems to be succeed.

(10)

(11)

ix FOREWORD

First of all I want to express my deep gratitude to my supervisor assist Dr. Faruk BEYCA who thoughtfully guided me by supporting every phase of my academic life. It is tremendous honor and a great pleasure for me to work with him for such a long time. I appreciate the effort that he put on my study by his inspiring advices, valuable comments and guidance to improve the quality of my thesis. I would like to express my deep appreciation to Istanbul Technical University and especially management department for giving me the precious chance to broaden my horizons of knowledge in this university, I spent most important part of my life here in this university and took many valuable experience.

May 2020 Ali NASER NAEIMI AVVAL Industrial Engineer

(12)

(13)

xi TABLE OF CONTENTS Page FOREWORD ... ix TABLE OF CONTENTS ... xi ABREVIATIONS ... xiii LIST OF TABLES ... xv

LIST OF FIGURES ... xvii

1. INTRODUCTION ... 1

1.1 Purpose of Thesis ... 1

1.2 Literature Review ... 2

1.2.1 Sequential CPD ... 4

1.2.2 Parametric and Nonparametric Methodologies ... 6

1.2.3 Supervised CPD Methods ... 7

1.2.4 Unsupervised CPD Methods ... 10

1.2.5 Dynamic Time Wrapping (DTW) ... 15

1.3 Hypothesis ... 18

2. GRAPH BASED CPD ... 21

2.1 Minimum Spinning Tree (MST) ... 21

2.2 Change Point Alternative ... 22

2.3 Significance Level ... 25

2.3.1 Skewness Correction ... 26

2.4 DTW Classification And k-means For CPD ... 27

3. HOLT-WINTERS METHOD ... 31

3.1 Exponential Smoothing ... 32

3.1.1 Single Exponential Smoothing ... 32

3.1.2 Additive Seasonal Method ... 33

3.1.3 Overall, Trend and Seasonal Smoothing ... 33

4. FORECASTING ... 35

4.1 Forecasting Using HW Method ... 35

4.2 Forecasting Using HW Method after Graph Based CPD ... 35

4.2.1 Forecast using combination of T1 and exponential smoothing values of T2 ... 36

4.2.2 Forecast using combination of T2 and exponential smoothing values of T ... 36

4.2.3 Forecasting Using HW Method after Dynamic Time Wrapping CPD ... 37

4.2.4 Results of Forecasts ... 38

4.3 Forecast Using Autoregressive Integrated Moving Average Model (ARIMA). ... 39

5. EVALUATE THE RESULTS USING MEAN ABSOLUTE PERCENT ERROR (MAPE) ... 43

6. CONCOLUSIONS AND FURTURE WORKS ... 45

6.1 Conclusions ... 45

(14)

xii

REFERENCES... 47 CURRICULUM VITAE ... 51

(15)

xiii ABREVIATIONS CPD MST MAPE SI NNCA SBS SDIL EWMA MEWMA GA ML DTC DTW NA SVM NN CRF AR uLISF RuLISF SWAB uLISF MDL STLF MTLF LTLF

: Change Point Detection : Minimum Spinning Tree

: Mean Absolute Percentage Error : Seasonality Index

: Nearest Neighbor Classification Algorithm : Sparsified Binary Segmentation

: Steepest Drop to Low Levels

: Exponentially Weighted Moving Average

: Multivariate Exponentially Weighted Moving Average : Generic Algorithm

: Maximum Likelihood : Decision Tree Classification : Dynamic Time Wrapping :

: Super Vector Machine : Nearest Neighbor

: Conditional Random Field : Auto Regression

: uncensored Least-squares Importance Fitting

: Relative uncensored Least-squares Importance Fitting : Sliding Window And Bottom-up

: uncensored Least-squares Importance Fitting : Minimum Description Length

: Short Term Load Forecast : Medium Term Load Forecast : Long Term Load Forecast

(16)

(17)

xv LIST OF TABLES

Page Table 2.1 : Weeks in third cluster ... 30 Table 4.1 : Forecast values and real consumption values for next 30 days... 38 Table 4.2 : Forecast values and real consumption values for next 30 days using

ARIMA method... 42 Table 5.1 : MAPE values calculated for (Table 4.1) forecast and real consumption

values. ... 43 Table 5.2 : MAPE values calculated for (Table 4.2) forecast and real consumption

(18)

(19)

xvii LIST OF FIGURES

Page

Figure 1.1 : CPD supervised methods (Aminikhanghahi, S and Cook, D. J, 2017). .. 9

Figure 1.2 : CPD unsupervised methods (Aminikhanghahi and Cook, 2017). ... 12

Figure 1.3 : Electric Load Forecasting methods. ... 16

Figure 1.4 : Time series methods used for modelizing the time series for electric load forecasting. ... 17

Figure 2.1 : MST graph for 1096 points 22 Figure 2.2 : For all the data points. ... 24

Figure 2.3 : _G and _G between the points 550 and 650. ... 24

Figure 2.4 : G and G between the points 850 and 950. ... 25

Figure 2.5 : Plot of skewness for ZG values . ... 26

Figure 2.6 : DTW for weekly load consumption values. ... 28

Figure 2.7 : Minimum distance path between first and second week. ... 28

Figure 2.8 : K-means clustering results for DTW weekly load consumption. ... 29

Figure 2.9 : Daily consumption values after deleting change point candidates. ... 29

Figure 3.1 : Sequence graph of daily consumption time series. ... 31

Figure 4.1 : Holt-Winters for whole data points. ... 35

Figure 4.2 : Holt-Winters for two sample edition step B. ... 36

Figure 4.3 : Holt-winters for two sample edition step C. ... 37

Figure 4.4 : Holt-winters for DTW edition D. ... 37

Figure 4.5 : ARIMA forecast graph for DTW change point data . ... 40

Figure 4.6 : ARIMA forecast graph for daily load consumption values without any change point elimination. ... 40

Figure 4.7 : ARIMA forecast graph for graph based CPD data. ... 41

(20)

(21)

xix

SUMMARY

The objective of the exponential smoothing forecasting is to use past observations to form future forecast, to do this algorithm use past data by multiplying specific weight of for each observation in order to magnify the importance the most recent observation compare to older ones, in most of the samples of past observations there are some abrupt change lying beyond which called time series data that called change points, they have a direct impact on forecast values and cause portion of errors called residual inside the estimated values, while it is data analysts monitor these changes by using several methods to clarify the reasons of the these outliers and preserve the operation or data points from further change points, removing these outliers from training sets of the forecasting algorithms can improve the efficiency of estimated values too.

In this research we used Holt-winters and ARIMA method to forecast the next 30 day electricity consumption according to our data, such that we changed the process of the Holt-Winters(HW) exponential smoothing forecast such that instead of fitting whole data points using HW we conducted a graph based CPD method, this method uses two sample test called minimum spinning tree(MST) to form a graphical view of data points to find two sample of data according to connection between data points. And as another change point detection approach we used dynamic time wrapping method to cluster the data so that we identified 9 outlier points and eliminated them from data.

Using the outcome of graph based method which searches for single change point the whole data into two samples , one before the change point and the other after the change point, then HW conducts separately on two samples, while for one of the samples these are real data points instead of second one we added fitted values of second sample, comparing new outcomes with normal HW outcomes with real data points using mean absolute percentage error (MAPE), and also we used the outcome of the dynamic time wrapping and its forecast error to compare them with graph based method, these results suggest that new method lowers the difference between real values and forecasted values thus this method can cause more accurate results comparison with traditional ARIMA and exponential smoothing method.

(22)

(23)

xxi ZET ve k-{ : =1, 2,..., } ve muhtemelen > . parametresi b

(24)

xxii

-bunlar, i

man sarma ve

(25)

xxiii serisi analizi

(26)

(27)

1 1. INTRODUCTION

Today world is the world of sudden and permanent changes and also war against time, none of the fields of technology, science and commercial companies want to lose time and they want adopt these changes in the minimum portion of the time. Beside the technical aspects of changes and operational effects of it, all the data are saved by super computers all around the world, scientist start to estimate these changes in the very beginning moment of their occurrence, these points that data monitoring experts called change points, and the process of searching for these point called change point detection (CPD).

Energy has a major role in our daily life and for all countries it is an important political, economic and environmental issue which affects the sense of satisfaction of the people, for the energy provider companies it is an intense competition because the most wealthy countries of the world are energy companies, so that these companies focuses on all the aspects of costumer behavior to detect the fluctuations as soon as possible.

Whenever something goes wrong with their market expectations, this period of time is not the exact point of change but it can be the effect of a change happened earlier, this is the place we should use CPD methods and find the exact point change as soon as possible and take the proper action to control the elements of the process.

happen tomorrow, they want to have predictions about tomorrow or even far future, if we can combine the past and the future in one method it seems to be more practical.

1.1 Purpose of Thesis

Nowadays our planet faces one of the most important challenge of itself with humans and the nature around us, modern technology made us a big consumer of the energy

(28)

2

and natural resources, but harming the nature alongside the global warming is a red alert for us and our planet, there are lots of reasons behind the situation we are facing with but the energy is one of the most important and rudimentary factors of it, we are using the energy everywhere for different purposes and goals like transportation, lightening, warming our houses and offices and for production in factories, between these elements the one which takes more and plays an important role in our life is the

unless we focus on standardizing electronic and smart living environments which control the energy usages and prevent the wastes.

Talking about our houses and living environment, Electricity is an inseparable kind of energy we use it in our daily life, while governmental and private providers spend more and more to know the consumption pattern and slight reasons which effects our hourly and daily consumption values, this proposal tries to optimize the process estimation of future consumption and it will evaluate the fact that if change point algorithms have any effects on forecast values and if they do what is their influence compare to normal exponential smoothing method, as an alternative method we will try dynamic time wrapping method as a clustering method to detect the changes and compare them with graph based method, and also another forecasting method which is ARIMA to compare the final results through forecasting methods.

Finally this study proposes new change point forecasting model which consists of change point detection and elimination from the data and prepare it for electricity load forecast according to our three years, daily consumption data.

1.2 Literature Review

Change point detection is kind of research for finding sudden change or changes in sample of observation called data points. To deal with this problem a common issue is to find shits happening to the mean or variance of the data in a specific process. This problem is one of crucial obstacles that different fields try to solve it, including audio analysis (Gillet et al, 2007), EEG segmentation (Lavielle, 2005), health monitoring (Noh et al, 2012), Hu et al (2007) and environmental problem such as climate changes (Verbesselt et al, 2010). However new methods and algorithms developed by scientists of data like Kawahara and Sugiyama (2009), Lavielle (2005), Liu et al (2010), Rigaill (2010) and asymptotic theory Bertrand et al

(29)

3

(2011), Tartakovsky et al (2006), Shao and Zhang (2010), Levy-leduc (2007) for a number of contexts, such as studies that working on independent distributed data points Existing researches working on optimal detection of shifts through the mean values temporal data with dependent data points are less common.(Keshavarz et al, 2018).

Downey (2008) was working on an new algorithm which can give more precise results for prediction of the network performance, the concept of this prediction is highly dependent on the stationary of the time series and data points, and if we consider the sudden changes caused by variations in traffic , in several periods data could had different intervals of stationary data points so that he came up with this idea that we can use a change point detection algorithm which is online , in online methods a algorithms always track the data base and whenever a new data point is available algorithms start the calculations over again in order to find optimal change point, Downey (2008) proposed an online algorithm with several equations which calculates the probability occurring the change point for each data point that enters

to data base, on the other hand until that

moment and computes the whole data to recognize the change point spots and for several scientific fields which they need to work on past data points without any new data point consideration it would be more efficient and practical method.

Considering carefully change point detection algorithms we can see two main branch of algorithms, batch and sequential. In batch algorithms, the sample considered fixed and algorithm retrospectively look for any type of shifts throughout the data points. While in sequential algorithms it quickly updates itself whenever new observations added to the sample space, in some books they call online CPD.

For batch CPD algorithms there were several approaches ,for example (Auger and Lawrence,1989; Bai and Perron, 2003; Ruggieri et al, 2009) proposed dynamic programming algorithm which try to lower the complexity in quadratic observations and there is high probability to find the optimal solution , on the other hand binary segmentation which is another practical methods used in many fields(Scott and Knott, 1974) is a greedy method that recursively separates data points into samples of data at each identified change point to the point that only homogeneous samples remain in data file, (Cho and Fryzlewicz, 2015) proposed the Sparsified Binary Segmentation (SBS) algorithm which aggregates the CUSUM statistics by adding

(30)

4

irrelevant, noisy contributions, which is particularly beneficial in high dimensions, one of the recent methods proposed by Fryzlewicz (2014), whose wild binary segmentation algorithm calculate the cumulative sum for all data points and split them into different groups. Fryzlewicz (2018) proposed a new change-point detection method which created to work in both infrequent and frequent change-point problems. It contains two major algorithm

CPD problem,

i.e -points,

where T length of data. The other algorithm is a process of selecting new model, which called

1.2.1 Sequential CPD

There are two methods dealing with solving the problem in sequential CPD first one is frequentist algorithms and second is Bayesian algorithms. In frequentist algorithm, there are different methods using such as different usage of cumulative sum (CUSUM), (Ruggieri and Antonellis, 2016) , Both Hidalgo (1995) and Honda (1997) suggested CUSUM tests for change points in the regression function in nonparametric time series regression models with strictly stationary and absolutely regular data. Su and Xiao (2008) extended these tests to strongly mixing and not necessarily stationary processes, allowing for heteroscedasticity, while Su and White (2010) proposed change point tests in partially linear time series models. Vogt (2015) constructed a kernel-based L2-test for structural change in the regression function in time-varying nonparametric regression models with locally stationary regressors ( Mohr and Neumeyer, 2019), And exponentially weighted moving average (EWMA) method ,utilize the different weight values for past observations and according to their significance includes them in each value of future and logic behind it is that the most recent observation is the most effective one while the oldest observation has the lowest effect on it (Lucas and Saccucci, 1990), Khan et al (2016) used generic algorithm (GA) to find an optimal parameter sample for MEWMA to identify multiple change points accurately , and for this goal one of useful methods that often used by scientist is likelihood ratio test . just as this paper the null hypothesis assumed as no shift or change happens in the observations and the alternative is to

(31)

5

find a single or multiple change point(s) throughout the observations (see Siegmund and Venkatraman, 1995; Hawkins et al, 2003; Ross, 2013). If any observations pass out of the lower or higher bonds of the control chart it would be an approval that single or multiple change point happened. Ruggieri and Antonellis (2016) Maximum likelihood (ML) is the arguably most common approach to change-point detection in the mean and/or the variance of normal data. Work of Wang and Zivot (2000) proposes a Bayesian method for the detection.

Changes in mean, trend and variance of a time series using the Gibbs sampler; As often encountered in multiple change-point detection, their method requires Knowledge about the total number of change points in the time series. Jandhyala et al (2013) proposed a Bayesian and maximum likelihood method for offline change point detection, Bayesian change point detection methods have been used in various fields. These methods give good results while we need to estimate new parameters using priori distribution of the observations. In This after each portion of time we will use sequence of observation by updating using posterior and prior information of data points ( Keshavarz and Huang, 2014).While other sequential Bayesian change point algorithms use only a estimation of posterior distribution for the location of the

change points and they focus on cause

a great impact on their inference , Ruggieri and Antonellis (2016) proposed a new simulation method for exact posterior distribution on the different change point areas and also the parameters of the regression model without any assumptions between the observations of adjacent change points.

Davis et al (2006) proposed the use of a genetic algorithm instead. The term `genetic' is founded in a link to the Darwinian evolution theory: the best segmentation is recovered by successively combining parent segmentations into (in terms of some specified optimality criterion) better fitting children segmentations, while discarding any combinations that do not improve the fit. Ultimately, by so-called crossover and mutation steps and by discarding segmentations with relatively poor fit, if used with the appropriate specifications the algorithm converges after several generations to a best possible offspring. The concept of genetic algorithms has also been applied by e.g. Battaglia and Protopapas (2011) for the segmentation of a regime-switching process (Hu et al, 2011).

(32)

6

1.2.2 Parametric and Nonparametric Methodologies

Aminikhanghahi (2017), while we are talking about online algorithms it is not just about adding new data points and find (a) new change point(s), another important issue is computational time and cost which directly affects efficiency of the outcome change point candidat

use them in right time it is not valuable for us, this computational time called scalability of the algorithms, whenever the scalability is the issue for most CPD problems it brings up the difference between parametric and non-parametric approaches and we need to clarify which of them is suitable for our data points. To distinguish between these two, there are some proven demonstrations about them which tells us non-parametric methods had achieved more success while datasets were massive and huge and also their computational cost is lower than parametric methods.

Parametric methods formulate some functions which consider data points as training

samples t

need the training samples. On the other hand, in non-parametric approach all the data points have to be maintained and it can cause a cost for model for retaining the data for making all inferences about CPD process.

To deal with the matter of cost of models, several solutions suggested by researchers

o -based which gives more importance to output

quality than computation time But this method required a time interval before the execution. on the other hand, Shieh and Keogh (2010) proposed promising method which called anytime algorithm and it was really useful, one of the factors that made anytime algorithm as a method of high quality classification was its flexibility cause from the beginning of the process it can interrupt wherever it wants and this property cause to accurate classification and take this algorithm from a theory to the real world methodology, one of the most well-known anytime classification techniques has been introduced by (Ueno et al, 2006) as NNCA.

Chandola et al (2011), Proposed a non-parametric method which reduced the computational time of the an Gaussian process algorithm which has already proposed

(33)

7

by (Williams et al, 2006), Chandola et al (2011) used the Gaussian process for finding the change points for the first time.

In nonparametric context, Muliere et al (1985); Naderi et al (2011) and Martinez et al (2014) proposed a Bayesian approach for changed point models but for this approach we need to know a prior distribution of the data which is another obstacle cause most is the sample size changes for the online change point detection for each data point a prior sample distribution must be calculated which cause a time delay for finding abrupt changes, but Chen and Zhang (2015) proposed graph-based change point detection method which is a nonparametric approach, and gives us opportunity to work on different dimensions of data.

1.2.3 Supervised CPD Methods

Supervised methods are based on learning algorithms, these are kind of machine learning algorithms which estimate the outputs of the data based on training samples we give them as inputs variables, which are measured or preset, and have some influence on outputs (Li et al, 2006).If we detect a shift in mean, according to Li et al (2006) it is a sign of shift in mean vector values after change point area. Thus, we need a kind of model which can the disclose the relationship between time and characteristics of the process.in this procedure A supervised algorithm satisfies our requirement when we set and process characteristics as inputs and the time of collecting data as the output. If there is no shift in the values of mean vector, the supervised algorithm would not be able to estimate the time proficiently. On the contrary, when there are change points, predicted outputs from the training algorithm should be totally different which takes time as a output and process characteristics as inputs.

When a supervised method used for CPD, there are two main ways of train machine learning algorithms which are binary and multi-class classifiers. If the numbers of states are indicated, the CPD algorithm is trained to find the borders of each state. When a sliding window shifting through the data points, considering any possible segmentation between two data points as a candidate change point. This approach has a simple training phase but it needs adequate amount and different kind of training sample have to be provided to illustrate all possible classes. conversely, when there are several kind of classes in different states detecting these classes

(34)

8

separately can give more information about nature and amount of detected change, there are variety of classifiers which can be useful learning problem, for example decision tree, according to Beucher et al (2019), Decision tree classification (DTC) enhance a machine learning technique, it uses circular partitioning of a data points to find a homogenous classification as a output variable. At each split the algorithm aims at reducing the degeneration of the output variable in the final datasets by choosing the optimum partition between independent variables. The main advantages DTC are that it makes no assumptions about distribution of data set and it is reliable when data have some missing points or unnecessary variables. (NB) method according to Saritas and Yasar (2019), this algorithm is a kind of classifier which computes a set of probabilities by calculating different combinations of giving training samples, this method uses theorem of Bayes whereas all variables assumed to be independent which is not realistic comparison to real world problems. But NB algorithm is capable of quick classifications using conditional probability on variety of problems. Support Vector Machine (SVM) is a another approach which according to Hartley et al (2017), SVM roots deep inside statistical theorem which developed by (Vapinik, 1979, from Hartley et al, 2017), for learning from data, SVM uses set of labeled training samples and estimate outcome data points. These new outcomes have the same distribution as training samples but classifier has regular steps with precise estimations, which can use picture classification. While these methods try to predict outcomes according to input data, Nearest Neighbor (NN) method check similarity between structures of whole data points. Chen (2019) proposed a new CPD according to NN which uses two-sample test which called -NN, this two-sample test first proposed by (Schilling, 1986), in this method assumed as fixed integer and means number of closest neighbors to each data point on, which should be considered carefully (normally it is digit between 2 and 20), all the data points assumed as nodes on the x-y graph and all of them are distinct with each other and Euclidean distance between all the nodes had been calculated, using -NN first we consider a training data and then for each input it will calculate the k nearest neighbors, whenever these NNs would be from a different distributions it will be sign for a new change point. While in these methods it prior distribution assumed to be known, Luong et al (2013) proposed that in most of the real world problems priori distribution is unknown and it cause a lots of afford and computational time to

(35)

9

calculate it according to different methods, Hidden Markrow model can be a efficient technique to obtain the prior distribution while problems have linear complexity. For most of the problems working on image recognition and classification, Conditional Random Field (CRF) is graphical model a practical method, according to Alam et al (2018) in the concept of Image segmentation one of the critical steps in hyper spectral remote sensing image processing, this method provides a high performance in classifying the context of the images. Another method is Gaussian Mixture Model (GMM) according to et al (2017) For mixture models, it is considered that a given training sample is the kind of a random vector which all data points in training sample are mixture of different distributions, (Figure 1.1) shows all these methods.

An alternative to multi class CPD is a binary class method, in which all of the available state shift (change point) sequences illustrate one class and all of the sequences inside the states illustrate a second class. In this case we have only two Figure1. 1 : CPD supervised methods (Aminikhanghahi, S and Cook, D. J, 2017).

(36)

10

learning classes but it can turns to a complex learning problem, if the number of possible types of shifts is large (Cook, 2015, from Aminikhanghahi and Cook, 2017),

logistic regression which illustrated on (Figure 1.1).

Another approach which is using supervised learning algorithms, it is called Virtual Classifier. Hido et al (2008) proposed a method for solving a unsupervised problems using supervised learning algorithms, they worked on problems which researchers want to know reasons behind the change more than detecting change point, this concept which called change analysis concentrates on labeled data, compares two window frame of data points using supervised learning algorithms, and the main goal is to find information about the any shift on two sample of data.

1.2.4 Unsupervised CPD Methods

Unsupervised learning algorithms are usually used to find structure of unlabeled data. In the context of CPD, we can use these algorithms for time series data segmentation, in the other words it a CPD method based on statistical features of the data points. Unsupervised segmentation is more favorable because it may handle different situations without any prior training for each situation (Aminikhanghahi, 2017). (Figure 1.2) gives brief information about the unsupervised methods used for CPD problem.

First method is likelihood-ration approach which is natural point of view for CPD problem by perform a simple hypothesis test in which null hypothesis H0 is for no change point while H1 is for a single change point =1 (Eckley et al, 2011). This method test this hypothesis first likelihood based CPD approach proposed by (Hinkley, 2011, from Eckley et al, 2011), who provided a statistic test for finding change in mean of data which are normally distributed, this likelihood based CPD extended to other kind of data which had different distribution including gamma distribution proposed by (Hsu, 1970, from Eckley et al, 2011), exponential distribution by (Haccou et al, 1988, from Eckley et al, 2011) and binomial distribution by (Hinkley and hinkley, 1970, from Eckley et al, 2011).

After searching for change in mean parameter scientists came up with a new likelihood method to detecting changes in variance in normally distributed data

(37)

11

points proposed by (Gupta and Tang, 1987, from Eckley et al, 2011). This method requires two steps. First of all, the probability density of two back-to-back split is calculated separately. Second of all, the ratio of probability densities is computed. The most favorable CPD algorithm is cumulative sum (Cho and Fryzlewicz, 2015, from Aminikhanghahi, 2017). This algorithm accumulates deviations of specified target of incoming values and shows that a change point happened when the cumulative sum outrun a limited threshold.

The other commonly used method which turns the problem of CPD into time series-based outlier detection is Change Finder. This method fits an Auto Regression (AR) algorithm onto the data points to disclose the statistical behavior of the time-series and updates the parameter estimates rapidly so that the effect of past examples is gradually reduce (Liu et al, 2013).

Another method which caused a high attraction among scientists recently is Subspace Method (Moskvina and Zhigljavsky, 2003, from Liu et al, 2013). This method uses priori designed time series model, by performing a principle component analysis between past and present samples, the similarity between the distances of two subspace would be calculated. There different approaches of subspace method, one of them called Subspace Identification (Kawahara et al, 2007, from Liu et al, 2013) in this method an extended observability matrix created using the systematic noise of the state space model and the then columns of matrix would compare with each other.

However, the methods explained above are dependent on priori designed parametric models such as probability distributions (Basseville and Nikiforov, 1993, from Liu et al, 2013), (AR) models (Takeuchi and Yamanishi, 2006, from Liu, 2013) and state space models (Kawahara, 2007, from Liu et al 2013). For finding the possible variations on the mean, variance and spectrum, They are not practical for different kind of changes, to find significant outcomes in practice. To deal with this problem, non-parametric prediction methods such as kernel density estimation can be used. However, non-parametric methods are less precise in high dimensional problems because of the complexity which called curse of dimensionality.

Since these methods are based on parametric models which had designed in past they are not flexible enough to deal with real-world CPD scenarios, some recent

(38)

12

researches proposed more flexible non-parametric methods by predicting the ratio of probability densities directly without any needing to examine density estimation. The reason of this density-ratio estimation approach is that knowing the two densities results in knowing the density ratio (Aminikhanghahi and Cook, 2017). Thus, Liu (2013) applied a density ratio estimation method which called unconstrained Least-Squares Importance Fitting (uLISF) (Kanamori et al, 2009, from Liu et al, 2013). This method has different advantages to find change points in non-parametric problems and also a high numeric stability which cause the algorithm to have a robustness dealing with various problems (Sugiyama et al, 2012, from Liu et al 2013). The further extension of the uLSIF method came as relative ULSIF (RuLSIF) which caused improvement in results on CPD in non-parametric problems (Yamada et al, 2011, from Liu et al, 2013).

(39)

13

Recently different studies showed that time series can be searched using graph theory. A common usage of graph is usually showing the distance or a generalized dissimilarity on the data samples, with time series which observations considered as nodes connecting all observations based on their distance between each other (Aminikhanghahi and Cook, 2017).

There are different definitions for this graph such as minimum spanning tree (Graham and Hell, 1985), nearest neighbor graph Rosenbaum (2005). A graph based method for CPD is a non-parametric approach which performs a two sample test on an identical graph to decide whether there is a change point throughout observations or not. In this method graph G is created for each sample of data. For each possibility of as change point spot, divides the observations into two part: data points that come before the and data points that come after the . The number of points in this graph G (RG) that link data points from these two samples is used as an signal of a

change point, so that smaller points of graph cause an increase in possibility of change point. because the value of RG is dependent on time t, the standard function

(ZG) is defined as in equation (2.2) (Aminikhanghahi and Cook, 2017).

According to these calculation the maximum value of ZG(t) is the change point if it is more than threshold which will calculated using thermos proposed by ( Chen, and Zhang, 2015).

another perspective for finding the change points which called clustering is a kind of method which considers a whole data as different clusters, inside each cluster data points have a same distribution and it will continue until point t, and if distribution of point t+1 is different from the earlier cluster that would called change point (Aminikhanghahi and Cook, 2017), one of simple and practical methods of clustering is sliding Window Algorithm which is segmentation algorithm in which a segment increase until exceeding some error limit, the process repeats with next data points another segmentation method which called Top-Down method separates data time series data into different partitions as much as it can until stopping criteria has alerted, while Bottom-up method starts from the best possible approximation and start to merge segments until stopping criteria alerted (Keogh et al 2001), but considering these three method Keogh et al (2001) proposed a new alternative

(40)

14

method for solving the clustering CPD problem, which keeps the online nature of Sliding window method and also superiority of Bottom-Up method which called SWAB (Sliding Window And Bottom-up). as mentioned before bottom-up approach first treats each observation as a separate segment, then merges these segments with an related merge cost until the stopping criteria would be alerted. On the other hand, SWAB keeps a buffer of size w to store enough data for 5 or 6 segments. The bottom-up method is performed to the data in the buffer and the outcome resulting segment is reported. The data equal to the reported segment will be removed from the buffer data and restored with the next data point in the series.

Another clustering method which groups segments is based on Minimum Description Length(MDL), (Rakthanmanon et al, 2011, From Aminikhanghahi and Cook, 2017) gives a description length DL according time series T with the length of m, While

H(T)is the entropy of the data points

DL(T) = m * H(T) (1.1)

MDL clustering for CPD uses bottom-up algorithm over the clusters which include segments of various lengths and there is no limitation on the number of clusters. Another way of clustering time series data points as a method to find change points is using a Shapelet method (Zakaria et al, 2012, From Aminikhanghahi and Cook, 2017) . An unsupervised-shapelet, while the distance between S and part of time series sample points is much smaller comparison to the distance between S and the rest of the time series sample points is a small pattern in a time series T . Shapelet-based clustering method, which attempts to cluster the data points according to the shape of the whole time series, trying to find an u-shapelet, which can separate and remove a time series segment from the rest of the data sample. The algorithm repeats this search cycle among the remaining data points until no data remains to be separated on the sample data. So that a greedy search algorithm which tries to maximize the separation gap between two segments of time series is used to extract u-shapelets. and this method prepares the whole data to perform any clustering algorithm such as k-means which calculates Euclidian distance function to find the change points.

(41)

15 1.2.5 Dynamic Time Wrapping (DTW)

A common and practical classification technique for time series is Nearest Neighbor method which utilizes the distance between the two series as Euclidean or dynamic time wrapping DTW distance (Santos and Kern, 2016), However many authors used this method for various approaches like: clustering(Aach and Church, 2001, From Ratanamahatana and Keogh, 2004), Detecting anomaly of series, (Dasgupta, 1999, From Ratanamahatana, and Keogh, 2004 ), rule discovery(Das et al, 1998, From Ratanamahatana and Keogh, 2004) and motif discovery (Chiu et al, 2003, From Ratanamahatana and Keogh, 2004) many authors proofed the superiority of DTW over Euclidean metric. If we have two time series as Q with length of n and C with the length of m such as:

to find the distance between these two time series using DTW, We have to create a an n-by-m matrix where I and j are consumed as the order of row and column of matrix and squared distance will be like d(qi, cj) = (qi - cj)2. And the next calculation

is to find the minimum path between these outcomes like

DTW(Q,C) = min k_k ₁w_k (1.4)

in which DTW is the path that minimizes distance known as wrapping coefficient

cost.

1.2.6 Electric Load Forecast

Load forecasting is in the middle of all electric process such as planning and operation; it needs the precise estimation of the usage values and also locations of the electric consumptions in different periods of time. according to this periods for predictions up to 1 day we call it short term load forecasting(STLF), For predictions between 1 day to 1 year as medium term load forecasting (MTLF), And predictions between 1 to 10 years we call them long term load forecasting(LTLF) (Alfares and Nazeeruddin, 2002), But when we are talking about demand prediction which is an

(42)

16

important side of nowadays world cause it feeds the technology, Hagan and Behr (1987) believe that short term load forecasting has a key role. Srivastava et al (2016) proposed four main categories for STLF techniques. According to Alfares and Nazeeruddin (2002) there are nine common methods using for electric load forecasting which are listed in (Figure 1.3).

Multiple regression method estimates weighted least-square method for electric load forecasting, In this method we can include the relationship between electricity consumption and weather condition as statistical constraints (Alfares and Nazeeruddin, 2002), in this technique regression coefficients are multiplied to weighted least-square values using known historical data (Mbamalu and El-Hawary 1993, From Alfares and Nazeeruddin, 2002). on the other hand Exponential Smoothing is a classical technique which commonly used for electric load forecast, this method also uses previous data to create a new sample which called fitting function(mogham and rahman, 1989, From Alfares and Nazeeruddin, 2002 ) and calculated like equation (1.5).

Load Forecasting

methods

multiple regression

exponential smoothing

iterative reweighted least-squares

adaptive load forecasting

stochastic time series

ARMAX models based on genetic algorithms

fuzzy logic

neural networks

knowledge-based expert systems

(43)

17

y(t) = (t)T f (t) + (t) (1.5)

in where f(t) is fitting function, (t)is coefficient, (t) is white noise and T called transpose operator.

Exponential smoothing is a simple technique used to smooth and forecast a time series without the necessity of fitting a parametric model(Gelper et al. 2010), in this method prediction of future values are calculated on sum of decreasing weighted of past values, Kotsialos et al (2005), for identifying model orders and parameters mbambalu and el-hawary (1992), From Alfares and Nazeeruddin (2002) proposed a new method which called Iterative reweighted least-squares, this method calculates optimal starting point using a control operator which controls a variable at a time. If we want to make an online form of this model which adopt the results to changing load conditions it would called Adaptive Load Forecasting, Pappas et al (2008) proposed to use adaptive load forecasting method to plan future expansion for predict . while these method are famous and commonly used by data analyzers, there are some unique patterns for energy consumption that in some growing areas and cities make it hard to estimate the load forecast using this methods and we have to utilize Stochastic Time Series (Alfares and Nazeeruddin, 2002), Stochastic method uses past data and uses time series formulation to create a model for the sample, There are three main approach for modeling this kind of tie series which has shown in (Figure 1.4).

Stochastic Time Series for Load

Forecasting

Autoregressive(AR) model

Autoregressive moving-average (ARMA) model

Autoregressive integrated moving-average (ARI MA) model

Figure1. 4 : Time series methods used for modelizing the time series for electric load forecasting.

(44)

18

If we want to use generic algorithm for identifying autoregressive moving averages while variables are exogenous we will use ARIMAX model load forecasting (Alfares and Nazeeruddin, 2002), as another approach we can assume load values as dynamic system which are unknown and use the Fuzzy logic with centroid defuzzification to forecast them, this method works in two steps: training and online estimating Ali et al (2016) proposed a new model for long term load forecasting according to fuzzy logic. while we using training step in fuzzy logic we have a powerful alternatives in learning concept which are Neural Networks (NN), according to their learning ability recently many scientists using these algorithms for forecasting approach, in this method there are multiple hidden layers in algorithm, and in each layer there are many neurons, and in each layer inputs multiplied by a weight given to each value and added to a value called threshold and give us a new output called net function. But these NN algorithms cannot reason or explain according to given information but Knowledge-based expert systems are the result of artificial intelligence, To build such a model scientists export the information from experts of base component and show them as if-then functions, and rules will set for each of constraints as forced condition factors, Weather conditions or system load and each change in these factors affects the final result of load forecast (Alfares and Nazeeruddin, 2002).

1.3 Hypothesis

if we show sample of whole data with Y and data point will be like , Our null hypothesis is that we change point(s) throughout the sample in our whole daily mean consumption values, then which means that we have detected sudden change will call change point, which has shown by , our data will be like :

(1.6)

And .

Sometimes more than one change can occur throughout the data points which will show like:

1 2 .... N

Y y y y

1 2 1

(45)

19

(1.7)

Here we have multiple changes which will separate our data point into n+1 sample. Throughout this proposal we assume that all the points of Y are independent.

1 2 2 1 2 1 1 1 1 2 1 1 1 ... ... ... ... n n n N Y y y y y y y y y y

(46)

(47)

21 2. GRAPH BASED CPD

Graph base CPD which first proposed by chen and zhang (2015) is a method which consider each data point as a edge of a X,Y graph and all the data will be separated

Like , then in two samples named

and , however we found the change point but closer edges in sample which are far from edges of sample will disclose the power of our method as a sign of approval.

According to this methodology need any prior information or estimation on distribution of samples but we search for similarity between each point of the graph. Thus to find the best samples on the graph we need a test to compare the data points on the graph, in the next part there is brief representation of minimum spanning tree.

2.1 Minimum Spinning Tree (MST)

MST is a kind of mathematical algorithm which is practical for problems like traveling salesman that we looking to find shortest route to reach the maximum destinations of our mission.

In this algorithm all the routes are weighted differently and algorithm consider it according to their significance level or traveling cost.

In this proposal we will use MST as a two-sample test in order to find the best candidates as samples with different distributions such as when the number of

connections between two different group can reject

our null hypothesis, because when two samples are from different distribution, edges from the same sample should be closer to each other.

In the (Figure 2.1) each edge is the daily electricity consumption and the line between each edge is the distance between them.

The data sequence for this MSD computation is n = 1096, such as each edge is for one day.

0 1

(48)

22 2.2 Change Point Alternative

Here we searching for statistic which can test our hypothesis toward any change that occurs on data distribution and if any signal comes out of this statistics, and we find point it will divide the whole data points into to sample space, according to two sample test which held by MSD we need

So that let be all the data points assumed as a edge of graph, G is the similarity graph which for each if we assume that is the indicator function and it will be 1 if is true otherwise it will be 0.

is the number of connection between edges of different groups and is the indicator for any that observed after t, which will be calculated like:

Here P, E, Var are probability and expected value and variance, then standardize value of will be:

G G

G

(2.2) Figure 2. 1 : MST graph for 1096 points.

(49)

23

Then expected value and also variance of the will be like equation (2.3) and (2.4)

Here shows the number of points inside , and we calculate and according to (2.5) and (2.6)

,

.

According to these statistic test we will look at the values on the (Figure 2.2), Which recommend us two candidates as a change point spots which they can considered as a point of abrupt change or change point interval?

For testing the null hypothesis H0 against alternative H1 we will perform a scan

statistic

0 1

in equation (2.7) n0 and n1 are considered as range of , for maximum values more than threshold which will be described in section 2.3 it will be assumed as change point.

(Figure 2.2) proposes that around the point 600 and 900 it there is an abnormal fluctuation, something happened on these days that are outlier when we compare them to other edges, an these outliers could cause major uncertainty in forecast algorithms so we should analyze these points carefully.

In (Figure 2.3) it we separated the data to focus on the details of the movements of the edges between point 550 and 650 illustrates that at the point 605 at the same time and we have radical movements not in the direction of the normal trend so that we consider 605 as a change point.

G 1 ( ) G R t Z t_G( ) (2.3) (2.4) (2.5) (2.6) (2.7)

(50)

24

In (Figure 2.4) for it shows some chaotic movements but we

approval from the because it illustrates only normal fluctuation during the

First of all smoothed values would affect because they depend on current and past values of the data points, Second of all these outliers are the part of parameters choosing process in recursive updating scheme, while we have these outliers we

accurate estimations as forecast.

( ) G R t ( ) G Z t

Figure 2. 2: for all the data points.

(51)

25 2.3 Significance Level

For single change point after finding the equation 2.7, we can find the maximum value for ZG but we need more evidence to be confident about our decision, Thus we use following formula to calculate the probability of maximum ZG while the significance level assumed as 0,5, if the P value is less than 0.5 we will accept the point as a change point, We can perform this equation on different intervals of our data.

from this equation we can understand that distribution for max ZG (t) will be defined as permutation distribution. So that we make the probability method more applicable we will define ZG (t) as family of test. and the probability equation is family wise error value. 1 , , 01 01 0 1 / * * 0 / r r r r n n n n where : (2.8) (2.9) Figure 2. 4: and between the points 850 and 950.

(52)

26 , 0 1 * * r r G u x 2.3.1 Skewness Correction

If the value of t/n is close to 0 or 1 and is the normal convergence of ZG(t) , sometimes ZG(t) values maybe come as left or right skewed figures and P value approximation would overestimate the tail probabilities, then skewness correction can be effective to fix this problem. The problem Skewness correction in CPD problems first carried out by (Tu and siegmund, 1999, From Chen and Zhang, 2015), and they proposed an application of universal third moment correction. From (Figure 2.5) it is obvious that our skewness depends on the value of time, while the index of skewness is -1,14 for our ZG values which means a negative skewness or skewness to

left, on the both end plot is more skewed.

in this method we will perform a new approach to adapt skewness value of ZG at each t with skewness correction by giving a better estimation to marginal probability, So

that for single change point we will assume the marginal probability like

P(ZG(t) b + dx/b)

(2.10)

(2.11) Figure 2. 5 : plot of skewness for ZG values .

(53)

27 for standardized Z, will be the skewness term.

P(ZG(t) b + dx/b) would be create by the change in the

measure of d = e Z ( ) d p, Would be like

After this correction on the marginal values of the skewness for P approximations will become like

Where, ^ ^ 2 3 . ^

exp((1/ 2))(

( ))

(1/ 6)

( )

. ( )

( )

.

1 ( )

( )

b G _G b G b G

b

t

G t

S

t

G t

2.4 DTW Classification And k-means For CPD

As mentioned in literature review an alternative for Euclidean metric for measuring distance between two time series is dynamic time wrapping which measures the distance between two samples set as a distance matrix and then find the minimum distance between each two points, this method is a significant approach for determining the classification of the time series, In this paper we used this technique as an alternative of Euclidean metric of graph based method to two compare result for two available method, Thus we arranged all the data points as weekly matrix, in where columns are the number of the weeks and each row is a week day, for this (2.12)

(2.13)

(2.14)

(2.15)

(54)

28

weekly load consumption matrix we will calculate the minimum distance between consecutive weeks consumption, as shown in (Figure 2.6),

(Figure 2.8) illustrates the minimum distance path using DTW algorithm for first and second week of the load consumption data.

for consecutive weeks the distance alters and we can classify the load consumption data according to this variations in distances. As mentioned in literature review one of the common approaches for CPD when we time series can be clustering technique, in this paper we performed the k-means clustering method on the outcome result of

Figure 2. 6 : DTW for weekly load consumption values.

Figure 2. 7 : Minimum distance path between first and second week.

(55)

29

the DTW weekly distance measurements to detect the number of different clusters and whenever we have a cluster change it can be a sign of the change point in the time series, (Figure 2.8) shows the K-means clustering result, Which illustrates that DTW of weekly consumption data consists of three main clusters. according to results of K-means we have nine week in which the consumption pattern had experienced a major variation while most of the weeks are between first and second clusters. able 2.1) we eliminate the weeks which we have in cluster three and the results would be like (Figure 2.9).

Figure 2. 8 : K-means clustering results for DTW weekly load consumption.

Figure 2. 9 : Daily consumption values after deleting change point candidates.

(56)

30 Week Cluster 29 38 78 79 80 88 89 139 140 3 3 3 3 3 3 3 3 3

In the section 4 of the paper we will forecast according to the new outcomes shown on (Figure 2.9) and we will compare it with normal holt-winters and graph based change point based holt-winters estimations.

(57)

31 3. HOLT-WINTERS METHOD

Whenever we want to forecast future values very first and effective step is to see the graph of the sample space of our data which called sequence graph and it shows the series of data against time, whole idea beyond using these graphs by data analysts is to see the bigger picture and have a superficial background about the movements of the trend and seasonality of the time series, using this background, analysts can choose the best model which would be suitable for type data, (figure 3.1)illustrates the sequence graph of daily consumption values.

For all the time series forecasting methods they consider data as combination of different patterns, these methods decompose these patterns and this procedure makes it easier to understand the data and will result into accurate estimation values.

(58)

32

Here in (figure 3.1) our data is daily electricity consumption for three years data, A glance at sequence graph shows the cycles that occurs repetitively which called seasonality in time series terms, this seasonality calculates by SI which called seasonality index.

3.1 Exponential Smoothing

Exponential smoothing is a procedure of forecasting, using the information from past data, the procedure is giving more value to the most recent values so that for each point it considers a known weight for data point going far from now the weight is getting lower, thus most recent points playing significant role in estimating the future points.

A simple form of exponential smoothing formula can be written like:

in this formula considered as smoothing factor while 0< <1. and smoothed statistic which is is simple weighted average of the observed x value, and the prior smoothed statistic , so that as value goes higher causes the level of smoothing to get reduce, thus we can apply exponential smoothing as soon as we get two observations.

this simple exponential smoothing method is popular as exponentially weighted moving average(EWMA), which theoretically it can be assumed as (ARIMA) (0,1,1), Autoregressive Integrated Moving Average, which has no constant value. we will ARIMA method as an alternative for exponential smoothing using holt-winters and compare its outcomes with current load forecast results.

3.1.1 Single Exponential Smoothing

This method is suitable when the goal of the forecast is to find short-range estimation, and this method assumes that in time series these is no significant trend and points are moving around this trend in consistent mean or slight slopes toward top or bottom of the graph. To deal with this aspect of seasonality of our time series

(59)

33

we can use holt-winters (HW) method, there are two type of HW according to type of seasonality, multiplicative and additive seasonal model.

The brief difference between these two is time impact on the scale and movement of seasonality, if by passing time we see any increasing or decreasing on the cycles of seasonality or trend, multiplicative method is more suitable otherwise additive is recommended, analyzing the (Figure 3.1) shows that there is no effect of time on seasonality or trend line which is an approval to use additive method.

3.1.2 Additive Seasonal Method

In this method data points will be shown like:

here b1 is permanent component, b2 is a linear component, St t as a random error, the length of season will be shown by L, so for each data point called X the seasonality index will be like:

using for each data point which is number between 0 and 1. The next procedures for estimating three parameters such as overall smoothing, smoothing of trend and smoothing

3.1.3 Overall, Trend and Seasonal Smoothing

Overall smoothing which is will be calculated like:

Trend smoothing will be like:

t R

(60)

34 And seasonal smoothing:

(61)

35 4. FORECASTING

4.1 Forecasting Using HW Method

Our sample space is for three years daily consumption data which is totally 1096 points, throughout this proposal we will use 1066 data as a sample for estimating next 30 days (one month) and we will compare them with real data.

In the first stage we will execute HW on whole data without any modification which is on (Figure 4.1) and forecast values for this step we called it A, are on (Table 2.1) on the (Figure 4.1) the black line is showing real data points and the red line is fitted HW data, and finally 30 estimated points showing with red line between two blue line which are the standard deviation of the future values.

4.2 Forecasting Using HW Method after Graph Based CPD

According to the graph based method, it proposed that a sudden change happens in the point 605, the fact that HW method is sensitive for outliers and any existing outlier can cause a major effect in both fitted data points and estimated values, we

(62)

36

separate the whole data into two samples, first one , and the second one as and will execute HW on both of them separately to see is there any difference between normal HW and change point integrated one or not.

4.2.1 Forecast using combination of T1 and exponential smoothing values of T2 Here we just take the values of sample one and remove the point 605 from sample space and instead of sample two we will use fitted data which we find with HW method and estimate new data points using this new sample data points, (Figure 5.2) illustrates the result for this examination.

Change-models are using broadly in different fields for detecting abrupt changes and Result of this step is also on (Table 3.1) as step B.

4.2.2 Forecast using combination of T2 and exponential smoothing values of T Here we just take the values of T2 and remove the point 605 from sample space and instead of T1 we will use fitted data which we find with HW method and estimate new data points using this new sample data points, (Figure 5.3) illustrates the result for this examination.

1 {1, 2,...,604} T

2 {606,...,1066} T

(63)

37

4.2.3 Forecasting Using HW Method after Dynamic Time Wrapping CPD After conducting DTW method on the weekly load consumption values and we found 9 candidate change point which can affect the final forecasting results we created a new sample data, using holt-winters algorithm the outcome is new fitted graph with the estimation for next 30 days as shown on (Figure4.4).

This step called D and the forecast result are illustrated in (Table 4.1). Figure 4. 3 : Holt-winters for two sample edition step C.

(64)

38 4.2.4 Results of Forecasts

Outcome of the all three examination are listed in (Table 4.1) and we have to test which of them is giving us more accurate estimations about n days later.

n Forecast Step A Forecast Step B Forecast Step C Forecast Step D Real Consumption ₁ _{33128.03 33156.14 33600.74 31058.68} _29138.22 ₂ _{33184.90 33176.23 33691.86 30996.78} _24277.88 ₃ _{32914.70 32918.13 33371.41 30286.69} _31940.88 ₄ _{32997.37 33019.66 33481.95 30528.85} _35073.84 ₅ _{32460.84 32475.01 32917.25 30059.62} _35267.53 ₆ _{32526.99 32541.35 33094.10 30949.27} _35597.23 ₇ _{32411.17 32417.12 32951.85 31337.60} _34670.95 ₈ _{32331.83 32331.94 33055.02 31271.46} _30783.63 ₉ _{32946.31 32945.93 33701.78 31390.48} _26264.94 10 32562.67 32553.02 33281.70 31055.72 33398.46 11 32715.53 32715.73 33462.20 31378.17 33872.08 12 32766.61 32757.11 33459.75 30728.04 33598.98 13 33306.93 33335.16 34217.72 31055.41 33481.04 14 33363.80 33355.25 34308.84 30993.51 32920.70 15 33093.59 33097.14 33988.39 30283.42 29394.72 16 33176.26 33198.68 34098.93 30525.58 25391.95 17 32639.73 32654.03 33534.23 30056.34 34093.49 18 32705.89 32720.37 33711.08 30946.00 35912.46 19 32590.07 32596.13 33568.83 31334.33 36821.08 20 32510.72 32510.96 33672.00 31268.18 37400.15 21 33125.20 33124.95 34318.76 31387.21 37688.66 22 32741.56 32732.04 33898.69 31052.45 34059.07 23 32894.42 32894.75 34079.19 31374.89 28307.52 24 32945.50 32936.13 34076.73 30724.76 35664.97 25 33485.82 33514.18 34834.7 31052.13 35704.86 26 33542.69 33534.27 34925.82 30990.23 34877.12 27 33272.48 33276.16 34605.37 30280.14 34165.57 28 33355.15 33377.70 34715.91 30522.30 33784.70 29 32818.62 32833.05 34151.21 30053.07 30343.65 30 32884.78 32899.39 34328.06 30942.72 26118.54 Table 4. 1 : Forecast values and real consumption values for next 30 days.