Adaptive contextual learning for unit commitment in microgrids with renewable energy sources

(1)

Adaptive Contextual Learning for Unit Commitment

in Microgrids With Renewable Energy Sources

Hyun-Suk Lee

, Cem Tekin

, Member, IEEE, Mihaela van der Schaar, Fellow, IEEE,

and Jang-Won Lee

, Senior Member, IEEE

Abstract—In this paper, we study a unit commitment (UC)

prob-lem where the goal is to minimize the operating costs of a micro-grid that involves renewable energy sources. Since traditional UC algorithms use a priori information about uncertainties such as the load demand and the renewable power outputs, their perfor-mances highly depend on the accuracy of the a priori information, especially in microgrids due to their limited scale and size. This makes the algorithms impractical in settings where the past data are not sufficient to construct an accurate prior of the uncertainties. To resolve this issue, we develop an adaptively partitioned contex-tual learning algorithm for UC (AP-CLUC) that learns the best UC schedule and minimizes the total cost over time in an online manner without requiring any a priori information. AP-CLUC effectively learns the effects of the uncertainties on the cost by adaptively considering context information strongly correlated with the un-certainties, such as the past load demand and weather conditions. For AP-CLUC, we first prove an analytical bound on the perfor-mance, which shows that its average total cost converges to that of the optimal policy with perfect a priori information. Then, we show via simulations that AP-CLUC achieves competitive perfor-mance with respect to the traditional UC algorithms with perfect

a priori information, and it achieves better performance than them

even with small errors on the information. These results demon-strate the effectiveness of utilizing the context information and the adaptive management of the past data for the UC problem.

Index Terms—Contextual learning, unit commitment,

micro-grids, renewable energy, system uncertainty. I. INTRODUCTION

U

SING renewable energy sources such as wind and solar has many advantages, e.g., low economic costs and carbon Manuscript received September 29, 2017; revised April 9, 2018 and June 14, 2018; accepted June 15, 2018. Date of publication June 22, 2018; date of current version July 27, 2018. The work of H.-S. Lee and J.-W. Lee was supported by Midcareer Researcher Program through NRF grant funded by the MSIT, Korea (No. NRF-2017R1A2B4006908). The work of M. van der Schaar was supported in part by an ONR grant and in part by the NSF under Grants 1407712, 1524417, and 1533983. This paper was presented in part at the 5th IEEE Global Conference on Signal and Information Processing, Greater Washington, D.C., Dec. 2016. The guest editor coordinating the review of this manuscript and approving it for publication was Dr. Dipti Srinivasan. (Corresponding author:

Jang-Won Lee.)

H.-S. Lee and J.-W. Lee are with the Department of Electrical and Elec-tronic Engineering, Yonsei University, Seoul 03722, South Korea (e-mail:, hs.lee@yonsei.ac.kr; jangwon@yonsei.ac.kr).

C. Tekin is with the Electrical and Electronics Engineering Department, Bilkent University, Ankara 06800, Turkey (e-mail:,cemtekin@ee.bilkent.edu. tr).

M. van der Schaar is with the Department of Electrical Engineering, Uni-versity of California at Los Angeles, Los Angeles, CA 90095 USA (e-mail:, mihaela@ee.ucla.edu).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/JSTSP.2018.2849855

footprint reduction from fossil fuels. In general, to efficiently use renewable energy sources in power systems, uncertainties in power systems from load demands and renewable power out-puts should be effectively addressed. Especially, in microgrids, properly addressing the uncertainties becomes more important due to their limited scale and size.

Recently, such uncertainties are considered in the unit com-mitment (UC) problems to determine the on/off states of the thermal generation units and their power outputs, i.e., the UC schedule, to minimize operating costs. In many existing works [2]–[10], UC schedules that take into account the system un-certainties are determined by using stochastic optimization for UC (SOUC). In SOUC, the UC schedule is determined to min-imize the expected operating cost over possible scenarios of the uncertainties. However, in practice, the number of scenarios considered in SOUC should be reduced due to its high compu-tational complexity [11], which causes reliability issues since the reduced scenarios do not capture all possibilities.

To resolve the reliability issues, two different approaches named robust optimization for UC (ROUC) and interval op-timization for UC (IOUC) are proposed. In ROUC [12]–[17], the UC schedule is determined to minimize the worst-case cost using a deterministic uncertainty set defined by the worst-case realization. In IOUC [18]–[20], the UC schedule is determined considering reliability constraints defined by using intervals rep-resenting the probable realizations of the uncertainties. How-ever, the above approaches have a difficulty in appropriately choosing the scenarios in SOUC, the uncertainty set in ROUC, and the intervals in IOUC to tradeoff the reliability and the costs. To overcome this difficulty, in [21], a Markovian approach for UC is proposed where the UC schedule is determined with-out scenario analysis by representing the uncertainties using a discrete Markov process. Moreover, numerous works have con-sidered hybrid approaches that combine the base approaches discussed above [22]–[25].

Although many prior works determine UC schedules by con-sidering uncertainties in different ways, they all require statis-tical information about the realization of the uncertainties as follows: a probability distribution over the scenarios in SOUC, the forecasted worst-case realizations of uncertainties in ROUC, the uncertainty intervals in IOUC, and the stochastic model of the uncertainties in Markovian UC. We call such statistical information about the uncertainties a priori information. Due to this dependency, the prior works can be used only if a pri-ori information is available. One way to tackle this issue is to 1932-4553 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

(2)

TABLE I

COMPARISONSWITHRELATEDWORKS

form the a priori information by acquiring more data, which costs both money and time. Then, this data can be processed by appropriate methods [26], [27] to form estimates of the uncer-tainties. In addition to the money and time costs, this method also has the following drawback: the performance of methods that are based on a priori information highly deteriorate when the a priori information is inaccurate.

The above approaches for UC are also widely adopted in microgrids [7]–[10], [16]–[18]. However, when adopting the approaches in microgrids, acquiring such an accurate a priori information may cost too much considering their small-scale power generation. Moreover, due to the limited scale and size of microgrids, the performance deterioration from the inaccu-rate a priori information may become severe [9], [28]. Thus, to overcome these problems, a UC algorithm which does not need any a priori information of uncertainties is necessary, especially in microgrids. The problems related to the a priori information also arise in smart grids [29]–[32], and are addressed by using learning methods that do not require a priori information [31], [32].

To effectively determine UC schedules even without the a priori information, it is necessary to exploit side information strongly correlated with the uncertainties, such as the past load demands and the weather [33], [34]. In the literature, such side information is also referred to as the context information, and the learning methods that utilize the context information are called contextual learning methods [35]. While contextual learning methods are successfully applied in domains like recommender systems [36] and wireless communications [37], to the best of our knowledge, this paper is the first to attempt to use contextual learning for developing a UC algorithm that does not require any a priori information on the uncertainties.

In our preliminary work [1], we developed a uniformly parti-tioned contextual learning algorithm for UC (UP-CLUC), where the expected costs of the UC schedules are learned by fusing the past data through uniform partitioning of the context space. The partition of the context space of UP-CLUC is optimized under the condition that the contexts are uniformly distributed over the context space. This might pose a significant performance degradation in real-world scenarios where the context arrivals are non-uniform or do not follow any well defined stochastic process. To address this challenge, in this paper we propose a contextual learning algorithm for the UC problem called an

adaptive partitioned contextual learning algorithm for UC (AP-CLUC). The algorithm addresses the challenge by learning the uncertainties in a completely adaptive way by forming the con-text space partition on-the-fly based on the concon-text arrivals ob-served so far. By this, AP-CLUC optimizes its context space to tradeoff estimation errors and approximation errors that oc-cur during learning. A comparison of our work with the related works is given in Table I.

The contributions of the paper are summarized as follows:

r

_{We propose a new contextual learning UC approach}

with-out requiring any a priori information by modeling the UC problem as a sequential decision making problem and de-veloping a contextual learning algorithm for the problem.

r

_{The developed algorithm called AP-CLUC adaptively}

par-titions the context space to effectively learn about the sys-tem uncertainties based on past data. Moreover, we propose methods to accelerate the learning speed of AP-CLUC.

r

_{We prove that AP-CLUC achieves regret which is sublinear}

in time, and hence, is optimal in terms of the long-term average cost.

r

_{We also show that AP-CLUC achieves competitive}

per-formance compared to the existing UC algorithms having perfect a priori information, and it achieves better perfor-mance than them even when the a priori information has only small errors.

The rest of this paper is organized as follows. Section II provides the system model. In Section III, we formulate a unit commitment problem. In Section IV, we develop an adaptively partitioned contextual learning algorithm for UC, and provide its regret bound. We provide numerical results in Section V. Finally, we conclude in Section VI.

II. SYSTEMMODEL

We consider the UC problem in an isolated microgrid sys-tem withJ thermal power generation units, where each unit is

indexed by j ∈ J = {1, 2, ..., J}.1 _{The system schedules the}

on/off status and power outputs of its thermal power generation units, i.e., a UC schedule, over a discrete time horizon, where each time period has a fixed duration, e.g., an hour. Lett be an

index of time periods of the time horizon. The set of time periods

(3)

is denoted byT = {0, 1, 2, ...}. At the beginning of time period

t, the system schedules its thermal power generation units for

a single time periodt + Tsc, i.e.,Tsc time periods-ahead UC scheduling, whereTsc is the number of necessary time periods to prepare the operation of the thermal power generation units according to the UC schedule.

The on/off status of unitj during time period t is denoted

byuj(t) ∈ {0, 1}, where 1 represents the on state and 0 rep-resents the off state. The vector of the on/off states of all ther-mal power generation units during time periodt is denoted by

u(t) = {uj(t)}j ∈J. The up time of unitj at time period t, which represents the number of consecutive time periods that unit j

has been in the on state at the end of time periodt, is denoted

byTj,on(t), and is given by

Tj,on(t) =

Tj,on(t − 1) + 1, if uj(t) = 1

0, ifuj(t) = 0 .

Similarly, the down time of unitj at time period t, which

rep-resents the number of consecutive time periods that unitj has

been in the off state at the end of time periodt, is denoted by Tj,of f(t), and it is obtained by

Tj,of f(t) =

Tj,of f(t − 1) + 1, if uj(t) = 0

0, ifuj(t) = 1 .

We denote the vectors ofTj,on(t)’s and Tj,of f(t)’s of all ther-mal power generation units as Ton(t) = {Tj,on(t)}j ∈J and Tof f(t) = {Tj,of f(t)}j ∈J, respectively. When a thermal power generation unit is turned on, it cannot be turned off for a specific number of time periods, i.e., for each unitj,

1≤ T_j,on(t − 1) < M U Tj ⇒ uj(t) = 1, (1) whereM U Tj is the minimum up time of unitj. Similarly, when it is turned off, it cannot be turned on for the next specific number of time periods, i.e., for each unitj,

1≤ Tj,of f(t − 1) < M DTj ⇒ uj(t) = 0, (2) whereM DTj is the minimum down time of unitj.

The power output of unitj during time period t is denoted

bypj(t), and it is bounded by pj(t) ∈ [pm inj , pm axj ], where pm inj and pm axj are the minimum and maximum power outputs of unitj, respectively. The vector of the power outputs of all

ther-mal power generation units during time periodt is denoted as

pther(t) = {pj(t)}j ∈J. Due to the ramp rate limit, the power output of unitj at time period t should satisfy the following

constraint:

pj(t − 1) − RRj ≤ pj(t) ≤ pj(t − 1) + RRj, (3) whereRRj is the ramp rate limit of unitj. Moreover, we con-sider a spinning reserve requirement in the system. We assume that the spinning reserve is not used for the fluctuation of the load demand, but for more critical situation such as the outage of thermal units. Then, the spinning reserve requirement should be guaranteed as j ∈J uj(t) pm axj − pj(t) ≥ SR, (4) whereSR is the spinning reserve requirement.

In our system model, we use the current time, the past weather condition, and the past load demands as the context informa-tion which the system considers. It is worth noting that any other related information can be used as the context informa-tion. To model the current time, we introduce a set of time indices for a circular time duration, e.g., a day, a month, and a year,H = {0, 1, ..., H − 1}, where each index represents an actual time in the time duration. Then, each time periodt is

mapped to the corresponding current time indexh(t) ∈ H as h(t) = mod(t, H). Let w(t) be the weather condition which

is observed by the system at the beginning of time periodt.

The set of weather conditions is denoted by W, which can be defined by using weather information components such as wind speed, wind direction, temperature, sky cover, and precip-itation potential [26], [27]. When defining it, it is necessary to consider the location of the system and the types of re-newable sources of the power generation units, such as wind and solar. For example, for a wind farm, it can be defined asW = Ww in dspd× Ww in dir, whereWw in dspd andWw in dir are the set of wind speeds and wind directions, respectively. Note that both continuous and discrete sets can be used for the weather conditions. The load demand during time periodt is

denoted byM (t) and is assumed to lie in the bounded interval M = [Mm in, Mm ax], where Mm inandMm axare the minimum

and maximum load demands, respectively. The sum of power outputs of all renewable power generation units during time pe-riodt is denoted by pr e(t) ≤ pm axr e , wherepm axr e is the maximum renewable power output. At the end of each time period, the re-alizations of the random variables that represent the uncertain quantities, i.e., the load demand and the power outputs of the renewable power generation units, are observed by the system. We assume that the distribution of the uncertain quantities at each time period depends on the context information observed at the beginning of that time period.

Due to the system uncertainties, the load demand could be shed or the generated power could be curtailed in our system model. Thus, to ensure the power balance on the system, we define load shedding and power curtailment variables which are determined according to the UC schedule and the realization of the uncertainties. The amount of load shedding during time periodt, psh(t), is given by psh(t) = ⎡ ⎣M(t) − j ∈J pj(t) − pr e(t) ⎤ ⎦ + ,

where [·]+ _{= max[0, ·]. Similarly, the amount of power}

curtail-ment during time periodt, pcu(t), is given by

pcu(t) = ⎡ ⎣ j ∈J pj(t) + pr e(t) − M (t) ⎤ ⎦ + .

Then, the power balance equation during time periodt is derived

by

j ∈J

(4)

The total operating cost of the system during time periodt, Ctot(t), is obtained as Ctot(t) = j ∈J (Cj,f u(t)+Cj,su(t)) + Csh(t) + Ccu(t), (5) where Cj,f u(t) is the fuel cost of unit j that supplies power

pj(t) during time period t, Cj,su(t) is the start-up cost of unit

j at time period t, Csh(t) is the load shedding cost during time periodt, and Ccu(t) is the power curtailment cost during time periodt. The fuel cost can be modeled as a non-linear function

of the power output [38] as

Cj,f u(t) = Cj,f u(0) · uj(t) + Cj,f u(1) · pj(t) + Cj,f u(2) · pj(t)2, (6) whereC_{j,f u}(0) ,C_{j,f u}(1) , andC_{j,f u}(2) are the cost coefficients of unit

j. The start-up cost can be modeled as follows [38], [39]: Cj,su(t) = CMj + CSCj 1− e −T j , o f f ( t −1 )_{C S T j} , (7) where CMj is the start-up cost and maintenance cost of unit

j, CSCj is the cold start-up cost of unitj, and CSTj is the cold start-up time of unitj. The load shedding cost during time

periodt, Csh(t), is given by

Csh(t) = LSP · psh(t), (8) whereLSP is the load shedding price. The power curtailment

cost during time periodt, Ccu(t), is given by

Ccu(t) = P CP · pcu(t), (9) whereP CP is the power curtailment price.

III. UNITCOMMITMENTPROBLEM

The context that is observed at the beginning of time period

t is defined by x(t) := {h(t), M(t, TM), w(t, TW))}, where M(t, TM) ={M(t − 1), ..., M(t − TM)} is the vector of load demands of the pastTM time periods andw(t, TW) ={w(t − 1), ..., w(t − TW)} is the vector of weather conditions of the pastTW time periods. The context space is defined byX =H×

MTM ×WTW_{. We denote the dimension of the context space as}

DX.

Remark 1: We introduce a projection function φ which

projects the contextx into a low dimensional space. For exam-ple, weighted averaging, principal component analysis (PCA), or mutual information-based dimensionality reduction [40] can be used. Note that the projection function helps our algorithm learn faster if necessary.

In addition to the contextx, the down time of units, Tof f, should be considered when choosing the action since the start-up cost in (7) depends on it. For the sake of analysis, we define the bounded down time of unitj at time period t, ˜Tj,of f(t), bounded by P DTj, i.e., ˜Tj,of f ∈ ˜Tj,of f ={0, 1, ..., P DTj}, where P DTj is the maximum bounded down time of unitj. Note that since the start-up cost becomes almost a constant for large down times, it is enough to consider down times in a bounded region. Then, the bounded down time space is defined by ˜Tof f =

j ∈JT˜j,of f. We denote the vector of ˜Tj,of f(t)’s of all units as ˜Tof f(t) = { ˜Tj,of f(t)}j ∈J. Then, we define an

extended context at time period t by z(t) := {x(t), ˜Tof f(t +

Tsc−1)}, and define the extended context space by Z = X × ˜

Tof f.

We now define the state for units at the beginning of time pe-riodt as s(t) := {u(t + Tsc−1), pther(t + Tsc− 1), Ton(t +

Tsc− 1), Tof f(t + Tsc− 1)} and let S denote the state space. At the beginning of each time period t, an action which

is denoted by a(t) = {u(t + Tsc), pther(t + Tsc)}, is cho-sen from a subset of the action space, which is defined as

A = {0, 1}J_×

j ∈J[pm inj , pm axj ]. The set of available actions at time periodt is constrained by the state (unit status) s(t) at

time periodt. Thus, the set of feasible actions at time period t

with the unit statuss(t), A(s(t)), is given as

A(s(t)) = {{u(t + Tsc), pther(t + Tsc)} ∈ A| (1), (2), (3) and (4) holds}. We denote a UC policy which depends on the extended context z(t) and the state s(t) as π : Z × S → A. For given extended contextz(t) and unit status s(t), the UC policy π chooses the action denoted byπs(t)(t, z(t)) from the set of feasible actions,

A(s(t)). For convenience, we denote the action for time period t, πs(t)(t, z(t)), as π(t). Then, the UC problem is formally defined

by the following equation argmin π :Z×S→AE lim T →∞ 1 T T t=0 Ctot(π(t), t) , (10) whereCtot(π(t), t) is the total operating cost during time period

t given action π(t). Note that unlike the existing UC models, the

UC problem in (10) optimizes the UC over the infinite horizon. IV. ADAPTIVELYPARTITIONEDCONTEXTUALLEARNING In this section, we introduce an online learning algorithm called adaptively partitioned contextual learning algorithm for UC (AP-CLUC), which solves the UC problem in (10) without requiring any a priori information. We describe AP-CLUC and provide a performance bound for it.

A. How AP-CLUC Learns Effectively?

Basically, contextual learning algorithms learn the effects of the system uncertainties on the costs related to the context and the actions, which are much easier to learn than the entire prob-ability distribution of the system uncertainties. Specifically, to learn the effects, the algorithms estimate the cost of each action with a given context arrival, i.e., an observed context, using the past observed costs of the chosen action with the given context. Then, they learn the best action with the given context arrival by using the estimates of the costs, instead of using a priori information. Hence, they do not require any a priori informa-tion. However, when the context space is an uncountable set, the algorithms cannot learn the best action for all possible con-text arrivals and should approximate concon-text arrivals by merging context arrivals. Thus, in the algorithms, an approximation error from merging context arrivals occurs as well as an estimation error on a cost from limited observations [41].

UP-CLUC in our preliminary work [1] and AP-CLUC parti-tion the context space into multiple sets. Then, they approximate

(5)

Fig. 1. Illustration of a tradeoff between the estimation error and the approx-imation error according to a size of sets in the partition of the context space. In the partition, each dot represents each context arrival and each square represents each set.

Fig. 2. Illustration of uniform partitioning and adaptive partitioning. In the adaptive partitioning, the filled dots denote the past arrived contexts, and the unfilled dots denote new context arrivals. As the contexts arrive, the context space is adaptively partitioned more precisely.

context arrivals by merging the context arrivals in each set. For such algorithms with a partition-based approximation, the ap-proximation error of each set mainly depends on the size of the set, since the larger size of the set implies that the context arrivals in the larger region are merged. On the other hand, the estimation error of each set depends on the number of context arrivals in the set, since the larger number of context arrivals implies the more accurate estimation. Thus, as shown in Fig. 1, as the size of each set becomes small, the approximation error of each set decreases, but at the same time, the estimation error of each set increases since the number of context arrivals in the set decreases in general. On the other hand, as the size of each set becomes large, the approximation error increases while the estimation error decreases. This results in a tradeoff between the estimation error and the approximation error. Thus, for effective learning, it is important to address the tradeoff considering the context arrivals.

In UP-CLUC, the context space is uniformly partitioned be-fore running, and the size of sets in the context space is deter-mined by a system parameter. This results in the approximation error of the algorithm fixed, and thus, UP-CLUC cannot control the tradeoff adaptively according to the context arrivals. On the other hand, when AP-CLUC learns, it optimizes its partition to address the tradeoff in an online manner by partitioning the context space on-the-fly according to the context arrivals. In AP-CLUC, for the sets in regions of the context space with a large number of context arrivals, partitioning the sets smaller is favorable for reducing the total error, since the approximation error is reduced due to the smaller sets and the estimation er-ror will become small soon due to the large number of context arrivals. Thus, as shown in Fig. 2, AP-CLUC partitions such re-gions into smaller sets. This results in a partition where smaller sets are concentrated around the regions of the context space with a large number of context arrivals and larger sets are scat-tered over the remaining parts of the context space with a small number of context arrivals. In the literature, this phenomenon is

called contextual zooming [41]. Owing to such an optimization, AP-CLUC outperforms UP-CLUC regardless of the system pa-rameter of UP-CLUC as will be shown in the numerical results section.

For more effective learning, we can generate virtual experi-ences assumed that the unselected actions were selected owing to the nature of the UC problem. By using such virtual expe-riences, we can accelerate the learning speed of AP-CLUC. In addition, when AP-CLUC partitions a region into smaller sets, the costs due to learning the uncertainties in the smaller sets newly can be reduced by reusing the past experiences learned in the region. The details of such methods for effective learning will be described in the following subsection, and their effects will be shown in the numerical results section.

B. Algorithm Description of AP-CLUC

The pseudocode of AP-CLUC is given in Algorithm 1. For simplicity of the description, we assume thatTsc= 0 and nor-malize the extended context2_{space to be}_{Z = [0, 1]}D_{, where}_D is the dimension of the context space, i.e.,DX + J. It is worth noting that normalizing the context is used only for the perfor-mance analyses, and the perforperfor-mance bound of AP-CLUC can always be achieved by a proper scaling of the context. Also, the assumption does not affect the performance bound. How the partitioning of the context space in AP-CLUC is different from that in UP-CLUC is illustrated in Fig. 2 forD = 2. Unlike

UP-CLUC, i.e., Algorithm 1 in [1], which uniformly partitions the context space at the beginning of the algorithm, in AP-CLUC the context space is partitioned into smaller sets on-the-fly accord-ing to the context arrivals. This adaptive partitionaccord-ing enables AP-CLUC to learn more precisely on the frequent context in-formation. In AP-CLUC, every partition of the context space is composed of hypercubes with side lengths belonging to the set

{20_{, 2}−1_{, 2}−2_{, ...}, and a D-dimensional hypercube which has}

sides of length 2−lis called a levell hypercube. The partition at

a time period is composed of a set of disjoint hypercubes that cover the context space. This set of hypercubes are also referred to as the active hypercubes at that time period. The set of active hypercubes, i.e., the context partition, is denoted byP_Z.

The adaptive partitioning mechanism of AP-CLUC performs as follows. The initial context partition for AP-CLUC is given by

PZ={[0, 1]D} as given in line 1 of Algorithm 1, which is the entire context space (i.e.,l = 0). Then, according to the context

arrivals, this partition is updated by the mechanism described below. Let l(pz) be the level of hypercube pz andN (pz) be the number of context arrivals to hypercubepz after pz was activated. An active hypercube pz is deactivated if N (pz)≥ 2ρl(pz)_{as in line 13 of Algorithm 1, where}_{ρ > 0 is a parameter}

of AP-CLUC. Whenpzis deactivated, 2Dlevell(pz) + 1 child hypercubes formed by partitioning hypercubepzbecome active, and the context partition is updated asPZ∪ Gl(ppz z)+ 1\ {pz} as

in line 16 of Algorithm 1, whereGl(pz)+ 1

pz is the set of 2D level

l(pz) + 1 child hypercubes created from the hypercube pz. This adaptive partitioning is illustrated in Fig. 2 forD = 2 and ρ = 1.

2_{In algorithm description, we omit “extended” from the extended context for}

(6)

In adaptive partitioning, the deactivation process of a hy-percube depends on its level. In addition, the action space is also adaptively discretized according to the level of the hy-percube that the context belongs to. To this end, the slicing parameter for the power output which is used to discretize the power output is determined by the level of the hypercube l,

and hence, is denoted by mA(l). Then, the power output of unitj is uniformly discretized using mA(l). The set of the dis-cretized power outputs of unitj for a level l hypercube is denoted

by ¯Pj(l) = {pm inj + p mA(l) j , pm inj + 2p mA(l) j , ..., pm axj }, where pmA(l)

j = (pm axj − pm inj )/mA(l). The power output of unit j during time period t is denoted by ¯pj(t) ∈ ¯Pj(l(pz(z(t)))), wherel(pz(z)) represents the level of hypercube p where context z belongs to. The vector of the discretized power outputs of all units during time periodt is denoted by pther(t) = {¯pj(t)}∀j∈J and the discretized action space for a levell hypercube is given

by ¯A(l) = {0, 1}J× _{j ∈J}P¯j(l). Using these, the set of dis-cretized available actions for unit statuss and a level l hypercube is given as

¯

A(s, l) := {{u(t), ¯pther(t)}| (1), (2), (3) and (4) holds} . Remark 2: Note that in the early stages of running AP-CLUC, there might exist no available action satisfying the ramp rate limit constraint in (3) when the discretization of power out-puts is too coarse, i.e., there are only a few discretized power outputs. This problem can be resolved by appropriately setting

mA(l) of low levels to be large enough to satisfy the ramp rate limit constraint.

We denote the number of times that actiona is chosen when

the context is in active hypercube pz as N (a, pz). We also define the estimated cost of actiona on set pz, ˆc(a, pz), which represents the sample mean of the total operating cost observed from actiona on active hypercube pz. At the beginning of each time periodt, the system observes the context z(t) and unit status

s(t). Then, it finds the corresponding active hypercube pz(z(t)) that the current context belongs to and calculates the set of available actions ¯A(s(t), l(p_z(z(t))) given the unit status. Then, it chooses the action with the lowest estimated total operating cost given as ˆ π(t) ∈ argmin a∈ ¯A(s(t),l(pz(z(t))) ˆ c(a, pz(z(t))).

During the time periodt, the system operates its thermal power

generation units according to the chosen action ˆπ(t). At the

end of the time period, the system observes the realization of the uncertainties with which the total operating cost during the time period, Ctot(t), is obtained as in (5). Then, the system updates the estimated cost ˆc(ˆπ(t), pz) by using Ctot(t) in line 8 of Algorithm 1.

The selected action does not affect the distribution of the un-certain events that happen in the current time period. This nature of the UC problem allows us to calculate the total operating cost for the actions that are not selected, i.e.,a ∈ ¯A(l(pz))\{ˆπ(t)}, from the observed cost. Note that for each unselected action, the fuel cost in (6) and the start-up cost in (7) can be simply calculated. The load shedding cost in (8) and power curtailment cost in (9) can be also calculated by using the realized load

Algorithm 1: AP-CLUC. 1: P = {[0, 1]D},

2: ˆc(a, [0, 1]D)← ∞ and N(a, [0, 1]D)← 0, ∀a ∈ ¯A(0) 3: N ([0, 1]D)← 0

4: while TRUE do

5: Observe contextz and unit status s 6: p ← pz(z), a ← argmina∈ ¯A(s,l(p))ˆc(a, p) 7: Operate units witha and observe Ctot 8: ˆc(a, p) ← ˆc(a,p)N (a,p)+Ct o t

N (a,p)+1

9: Virtually observeCtot (a),∀a∈ ¯A(l(p))\{a} 10: ˆc(a, p) ← ˆc(a,p)N (a,p)+Ct o t (a)

N (a,p)+1 ,∀a∈ ¯A(l(p))\{a} 11: N (a, p) ← N (a, p) + 1, ∀a ∈ ¯A(l(p))

12: N (p) ← N (p) + 1

13: ifN (p) ≥ 2ρl(p)then

14: Create 2D levell(p) + 1 child hypercubes, Gl(p)+1p 15: Run INIT Gl(p)+1p , p 16: P ← P ∪ Gl(p)+1p \ {p} 17: end if 18: end while 19: procedure INIT(B, p) 20: forp∈ B do

21: ˆc(a, p)← ∞ and N(a, p)← 0, ∀a ∈ ¯A(l(p))

22: N (p)← 0

23: end for

24: end procedure

demand and renewable power outputs due to the nature of the UC problem. Thus, by using the calculated costs for the unse-lected actions, i.e., virtually observed costs, AP-CLUC performs virtual updates of the estimated costs of unselected actions in order to accelerate the learning as given in lines 10–11 of Algo-rithm 1. Therefore, the number of times that actiona is

(virtu-ally) chosen when the context is in the setpz, i.e.,N (a, pz), is updated for alla ∈ ¯A(l(pz)).

Remark 3: Note that the virtual update allows AP-CLUC to accelerate the learning speed, but it also causes an increase in the computational complexity of AP-CLUC due to the calcu-lation of the costs for the unselected actions. We can control the tradeoff between the learning speed and the computational complexity by performing the virtual updates for a part of the unselected actions, not for all unselected actions. We investigate the computational complexity according to the number of the virtually updated unselected actions in Section IV-D, and show the learning speed in the numerical results.

Moreover, to help AP-CLUC learn faster, when child hy-percubes become active, we can reuse the a priori information provided from their parent hypercube. To this end, we initiate the parameters of activated hypercubes using Algorithm 2 in-stead of using the initiating procedure in AP-CLUC as given in lines 21–22 of Algorithm 1. We call this an experience reuse, and it helps AP-CLUC learn faster by providing a guidance in the early stages of learning in the activated hypercubes. This improvement is shown in the numerical results section.

(7)

Algorithm 2: Experience Reuse. 1: procedure ExpReuse(B, p)

2: forp∈ B do

3: N (a, p)← N(a, p)/2 , ∀a ∈ ¯A(l(p)) 4: N (p)← N(p)/2

5: ˆc(a, p)← ˆc(˜a, p), ∀a ∈ ¯A(l(p)), where ˜a is the

action in ¯A(l(p)) which is nearest from a.

6: end for

7: end procedure

C. Regret Bound for AP-CLUC

In this subsection, to evaluate the performance of AP-CLUC in Algorithm 1, we first define the learning regret, and then provide the regret bound for AP-CLUC. For simplicity of the presentation, we normalize the total operating cost such that it lies in [0, 1]. Let the expected operating cost of action a ∈ A

during a time period with a given contextz ∈ Z be

c(a, z) := E_{M (x), ˆ}ˆ _p

r e(x)[Ctot(a, t)] ,

where ˆM (x) and ˆpr e(x) are the random variables for the load demand and the renewable power output, respectively, during the time period where the contextx is given. The joint distribution of ˆM (x) and ˆpr e(x) is given by Fx. Next, we show that the expected total operating costs are similar for similar contexts, which is widely used as a similarity information [42], [43]. We formalize this as a Lipschitz condition, and we prove that the expected cost of each action also satisfies the Lipschitz condition in the following lemma.

Lemma 1: There existsL > 0 such that for all z, z∈ Z and a ∈ A,

|c(a, z) − c(a, z₎_{| ≤ Lz − z}_,

and for all a, a∈ A and z ∈ Z, where the on/off status are

same but the power outputs might be different,

|c(a, z) − c(a_{, z)| ≤ La − a}_.

Proof: See Appendix A.

We define the regret with respect to a complete information benchmark, which myopically selects the best available action for the current time period given perfect knowledge of the distri-butionFx, i.e., the impact on the future costs is not considered

when selecting the action. It is worth noting that in a viewpoint of the existing UC approaches,Fx can be interpreted as a

tar-get distribution which their a priori information is intended to provide. Given context z and unit status s, this benchmark is defined as

πs∗(z) := argmina∈A(s)c(a, z), ∀z ∈ Z. (11) It is worth emphasizing that the complete information bench-mark is defined on the continuous action spaceA. Let ˆπ be the UC policy obtained by AP-CLUC. Then, the expected learn-ing regret with respect to the benchmarkπs∗(z) in (11) by time

periodT is given by R(T ) := E _T t=0 Ctot(ˆπ(t), t) − T t=0 c(π∗s(t)(z(t)), z(t)) (12) wheres(t) and z(t) denote the unit status and context of AP-CLUC at time periodt.

The following theorem bounds the regret of AP-CLUC (with-out experience reuse) given in (12).

Theorem 1: When the parameters of AP-CLUC are set as

ρ = 2(J + 1) and mA(l) = 2l, we have for AP-CLUC

R(T ) ≤ l o g 2 T 2 ( J + 1 ) +1 l=1 Kl(T ) 2l(2J +1)(2L√D + 2J +12 + L¯pther √ J) + 1 ,

whereKl(T ) is the number of level l hypercubes that are acti-vated by timeT and ¯pther = maxj ∈J[pm axj − pm inj ].

Proof: See Appendix B.

Note that the regret bound for AP-CLUC depends the number of activated levell hypercubes given by Kl(T ), which depends on the pattern of context arrivals. Using the general form of the regret bound given in Theorem 1, next we show that the regret of AP-CLUC for the worst possible pattern of context arrivals which maximizes the number of activated hypercubes (in which the contexts arrive uniformly over the context space) is sublinear inT .

Corollary 1: When the parameters of AP-CLUC are set as

ρ = 2(J + 1) and mA(l) = 2l, if the context arrivals by time

T are uniformly distributed over the context space, we have for

AP-CLUC, R(T ) = O TD + 2 J + 1D + 2 J + 2 .

Proof: See Appendix C.

The regret bound in Collorary 1 is sublinear inT . Thus, in

theory, with the indefinite time periods, it is guaranteed that the average cost of AP-CLUC converges to the average cost of the benchmark, i.e., lim_{T →∞}R(T )/T = 0, for all possible context

arrivals.

D. Computational Complexity of AP-CLUC

In each time period, AP-CLUC needs to perform compari-son operations to identify the active hypercube that the current context belongs to. The computational complexity of such iden-tification is given byO(|P|), where |P| is the cardinality of

the active hypercubes in the partition P. Since the uniform context arrivals over the context space maximizes the num-ber of active hypercubes,|P| at time period T is bounded by

|P| < 2D lm a x(T )_{, where l}

m ax(T ) = 1 +log_{D +ρ}2T is the

maxi-mum hypercube level at time periodT with the uniform context

arrivals derived in Appendix C. Then, the worst-case computa-tional complexity of the identification at time periodT is given

byO(2D lm a x(T )_).

After identifying the active hypercube, AP-CLUC needs to perform one comparison operation for choosing the action with

(8)

the lowest estimated total operating cost and update operations on the estimated total operating costs of the actions including the unselected actions, i.e., virtual updates. To virtually update the estimated total operating costs of the unselected actions, the operating costs of the unselected actions have to be computed from the observed cost of the selected action and realization of the uncertainties. Thus, the computational complexity of the virtual updates highly depends on the number of the unselected actions whose estimated total operating costs will be virtually updated in each time period. The computational complexity of the update operations has the orderO((2mA(l))J) if the esti-mated costs for all unselected actions are virtually updated. On the other hand, if AP-CLUC does not virtually update any unse-lected actions, then the computational complexity has the order

O(1). Hence, we can control the computational complexity of

AP-CLUC by limiting the number of the unselected actions whose estimated total operating costs will be virtually updated.

V. NUMERICALRESULTS

In this section, we provide simulation results to evaluate the performance of AP-CLUC.

A. Simulation Setup

The length of a time period is taken to be an hour and the circular time duration for the context is set to be a day, i.e.,

H = {0, 1, ..., 23}. It is worth emphasizing that AP-CLUC starts

without any a priori information of the uncertainties and learns them during the simulation. We consider a microgrid with wind turbines and four identical thermal power generation units. The parameters of the thermal power generation units are provided in Table II. We set the load shedding price,LSP , and the power

curtailment price, P CP , to be 200 $/kWh [9]. The spinning

reserve requirement is set to be 10% of the total power output of the thermal power generation units.

In our simulation, we consider a context consisting of current time, load demand context, weather context, and down time of units, where the dimensions of both load demand and weather context spaces are 1.3 The power output profile for each hour and parameters of wind turbines are adopted from [9], and their power output capacity is set to be 650 kW. For a load demand profile for each hour, we use the hourly average load shapes of residential electricity services in California [44] with 500 customers. Then, the uncertainties are generated by using their profiles and the context. In each time periodt, the load demand

context,xM(t), is non-uniformly generated between [−1, 1] by a truncated normal distribution whose mean is zero and variance is 0.2. Then, the load demand is generated by a Gaussian distribu-tion of which mean is set to bePMpr of ile(h(t)) + xM(t)PMuncer t, whereP_Mpr of ile(h) is the value of the load demand profile in time indexh and Puncer t

M is an amount of load demand uncertainty. Note that the Gaussian distribution is widely used to model the forecasting error [45]. By this, the load demand is generated

3_{To simply construct the simulation system with stochastic uncertainties, we}

assume that the dimension of each of the load demand and weather contexts is one. In real world, the projected context can be obtained as discussed in Remark 1.

TABLE II

PARAMETERS OFTHERMALUNITS[9]

based on both its profiles and its correlated context. The standard deviation of the distribution is set to be 2.5% of its mean, which is widely used to model the scenarios in day-ahead UC prob-lems [9], [46]. Similarly, the weather context,xW(t), is non-uniformly generated between [−1, 1] by the same distribution for the load demand context. Then, the renewable power output is also generated by a Gaussian distribution of which mean is set to beP_Wpr of ile(h(t)) + xW(t)PWuncer t, wherePWpr of ile(h) is the value of the renewable power output profile in time indexh

andPuncer t

W is an amount of renewable power uncertainty. We set bothPuncer t

M andPWuncer tto be 150 kW.

To evaluate the performance of AP-CLUC, we compare it with its complete information benchmark and also with the stochastic optimization for UC (SOUC) that is one of the most representative UC approaches [2]–[9], [28]. For SOUC, we as-sume that perfect a priori information (PI) is given. We also compare it with several learning algorithms: AP-CLUC without experience reuse (ER), UP-CLUC, Q-learning-like algorithm for UC (QLUC), UCB1, and EXP3. The descriptions of the al-gorithms and their parameter settings are provided as follows. The number of power outputs,mA, is set to be 8 for all algo-rithms unless mentioned explicitly.

r

_{SOUC with PI chooses the best UC schedule considering}

all possible stochastic scenarios on the uncertainties using the perfect information. In other words, it chooses the op-timal UC schedule in the continuous action spaceA as in (11). It is worth noting that it is an ideal SOUC since such perfect a priori information cannot be obtained in reality.

r

_{AP-CLUC is given in Algorithm 1 with} _{ρ which is set}

to be 2. In AP-CLUC, we also run the complete infor-mation benchmark. The benchmark chooses the optimal UC schedule as in (11). This is similar to SOUC with PI, but when the benchmark chooses the UC schedule, it uses the unit status and context of AP-CLUC as in (12), while SOUC with PI uses its own. We also implement AP-CLUC without ER to show the improvement due to ER. In AP-CLUC without ER, for initiating of each hypercube, its initial estimated cost is set to be infinite and its counters are set to be zero.

r

_{CLUC is given in [1], and we implement two}

UP-CLUCs withmz set to be 5 and 10, respectively, where

mZ is the slicing parameter for the context space.

r

_{QLUC is a learning algorithm which consider only the}

current time information which is basic state information while not considering both load demand and weather con-texts. We simply implement it by neglecting both contexts

4_{Note that the ramp rates of the units are assumed to be equal to}_pm a x

j since

their sizes are small enough to reach their maximum power outputs within a time period [9].

(9)

TABLE III

ON/OFFSTATUS OFUNITS BYAP-CLUC

TABLE IV

COMPARISON OFAVERAGECOSTS($)

in UP-CLUC. In QLUC, we also adopt the virtual updates for fair comparison.

The following learning algorithms, i.e., UCB1, and EXP3, do not consider the context information. The parameters of each algorithm are chosen as the set of parameters for which the algorithm performs the best.

r

_{UCB1 [47] computes an index for each action, which is}

a lower confidence bound of the expected cost. Then, the algorithm chooses the action with has the lowest index.

r

_{EXP3 [48] computes and updates a weight parameter for}

each action by using its realized operating costs. Then, it uses the weight parameters to randomly decide the action to be taken. For EXP3,mA is set to be 4 instead of 8 for better performance.

To clearly show that AP-CLUC addresses the UC problem, we list the on/off status of units by AP-CLUC for certain 24 time periods during the simulation in Table III. According to the ob-served context that is strongly correlated with the uncertainties, AP-CLUC decides the on/off status of units to minimize the average total operating costs. From the table, we can see that AP-CLUC satisfies the minimum up/down time constraints. B. Average Costs and Learning Speeds

We first compare the achieved average costs, which are pro-vided in Table IV. CLUCs, i.e., AP-CLUC and UP-CLUCs, achieve better performance than other learning algorithms which do not utilize the context information. Especially, AP-CLUC achieves 27.4%, 40.1%, and 48.6% cost reduction against QLUC, UCB1, and EXP3, respectively. This result shows that using the context information is effective to achieve better per-formance when the system uncertainties are correlated to the context information. It is worth noting that many existing re-searches show that the context information, i.e., current time, weather condition, and past load demand, is highly correlated to the system uncertainties, i.e., renewable power outputs and load demand, in real world [33], [34]. In addition, QLUC which uses only the current time context achieves better performance

Fig. 3. Average total operating costs of the algorithms.

than other learning algorithms which do not use any context information. This result also shows the effectiveness of using the context information. We can see that in general the load shedding costs of the algorithms which do not use the context information are larger than those of CLUCs, while their fuel costs are smaller. This result shows that in general they generate too small amount of power to support the load demand com-pared with CLUCs since they fail to predict the uncertainties due to the lack of the context information.

In addition, from Table IV, we see that AP-CLUC achieves a performance close to the benchmark by effectively using the context information. It is worth noting that the benchmark is based on the assumption that it has perfectly accurate a priori information. The benchmark and SOUC with PI achieve a sim-ilar performance owing to the perfect information. AP-CLUC and UP-CLUC which has a fine partition of the context space, i.e., UP-CLUC withmz = 10, achieve performance close to SOUC with PI by effectively using the context information. On the other hand, UP-CLUC which has a rough partition of the context space, i.e., UP-CLUC withmz = 5, only achieves worse performance than them due to the approximation errors from merging context arrivals.

In Fig. 3, we compare the learning speeds of the learning algo-rithms. The faster learning speed of a learning algorithm implies that when the statistical characteristics of the system uncertain-ties vary, the algorithm can adapt to it more quickly. Hence, the learning speed of the learning algorithm is important to use it in practice, since in real-world, such statistical characteristics might vary over time due to many environmental reasons such as seasonal change and economy. Note that SOUC and bench-mark are not learning algorithms. We can see that AP-CLUC,

(10)

Fig. 4. Average total operating costs of the algorithms varying the amount of uncertainties.

AP-CLUC without ER, UP-CLUC withmz = 5, QLUC, and UCB1 have relatively fast learning speeds, and UP-CLUC with

mz = 10 and EXP3 have relatively slow learning speeds. By comparing UP-CLUCs, we can see that UP-CLUC which has a finer partition of the context space learns slower while achieving smaller average cost. In addition, by reusing the past experience, AP-CLUC learns faster than AP-CLUC without ER as shown in the figure.

C. Impact of Degree of Uncertainties

We see the impact of uncertainties by varying the degree of un-certainties in our simulation system, i.e.,Puncer t

M andPWuncer t. Note that the degree of uncertainty represents the maximum deviation from the profile value according to the context in-formation. Thus, as the amount of uncertainties increases, both load demand and renewable power output more fluctuate. In general, more fluctuation of the uncertainties causes higher op-erating costs since the system needs more effort to address the uncertainties. From Fig. 4, we can see that the average total op-erating costs of all learning algorithms increase as the degree of uncertainties increases. It is worth noting that the average cost of SOUC with PI also increases even it has perfect a priori infor-mation since more inevitable costs occur from the constraints of the thermal units such as the minimum power outputs of the units and the minimum up/down times. In the figure, we can also see that the increased amounts of the average total costs of CLUCs are similar with that of SOUC with PI. This shows that CLUCs can address the uncertainties only incurring a sim-ilar amount of cost when using SOUC with PI since they can learn the uncertainties by using the context information. On the other hand, the increased amount of the average total costs of the learning algorithms which do not use the load demand and weather contexts is larger than that of other algorithms which use the context information. This also implies that using the context information is effective to address the uncertainties. D. Impact of Inaccuracy of a Priori Information

Next, we evaluate the performance of SOUC when the a priori information is inaccurate. Specifically, we investigate the inac-curacy of a priori information on the performance of SOUC.

Fig. 5. Average total operating costs of CLUCs, QLUC and SOUC with inaccurate a priori information.

It is worth noting that perfectly accurate a priori information cannot be obtained in reality, and thus, any a priori information used in the existing UC models is inaccurate to some degree. To adjust the degree of inaccuracy in a priori information, we con-sider the case when the mean of the load demand is overesti-mated and the mean of the renewable power output is underesti-mated compared to their expected values. The degree of overes-timation and underesoveres-timation is stated in percentages. In Fig. 5, we show the average total operating costs of CLUCs, QLUC, and SOUC varying the degree of inaccuracy in a priori informa-tion. For simple presentation, among the comparative learning algorithms, i.e., QLUC, UCB1 and EXP3, only the performance of QLUC is provided in the figure since QLUC has the best per-formance among them. We can see that the average total cost of SOUC increases as the degree of inaccuracy increases. On the other hand, the average total cost of other algorithms does not change since they do not use a priori information. AP-CLUC and UP-CLUC withmz = 10 achieve better performance than SOUC if the degree of inaccuracy in a priori information of SOUC becomes more than 4%. Moreover, the difference be-tween the performances of CLUCs and SOUC rapidly increases as the degree of inaccuracy increases, and if the degree of inac-curacy becomes more than 11%, SOUC has worse performance than even QLUC which uses only the current time context. This result shows that CLUCs are more effective than SOUC when given a priori information is not highly accurate.

E. Effectiveness of Adaptive Partitioning

In this subsection, we compare AP-CLUC and UP-CLUC to show the effectiveness of adaptive partitioning in AP-CLUC. From the previous results, we can infer that in UP-CLUC, there is a tradeoff between the average costs and the learning speed and the tradeoff can be controlled by the slicing parameter for the context spacemz. To compare the performance of AP-CLUC and UP-CLUC more clearly, in Fig. 6, we provide the average total operating costs of AP-CLUC and UP-CLUC varyingmz as 2, 4, 6, and 10. From the figure, we can see that in UP-CLUC, as mz increases, the learning speed becomes slower, but the average total cost decreases with enough time periods. Thus, due to such slow learning speeds, UP-CLUC with large mz

(11)

Fig. 6. Average total operating costs of AP-CLUC and UP-CLUC varying

mz as 2, 4, 6, and 10.

Fig. 7. Average total operating costs of AP-CLUCs varying the ratio of virtu-ally updated unselected actions from all unselected actions.

has a worse average total cost than UP-CLUC with smallermz before the uncertainties are learned enough. On the other hand, AP-CLUC does not have such a tradeoff since it adaptively partitions the context space according to the context arrivals. In the figure, AP-CLUC achieves the lowest average total cost while having a relatively fast learning speed compared with UP-CLUCs. Besides, it achieves the lower average total cost than UP-CLUC in all time periods. regardless of mz. This implies that AP-CLUC outperforms UP-CLUC regardless ofmzowing to its adaptive partitioning.

F. Impact of Virtually Updated Actions

In Fig. 7, the impact of the number of virtually updated uns-elected actions in AP-CLUC is shown. The unsuns-elected actions which will be virtually updated are randomly chosen, and their numbers are determined as 20%, 40%, 60%, 80%, and 100% of all unselected actions. We can see that as more number of unselected actions are virtually updated, the learning speed of AP-CLUC increases. However, as investigated in Section IV-D, the computational complexity also increases. Thus, the number of virtually updated unselected actions should be carefully cho-sen considering the tradeoff between the learning speed and the computational complexity.

Fig. 8. Average total operating costs of the algorithms in the larger microgrid.

G. Average Total Operating Costs and Learning Speeds in Larger Microgrids

In Fig. 8, we provide the average total operating costs of the algorithms in a microgrid having a larger number of customers, wind turbines, and thermal power generation units compared with the microgrid considered in the previous results. In the mi-crogrid, we consider six units, 750 customers, and wind turbines whose power output capacity is given by 1000 kW. Similar to the previous results, AP-CLUC achieves better performance than other learning algorithms and a performance close to the bench-mark. Moreover, the learning speed of AP-CLUC is similar to that in the previous results. This clearly shows that AP-CLUC is applicable to larger microgrids.

VI. CONCLUSION ANDFUTUREWORK

In this paper, we developed AP-CLUC which minimizes the average total operating cost of the microgrid with renewable en-ergy sources by learning the system uncertainties using the con-text information. Then, we proved the optimality of AP-CLUC in terms of the long-term average cost. Moreover, we showed through simulations that AP-CLUC achieves performance close to the complete information benchmark which has perfect a pri-ori information about the system uncertainties, and outperforms other learning algorithms which do not use the context informa-tion. Our results show that two key properties of AP-CLUC, use of and adaptive management of the context information, makes it perform better than its competitors including UP-CLUC.

As a future work on this subject, power flow issues can be incorporated into the UC problem for the secure power flow. To this end, the transmission line capacity constraints can be con-sidered in AP-CLUC. Moreover, AP-CLUC can be extended by incorporating power flow decisions into the actions. In ad-dition to the power flow issues, the operational reliability of microgrids can be also considered in AP-CLUC. For exam-ple, the existing concepts to adjust the conservativeness and robustness, such as minimax regret [14] and CVaR [15], can be applied to AP-CLUC. Moreover, the reliability for load shed-ding or power curtailment can be addressed by incorporating more strict reserve requirement constraints or introducing the different weights for each type of cost. Lastly, our learning

(12)

approach can be extended to UC scenarios on microgrids us-ing game theory, which are widely studied recently [49]–[51]. For this, there are several promising learning methods such as game-theoretical multi-armed bandits [52], which can be used for such scenarios with game settings.

APPENDIXA PROOF OFLEMMA1

The proof is done by showing that all costs including the operating cost in (5), the fuel cost in (6), the start-up cost in (7), the load shedding cost in (8) and the power curtailment cost in (9) obeys to the Lipschitz condition for contextz and action

a. We assume that the statistical characteristics of load demand

and renewable power outputs are similar for similar contexts, and formalize this as the Lipschitz condition. For the simple presentation, we substitute ˆM (x) − ˆpr e(x) as ωxand denote an event{a < ωx ≤ b} by Ωba(x).

Assumption 1: There existsLx > 0 such that for all x, x∈

X , we have

E[ωxI(Ωba(x))] − E[ωxI(Ωba(x))] ≤Lxx − x and

|P(Ωb

a(x)) − P(Ωba(x))| ≤ Lxx − x

for any givena ≤ b, where I(Ω) is the indicator function for

event Ω andP(Ω) denotes the probability of event Ω.

The proof forz: The fuel cost always satisfies the condition since it does not depend on the extended contextz. To prove that the start-up cost of unitj satisfies the condition, we show

that CSCj e −T j , o f f_{C S T j} _{− e}−T j , o f f C S T j ≤ LTj,of f − Tj,of f , (13) for someL > 0, where Tj,of f andTj,of f are the elements repre-senting the down times of unitj in z and z, respectively. Since

Tj,of f ≥ 0 and Tj,of f ≥ 0, we have e− T j , o f f C S T j − e− T j , o f f C S T j ≤ 1.

Using the fact that the down time is a non-negative integer, we getTj,of f − Tj,of f  ≥ 1, for any Tj,of f andTj,of f such that

Tj,of f = Tj,of f . Thus, when we chooseL = CSCj, the start-up cost of unitj satisfies the condition in (13) for any Tj,of f andT_{j,of f} such that Tj,of f = Tj,of f . WhenTj,of f = Tj,of f , the condition is satisfied regardless ofL since both sides in (13)

are zero. Hence, when we chooseL = CSCj, the start-up cost of unitj satisfies the condition for any Tj,of f andTj,of f . From the load shedding cost in (8), we have

_LSP E([ωx− Σjpj]+)− E([ωx− Σ_jp_j]+) = LSPE[(ωx− Σjpj)I(Ω∞Σjpj(x))] −E[(ωx− Σ_jp_j)I(Ω∞_Σ jpj(x _))]

= LSPE[ωxI(Ω∞Σjpj(x))] − E[ωxI(Ω

∞ Σjpj(x _))] − Σjpj P(Ω∞ Σjpj(x)) − P(Ω ∞ Σjpj(x ₎₎

≤ LSPE[ωxI(Ω∞Σjpj(x))] − E[ωxI(Ω

∞ Σjpj(x _))] + LSP ΣjpjP(Ω∞Σjpj(x)) − P(Ω ∞ Σjpj(x ₎₎ ≤ LSP · Lx(1 + Σjpj)x − x,

where the last inequality follows Assumption 1. Then, by choos-ing L = LSP · Lx(1 + Σjpm axj ), we can show that the load shedding cost obeys to the Lipschitz condition. Similarly, we can show the Lipschitz condition of the power curtailment cost withL = P CP · Lx(1 + Σjpm axj ). Then, the expected cost is a Lipschitz continuous function of the extended contextz with

Lz=j ∈JCSCj + Lx(LSP + P CP )(1 + Σjpm axj ). The proof fora: To prove that the fuel cost of unit j satisfies

the condition, we show that the following inequality is satisfied

c(1)_{j,f u}(pj− pj) + c (2) j,f u(pj 2_{− p} j 2 )≤ L(pj − pj), for someL > 0, where pj > pj. By dividing both sides of the above inequality by (pj− pj), we have

c(1)j,f u + c

(2)

j,f u(pj + pj)≤ L. Thus, when we chooseL = c(1)_{j,f u} + 2c(2)_{j,f u}pm ax

j , the fuel cost of unitj satisfies the condition for any pj andpj. The start-up cost satisfies the condition since it does not depend on the action. For the load shedding cost, we have

_LSP

E([ωx − Σjpj]+)− E([ωx− Σjpj]+). To simplify the notations, we omit x in the following. With-out loss of generality, we assume Σjpj > Σjpj. Then, we can arrange the load shedding cost as

LSPE[(ω − Σjpj)I(Ω∞Σjpj)]− E[(ω − Σjp

j)I(Ω∞Σjpj)] = LSP Ω∞_{Σ j p j}(ω − Σjpj)dF − Ω∞ Σ j p j(ω − Σjp j)dF = LSPΩ∞ Σ j p j(ω − Σjpj)dF − Ω∞ Σ j p j(ω − Σjp j)dF −_ΩΣ j p j Σ j p j (ω − Σjpj)dF = LSP Ω∞_{Σ j p j}(Σjpj − Σjpj)dF + ΩΣ j p j Σ j p j (ω − Σjpj)dF ≤ LSP_Ω∞ Σ j p j(Σjpj−Σjp j)dF + ΩΣ j p j Σ j p j (Σjpj−Σjpj)dF

(13)

= LSP Ω∞ Σ j p j (Σ_jpj−Σjpj)dF = LSPP(Ω∞Σjpj)(Σjpj − Σjp j) ≤ LSP |Σjpj − Σjpj| = LSP |Σj(pj − pj)| ≤ LSP √ Jpther− pther,

where F is the joint distribution of ˆM and ˆpr e and the last inequality follows from the Cauchy-Schwarz inequality, i.e., (Σj1· xj)2 ≤ JΣjx2j. Then, by choosing L = LSP

√ J, the

load shedding cost satisfies the condition. Similarly, we can show that the power curtailment cost satisfies the condition by choosingL = P CP√J. Then, the expected cost is a Lipschitz

continuous function of the actiona with La = j ∈Jc (1) j,f u+ 2c(2)_{j,f u}pm ax j + √ J(LSP + P CP ).

Finally, we conclude that the expected cost is a Lipschitz con-tinuous function for the contextz and the action a, respectively, withL = max(Lz, La).

APPENDIXB PROOF OFTHEOREM1

We first introduce some notations and definitions. For each set (hypercube)p ∈ PZ, let ¯ca,p := supz∈pc(a, z) and ca,p := inf_z∈pc(a, z). For notational brevity, we denote the expected

operating costc(a(t), z(t)) by ca,z(t). We denote the estimated cost of action a on set p at time period t by ˆca,p(t). Let ˆa(t) be the action selected by AP-CLUC at time periodt, a∗(t) =

π∗_s(t)(z(t)) be the best myopic action given unit status s(t) and contextz(t), and ¯a∗(t) be the best myopic action in ¯A(s(t), l) given unit statuss(t) and context z(t).

The upper bound on the highest level hypercube that is active at any timet is given by the following lemma.

Lemma 2: In AP-CLUC, all the active hypercubesp ∈ P(t)

at timet have at most a level of log2t ρ + 1.

Proof: Letl+ 1 be the level of the highest level active hy-percube. We must have l_l=0 2ρl _{< t, otherwise the highest} level active hypercube’s level will be less thanl+ 1. We have fort > 1,2ρ ( l + 1 )₋₁

2ρ₋₁ < t → 2ρl

< t → l< log2t

ρ .

With the introduced notations, the one-step regret in time periodt is defined as

r(t) := ca,zˆ (t) − ca∗,z(t).

Consider time periodt in which a context z(t) arrives to level l hypercube denoted by p. Suppose that mA(l) = 2lξ and the number of previous context arrivals to this hypercube isτ . Note

that in AP-CLUC, the estimated costs of all actions in ¯A(l) are virtually updated for every context arrival. Thus, all actions in

¯

A(l) are updated τ times. From the one-step regret, we have r(t) = ca,zˆ (t) − c¯a∗,z(t) + c¯a∗,z(t) − ca∗,z(t)

≤ ca,zˆ (t) − c¯a∗_,z(t) + L¯p_ther

√

J2−lξ, (14) where ¯pther = maxj ∈J[pm axj − pm inj ] and the inequality fol-lows from Lemma 1. Note that there always exists a discretized action ¯a∗within the distanceL¯pther

√

J−lξfrom the optimal

ac-tiona∗, since the power outputs are uniformly discretized using

mA(l) and the set of feasible power outputs at each time period is a convex set for any given on/off states. Also, we have

ˆ

cˆa,p(t) ≤ ˆca¯∗_,p(t) a.s.

by the action selection rule of AP-CLUC. Then, from (14), we obtain

r(t) ≤ ca,zˆ (t) − ˆcˆa,p(t) + ˆc¯a∗,p(t) − c¯a∗,z(t) + L¯pther

√ J2−lξ ≤ 2 max

a∈ ¯A(l)|ca,z(t) − ˆca,p(t)| + L¯pther

√ J2−lξ.

Let Δt:= maxa∈ ¯A(l)|ca,z(t) − ˆca,p(t)|. Then, since the total operating cost is bounded in [0, 1], we have

E[r(t)] ≤ 2E[Δt] + L¯pther

√ J2−lξ = 2 1 0 P(Δt≥ y)dy + L¯pther √ J2−lξ. (15) We also have for alla ∈ ¯A(l),

E[ˆca,p(t)] − L

√

D2−l≤ ca,p ≤ E[ˆca,p(t)] and

E[ˆca,p(t)] ≤ ¯ca,p ≤ E[ˆca,p(t)] + L

√ D2−l

from Lemma 1. Thus, we have

{Δt ≥ y} = a∈ ¯A(l) {|ca,z(t) − ˆca,p(t)| ≥ y} = a∈ ¯A(l)

{ca,z(t) − ˆca,p(t) ≤ −y}

∪ a∈ ¯A(l) {ca,z(t) − ˆca,p(t) ≥ y} ⊂ a∈ ¯A(l)

ca,z− ˆca,p(t) ≤ −y ∪ a∈ ¯A(l) {¯ca,z− ˆca,p(t) ≥ y} ⊂ a∈ ¯A(l) E[ˆca,p(t)] − L √ D2−l− ˆca,p(t) ≤ −y ∪ a∈ ¯A(l) E[ˆca,p(t)] + L √ D2−l− ˆca,p(t) ≥ y ⊂ a∈ ¯A(l) ˆ