Quantile optimization for heavy-tailed distributions ... - CASTLE Lab

QUANTILE OPTIMIZATION FOR HEAVY-TAILED 

DISTRIBUTIONS USING ASYMMETRIC SIGNUM FUNCTIONS 

JAE HO KIM ∗ , WARREN B. POWELL † , AND RICARDO A. COLLADO ‡ 

Abstract. In this article, we present a provably convergent algorithm for computing the quantile 

of a continuous random variable that does not require the existence of expectation or storing all of the 

sample realizations. We then present an algorithm for optimizing the quantile of a random function 

satisfying some strict monotonicity and differentiability properties. This random function may be 

characterized by a heavy-tailed distribution where the expectation is not defined. The algorithm is 

illustrated in the context of electricity trading in the presence of storage, where electricity prices are 

known to be heavy-tailed with infinite variance. 

Key words. quantile optimization, stochastic gradient, signum functions, energy systems 

1. Introduction. For the past several decades, academic research in stochastic 

control and optimization has been focused mainly on maximizing or minimizing the 

expectation of a random process [12, 19, 21, 33, 34, 37, 38]. A separate line of research 

has focused on worst-case outcomes under the umbrella of robust optimization 

[3]. However, when we are dealing with a non-stationary and complex or heavy-tailed 

system, the expectation is difficult or impossible to compute and it is often meaningless 

to even try to approximate it. In those cases, it is rational to optimize our policy 

based on quantiles, which are zeroth order statistics derived directly from the probability 

distribution. The cumulative distribution function (CDF) and the inverse CDF, 

also known as the quantile function, can almost always be computed in a practical 

manner. Optimizing expectations is especially difficult in various financial markets 

including equities, commodities, and electricity markets, many of which are notorious 

for being non-stationary, extremely volatile, and heavy-tailed [2, 4, 5, 8]. For example, 

daily volatilities of 20-30% are common in electricity markets while the equity market 

also exhibits occasional heavy-tailed crashes as happened recently during the 1987 

meltdown, the LTCM crisis in 1998, the dot-com bubble burst in 2001, and of course, 

during the 2008 financial meltdown. 

A heavy-tailed distribution is defined by its structure of the decline in probabilities 

for large deviations. In a heavy-tailed environment, the usual statistical tools at our 

disposal can be tricked into producing erroneous results from observations of data in 

a finite sample and jumping to wrong conclusions. For example, suppose we want 

to compute the variance from a finite number of i.i.d. samples whose underlying 

distribution is heavy-tailed such that the second and higher-order moments are not 

finite. If we use the standard empirical formula for computing the variance, we will 

get “a number,” thus giving the illusion of finiteness, but they typically show a high 

level of instability and grows steadily as we add more and more samples [43]. In this 

article, we present a framework for optimizing the median or any other quantile of 

a stochastic process subject to some strict monotonicity properties and show how it 

can be applied to optimizing a trading policy in the electricity markets. Since the 

quantile function always exists and is relatively stable even in a highly volatile and 

non-stationary environment, optimizing the quantile is a robust procedure, and almost 

∗email: jaek@princeton.edu 

† email: powell@princeton.edu / Author supported in part by a grant from SAP and AFOSR 

contract FA9550-08-1-0195. 

‡ email: rcollado@princeton.edu 

1

2 QUANTILE OPTIMIZATION FOR HEAVY-TAILED DISTRIBUTIONS 

always a reasonable thing to do in a complex, volatile, and heavy-tailed environment 

or in a situation where the worst-case scenario is unknown or too extreme. 

For the past decade, many countries have deregulated their electricity markets. 

Electricity prices in deregulated markets are known to be very volatile and as a result 

participants in those markets face large risks. Recently, there has been a growing 

interest among electricity market participants about the possibility of using storage 

devices to mitigate the effect of the volatilities in electricity prices. However, storage 

devices have capacity limits and significant conversion losses, and it is unclear how one 

should use a storage device to maximize its value. Fortunately, unlike stock prices, it 

is well-known that electricity prices in deregulated markets exhibit reversion to the 

long-term “average,” allowing us to make directional bets [6, 7, 26, 32, 39]. The value 

of a storage device is determined by how good we are at making such bets-buying 

electricity when we believe the price will go up and selling electricity when we believe 

the price will go down. When using a storage device, we automatically lose 10-30% of 

the energy through conversion losses. The conversion loss can be seen as a transaction 

cost, and hence the storage is valuable only if the price process is volatile enough so 

that we can sell at a price that is sufficiently higher than the price we buy to make 

up for the loss due to conversion. 

The statistics community has long recognized that in the context of linear regression, 

minimizing the mean-squared or absolute-value error is an arbitrary choice. For 

certain applications, minimizing some quantile of the error terms instead can be a 

viable alternative. This approach is generally referred to as quantile regression, which 

minimizes the empirical expectation of a loss function on a given finite sample set via 

deterministic optimization. A comprehensive review of quantile regression and the 

relevant literature can be found in [20]. Meanwhile, the signal processing community 

has recognized that when the data is non-stationary and highly volatile, applying 

median filters [1] using order statistics yields better results than applying traditional 

filters based on minimizing empirical expectations. 

The method described in this article is different from the ones described in [20] 

or [1]. Instead of applying order statistics and deterministic algorithms on a given 

data set, we use an adaptation of the recursive algorithm pioneered by Robbins and 

Monro. In its seminal 1951 paper [36], Robbins and Monro addressed the problem of 

estimating the α-quantile of a distribution function by the use of response, no response 

data and a recursive algorithm that looks quite similar to the Newton’s method. Just 

like in the classical Newton’s method, Robbins and Monro’s method depends on a 

stepsize sequence with a set of properties that the authors named type 1/n. Crucial 

to this method is the existence of expectation of the related random variables used in 

the model. This quantile estimation problem is a simple application of a fundamental 

recursive method developed in [36] for stochastic root-finding which is considered by 

many to be the foundation of stochastic approximation. The Robbins and Monro 

recursive algorithm has spawned numerous research on stepsizes, stopping conditions 

and related root-finding methods, see [15, 16, 23, 24, 27]. Later, the method was 

generalized to the use of adaptive stepsizes with stochastic properties. See [22, 41] for 

an overview of the development of adaptive stepsizes and extensions of the Robbins- 

Monro method. 

While general algorithms for estimating quantiles require storing the history of 

observations, our adaptation of Robbins and Monro method performs quantile optimization 

without this requirement, making our implementation much more compact. 

The first contribution of this article is the development of an algorithm which avoids

QUANTILE OPTIMIZATION FOR HEAVY-TAILED DISTRIBUTIONS 3 

the need to store the history of our process by replacing the stochastic gradient with 

the asymmetric signum function. Unlike the stochastic approximation algorithms that 

are derived from Robbins and Monro [36], we do not assume that the expectation of 

the random variable or random function exists. This is a fundamental difference 

that makes our algorithm applicable to heavy-tailed random variables and random 

functions whose mean and variance are not necessarily well-defined. 

Our next contribution is the development of a quantile optimization problem 

with an objective function that satisfies some strict monotonicity and differentiability 

properties. We use the quantile algorithm to develop a recursive method with a 

stochastic stepsize to solve our quantile optimization problem. Then we apply our 

optimization problem to trading electricity in the spot market in the presence of 

storage. This application is in a heavy-tailed environment where the expectation is 

not necessarily well defined and the application of other methods would fail. 

This article is organized as follows. In §2, we present a provably convergent 

algorithm that allows us to find any quantile of a continuous random variable. In §3, 

we present a framework for optimizing the quantile of a random function satisfying 

some strict monotonicity and differentiability properties that are defined in §3.1. In §4, 

we apply our quantile optimization method to trading electricity in the spot market 

in the presence of storage. In §5, we compare the trading strategy for maximum profit 

based on quantiles to the standard trading policy based on the hour of the day. In 

§6, we summarize our conclusions. 

2. Computing the Quantile of a Random Variable. In this section, we provide 

a simple, provably convergent algorithm that allows us to compute the quantile of 

a continuous random variable without storing the history of observations. Let (Ω, F) 

be a sample space with set of all possible outcomes Ω. Let X1, X2, . . . be a sequence 

of continuous i.i.d. random variables on Ω such that −∞ < Xn < ∞, ∀n ≥ 0. 

For z ∈ R and n ≥ 1, let Fn(z) = P [Xn ≤ z] be the right-continuous CDF of 

the random variable Xn. The i.i.d. property of (Xn) n≥1 guarantee us that all the 

CDF’s Fn are identical so we can simply denote them by F . For any α ∈ (0, 1), the 

α-quantile of Xn is defined by 

qα := inf {z ∈ R : P [Xn ≤ z] ≥ α} 

= inf {z ∈ R : F (z) ≥ α} . 

We can see in the last equality that qα is well defined due to the right-continuity of 

F . 

We consider a filtration F0 ⊆ · · · ⊆ Fn ⊆ · · · ⊆ F where each Fn is generated by 

the information given up to time n: 

Fn = σ (x0, . . . , xn) , 

here xi is the outcome of the random variable Xi, 1 ≤ i ≤ n. We assume that for each 

n ≥ 0, Xn is Fn-measurable but we do not assume that E [Xn] exists. For example, 

(Xn) n≥0 can be i.i.d. Cauchy random variables. 

Given Fn, the most common and straightforward method for approximating qα 

is to sort (Xi) 0≤i≤n in increasing order and pick the nα th smallest number. However, 

in order to find the exact quantile using this method, we need an infinite amount 

of memory to store all (Xi) 0≤i≤n as n → ∞, which is not practical. We present an 

algorithm that only requires us to store one estimate of the quantile at any given time 

and show that it converges to the true quantile as n → ∞.


For α ∈ (0, 1) define the asymmetric signum function as 

⎧ 

⎨ 1 − α, if u ≥ 0, 

sgnα(u) = 

⎩ −α, if u < 0. 

Theorem 2.1. Let Y0 ∈ R be some finite number and 

where γn−1 ≥ 0 a.s. and (γn) ∞ 

Yn = Yn−1 − γn−1sgnα (Yn−1 − Xn) , 

n=0 

F0 ⊆ · · · ⊆ Fn ⊆ · · · ⊆ F satisfying 

and 

Then, 

is a stochastic stepsize adapted to the filtration 

∞ 

γn = ∞ a.s. 

n=0 

∞ 

(γn) 2 < ∞ a.s. 

n=0 

lim 

n→∞ Yn 

a.s. 

= qα. 

Proof. We know that Yn is Fn-measurable and 

(Yn − qα) 2 = (Yn−1 − qα) 2 − 2γn−1 (Yn−1 − qα) · sgn (Yn−1 − Xn) + (γn−1) 2 λn, 

where 

and hence 0 ≤ λn ≤ 1. Since 

λn := sgn 2 α (Yn−1 − Xn) 

E [sgn α (Yn−1 − Xn) | Fn−1] = (1 − α) P [Xn ≤ Yn−1 | Fn−1] − αP [Xn > Yn−1 | Fn−1] 

we know that 

Therefore, 

Next, define 

= (1 − α) P [Xn ≤ Yn−1 | Fn−1] − α (1 − P [Xn ≤ Yn−1 | Fn−1]) 

= P [Xn ≤ Yn−1 | Fn−1] − α 

≤ 0 if and only if Yn−1 ≤ qα, 

E [(Yn−1 − qα) sgn α (Yn−1 − Xn) | Fn−1] ≥ 0. (2.1) 

 

E (Yn − qα) 2 

| Fn−1 ≤ (Yn−1 − qα) 2 + (γn−1) 2 , ∀n ≥ 1. (2.2) 

Wn := (Yn − qα) 2 + 

∞ 

(γm) 2 , ∀n ∈ N+. (2.3) 

m=n

Then, from (2.2), 


E [Wn+1 | Fn] ≤ Wn, ∀n ∈ N+, 

showing that (Wn) n≥1 is a positive supermartingale. Then, by the martingale convergence 

theorem, 

where 

Moreover, since 

lim 

n→∞ (Yn − qα) 2 = lim 

n→∞ Wn 

a.s. 

= W∞, (2.4) 

0 ≤ W∞ < ∞. 

0 ≤ E [Wn | F0] ≤ W0 < ∞, ∀n ∈ N+, 

by the dominated convergence theorem, 

 

E [W∞ | F0] = E lim 

N→∞ WN 

 

| F0 

Define 

= lim 

N→∞ E [WN | F0] . (2.5) 

βn−1,n := E [(Yn−1 − qα) · sgn α (Yn−1 − Xn) | Fn−1] . 

From (2.1), βn−1,n ≥ 0, and hence 

Next, from (2.3), 

β0,n = E [(Yn−1 − qα) · sgn α (Yn−1 − Xn) | F0] 

= E [E [(Yn−1 − qα) · sgn α (Yn−1 − Xn) | Fn−1] | F0] 

= E [βn−1,n | F0] ≥ 0. 

WN − (Y0 − qα) 2 = (YN − qα) 2 − (Y0 − qα) 2 + 

= 

N 

n=1 

= −2 

∞ 

m=N 

(γm) 2 

 

(Yn − qα) 2 − (Yn−1 − qα) 2 

+ 

N 

n=1 

∞ 

m=N 

(γm) 2 

γn−1 (Yn−1 − qα) · sgn α (Yn−1 − Xn) + 

Therefore, from (2.5), 

E [W∞ | F0] − (Y0 − qα) 2 = lim 

N→∞ E 

 

WN − (Y0 − qα) 2 

| F0 

N 

= (γn−1) 2 λn + 

∞ 

(γm) 2 

= 

n=1 

−2 lim 

N 

n=1 

N 

m=N 

N 

n=1 

(γn−1) 2 λn + 

γn−1E [(Yn−1 − qα) · sgnα (Yn−1 − Xn) | F0] 

N→∞ 

n=1 

(γn−1) 2 λn + 

∞ 

m=N 

(γm) 2 − 2 

∞ 

n=1 

γn−1β0,n. 

∞ 

m=N 

(γm) 2 .


Since the left-hand side is finite and 

we must have 

Since β0,n ≥ 0 and 

we conclude that 

N 

n=1 

(γn−1) 2 λn + 

∞ 

m=N 

(γm) 2 ≤ 

∞ 

(γm) 2 < ∞, 

n=1 

∞ 

γn−1β0,n < ∞. 

n=1 

∞ 

γn−1 > ∞, 

n=1 

lim inf 

n→∞ β0,n = lim 

n→∞ E [βn−1,n | F0] = 0. 

Therefore there is a subsequence (n∗ ) ⊆ (n) ∞ 

n=0 such that 

lim 

n∗ β0,n∗ = lim 

→∞ n∗→∞ E [βn∗−1,n∗ | F0] = 0. 

However, since βn ∗ −1,n∗ is a non-negative random variable, this implies that βn ∗ −1,n∗ 

goes to 0 with probability 1 as n ∗ → ∞. In other words, 

Since 

lim 

n∗→∞ βn∗−1,n∗ = lim 

n→∞ (Yn∗−1 − qα) E [sgnα (Yn∗−1 − Xn∗) | Fn∗−1] a.s. 

= 0. 

if and only if Yn ∗ −1 = qα, 

E [sgn α (Yn ∗ −1 − Xn ∗) | Fn ∗ −1] = 0 

lim 

n∗ a.s. 

Yn∗ = qα. (2.6) 

→∞ 

This proves our goal but only for the subsequence (Yn∗). In order to get the convergence 

of the whole sequence (Yn) ∞ 

n=0 we have to look back at equation (2.4) from 

which we obtain that 

where 

lim 

n→∞ |Yn − qα| a.s. 

= W∞, (2.7) 

0 ≤ W∞ < ∞. 

Let Y + n , Y − n be the the random variables 

Y + 

Yn 

n = 

Y + n−1 otherwise, 

if Yn ≥ qα,

and 


We can see from (2.7) that 

and 

Y − n = 

 

Yn 

Y − n−1 otherwise. 

if Yn < qα, 

lim 

n→∞ Y + a.s. 

n − qα = W∞, 

lim 

n→∞ qα − Y − n 

a.s. 

= W∞, 

a.s. 

= qα − √ W∞. By 

which implies that limn→∞ Y + a.s. 

n = qα + √ W∞ and limn→∞ Y − n 

(2.6) we can conclude that at least one of Y + n or Y − n converges almost surely to qα, 

which implies that W∞ = 0 and so 

Clearly, 

and so 

as we set to prove. 

lim 

n→∞ Y + n 

Yn = 

a.s. a.s. 

= qα = lim 

n→∞ Y − n . 

 

Y + n 

Y − n 

if Yn ≥ qα, 

otherwise, 

lim 

n→∞ Yn 

a.s. 

= qα, 

3. Optimizing the Quantile of a Random Function. In this section, we 

show how to optimize the quantile of an almost everywhere differentiable function. 

We show that a monotonicity condition in a random function and its derivative with 

respect to a sample realization allows us to optimize the quantile of the random 

function using a simple, provably convergent algorithm. We also show that the wellknown 

newsvendor problem has the aforementioned monotonicity. 

3.1. Monotonicity Assumptions. Let Ω be the set of all possible outcomes 

and let X ⊆ R be a compact control space. We start by defining our control variable 

x ∈ X as a scalar for a clear and concise presentation of our concepts and the proof of 

convergence for our algorithm. Then, we generalize our process to the vector-valued 

case. Let f : X × Ω ↦→ R be a function satisfying the following conditions: 

(C1) f (·, ω) is convex and continuous for all ω ∈ Ω; 

(C2) f (·, ω) is differentiable almost everywhere for all ω ∈ Ω; 

(C3) f (x, ·) is a real-valued random variable for all x ∈ X . 

For example, f (x, ω) may be piecewise-linear with respect to x ∈ X with a finite 

number of break-points for all fixed ω ∈ Ω. Let f (x) denote the random variable 

whose specific sample realization is f (x, ω) . Also, let Qα denote the operator such 

that for any random variable X, 

QαX = inf {z ∈ R : P [X ≤ z] ≥ α}


gives the α-quantile of the random variable X. 

For x ∈ X let Fx(z) = P [f(x) ≤ z] be the right-continuous CDF of the random 

variable f(x). Then the α-quantile of f(x) is given by 

Qαf(x) = inf {z ∈ R : P [f(x) ≤ z] ≥ α} 

= inf {z ∈ R : Fx(z) ≥ α} , 

where, as we saw before, Qαf(x) is guaranteed to exist due to the right-continuity of 

Fx. 

Let z ∗ := Qαf(x). If Fx is not continuous at z ∗ then the fact that Fx is nondecreasing 

and bounded help us see that 

P [f(x) < z ∗ ] = lim 

z→z ∗ − Fx(z) < lim 

z→z ∗ + Fx(z) = Fx(z ∗ ) = P [f(x) ≤ z ∗ ] . (3.1) 

From (3.1) and the definition of the probability of a random variable we conclude that 

z ∗ is in the image of f(x). Unfortunately if Fx is continuous at z ∗ we cannot readily 

conclude that z ∗ is in the image of f(x). This is an important property for us and 

thus we add it as a requirement for the function f: 

(C4) Qαf(x) is in the image of f(x) for every x ∈ X . 

The following propositions show the monotonicity assumptions that allow us to 

interchange the gradient ∂/∂x and the quantile operator Qα. 

Proposition 3.1. Assume that 

f (x, ω1) ≤ f (x, ω2) if and only if ∂ 

∂x f (x, ω1) ≤ ∂ 

∂x f (x, ω2) , (3.2) 

∀ω1, ω2 ∈ Ω and ∀x ∈ X . Then, 

Qα 

Proposition 3.2. Assume that 

∂ ∂ 

f (x) = 

∂x ∂x Qαf (x) , ∀x ∈ X . 

f (x, ω1) ≤ f (x, ω2) if and only if ∂ 

∂x f (x, ω1) ≥ ∂ 

∂x f (x, ω2) , (3.3) 

∀ω1, ω2 ∈ Ω and ∀x ∈ X . Then, 

Let 

and 

Q1−α 

∂ ∂ 

f (x) = 

∂x ∂x Qαf (x) , ∀x ∈ X . 

Proof. By condition (C4) of f(x) there is a sample realization ω ∗ ∈ Ω such that 

Qαf (x) = f (x, ω ∗ ) . 

A := {ω ∈ Ω | f (x, ω) ≤ f (x, ω ∗ )} 

B := {ω ∈ Ω | f (x, ω) > f (x, ω ∗ )} .

Then, if (3.2) holds, 

and hence 


∂ 

∂x f (x, ω1) ≤ ∂ 

∂x f (x, ω∗ ) < ∂ 

∂x f (x, ω2) , ∀ω1 ∈ A, ω2 ∈ B, 

Qα 

On the other hand, if (3.3) holds, 

and hence 

∂ ∂ 

f (x) = 

∂x ∂x f (x, ω∗ ) = ∂ 

∂x Qαf (x) . 

∂ 

∂x f (x, ω1) ≥ ∂ 

∂x f (x, ω∗ ) > ∂ 

∂x f (x, ω2) , ∀ω1 ∈ A, ω2 ∈ B, 

Q1−α 

∂ ∂ 

f (x) = 

∂x ∂x f (x, ω∗ ) = ∂ 

∂x Qαf (x) . 

We can illustrate Proposition 3.2 through the following example. Suppose X = 

[0, 1] and 

f (x) = (1 − x) U, ∀x ∈ X , 

where U ∼ U (0, 1) is a uniform random variable. Then, ∀ω1, ω2 ∈ Ω and ∀x ∈ X , 

f (x, ω1) = (1 − x) U (ω1) ≤ (1 − x) U (ω2) = f (x, ω2) 

if and only if U (ω1) ≤ U (ω2) , and thus 

∂ 

∂x f (x, ω1) = −U (ω1) ≥ −U (ω2) = ∂ 

∂x f (x, ω2) , 

satisfying (3.3). Next, Qαf (x) = α (1 − x) and hence 

On the other hand, 

∂ 

∂x Qαf (x) = −α, ∀0 < α < 1. 

∂ 

∂ 

f (x) = −U and hence Q1−α f (x) = −α, 

∂x ∂x 

satisfying Proposition 3.1. 

Note that the classical stochastic gradient algorithm that allows us to optimize 

the expectation of a function works because the expectation operator is interchangeable 

with the differential operator as long as the expectation is finite. We do not 

need to satisfy this condition when optimizing quantiles. Our goal is to develop an 

algorithm that is analogous to the classical stochastic gradient algorithm that allows 

us to optimize the quantile of a function.


3.2. Quantile Optimization. We begin by stating the algorithm for quantile 

optimization and proving its convergence. 

From this point on we assume that the function f and α ∈ (0, 1) are such that 

argmin 

x∈X 

Qα f (x) (3.4) 

is a real number. For example, a way to accomplish (3.4) is by assuming that there is 

a metric on Ω that makes the CDF Fx(z) : X × R → R uniformly continuous. Then 

we can conclude that Qαf(·) is a lower semi-continuous function and that (3.4) is well 

defined. 

Theorem 3.3. Let f : X × Ω → R be a function as defined before. Define 

x α := argmin 

x∈X 

Qα f (x) . 

Let x0 ∈ X be fixed. For n ∈ N+ and xn−1 ∈ X , let fn (xn−1) be the random variable 

whose distribution is that of f (xn−1). Define Fn as the sigma algebra on Ω generated 

by the information given up to time n: 

Denote 

Let 

⎧ 

⎨ 

xn = 

⎩ 

Fn = σ (f1(x0), . . . , fn(xn−1)) . 

f ′ n (xn−1) := ∂ 

∂x fn (x, ω) |x=xn−1 . 

xn−1 − γn−1sgn α (f ′ n (xn−1)) , if (3.2) holds for f (x) 

xn−1 − γn−1sgn 1−α (f ′ n (xn−1)) , if (3.3) holds for f (x) 

where γn−1 ≥ 0 a.s. and (γn) ∞ 

F1 ⊆ F2 ⊆ · · · ⊆ F satisfying 

and 

Then, 

n=0 

, ∀n ∈ N+, 

is a stochastic stepsize adapted to the filtration 

∞ 

γn = ∞ a.s. (3.5) 

n=0 

∞ 

(γn) 2 < ∞ a.s. (3.6) 

n=0 

lim 

n→∞ xn 

a.s. α 

= x . 

Proof. Notice first that by construction the sequence of random variables (fn (xn−1)) ∞ 

n=1 

is adapted to the filtration F1 ⊆ F2 ⊆ · · · ⊆ F. Assume (3.2) holds, let x0 ∈ X and 

xn = xn−1 − γn−1sgn α (f ′ n (xn−1)) .

Then, 


(xn − x α ) 2 = (xn−1 − x α ) 2 − 2γn−1 (xn−1 − x α ) · sgn α (f ′ n (xn−1)) + (γn−1) 2 λn, 

where 

From assumption (3.2), if xn−1 ≤ x α , 

and if xn−1 > x α , 


λ α n := sgn 2 α (f ′ n (xn−1)) . 

Qαf ′ n (xn−1) = ∂ 

∂x Qαf (x) |x=xn−1 ≤ 0 

Qαf ′ n (xn−1) = ∂ 

∂x Qαf (x) |x=xn−1 > 0. 

E [sgn α (f ′ n (xn−1)) | Fn−1] = (1 − α) P [f ′ n (xn−1) ≥ 0 | Fn−1] − αP [f ′ n (xn−1) < 0 | Fn−1] 

and hence 

Then, 

= (1 − α) P [f ′ n (xn−1) ≥ 0 | Fn−1] − α (1 − P [f ′ n (xn−1) ≥ 0 | Fn−1]) 

= P [f ′ n (xn−1) ≥ 0 | Fn−1] − α 

≤ 0 if and only if xn−1 ≤ x α , 

E [(xn−1 − x α ) · sgn α (f ′ n (xn−1)) | Fn−1] ≥ 0. 

 

E (xn − x α ) 2 

| Fn−1 ≤ (xn−1 − x α ) 2 + (γn−1) 2 , ∀n ≥ 1. 

Using the same proof technique used in section §2, we can show that 

Then, 

lim 

n→∞ xn 

a.s. α 

= x . 

Similarly, if assumption (3.3) holds instead, we can let x0 ∈ X and 

xn = xn−1 − γn−1sgn 1−α (f ′ n (xn−1)) . 

(xn − x α ) 2 = (xn−1 − x α ) 2 − 2γn−1 (xn−1 − x α ) · sgn 1−α (f ′ n (xn−1)) + (γn−1) 2 λn, 

where 

λ 1−α 

n 

Under assumption (3.3), if xn−1 ≤ x α , 

:= sgn 2 1−α (f ′ n (xn−1)) . 

Q1−αf ′ n (xn−1) = ∂ 

∂x Qαf (x) |x=xn−1 ≤ 0


and if xn−1 > x α , 


Q1−αf ′ n (xn−1) = ∂ 

∂x Qαf (x) |x=xn−1 > 0. 

E sgn1−α (f ′ ′ 

n (xn−1)) | Fn−1 = αP [f n (xn−1) ≥ 0 | Fn−1] − (1 − α) P [f ′ n (xn−1) < 0 | Fn−1] 

and hence 

Again, 

and we can show that 

= αP [f ′ n (xn−1) ≥ 0 | Fn−1] − (1 − α) (1 − P [f ′ n (xn−1) ≥ 0 | Fn−1]) 

= P [f ′ n (xn−1) ≥ 0 | Fn−1] − (1 − α) 

≤ 0 if and only if xn−1 ≤ x α , 

E (xn−1 − x α ) · sgn1−α (f ′ 

n (xn−1)) | Fn−1 ≥ 0. 

 

E (xn − x α ) 2 

| Fn−1 ≤ (xn−1 − x α ) 2 + (γn−1) 2 , ∀n ≥ 1, 

lim 

n→∞ xn 

a.s. α 

= x . 

While we have shown the proof for scalar x, the algorithm can be readily extended 

to the multivariable case as long as the monotonicity assumption holds in each of the 

variables and corresponding partial derivatives. For a simple illustration, suppose 

f ( −→ x , ω) is convex in −→ x = x 1 , x 2 , . . . , x m ∈ X for some m-dimensional compact 

space X ⊂ R m and fixed ω ∈ Ω. 

Theorem 3.4. For each 1 ≤ i ≤ m, assume that either 

∂ 

∂xi f (−→ x , ω1) ≤ ∂ 

∂xi f (−→ x , ω2) if and only if f ( −→ x , ω1) ≤ f ( −→ x , ω2) (3.7) 

holds true or 

∂ 

∂xi f (−→ x , ω1) ≥ ∂ 

∂xi f (−→ x , ω2) if and only if f ( −→ x , ω1) ≤ f ( −→ x , ω2) (3.8) 

holds true, ∀ω1, ω2 ∈ Ω and ∀ −→ x ∈ X . For some n ∈ N+ and xn−1 ∈ X , let fn ( −→ x n−1) 

be the Fn-measurable random variable whose distribution is that of f ( −→ x n−1). Then, 

if we let 

and 

x i n = 

⎧ 

⎨ 

⎩ 

−→ x 0 = x 1 0, x 2 0, ..., x m 0 

xi n−1 − γi 

∂ 

n−1sgnα x i n−1 − γ i n−1sgn 1−α 

for some −→ x 0 ∈ X 

∂x i fn ( −→ x ) |−→ x = −→ x n−1 

while the stochastic stepsizes γ i n ≥ 0 satisfy 

∂ 

∂x i fn ( −→ x ) |−→ x = −→ x n−1 

∞ 

n=0 

γ i n = ∞ a.s. 

, if (3.7) holds true, 

, if (3.8) holds true,

and 

∀1 ≤ i ≤ m, then 

where 


∞ 

n=0 

i 2 

γn < ∞ a.s, 

lim 

−→ a.s. 

x n = 

−→ α 

x . 

n→∞ 

−→ x α = argmin 

−→ x ∈X 

Qα f ( −→ x ) . 

3.3. Initialization and the Scaling of Stepsizes. We have shown that the 

algorithm for finding some quantile of a random variable and optimizing some quantile 

of a random function converges as long as the stepsize γn ≥ 0 a.s. satisfies (3.5) 

and (3.6). Extensive study on various stepsizes for optimizing the expectation of a 

random function has been done. A comprehensive review of the literature can be 

found in [9, 33]. The typical approach for finding the optimal stepsizes in the case of 

optimizing expectations is as follows. Suppose (Xi) 1≤i≤n are i.i.d random variables 

with finite variance and we want to find the best stepsizes (γi−1) 1≤i≤n that gives the 

best estimate for the E [Xi] . Then, the mean-squared error 

min E 

γ0,...,γn 

n 

i=1 

γi−1Xi − E [Xi] 

is minimized when γn−1 = 1 

n , ∀n. When (Xi) 1≤i≤n are not i.i.d, different stepsizes 

can give better results. 

However, when computing quantiles using asymmetric signum functions, it is 

difficult to determine what the optimal stepsizes should be even for the i.i.d case 

because it is unclear what we should try to minimize. We cannot minimize the 

expectation or variance in a heavy-tailed environment. Thus, instead of trying to 

identify stepsizes that are optimal in some sense, we present an ad-hoc method for 

practical use. Let (Xi) 1≤i≤n be i.i.d random variables and assume we want to compute 

some α-quantile. However, suppose the α-quantile of Xi is 10 6 , for example. Then, if 

we use the following algorithm 

2 

Yn = Yn−1 − γn−1sgn α (Yn−1 − Xn) , 

with Y0 = 0 and γn−1 = 1 

n , it will take prohibitively long for the algorithm to give us 

a reasonable estimate. Thus, finding a reasonable initialization point Y0 and stepsizes 

γn−1 is critical. We propose initially taking sample realizations (Xi) 1≤i≤k for some 

small k and sorting it in increasing order and pick the kαth smallest number and let 

it be Y0. Next, we let 

where 

σ = 1/2 

γn−1 := σ 

n 

(3.9) 

 

 

 

.75-quantile of the set (Xi) 1≤i≤k − .25-quantile of the set (Xi) 1≤i≤k


is a scaling factor. 

For the quantile optimization problem, first pick (xm) 1≤m≤M evenly dividing the 

compact space X for some small M. For each xm, we can generate (Fi (xm)) 1≤i≤k 

to find an approximation to the α-quantile of F (xm) . Then, we let our initialization 

point x0 be the one that gives the minimum α-quantile. That is, initialization requires 

generating M × k samples. Next, we let the stepsize to be given by equation (3.9) 

where we replace Xi with Fi (x0) . When optimizing expectations using the standard 

Robbins-Monro algorithm, the scaling factor has to strike a balance between the units 

of x ∈ X and the scale of the stochastic gradient, which can be highly volatile, ranging 

between large negative values and large positive values. However, when optimizing 

the quantile using our algorithm, the stepsizes have to be scaled based on x alone. 

If we have a rough idea of where the optimal x may fall, we can scale the stepsize 

accordingly. 

3.4. Newsvendor Problem. To develop an understanding of the properties 

of the algorithm, we begin by illustrating it in the context of a simple newsvendor 

problem [31]. Given a random demand D ∈ R+, we usually want to compute the 

optimal stocking quantity x ∈ R+ that maximizes the expected profit 

G (x) := E [p min {x, D} − cx] 

where p > 0 is the unit price, c > 0 is the unit cost, and p > c. For this simple 

problem, the optimal solution 

x ∗ = Q p−c D 

p 

is the p−c 

p -quantile of the random variable D, which can be computed using the 

algorithm shown in section §2. However, instead of maximizing the expected profit, 

suppose a manager wants to maximize some α-quantile of the profit 

This is equivalent to minimizing 

qα := Qα [p min {x, D} − cx] . 

q1−α := Q1−α [cx − p min {x, D}] . 

Let Ω be the set of all possible outcomes. For each ω ∈ Ω, let D (ω) be the sample 

realization of the demand and denote 

f (x, ω) : = cx − p min {x, D (ω)} 

⎧ 

⎨ x (c − p) , if x ≤ D (ω) , 

= 

⎩ cx − pD (ω) , if x > D (ω) . 

In order to enforce compactness on the control set we restrict to x ∈ [0, M], where M 

is some number big enough to capture any feasible stocking quantity. As an example 

in the newsvendor problem we could set M to be an upper bound on the number of 

newspapers produced on any given day. 

Let X = [0, M]. Then, 

⎧ 

∂ 

⎨ 

f (x, ω) = 

∂x ⎩ 

c − p, if x < D (ω) , 

c, if x > D (ω) .


In general the derivative ∂ 

∂x f(x, ω) is not defined when x = D(ω) but notice that this 

still agrees with the condition (C2) of f(x, ·) being differentiable almost everywhere. 

Next, ∀ω1, ω2 ∈ Ω and ∀x ∈ X , we know that f (x, ω1) ≤ f (x, ω2) if and only if 

D (ω1) ≥ D (ω2) , which again is true if and only if 

∂ 

∂x f (x, ω1) ≤ ∂ 

∂x f (x, ω2) . 

Let (Dn) n≥1 be i.i.d random variables. If we let x0 = 0 and 

xn = xn−1 − 1 

⎧ 

⎨ 

= 

⎩ 

where Dn is Fn-measurable, then 

lim 

n→∞ xn 

a.s. 

from Theorem 3.3. This implies 

n sgn α (f ′ n (xn−1)) 

xn−1 + α 

n , if xn−1 ≤ Dn 

xn−1 − 1−α 

n , if xn−1 > Dn 

= argmax 

x∈X 

Qα [p min {x, D} − cx] , 

argmaxQα 

[p min {x, D} − cx] = QαD. 

x∈X 

(3.10) 

For illustration, suppose p = 10, c = 1, and (Dn) n≥1 is i.i.d with Pareto distribution 

whose tail index is 1/2: 

⎧ 1/2 ⎨ 1 1 − y , if y ≥ 1 

P [Dn ≤ y] = 

⎩ 

0, if y < 1 

for all n ≥ 1. For given x ≥ 1, the expected profit is 

E [p min {x, Dn}] − cx = E [10 min {x, Dn}] − x 

x 

= 5 y −1/2 dy + 10xP [Dn > x] − x 

y=1 

= 10y 1/2 | x y=1 + 10x 1/2 − x 

= 20x 1/2 − 10 − x. 

Then, we obtain a maximum expected profit of 90 if 

x ∗ = Q p−c Dn = Q 9 

p 

10 Dn = 100. 

However, while x ∗ = 100 maximizes the expected profit, the risk that we lose a lot 

of money is very high. For our newsvendor example, the probability that we lose at 

least $50 while following our optimal policy is given by 

P [p min {x ∗ , Dn} − cx ∗ ≤ −50] = P [10 min {100, Dn} ≤ 50] 

1/2 = P [D ≤ 5] = 1 − 

1 

5 

≈ .55.


A policy that has a 55% chance of losing 50% or more of the initial investment of 100 

is a disastrous policy even if the expected return on investment is 90%. In a heavytailed 

environment, focusing on the expected profit leads one to take on imprudent 

risks. The above policy has only a 30% chance of making a profit of 10% return or 

more. Instead of trying to maximize the expectation, suppose we want to maximize 

some quantile. Since 

argmaxQα 

[p min {x, Dn} − cx] = QαDn = 

x∈X 

1 

(1 − α) 

we maximize the median profit if x ∗ = 4. Then, the median profit is 36 while the 

expected profit is 26 and the minimum amount of profit we can make is 6. That is, 

we have a 100% chance of making more than 150% profit on the initial investment and 

our expected return on investment is 650%. The median return on our investment 

is 900%. Again, when x ∗ = 4, we have a 50% chance of making a profit of 36 

or more, while we only have a 30% chance of making a profit of 10 or more when 

x ∗ = 100. In fact, x ∗ = 100 maximizes the 0.9-quantile of the profit. This example 

illustrates that it is not always reasonable to try to maximize the expectation even 

when we can compute the expectation. The merit of the more conservative policy 

derived from quantile optimization becomes even more conspicuous if we conduct 

independent experiments multiple times and observe what happens to the cumulative 

profit. Let 

Ui = 10 min {100, Di} − 100 

be the profit of a business that follows the expectation-maximizing policy (x∗ = 100) 

at the ith iteration. Let 

n 

Vn = 

be the cumulative profit of a business after the n th iteration. Left plot on the Figure 

3.1 shows ten different sample paths of Vn as well as their average. 

Next, let 

i=1 

Ui 

Xi = 10 min {4, Di} − 4 

be the profit of a business that follows the median-maximizing policy (x∗ = 4) at the 

ith iteration. Let 

n 

Yn = 

i=1 

be the cumulative profit. The right plot on the Figure 3.1 shows several different 

sample paths following the median-maximization policy. By comparing the two plots, 

we can see that while Vn grows a lot more faster than Yn and thus is much more 

profitable on average, it is much more volatile than Yn. We can observe how risky 

the expectation-maximizing policy can be; one sample path of Vn reaches below -$700 

after the 11 th iteration, implying that the business would have gone bankrupt if it did 

not have initial capital in excess of $700. 

For the above example, we happened to know the closed-form expression for the 

quantiles, but there are many applications where we do not have a formula in a closedform. 

Often, we do not even know which distribution we are sampling from. In such 

Xi 

2 ,


Fig. 3.1. Cumulative profit obtained by following the expectation-maximizing policy (left) and 

by following the median-maximizing policy (right), for ten sample paths. 

Fig. 3.2. Convergence of xn when α = 0.5, 0.684, and 0.8. 

cases, we must rely on the algorithm (3.10). Figure 3.2 shows the convergence of the 

algorithm (3.10) for α = 0.5, 0.684, and 0.8, where xn → 4, 10, and 25, respectively. 

The horizontal axis is shown in the logarithmic scale. Table 3.1 shows the expected 

profit and the probability of loss for different quantiles.


x ∗ 

Expected 

Profit 

Probability of 

Loss 

α = 0.5 4 36 0 

α = 0.684 10 43.2 0 

α = 0.8 25 65 0.37 

α = 0.9 100 90 0.68 

Table 3.1 

Optimizing the quantile of the profit for the newsvendor problem 

4. Trading Electricity in the Presence of Storage. Trading on the electricity 

spot market introduces the challenge of dealing with spot prices which are 

well-known to be heavy-tailed [2, 4, 8]. In this section, we show how quantile optimization 

can be used to derive a trading strategy in the presence of storage. At each 

time t, we sign a contract and determine whether to buy or sell electricity during 

the time interval [t, t + 1). We propose a trading strategy and a performance metric 

that can be assumed to be approximately i.i.d, thus allowing us to apply our quantile 

optimization algorithm. We use electricity spot market price data from Ercot and 

PJM West Hub. Ercot is a utility that covers most of Texas. PJM West covers the 

Chicago metropolitan area. Ercot and PJM represent competitive electricity markets 

where the price is settled by the market based on the supply and the demand in real 

time. The electricity price in Ercot is particularly volatile because of the significant 

role wind energy plays in the market. While demand for electricity is always random, 

reliance on intermittent energy such as wind or solar makes the supply side of the 

equation random as well, thus making a large mismatch between supply and demand 

more likely, resulting in violent price swings. 

4.1. Model. Throughout this section, we assume the amount of electricity we 

can deliver to the market by discharging our storage for a one hour period is one 

megawatt-hour. We assume that the round-trip efficiency of our storage is 0 < ρ < 1. 

For most of the commercially available storage devices the value ρ is between .65 and 

.80 [40]. We must import 1/ρ megawatt-hours of electricity when charging our storage 

device for every megawatt-hour of energy that we want to release. Throughout this 

section, we assume that we have an infinite capacity storage device. Moreover, we 

assume that we are sufficiently small compared to the overall market so that we can 

always discharge our storage and sell one-megawatt-hour of electricity to the market 

and we can always buy one-megawatt-hour of electricity and charge our storage if we 

choose to do so. 

State Variables: 

Let t ∈ N+ be a discrete time index corresponding to the decision epoch. 

pt = price of electricity per megawatt-hour traded during the time interval [t, t + 1) , ∀t. 

When we discharge our storage one unit to sell one megawatt-hour of electricity, we 

earn pt. When we buy electricity to charge our storage, we are buying 1/ρ megawatt-


hour of electricity, thus we have to pay pt/ρ. 

St = the electricity price history over the last 99 hours = (pt ′) t−99≤t ′ ≤t . The history 

St is needed to compute the quantile of the price process. We interpret St as the state 

of the system at time t. 

Decision (Action) Variable: 

xt = decision that tells us to discharge (sell), hold, or charge (buy) our storage at 

time t. xt ∈ {−1, 0, 1} where xt = −1 indicates discharge, xt = 0 indicates hold, and 

xt = 1 indicates charge. 

Exogenous Process: 

pt = pt − pt−1 = random variable that captures the evolution of the price process 

(pt) t≥0 . 

Let Ω be the set of all possible outcomes and let F be a σ-algebra on the set, with 

filtrations Ft generated by the information given up to time t : 

Ft = σ (S0, x0, p1, S1, x1, p2, S2, x2, ..., pt, St, xt) . 

P is the probability measure on the measure space (Ω, F) . We have defined the state 

of our system at time t as all variables that are Ft-measurable and needed to compute 

our decision at time t. 

Storage Transition Function: 

Rt+1 = Rt + xt. 

If xt = 1, we import electricity from the market and store it with some conversion 

factor ρ. That is, we purchase 1/ρ megawatt-hours of electricity from the market, 

but it only fills up one megawatt-hour worth of electricity in our storage due to the 

conversion loss. If xt = −1, the potential energy in the storage is converted into 

electricity with a conversion factor of 1, and sold to the market. 

4.2. Electricity Price Behavior. At time t, let p (i) 

t denote the order statistics 

of the past hundred hours of data (pt ′) t−99≤t ′ ≤t sorted in increasing order: 

Then, let 

p (1) 

t ≤ p (2) 

t ≤ ... ≤ p (100) 

t 

 

qt := min b ∈ {1, 2, ..., 100} : p (b) 

 

t ≥ pt 

. 

(4.1) 

be the index of pt in the order statistics. For the hourly electricity spot market price 

from Texas Ercot and PJM West Hub, 

P [qt ≤ b] ≈ b/100, ∀b ∈ {1, 2, ..., 100} . (4.2) 

Next, Table 4.1 shows the behavior of the electricity price for Texas Ercot. One 

can see that for lower values of qt, the probability that the price will increase grows, 

and vice versa. The results are similar for prices for the PJM West hub, as shown


b = 10 20 30 40 50 60 70 80 90 

P pt+1 ≥ 1 

0.7 pt | qt ≤ b = .41 .30 .24 .20 .17 .15 .14 .13 .12 

P pt+1 ≥ 1 

0.75 pt | qt ≤ b = .44 .33 .27 .23 .20 .18 .16 .15 .14 

P pt+1 ≥ 1 

0.8 pt | qt ≤ b = .47 .37 .32 .28 .25 .22 .21 .19 .18 

P pt+1 ≥ 1 

0.9 pt | qt ≤ b = .55 .47 .44 .40 .38 .35 .33 .32 .30 

P [pt+1 ≤ pt | qt > b] = .52 .53 .54 .56 .57 .59 .62 .66 .74 

Table 4.1 

Dependence of electricity price behavior on qt for Texas Ercot 

b = 10 20 30 40 50 60 70 80 90 

P pt+1 ≥ 1 

0.7 pt | qt ≤ b = .173 .132 .121 .119 .115 .110 .106 .102 .098 

P pt+1 ≥ 1 

0.75 pt | qt ≤ b = .211 .174 .153 .151 .148 .144 .139 .134 .129 

P pt+1 ≥ 1 

0.8 pt | qt ≤ b = .249 .205 .193 .191 .190 .186 .181 .176 .171 

P pt+1 ≥ 1 

0.9 pt | qt ≤ b = .393 .339 .323 .320 .317 .310 .304 .297 .289 

P [pt+1 ≤ pt | qt > b] = .52 .54 .55 .56 .58 .60 .65 .72 .83 

Table 4.2 

Dependence of electricity price behavior on qt for PJM West Hub 

in Table 4.2. This is because we may assume that the electricity price is medianreverting, 

as shown in [18]. According to [18], the electricity price is heavy-tailed and 

non-stationary so that the empirical mean cannot be used to characterize the behavior 

of the price. Instead of using standard Gaussian jump-diffusion processes with meanreversion, 

[18] shows that the price behavior is more appropriately modelled as being 

median-reverting with an underlying heavy-tailed process. Thus, if the current price 

level is at the higher end of the prices realized in the past one hundred hours, we know 

there is a high probability that the price will go down due to the median-reversion, 

and vice versa. 

4.3. A Robust Trading Policy. We believe that a robust policy for trading 

electricity is a policy whose performance is consistent across different sample paths 

of the price process. The policy optimized based on past data should remain optimal 

going into the future. In other words, if we have a parameterized policy, the parameters 

that were optimal for year 2008 and the parameters that were optimal for year 

2009 should be almost identical. This implies that by the end of the year 2008, we 

would know what policy we will use to trade in year 2009 and we would know that 

the policy will perform well. Our goal is to devise a robust trading policy that utilizes 

the fact that electricity price is median-reverting, as shown in the previous section. 

In prior research, we have shown that the electricity price is heavy-tailed [18], 

which implies that our trading policy should only depend on zeroth-order statistics, 

i.e., the quantile function. Since we do not want to have negative exposure to heavytailed 

price movements, we must always buy low and sell high, which is possible


because the price process is median-reverting. 

In the previous section, we have shown that we can ascertain how high or low 

the current price level is by comparing it to the percentile constructed from the order 

statistics. Given a set of potential quantiles 1 ≤ x L ≤ 50 and 51 ≤ x H ≤ 100, our 

policy is 

where 

X π t (St) = 

⎧ 

⎪⎨ 

⎪⎩ 

1, if qt ≤ x L 

−1, if qt ≥ x H 

0, otherwise 

π := x L , x H . 

. (4.3) 

x L determines when to buy electricity. Of course, the lower the threshold x L , the 

higher the probability that the price will increase by more than a factor of 1/ρ, as 

shown in Table 4.1 and Table 4.2. On the other hand, for smaller values of x L , we 

will buy electricity less frequently and thus we will have less opportunity to make 

money, as implied from (4.2). Similarly, the higher the threshold x H , the higher the 

probability that the price will decrease, but we will be able to sell electricity less 

frequently, reducing the opportunities to make money. 

We define an objective function in the following way: 

⎧ 

⎪⎨ 

x 

L H 

ft+1 x , x , ω = 

⎪⎩ 

L 

· pt+1 − 1 

ρpt 

(ω) , if qt ≤ xL 0, if 

t 

xL < qt < xH 

H 100 − x · (pt − pt+1) (ω) , if xH ≤ qt 

to be our objective function. Note that 

P qt ≤ x L ≈ xL 

100 and P qt ≥ x H ≈ 

100 − xH 

, 

100 

(4.4) 

as shown in (4.2). Notice that when x L ≥ qt, we buy electricity and hence we would 

make a profit (in a mark-to-market sense) if pt+1 − 1 

ρ pt ≥ 0 and have a loss if pt+1 − 

1 

ρ pt < 0. In the other hand, when x H ≤ qt, we would have correctly sold electricity 

if pt+1 − pt ≤ 0, and incur a loss in opportunity cost if pt+1 − pt ≥ 0. Clearly 

(4.4) reflects these observations and allows us to calculate the one-step contribution 

of buying, selling, or holding electricity in our storage device. 

L H Proposition 4.1. Given Ft, define the objective function ft+1 x , x , ω as in 

(4.4). Then x L satisfies the monotonicity condition (3.2) while x H satisfies (3.3). 

 

L H Proof. Let ft+1 x , x 

denote the random variable 

 

of which 

 

specific sample re- 

L H L H alization is ft+1 x , x , ω . Then, for ∀ω ∈ Ω, ft+1 x , x , ω is a concave function 

of xL , xH . When xL ≥ qt, we buy electricity and hence we would make a profit (in 

a mark-to-market sense) if pt+1 − 1 

ρ pt ≥ 0 and have a loss if pt+1 − 1 

ρ pt < 0. When 

x H ≤ qt, we would have correctly sold electricity if pt+1 − pt ≤ 0, and incur a loss in 

opportunity cost if pt+1 − pt ≥ 0. 

Next, we know that 

⎧ 

∂ ⎨ 

L H 

ft+1 x , x , ω = 

∂xL ⎩ 

 

pt+1 − 1 

ρ pt 

 

(ω) , if qt ≤ x L t , 

0, else.


and 

For some ω 1 , ω 2 ∈ Ω, 

⎧ 

∂ ⎨ 

L H 

ft+1 x , x , ω = 

∂xH ⎩ 

(pt+1 − pt) (ω) , if x H ≤ qt, 

0, else. 

L H 1 

ft+1 x , x , ω L H 2 

≤ ft+1 x , x , ω 

if and only if one of the following three conditions hold: 

1. qt ≤ xL 

and 

ω 

1 

≤ 

2. x L < qt < x H 

pt+1 − 1 

ρ pt 

pt+1 − 1 

ρ pt 

3. x H ≤ qt and (pt+1 − pt) ω 1 ≤ (pt+1 − pt) ω 2 , 

These conditions hold true if and only if 

and 

∂ 

ft+1 

∂xL ∂ 

ft+1 

∂xH L H 1 

x , x , ω ≤ ∂ L H 2 

ft+1 x , x , ω 

∂xL 

L H 1 

x , x , ω ≥ ∂ L H 2 

ft+1 x , x , ω 

∂xH . 

Therefore, x L satisfies (3.2) while x H satisfies (3.3). 

Then, our goal is to compute x L∗ , x H∗ such that 

ω 2 

L∗ H∗ 

x , x := argmax 

1≤xL≤50,50


Fig. 4.1. x L t 

computed from (4.5) as t → ∞, for α = 0.5 and ρ = 0.7. 

α = 0.3 α = 0.4 α = 0.5 

ρ = 0.7 36.2 43.6 49.4 

ρ = 0.75 37.1 44.6 50.5 

ρ = 0.8 39.0 46.3 51.6 

Table 4.3 

x L computed from PJM West Hub Data 

and round-trip conversion factor ρ. The lower the value of ρ, the greater the conversion 

loss, and hence our decision to buy must be more conservative, leading to a lower x L . 

And of course, a larger value of α leads to more risk taking and hence higher x L . We 

show results for α ≤ 0.5 because for larger α, we end up with policies that take too 

much risk and buy electricity too readily and end up not making a reasonable amount 

of profit. Similarly, Table 4.4 shows x L computed from the above algorithm using 

Texas Ercot price data. Next, Table 4.5 shows x H computed from PJM West and 

Texas Ercot data for some α ≤ 0.6. For α > 0.6, we end up with policies that take 

too much risk and sell our electricity too easily and end up not making a reasonable 

amount of profit. 

5. Maximum Profit Trading Policy. In the previous section, we have shown 

how to compute x L and x H based on our risk-appetite determined by α. If we want to 

find out exactly how much risk we would have had to accept in the past to maximize


α = 0.3 α = 0.4 α = 0.5 

ρ = 0.7 32.6 39.6 46.0 

ρ = 0.75 33.7 40.8 47.2 

ρ = 0.8 35.5 42.2 49.0 

Table 4.4 

x L computed from Texas Ercot Data 

α = 0.3 α = 0.4 α = 0.5 α = 0.6 

PJM West 87.9 75.2 62.4 50.1 

Texas Ercot 85.4 72.1 59.7 45.5 

Table 4.5 

x H computed from Texas Ercot and PJM West Data 

our profit, we can first find x L and x H that maximizes the cumulative profit 

C x L , x H := − 

T 

ptxt, 

through deterministic search. Then, α that is closest to producing the above to x L 

and x H tells us what the necessary risk-appetite should have been for us to have 

maximized our profit. Assuming ρ = .75, for year 2006, when the median price of 

the year was $47.1 in Texas Ercot, x L = 39 and x H = 71 would have given us the 

highest profit of $7.6 per hour, and these correspond to α ≈ 0.36. In other words, we 

should have been fairly conservative in controlling the probability of losses in order 

to maximize our profit. This is because the heavy-tailed spikes in the electricity 

prices have a disproportionate impact on profits and losses. Instead of trading often 

and trying to make a profit as often as possible, we must be patient and wait long 

enough for large price deviations to occur. These large deviations actually occur often 

enough so that we can eventually get a good price for buying and selling electricity at 

a good profit. For year 2007, when the median price of the year was $48.3, x L = 36 

and x H = 68 would have been optimal, giving us a profit of $7.2 per hour. This 

again corresponds to about α ≈ 0.36. Thus, even though the electricity price is highly 

volatile and non-stationary with wildly differing price distributions over time, the riskappetite 

that would have been optimal for year 2006 was still optimal for year 2007. 

Results are similar for year 2008 and 2009 and for the data from PJM West hub. The 

trading policy based on optimizing the quantiles is truly robust - it is not sensitive 

to the particular sample paths of data. This is because the policy captures the core 

characteristics of the data. The hourly electricity spot price is median-reverting and 

heavy-tailed. 

Table 5.1 shows the performance of the above trading policy from year 2007 to 

2009 in Ercot. We can see that the difference between the profit we can make using 

the optimal parameters x L , x H and the profit we can make using the parameters 

computed from the previous year’s data, is negligible, implying that the quantile-based 

trading policy is robust. 

t=0


year median price optimal (xL, xH) optimal profit last year’s (xL, xH) profit 

2007 $48.3 (36, 68) $7.2/hour (39, 71) $7.1/hour 

2008 $48.7 (39, 68) $17.0/hour (36, 68) $16.8/hour 

2009 $23.5 (39, 69) $6.3/hour (39, 68) $6.2/hour 

Table 5.1 

Performance of quantile based trading policy in Ercot 

6. Conclusion. In this article, we have presented a simple, provably convergent 

algorithm for optimizing the quantile of a continuous random function which may not 

have expectation. The algorithm follows the spirit of Robbins and Monro [36], and 

does not require us to store and sort the data. In the real world, we must often deal 

with non-stationary and heavy-tailed data. In that case, a policy based on higherorder 

statistics such as the expectation and the variance can be unstable at best or 

enormously risky, as we have shown with our example of newsvendor problem. We 

also showed that in the electricity spot market, a simple trading policy based on 

quantile optimization is robust and generate good profit. 

REFERENCES 

[1] G. R. Arce and J. L. Paredes, Recursive weighted median filters admitting negative weights 

and their optimization, IEEE Trans. Signal Processing, 48 (3), 2000, pp. 768–779. 

[2] J. P. Bouchaud, Power laws in economics and finance: some ideas from physics, Quantitative 

Finance, 1 (1), 2001, pp. 105–112. 

[3] D. Bertsimas and A. Thiele, A robust optimization approach to inventory theory, Operations 

Research, 54 (1), 2006, pp. 150–168 

[4] W. Breymann, A. Dias, and P. Embrecths, Dependence structure for multivariate highfrequency 

data in finance, Quantitative Finance, 3 (1), 2003, pp. 1–14. 

[5] H. N. E. Bystrom, Extreme value theory and extremely large electricity price changes, International 

Review of Economics & Finance, 14 (1), 2002, pp. 41–55. 

[6] A. Cartea and M. G. Figueroa, Pricing in electricity markets: a mean-reverting jumpdiffusion 

model with seasonality Applied Mathematical Finance, 12 (4), 2005, pp. 313–335. 

[7] A. Eydeland and K. Wolyniec, Energy and Power Risk Management, John Wiley & Sons, 

New Jersey, 2003. 

[8] J. D. Farmer and F. Lillio, On the origin of power law tails in price fluctuations, Quantitative 

Finance, 4 (1), 2004, C7–C11. 

[9] A. P. George and W. B. Powell, Adaptive stepsizes for recursive estimation with applications 

in approximate dynamic programming, Machine Learning, 65 (1), 2006, pp. 167–198. 

[10] B. M. Hill, A simple general approach to inference about the tail of a distribution, Annals of 

Statistics, 3 (5), 1975, pp. 1163–1174. 

[11] P. Heidelberger and P. A. W. Lewis, Quantile Estimation in Dependent Sequences, Operations 

Research, 32 (1), 1984, pp. 185–209. 

[12] D. P. Heyman and M. J. Sobel, Stochastic Models in Operations Research, Dover Publications, 

New York, 2003. 

[13] T. Hsing, On tail index estimation using dependent data, Annals of Statistics, 19 (3), 1991, 

pp. 1547–1569. 

[14] R. Huisman and R. Maiheu, Regime jumps in electricity prices, Energy Economics, 25 (5), 

2003, pp. 425–434. 

[15] D. Hutchinson and J. C. Spall, Stopping small-sample stochastic approximation, in American 

Control Conference, 2009, AAC’09, IEEE, pp. 26–31. 

[16] V. R. Joseph, Y. Tian, and C. f. Wu, Adaptive designs for stochastic root-finding, Satistica 

Sinica, 17 (4), 2007, pp. 1549–1565. 

[17] A. I. Kibzun and V. Y. Kurbakovskiy, Guaranteeing approach to solving quantile optimization 

problems, Annals of Operations Research, 30 (1), 1991, pp. 81–93.


[18] J. H. Kim and W. B. Powell, An Hour-Ahead Prediction Model for Heavy-Tailed Spot Prices, 

submitted to Energy Economics, 2011. 

[19] A. J. Kleywegt, A. Shapiro, and T. Homem-de-Mello, The sample average approximation 

method for stochastic discrete optimization, SIAM Journal on Optimization, 12 (2), 2002, 

pp. 479–502. 

[20] R. Koenker, Quantile Regression, Cambridge University Press, New York, 2005. 

[21] H. J. Kushner, and G. Yin, Stochastic approximation and recursive algorithms and applications, 

Sringer-Verlag, New York, 2003. 

[22] T. L. Lai, Stochastic approximation: invited paper, Annals of Statistics, 31 (2), 2003, pp. 

391–406. 

[23] T. l. Lai and H. Robbins, Adaptive design and stochastic approximation, The Annals of 

Statistics, 7 (6), 1979, pp. 1196–1221. 

[24] T. L. Lai and H. Robbins, Consistency and asymptotic efficiency of slope estimates in stochastic 

approximation schemes, Probability Theory and Related Fields, 56 (3), 1981, pp. 329– 

360. 

[25] B. LeBaron, Stochastic volatiliy as a simple generator of apparent financial power laws and 

long memory, Quantitative Finance, 1 (6), 2001, pp. 621–631. 

[26] J. J. Lucia and E. S. Schwartz, Electricity prices and power derivatives: Evidence from the 

Nordic power exchange, Review of Derivatives Research, 5 (1), 2002, pp. 5–20. 

[27] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation 

approach to stochastic programming, SIAM Journal on Optimization, 19 (4), 2009, pp. 

1574-1609. 

[28] L. Peng, Estimating the mean of a heavy-tailed distribution, Statistics and Probability Letters, 

52 (3), 2001, pp. 255–264. 

[29] L. Peng, Empirical-likelihood based confidence interval for the mean with a heavy-tailed distribution, 

Annals of Statistics, 32 (3), 2004, pp. 1192–1214. 

[30] L. Peng and Y. Qi, Confidence regions for high quantiles of a heavy-tailed distribution, Annals 

of Statistics, 34 (4), 2006, pp. 1964–1986. 

[31] N. C. Petruzzi and M. Dada, Pricing and the newsvendor problem: A review with extensions, 

Operations Research, 47 (2), 1999, pp. 183–194. 

[32] D. Pilipovic, Energy Risk: Valuing and Managing Energy Derivatives, Mcgraw-Hill, New 

York, 2007. 

[33] W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, 

John Wiley and Sons, New York, 2007. 

[34] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, 

John Wiley and Sons, New York, 1994. 

[35] S. Resnick and C. Starica, Tail index estimation for dependent data, Annals of Applied 

Probability, 8 (4), 1998, pp. 1156–1183. 

[36] H. Robbins and S. Monro, A stochastic Approximation Method, Annals of Mathematical 

Statistics, 22 (3), 1951, pp. 400–407. 

[37] A. Ruszczyński and A. Shapiro, Stochastic Programming, Handbooks in Operations Research 

and Management Science, 10, Elsevier, 2003. 

[38] A. Shapiro, D. Dentcheva, and A. Ruszczyński, Lectures on Stochastic Programming: Modeling 

and Theory, MPS-SIAM Series on Optimization, 9, MPS-SIAM, Philadelphia, 2009. 

[39] E. S. Schwartz, Stochastic behavior of commodity prices: Implications for valuation and 

hedging, The Journal of Finance, 52 (3), 1997, pp. 923–974. 

[40] R. Sioshansi, P. Denholm, T. Jenkin, and J. Weiss, Estimating the value of electricity 

storage in PJM: Arbitrage and some welfare effects, Energy Economics, 31 (2), 2009, pp. 

269–277. 

[41] P. Sunehag, J. Trumpf, S. V. N. Vishwanathan, and N. Schraudolph, Variable metric 

stochastic approximation theory, Arxiv preprint arXiv:0908.3529v1, 2009. 

[42] N. N. Taleb, Black swans and the domains of statistics, The American Statistician, 61 (3), 

2007, pp. 198–200. 

[43] N. N. Taleb, Errors, robustness, and the fourth quadrant, International Journal of Forecasting, 

25 (4), 2009, pp. 744–759. 

[44] L. Yin, R. Yang, M. Gabbouj, and Y. Neuvo, Weighted Median Filters: A Tutorial, IEEE 

Transactions on Circuits and Systems, 43 (3), 1996, pp. 157–192.

Quantile optimization for heavy-tailed distributions ... - CASTLE Lab

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?