User:JamesQueue/sandbox

Drift Plus Penalty
This article describes the drift-plus-penalty method for optimization of queueing networks and other stochastic systems.

Introduction to the Drift-Plus-Penalty Method
The drift-plus-penalty method refers to a technique for stabilizing a queueing network while also minimizing the time average of a network penalty function. It can be used to optimize performance objectives such as time average power, throughput, and throughput utility. In the special case when there is no penalty to be minimized, and when the goal is to design a stable routing policy in a multi-hop network, the method reduces to backpressure routing. The drift-plus-penalty method can also be used to minimize the time average of a stochastic process subject to time average constraints on a collection of other stochastic processes. This is done by defining an appropriate set of virtual queues. It can also be used to produce time averaged solutions to convex optimization problems.

Methodology
The drift-plus-penalty method applies to queueing systems that operate in discrete time with time slots t in {0, 1, 2, ...}. First, a non-negative function L(t) is defined as a scalar measure of the state of all queues at time t. The function L(t) is typically defined as the sum of the squares of all queue sizes at time t, and is called a Lyapunov function. The Lyapunov drift is defined:

$$ \Delta(t) = L(t+1) - L(t) $$

Every slot t, the current queue state is observed and control actions are taken to greedily minimize a bound on the following drift-plus-penalty expression:

$$\Delta(t) + Vp(t), $$

where p(t) is the penalty function and V is a non-negative weight. The V parameter can be chosen to ensure the time average of p(t) is arbitrarily close to optimal, with a corresponding tradeoff in average queue size. Like backpressure routing, this method typically does not require knowledge of the probability distributions for job arrivals and network mobility.

Origins and Applications
When V=0, the method reduces to greedily minimizing the Lyapunov drift. This results in the backpressure routing algorithm originally developed by Tassiulas and Ephremides (also called the max-weight algorithm). The Vp(t) term was added to the drift expression by Neely and Neely, Modiano, Li for stabilizing a network while also maximizing a throughput utility function. For this, the penalty p(t) was defined as -1 times a reward earned on slot t. This drift-plus-penalty technique was later used to minimize average power and optimize other penalty and reward metrics.

The theory was developed primarily for optimizing communication networks, including wireless networks, ad-hoc mobile networks, and other computer networks. However, the mathematical techniques can be applied to optimization and control for other stochastic systems, including renewable energy allocation in smart power grids and inventory control for product assembly systems.

How it works
This section shows how to use the drift-plus-penalty method to minimize the time average of a function p(t) subject to time average constraints on a collection of other functions. The analysis below is based on the material in.

The Stochastic Optimization Problem
Consider a discrete time system that evolves over normalized time slots t in {0, 1, 2, ...}. Define p(t) as a function whose time average should be minimized, called a penalty function. Suppose that minimization of the time average of p(t) must be done subject to time-average constraints on a collection of K other functions:

$$ p(t) = \text{penalty function whose time average should be minimized} $$

$$ y_1(t), y_2(t), ..., y_K(t) =\text{other functions whose time averages must be non-positive} $$

Every slot t, the network controller observes a new random event. It then makes a control action based on knowledge of this event. The values of p(t) and y_i(t) are determined as functions of the random event and the control action on slot t:

$$ \omega(t) = \text{random event on slot t (assumed i.i.d. over slots)} $$

$$ \alpha(t) = \text{control action on slot t (chosen after observing } \omega(t) \text{)} $$

$$ p(t) = P(\alpha(t), \omega(t)) \text{ (a deterministic function of } \alpha(t), \omega(t) \text{)} $$

$$ y_i(t) = Y_i(\alpha(t), \omega(t))  \text{ }  \forall i \in {1, ..., K} \text{ (deterministic functions of }  \alpha(t), \omega(t) \text{)} $$

The small case notation p(t), y_i(t) and upper case notation P, Y_i is used to distinguish the penalty values from the function that determines these values based on the random event and control action for slot t. The random event $$\omega(t)$$ is assumed to take values in some abstract event space $$\Omega$$. The control action $$\alpha(t)$$ is assumed to be chosen within some abstract control space $$A$$.

As an example in the context of communication networks, the random event $$\omega(t)$$ can be a vector that contains the slot t arrival information for each node and the slot t channel state information for each link. The control action $$\alpha(t)$$ can be a vector that contains the routing and transmission decisions for each node. The functions P and Y_i can represent power expenditures or throughputs associated with the control action and channel condition for slot t.

For simplicity of exposition, assume the P and Y_i functions are bounded. Further assume the random event process $$\omega(t)$$ is independent and identically distributed (i.i.d.) with some possibly unknown probability distribution. The goal is to design a policy for making control actions over time to solve the following problem:

$$ \text{Minimize: } \lim_{t\rightarrow\infty} \frac{1}{t}\sum_{\tau=0}^{t-1}E[p(\tau)] $$

$$ \text{Subject to: } \lim_{t\rightarrow\infty} \frac{1}{t}\sum_{\tau=0}^{t-1}E[y_i(\tau)] \leq 0 \text{  } \forall i \in {1, ..., K} $$

It is assumed throughout that this problem is feasible.

The Drift-Plus-Penalty Expression
For each constraint i in {1, ..., K}, define a virtual queue with dynamics over slots t in {0, 1, 2, ...} as follows:

$$ (Eq. 1) \text{ } Q_i(t+1) = \max[Q_i(t) + y_i(t), 0] $$

The queues can be initialized to 0 for slot t=0. Intuitively, stabilizing these virtual queues ensures the time averages of the constraint functions are non-negative, so that the desired constraints are satisfied. To stabilize these queues, define the Lyapunov function L(t) as a measure of the total queue backlog on slot t:

$$ L(t) = \frac{1}{2}\sum_{i=1}^KQ_i(t)^2 $$

Squaring the queueing equation results in the following bounds:

$$ Q_i(t+1)^2 \leq (Q_i(t) + y_i(t))^2 $$

$$ \Delta(t) = L(t+1) - L(t) \leq \frac{1}{2}\sum_{i=1}^ky_i(t)^2 + \sum_{i=1}^K Q_i(t)y_i(t) $$

$$ \Delta(t) \leq B + \sum_{i=1}^K Q_i(t)y_i(t) $$

where B is a positive constant that upper bounds the term with the sum of squares of the y_i(t) values (such a constant exists because these values are bounded). Adding Vp(t) to both sides of the above inequality results in the following bound on the drift-plus-penalty expression:

$$ (Eq. 2) \text{ } \Delta(t)  + Vp(t) \leq B + Vp(t) + \sum_{i=1}^K Q_i(t)y_i(t) $$

The drift-plus-penalty algorithm (defined below) makes control actions every slot t that greedily minimize the right-hand-side of the above inequality. Intuitively, taking an action that minimizes the drift alone would be beneficial in terms of queue stability but would not minimize time average penalty. Taking an action that minimizes the penalty every slot would not necessarily stabilize the queues. Thus, taking an action to minimize the weighted sum incorporates both objectives of queue stability and penalty minimization. The weight V is chosen to place more or less emphasis on penalty minimization, which results in a performance tradeoff.

Drift-Plus-Penalty Algorithm
Let $$A$$ be the abstract set of all possible control actions. Every slot t, observe the random event and the current queue values:

$$ \text{Observe: } \omega(t), Q_1(t), ..., Q_K(t) $$

Given these observations, greedily choose a control action $$\alpha(t) \in A$$ to minimize the following expression:

$$ Vp(t) + \sum_{i=1}^KQ_i(t)Y_i(\alpha(t), \omega(t)) $$

Then update the queues for each i in {1, ..., K} according to (Eq. 1).

Performance Analysis
This section shows the algorithm results in a time average penalty that is within O(1/V) of optimality, with a corresponding O(V) tradeoff in average queue size.

Average Penalty Analysis
Define an $$\omega$$-only policy to be a stationary and randomized policy for choosing the control action $$\alpha(t)$$ based on the observed $$\omega(t)$$ only. That is, an $$\omega$$-only policy specifies, for each possible random event $$\omega \in \Omega$$, a conditional probability distribution for selecting a control action $$\alpha \in A$$. Such a policy makes decisions independent of current queue backlog. Assume there exists an $$\omega$$-only policy $$\alpha^*(t)$$ that satsifies the following:

$$ (Eq. 3) \text{ } E[P(\alpha^*(t), \omega(t))] = p^* = \text{ optimal time average penalty} $$

$$ (Eq. 4) \text { } E[Y_i(\alpha^*(t), \omega(t))] \leq 0 \text{ } \forall i \in {1, ..., K} $$

The expectations above are with respect to the random variable $$\omega(t)$$ for slot t, and the random control action $$\alpha(t)$$ chosen on slot t after observing $$\omega(t)$$. Such a policy $$\alpha^*(t)$$ can be shown to exist whenever the desired control problem is feasible and the event space for $$\omega(t)$$ and action space for $$\alpha(t)$$ are finite, or when mild closure properties are satisfied.

Let $$\alpha(t)$$ represent the action taken by the drift-plus-penalty algorithm of the previous section, and let $$\alpha^*(t)$$ represent the $$\omega$$-only decision:

$$ \alpha(t) = \text{ action taken by drift-plus-penalty algorithm} $$

$$ \alpha^*(t) = \omega\text{-only action that satisfies (Eq.3)-(Eq.4)} $$

By (Eq. 2), the drift-plus-penalty expression under the $$\alpha(t)$$ policy satisfies:

$$ \Delta(t) + Vp(t) $$

$$ \leq B + Vp(t) + \sum_{i=1}^KQ_i(t)y_i(t) $$

$$ = B + VP(\alpha(t), \omega(t)) + \sum_{i=1}^K Q_i(t)Y_i(\alpha(t), \omega(t)) $$

$$ \leq B + VP(\alpha^*(t), \omega(t)) + \sum_{i=1}^KQ_i(t)Y_i(\alpha^*(t), \omega(t)) $$

where the last inequality follows because $$\alpha(t)$$ is defined to minimize the second-to-last expression over all other decisions in the action space A, including the (randomized) decision $$\alpha^*(t)$$. Taking expectations of the above inequality gives:

$$ E[\Delta(t) + Vp(t)] $$

$$ \leq B + VE[P(\alpha^*(t), \omega(t))] + \sum_{i=1}^KE[Q_i(t)Y_i(\alpha^*(t), \omega(t))] $$

$$ = B + VE[P(\alpha^*(t), \omega(t))] + \sum_{i=1}^KE[Q_i(t)]E[Y_i(\alpha^*(t), \omega(t))] $$

$$ \leq B + Vp^* $$

where the second-to-last equality follows because $$\alpha^*(t), \omega(t)$$ are independent of $$Q_i(t)$$, and the last inequality follows by (Eq.3)-(Eq.4). Summing the above inequality over the first t>0 slots gives:

$$ \sum_{\tau=0}^{t-1} E[\Delta(\tau) + Vp(\tau)] \leq (B+Vp^*)t $$

Using the fact that $$\Delta(\tau) = L(\tau+1)-L(\tau)$$ together with the law of telescoping sums gives:

$$ E[L(t)] - E[L(0)] + V\sum_{\tau=0}^{t-1}E[p(\tau)] \leq (B + Vp^*)t $$

Using the fact that L(t) is non-negative and assuming L(0) is identically zero gives:

$$ V\sum_{\tau=0}^{t-1}E[p(\tau)] \leq (B + Vp^*)t $$

Dividing the above by t and rearranging the terms yields the following result, which holds for all slots t>0:

$$ \frac{1}{t}\sum_{\tau=0}^{t-1} E[p(\tau)] \leq p^* + B/V $$

Thus, the time average expected penalty can be made arbitrarily close to the optimal value p* by choosing V suitably large.

Average Queue Size Analysis
Assume now there exists an $$\omega$$-only policy $$\alpha^*(t)$$, possibly different from the one that satisfies (Eq. 3)-(Eq.4), that satisfies the following for some $$\epsilon>0$$:

$$ (Eq. 5) \text { } E[Y_i(\alpha^*(t), \omega(t))] \leq -\epsilon \text{ } \forall i \in {1, ..., K} $$

An argument similar to the one in the previous section shows:

$$ \Delta(t) + Vp(t) \leq B + VP(\alpha^*(t), \omega(t)) + \sum_{i=1}^KQ_i(t)Y_i(\alpha^*(t), \omega(t)) $$

Now assume there are upper and lower bounds on the penalty function P, so that:

$$ p_{min} \leq P(\cdot) \leq p_{max} $$

Then the above inequality reduces to:

$$ \Delta(t) + Vp_{min} \leq B + Vp_{max} + \sum_{i=1}^KQ_i(t)Y_i(\alpha^*(t), \omega(t)) $$

Taking expectations of the above and using (Eq. 5) gives:

$$ E[\Delta(t)] + Vp_{min} \leq B + Vp_{max} + \sum_{i=1}^KE[Q_i(t)](-\epsilon) $$

A telescoping series argument similar to the one in the previous section can thus be used to show the following for all t>0:

$$ \frac{1}{t}\sum_{\tau=0}^{t-1} \sum_{i=1}^KE[Q_i(\tau)] \leq \frac{B + V(p_{max} - p_{min})}{\epsilon} $$

This shows that average queue size is indeed O(V).

Treatment of Queueing Systems
The above analysis considers constrained optimization of time averages in a stochastic system that did not have any explicit queues. Each time average inequality constraint was mapped to a virtual queue according to (Eq. 1). In the case of optimizing a queueing network, the virtual queue equations in (Eq. 1) are replaced by the actual queueing equations.

Delay Tradeoffs and Related Work
The mathematical analysis in the previous section shows that the drift-plus-penalty method produces a time average penalty that is within O(1/V) of optimality, with a corresponding O(V) tradeoff in average queue size. This method, together with the O(1/V), O(V) tradeoff, was developed in Neely and Neely, Modiano, Li in the context of maximizing network utility subject to stability.

A related algorithm for maximizing network utility was developed by Eryilmaz and Srikant. The Eryilmaz and Srikant work resulted in an algorithm very similar to the drift-plus-penalty algorithm, but used a different analytical technique. That technique was based on Lagrange multipliers. A direct use of the Lagrange multiplier technique results in a worse tradeoff of O(1/V), O(V^2). However, the Lagrange multiplier analysis was later strengthened by Huang and Neely to recover the original O(1/V), O(V) tradeoffs, while showing that queue sizes are tightly clustered around the Lagrange multiplier of a corresponding deterministic optimization problem. This clustering result can be used to modify the drift-plus-penalty algorithm to enable improved O(1/V), O(log^2(V)) tradeoffs. The modifications can use either place-holder backlog or Last-in-First-Out (LIFO) scheduling.

When implemented for non-stochastic functions, the drift-plus-penalty method is similar to the dual subgradient method of convex optimization theory, with the exception that its output is a time average of primal variables, rather than the primal variables themselves. A related primal-dual technique for maximizing utility in a stochastic network is developed by Stolyar using a fluid model analysis. The Stolyar analysis does not provide analytical results for a performance tradeoff between utility and queue size. A later analysis of the primal-dual method for stochastic networks provides a limited form of utility and queue size tradeoffs, and also shows local optimality results for minimizing non-convex functions of time averages, under an additional convergence assumption.

Extensions to Non-I.I.D. Event Processes
The drift-plus-penalty algorithm is known to ensure similar performance guarantees for more general ergodic processes $$\omega(t)$$, so that the i.i.d. assumption is not crucial to the analysis. The algorithm is robust to non-ergodic changes in the probabilities for $$\omega(t)$$, and provides desirable analytical guarantees, called universal scheduling guarantees, for arbitrary $$\omega(t)$$ processes.

Extensions to Variable Frame Length Systems
The drift-plus-penalty method can be extended to treat systems that operate over variable size frames. In that case, the frames are labeled with indices r in {0, 1, 2, ...} and the frame durations are denoted {T[0], T[1], T[2], ...}, where T[r] is a non-negative real number for each frame r.  The extended algorithm takes a control action over each frame r to minimize a bound on the following ratio of conditional expectations:

$$ \frac{E[\Delta(t) + Vp(t) | Q[r]]}{E[T[r] | Q[r]]} $$

where Q[r] is the vector of queue backlogs at the beginning of frame r. In the special case when all frames are the same size and are normalized to 1 slot length, so that T[r]=1 for all r, the above minimization reduces to the standard drift-plus-penalty technique. This frame-based method can be used for constrained optimization of Markov decision problems (MDPs) and for other problems that experience renewals.