ECSE 506: Stochastic Control and Decision Theory

Aditya Mahajan
Winter 2016

About  |  Lectures  |  Coursework


Assignment 1 (due Feb 8, 2016)

\(\def\PR{\mathbb P} \def\EXP{\mathbb E} \def\reals{\mathbb R} \def\integers{\mathbb Z}\)

Note: You may write code in any programming language for the problems that involve numerical solution.

Problem 1 (Exploration-exploitation tradeoff)

A hunter is interested in optimal scheduling of hunting during an expedition that may last up to \(T\) days. The hunter has two options on each day: he may go out to hunt or he may decide to stop the expedition and go back home.

Assume the following:

  1. The hunter knows the initial population \(N\).
  2. The time scale is such that the population does not change due to natural causes (birth or death).
  3. The probability that the hunter makes more than one capture a day is negligible.
  4. The function \(p(n)\) is monotonically increasing in \(n\).

For the above system:

  1. Write a dynamical model with the state as the current population. Argue that the above system is a MDP and write the dynamic program for this system.

    Note: You can write the dynamic program in two ways: either to maximize reward or to minimize cost. You may use either of these alternatives.

  2. Let \(λ\) be the smallest integer such that \(p(λ) > c\). Show that it is optimal to stop hunting when the current population falls below \(λ\).

    Hint: The above result is equivalent to saying that the value function is zero if and only if the current population is less than \(λ\).

Problem 2 (Exploration-exploitation tradeoff)

Consider a discrete-time time-homogeneous Markov chain with state space \(\integers _ {> 0}\) and transition probabilities \(\{ P(i,j) : i, j \in \integers _ {> 0} \}\).

At each time instant, a decision maker perfectly observes the state of the system and decides to either: (i) explore further and incurs an per-step cost of \(c\); (ii) exploit the system in its current state, that gives a terminal reward of \(r(X_t)\). The exploit decision is terminal in the sense that once an exploit decision is made, no further costs or rewards occur. At time \(T\), the decision maker must choose the exploit option.

  1. Write down the dynamic programming equation for the above model. (Expand the expectation in the DP equation in terms of the variables defined above).

  2. Show that for a fixed \(x\), the value function is non-increasing in \(t\).

  3. Define for \(t = 1, \dots, T-1\) the set \[H_t = \{ x \in \integers_{ > 0 } : g^*_t(x) = \text{exploit} \}.\] Assume that \(H_{T-1}\) has the following property (P1): whenever \(i \in H_{T-1}\) and \(j \not\in H_{T-1}\), then \(P(i,j) = 0\).

    Show that under property (P1): \[H_{T-1} = H_{T-2} = \cdots = H_1.\]

  4. Suppose that for every \(x \in \integers _ {> 0}\): \[ P(x,x) = P(x,x+1) = \frac 12 \quad\text{and}\quad r(x) = k ( 1 - 2^{-x})\] where \(k\) is a positive constant. Prove that in this case it is optimal to stop at the first instant of time the Markov chain enters the set \[ M = \{ m, m + 1, m + 2, \dots \}\] where \[ m = \min \{ x \ge 0 : 2^x \ge k/(4c) \}\]

Problem 3 (Forest management)

Consider a forest that is grown for the purpose of selling wood. Assume that the price of wood is $1 (in some normalized units) if the age of the tree is less than \(L\) years, and $\(R\) if the age is equal to or more than \(L\) years.

Each year, there is a probability \(p\) that the forest may be destroyed due to wildfire or disease. If a forest is destroyed, it is replanted (so it's age restarts from 1 year).

At each year, the forest may be cut (earning a reward that depends on its age) or it may be allowed to grow (increasing it's age by one). The system runs for a finite horizon \(T\).

  1. Write down the probabilistic model for managing the forest optimally.

  2. Identify the optimal strategy when \(p = 0.1\), \(R = 100\), \(L = 10\), and \(T = 20\). (Submit your code with your solution)

  3. Play around with the values \(p\), \(R\), \(L\), \(T\) and guess the structure of optimal strategies.

  4. Prove the optimality of the structure that you identified in the previous step.

Problem 4 (Sleep scheduling in datacenters)

Jobs arrive at a server according to a second-order time-homogeneous binary-valued Markov process \(\{X_t\} _ {t=1}^T\), i.e., \(X_t \in \{0, 1\}\) and for any \(t > 2\), \[\PR(X _ {t+1} = x _ {t+1} \mid X _ {1:t} = x _ {1:t} ) = \PR(X _ {t+1} = x _ {t+1} \mid X _ t = x _ t, X _ {t-1} = x _ {t-1} ).\] Such a process may be characterized by parameters \((p_1, p_2, p_3, p_4)\) where

\(x _ t\), \(x _ {t-1}\) \(\PR( X _ {t+1} = 0 | x _ t, x _ {t-1})\) \(\PR( X _ {t+1} = 1 | x _ t, x _ {t-1})\)
0, 0 \(p_1\) \(1 - p_1\)
0, 1 \(p_2\) \(1 - p_2\)
1, 0 \(p_3\) \(1 - p_3\)
1, 1 \(p_4\) \(1 - p_4\)

Assume that \(x _ {-1} = x _ { 0 } = 0\) for the purpose of computing the distribution of \(X _ 1\) and \(X _ 2\).

The server may be in one of two states: "on" or "off". Assume that the server starts in the "off" state at time \(1\).

At the beginning of each time slot, the server scheduling strategy must determine whether the server is in "on" or "off" state. The jobs for that time slot arrive after the server scheduling decision has been made.

If the server is "on" when a job arrives, the job is served and earns a reward of \(r\); if the server is "off" when the job arrives, the job is lost and the server is charged a penalty \(d\).

When the server is in the "on" state, it incurs an operational cost of \(c\) per time-slot; when it is in the "off" states, no operational cost is incurred. There is a cost \(s\) of turning on the server from an "off" state.

  1. Write a dynamical model for the above system. Identify an information state of the system and write down the dynamic programming decomposition.

  2. Consider a particular instance with \[(p_1, p_2, p_3, p_4) = (0.9, 0.4, 0.7, 0.1), \quad T = 100 \quad\text{and}\quad (r,d,c,s) = (1,1,0.5,2).\]

    Characterize the optimal strategy at \(t=10\). (Submit your code along with the solution).