Metropolis-Hastings Algorithm

The first time I read about the Metropolis-Hastings (MH) algorithm I found it all very confusing. The reason is that I was (am?) used to think in terms of Markov chains and transition matrices, and then the goal is to find the stationary distribution \(\pi\). In the Metropolis-Hastings algorithm, the situation is precisely reversed: for a given distribution \(\eta\), say, construct a Markov chain whose stationary distribution is \(\eta\). My first point of confusion with this was: why do this, if we already have \(\eta\)? Then I read on, to discover that the MH algorithm is used to compute the normalizing constant for some distribution \(\eta\). Ok, but why? If I start with a Markov chain, then I have no clue about the functional form of \(\pi\). What then does it help to try to find its normalizing constant when the only thing that I have available is the transtion matrix?

So far my misconceptions. After reading yet more, I finally discovered that the MH algorithm is used for distributions whose functional form is completely known, but only the normalizing constant is too hard to compute. In other words, we just use the MH algorithm to estimate the normalizing constant by means of simulation.

An example for which the functional form of the stationary distribution is known is the distribution of jobs in a closed queueing network consisting of \(N\) \(M/M/1\) stations.

Here is an example in python code to demonstrate how all this works. I learned it here.

We start with the regular

import matplotlib.pyplot as plt
import numpy as np
import scipy.integrate as integrate

# from latex_figures import *

gen = np.random.default_rng(3)

Suppose we are given as unnormalized density the function \(p(x) = \sqrt{1-x^{2}}\) with a support on \([-1,1]\). Clearly, this is the half circle, and its normalization factor is \(Z=\pi/2\). We define \(Z\) here, but don't use it in the simulation because there we are supposed not to know this normalization factor.

def p(x):
    return np.sqrt(1 - x**2)


Z = np.pi / 2

We next run the MH algorithm. I implicitly assume that the Markov chain that is used is symmetric, so that it falls out of the equations. Observe, it is not my aim to explain the details of the MH algorithm, but just how it works. Once you understand that the MH algorithm is meant to be used to sample from a given unnormalized density function, the rest is quite easy to understand.

Since the support of \(p\) is \([-1,1]\), I use a uniform rv on \([-1,1]\) to provide a candidate solution \(x'\). When the system is in state \(x\), the acceptance probability becomes \(p(x')/p(x)\). As explained in the literature, it is best to drop the first samples, i.e., to wait for a certain burn-in time.

N = 10000
samples = np.zeros(N)
xt = 0.0
for i in range(len(samples)):
    xt_candidate = gen.uniform(-1, 1)
    if gen.uniform() < p(xt_candidate) / p(xt):
        xt = xt_candidate
    samples[i] = xt
burn_in = len(samples) // 10
samples = samples[burn_in:]

So, now we have a set of samples that we can use for estimation purposes. Let us first compare the probabilities we obtained with the theoretical values by making a histogram. The conversion of a bin heights in a histogram to estimates of the pmf is a bit tricky; at least, I did not directly see how to do it, so here is the reasoning in full. Suppose we have samples \(\{X_{i}\}_{i=1}^{N}\), then the height of a bin, with the interval \(A\) as support, is just the number of samples that hit \(A\), i.e.,

\begin{equation*} c(A) = \sum_{i=1}^{N} \1{X_{i}\in A}. \end{equation*}

If the length of \(A\) is small, then, \(c(A) \approx N p(x) |A|\) where \(p(x)\) is the density \(p\) computed at the midpoint \(x\) of \(A\). Thus, if we want to estimate \(p(x)\) based on the histogram, then

\begin{equation*} p(x) \approx \frac{c(A)}{N} \frac{1}{|A|} = \frac{c(A)}{N} \frac{n}{2}, \end{equation*}

where \(|A| = 2/n\) when the interval \([-1,1]\) is chopped up into \(n\) bins.

n_bins = 30
counts, bins = np.histogram(samples, bins=n_bins)
dx = 2 / n_bins
pmf = counts / sum(counts) / dx
midpoints = (bins[:-1] + bins[1:]) / 2

Let's compare our estimates to the theoretical values. Here we use \(Z\) to scale \(p\) to the proper values; once again, note that we did not use \(Z\) in the simulations.

x_vals = np.linspace(-1, 1, 1000)
y_vals = p(x_vals)

plt.figure(figsize=(3, 1))
plt.plot(x_vals, y_vals / Z, 'r', label='P(x)')
plt.stairs(pmf, bins)
plt.tight_layout()
plt.savefig("../images/MH-half-circle.png", dpi=300)

nil

We can use the midpoints and the histogram to estimate the normalization factor \(Z\) by comparing the counts to the probabilities at the midpoints.

true_probs = np.array([p(k) for k in midpoints])
Z_estimates = true_probs / pmf
print(f"{Z_estimates.mean()=:2.3f}, {Z=:2.3f}")
Z_estimates.mean()=1.589, Z=1.571

This is not bad at all.

Finally, we actually don't need \(Z\) at all to estimate the expected values of functions.

f = lambda x: x**2
E_f = integrate.quad(lambda x: f(x) * p(x) / Z, -1, 1)[0]
E_f_estimated = np.mean(f(samples))
print(f"{E_f_estimated=:2.3f}, {E_f=:2.3f}")
E_f_estimated=0.247, E_f=0.250

And this is also a nice result.

Perhaps I add at a later stage how to estimate the distribution of jobs in a closed queueing network.

ECDF and Kernels

When analysing data obtained from simulation or measurements we often want to make a histogram of the data. In more general terms, we want to plot an (estimate of) the probability mass function (pmf). With numerical tools this is always a bit awkward because we have to specify, for instance, the number of equal-width bins in which to `throw the data', or, if we don't want equal-width bins, we need to provide the edges of the bins. In other words, we have to choose one or more hyper parameters before we can plot the pmf.

The reason that I find this akwward is because before knowing anything about the data, I have to make a sensible choice for the hyper parameters. Morever, this situation appears somewhat strange to me in view of the fact that making the empirical distribution function (ecdf) from the data does not require any intervention on my part. Can't I then just use the ecdf to make a empirical proability mass function (epmf)?

Here we study two methods to make a epmf from the ecdf. The first is simple, and is based on some interesting ideas of (Lebesgue's) integration theory. The other uses function fitting and is based on machine learning and neural networks. We will test the accuracy of each graphically.

1. ECDF and EPMF

The emperical cumulative distribution function (ecdf) is defined for a given set of real-valued measurements \(x_{1}, \ldots, x_N\) as

\begin{align*} F(x) = \frac{\#\{i : x_{i} \leq x\}}{N} = \frac{1}{N}\sum_{i=1}^N \1{x_i \leq x}, \end{align*}

that is, for each \(x\) we count the number of observations that lie in the set \((-\infty, x]\). Despite the mathematical elegance of this definition, we should stay clear from using this to compute the ecdf; the numerical performance is bad. A much better approach is this:

  1. Sort the measurements \(\{x_i\}\).
  2. Count how often each value occurs.
  3. Then take the cumulative sum.
  4. Normalize, so as to get a real cdf, i.e., a non-decreasing function with \(\lim_{x\to -\infty} F(x) = 0\), and \(\lim_{x\to\infty} F(x) = 1\).

In one function it looks like this.

def ecdf(x):
    support, values = np.unique(x, return_counts=True)
    X = values.cumsum()
    return support, X / len(x)

Read the numpy documentation on np.unique to understand what this function does. Why don't we have to sort X before calling np.unique?

We could also divide by F[-1] instead of len(X); why is that?

Solution

Reading the documentation is homework. As for \(F[-1]\), recall that the cumsum adds up to all counted values. So, \(F[-1]\) contains the number of observations, which is equal to len(X).

We can make a nice plot of the ecdf with a bit of work. 1 shows the result of the code we discuss next. We start with the modules.

import matplotlib.pyplot as plt
import numpy as np
import scipy

For the plot we add filled and open dots to make explicit that the ecdf is a right-continuous function.

X = np.array([2, 5, 2, 1, 9, 5, 5, 5])
x, F = ecdf(X)

fig, ax = plt.subplots(1, 1, figsize=(6, 3))

for i in range(len(x) - 1):
    ax.hlines(y=F[i], xmin=x[i], xmax=x[i + 1], color='k')
    ax.plot(x[i], F[i], 'o', c='k', mfc='k', ms=3)  # closed circles
    ax.plot(x[i + 1], F[i], 'o', c='k', mfc='white', ms=3)  # open circles

# left boundary
ax.hlines(y=0, xmin=x[0] - 1 / 2, xmax=x[0], color='k')
ax.plot(x[0], 0, 'o', c="k", mfc='white', ms=3)
# right boundary
ax.hlines(y=1, xmin=x[-1], xmax=x[-1] + 0.5, color='k')
ax.plot(x[-1], 1, 'o', c="k", mfc='k', ms=3)

ax.set_xlabel('$x$')
ax.set_ylabel('$F$')
ax.set_title('ecdf')

fig.tight_layout()
fig.savefig("../images/ecdf.png")

nil

Figure 1: The ecdf of X = np.array([2, 5, 2, 1, 9, 5, 5, 5]).

With the ecdf we can make an empirical probability mass function (epmf) of the measurements \(x_1, \ldots, x_{N}\) in various ways. For instance, we can specify a \emph{bandwidth} \(h>0\) and then set

\begin{align*} f(x) = \frac1{Nh} \sum_{j=1}^N \1{x-\frac{h}{2} < x_j \leq x+ \frac{h}{2}} = \frac{F(x+h/2) - F(x-h/2)}h, \end{align*}

where \(F\) is the ecdf.

We can refine this idea by saying that points nearby \(u_{i}\) should contribute a bit more than points far away. To achieve this, use a weight function \(w(\cdot)\geq 0\), and set

\begin{equation} \label{org1837fbe} f(x) = \frac1{Nh} \sum_{j=1}^N w(x-x_{j}) \1{x-\frac{h}{2} < x_j \leq x+ \frac{h}{2}}. \end{equation}

A yet more general form is to replace the summands by a \emph{kernel} function \(K(\cdot)\) and take

\begin{equation*} \label{org622db21} f(x) = \frac{1}{N}\sum_{j=1}^N K_{h}(x-x_j), \end{equation*}

where \(K_h(x) := K(x/h)/h\). Such a function is called a Kernel Density Estimate (KDE). (Often \(K\) is symmetric, i.e., \(K(-x) = K(x)\), however this is not necessary.)

There are a number of different suitable functions for the kernel \(K\). Below we will use the choice \(\rho(x) = 1/(1+(x/\sigma)^{2})\).

Why is \(K(x) = \1{-0.5 < x < 0.5}\) in \eqref{org1837fbe}?

Solution \begin{align*} \frac{1}{h}\1{x-\frac{h}{2} < x_j \leq x + \frac{h}{2}} = \frac{1}{h}\1{-\frac{1}{2} < \frac{x_j-x}{h} \leq \frac{1}{2}}. \end{align*}

However, rather than setting the width for all bins equal to \(h\), we can instead find bin edges such that each bin has the same number of observations. So, supposing we want to have \(n\) bins, then we should take as bin edges for \(i=0, \ldots, n\),

\begin{align} v_i &= \min\{x_j : F(x_j) \geq i/n\}, & u_{i} &= (v_{i+1} + v_{i})/2, \end{align}

where \(x_{1}\leq x_{2}\leq \cdots \leq x_{n}\) are the sorted values of \(\{x_{i}\}\), and \(u_i\) are the midpoints of the bins.

Why should the height \(h\) of a bin with width \(\Delta\) be taken as \(h = (n \Delta)^{-1}\)?

Solution

We want each of the \(n\) boxes with height \(h\) and width \(\Delta\) to have the same probability to be hit. That is, we want that \(h \Delta = 1/n\) for all bins \(i=1, 2, \ldots, n\).

This is one way to compute the bin edges and heights.

def is_sorted_asc(arr):
    return np.all(arr[:-1] <= arr[1:])


def epmf(x, n=10):
    if not is_sorted_asc(x):
        x.sort()
    num = min(len(x), n + 1)
    v = x[np.linspace(0, len(x) - 1, num, dtype=int)]
    u = (v[1:] + v[:-1]) / 2
    delta = v[1:] - v[:-1]
    return u, 1 / num / delta

Why do we add \(1\) to the number of bins n? Why does linspace run up to len(x) - 1 instead of len(x)?

Solution
  1. To make n bins, we need n+1 bin edges. 2. linspace includes the boundaries, so up to and including \(100\) if x contains 100 elements. However, the last element of x has index \(99\).

To see whether this works well, we do one experiment with \(\Exp{\lambda}\) and a second experiment with \(\Unif{[0,1]}\). The results are in Fig.2.

gen = np.random.default_rng(3)

fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(6, 3))

labda = 3
x, F = ecdf(gen.exponential(scale=1 / labda, size=10000))
u, f = epmf(x, n=30)
ax1.vlines(x=u, ymin=0, ymax=f, color='k', lw=0.2)
ax1.plot(u, f, 'ko', ms=2)
ax1.plot(x, labda * np.exp(-labda * x), color='k')
ax1.set_xlabel('x')
ax1.set_ylabel('epmf')
ax1.set_title('epmf exponential')

x, F = ecdf(gen.uniform(size=10000))
u, f = epmf(x, n=30)
ax2.vlines(x=u, ymin=0, ymax=f, color='k', lw=0.2)
ax2.plot(u, f, 'ko', ms=2)
ax2.plot(x, np.ones_like(x), color='k')
ax2.set_xlabel('x')
ax2.set_title('epmf uniform')

fig.tight_layout()
fig.savefig("../images/epmf.png")

nil

Figure 2: The epmf for samples from \(\Exp{\lambda}\) and from \(\Unif{[0,1]}\).

In view of how little work it is to code the ecdf and epmf, and how simple these methods are to understand, I find the results surprisingly good. However, there is a drawback of this procedure to estimate the pdf: the epmf contracts the probablity masses onto single points. What if we would like a smooth density function instead of a probability mass function? For that we need to interpolate between the observations; this will be the topic of the rest of this chapter.

2. Interpolation with Function Approximations

An attractive method to compute the epmf is to approximate the ecdf \(F\) by some smooth function \(x\to\phi(x)\). The specification as a function provides an estimate for the cdf in between observations, and the smoothness allows to use the derivative \(\phi'\) as a (hopefully) good approximation for the epmf. Here we'll implement this and see how it works. We note in passing that this idea is of general interest; in machine learning, the technique is known as using radial basis functions, c.f., wikipedia.

We start with a basis function ρ that has a peak at \(x=0\) and then tapers off to the left and right. The pmf of the normal distribution satisfies these constraints, but here we take

\begin{alignat}{2} \rho(x) = \frac{1}{1+(x/\sigma)^{2}} &\implies & \rho'(x) = -\frac{2 x/\sigma^2}{(\sigma^2+x^2)^{2}}. \end{alignat}

The number σ is a smoothing factor: the larger, the flatter ρ. Note that this is a hyper parameter that needs to be tuned to the data. (In all honesty, I find searching for hyper parameters very boring. There must be some theory on how to obtain reasonable values for σ, but I have not investigated this.)

With as centers the (sorted) support \(\bar x = \{x_i\}\) of the ecdf \(F\), we take as approximation for the ecdf

\begin{equation} \phi(x | a, \bar x) = \sum_{i=0}^{N} a_i \rho(|x-x_i|), \end{equation}

where the coefficients \(a = \{a_i\}\) are to determined such that \(\phi\) lies close to \(F\) in some sense. Note that we write \(\phi(x | a, x_{i})\) to make explicit that \(\phi\) depends on the parameter vector \(a\) and the observations \(\bar x\).

One way to find \(a\) is by means of a minimization problem. Writing \(y_i=F(x_i)\) for the values of the ecdf at the point \(x_{i}\), we can find \(a\) by solving the problem

\begin{align} \text{min}_{a} \sum_{j=1}^N (\phi(x_j|a, \bar x) - y_j)^{2}. \end{align}

By substituting the definition of ρ and introducting the (symmetric) matrix \(G\) with elements \(G_{ij} = \rho(|x_i-x_j|)\), we have that

\begin{align} \sum_{j=1}^N (\phi(x_j| a, \bar x) - y_j)^{2} = \sum_{j=1}^N \sum_{i=1}^{N} (a_{i}\rho(|x_j-x_i|) - y_j)^{2} = (a' G' - y')(G a - y). \end{align}

where \(a'\) means the transpose of \(a\). With this notation, it is clear that the best \(a\) must be the least squares minimizer of \(G a - y\).

This is the code for ρ and its derivative which we call rho_p because p stands for `prime'.

def rho(x):
    return 1 / (1 + (x / sigma) ** 2)


def rho_p(x):  # derivative of rho
    return -2 * x * sigma**2 / (sigma**2 + x**2) ** 2

Here is the code to solve the above problem. We start with drawing some random numbers. (Note that our code depends on all the regular python imports.) For no particular reason I take σ to be equal to 20 times the largest bin size in the support of the ecdf; it seems to work reasonable.

rv = scipy.stats.expon()
N = 1000
x, F = ecdf(rv.rvs(size=N, random_state=gen))
sigma = 20 * np.diff(x).max()

To save some numerical work, we don't take the full support \(\{x_i\}\) of \(F\) for the centers for \(\rho\), but just a subset. Second, observe that \(G\) is a matrix. To get the data \(x\) and the centers in the right dimensions, we have to augment these arrays with None at the right places.

centers = x[::10]
G = rho(np.abs(x[:, None] - centers[None, :]))
a = np.linalg.lstsq(G, F, rcond=None)[0]  # rcond silences a warning

xi = np.linspace(x.min(), x.max(), 30)  # points to test the approximation
phi = rho(np.abs(xi[:, None] - centers[None, :])) @ a
phi_p = rho_p(xi[:, None] - centers[None, :]) @ a

It remains to make a plot to see how things look.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(6, 3))
ax1.plot(x, F, 'k-', linewidth=1, label="ecdf")
ax1.plot(xi, phi, 'ko', ms=2, label="fit")
ax1.legend()
ax2.plot(xi, rv.pdf(xi), 'k-', linewidth=1, label="pdf")
ax2.plot(xi, phi_p, 'k.', ms=3, label="fit")
ax2.vlines(xi, 0, phi_p, colors='k', linestyles='-', lw=1)
ax2.legend()
fig.tight_layout()
fig.savefig('../images/radial_epmf_expon.png')

In Fig.3 we see that φ lies closely to the ecdf \(F\) in the left part of the graph, but the \(\phi'\) becomes negative around 6.

nil

Figure 3: The epmf for an \(\Exp{\lambda}\) rv.

All in all, for my simple goals I am not happy with this method. First, it is difficult to understand, at least quite a bit more than the previous method. Second, I need to choose and/or tune hyper parameter. This is something I don't like because I don't know how to do it, and if I choose not to know the theory, I have to search empiricallyfor good values, and that I find painfully boring. Third, numerically it is much more demanding than my earlier method to compute the epmf. Fourth, and finally, the observation that \(\phi'\) can become negative is a killer for me.

Perhaps it's possible to repair all this, by playing with \(\sigma\), using more points as centers, and finally, including checks that \(\phi'\) cannot become negative. However, I prefer the simpler method.

Solving some Integrals using Indicators and Fubini's Theorem

My students sometimes (often?) find it hard to see why, for example, for a positive continuous rvs

\begin{align*} \E X &= \int_0^{\infty} G(x) \d x, \\ \E{X^2} &= 2 \int_{0}^\infty x G(x) \d x. \end{align*}

Here I show how to do this. We start with a discrete positive rv.

For a nonnegative discrete random variable \(X\), use an indicator function to prove that

\begin{align*} \E X = \sum_{k=0}^\infty G(k), \end{align*}

where \(G(k) = \P{X > k}\).

Solution \begin{align*} \sum_{k=0}^\infty G(k) &= \sum_{k=0}^\infty \P{X > k} = \sum_{k=0}^\infty \sum_{m=k+1}^\infty \P{X=m} \\ & = \sum_{k=0}^\infty \sum_{m=0}^\infty \1{m > k} \P{X=m} = \sum_{m=0}^\infty \sum_{k=0}^\infty \1{m > k} \P{X=m} \\ &= \sum_{m=0}^\infty m\P{X=m} = \E X. \end{align*}

For a nonnegative discrete \(X\), use an indicator function to prove that

\begin{align*} \sum_{i=0}^\infty i G(i) = \frac{\E{X^2}}2 - \frac{\E{X}}2. \end{align*}
Solution \begin{align*} \sum_{i=0}^\infty i G(i) &= \sum_{i=0}^\infty i \sum_{n=i+1}^\infty \P{X=n} = \sum_{n=0}^\infty \P{X=n} \sum_{i=0}^\infty i \1{n\geq i+1} \\ &= \sum_{n=0}^\infty \P{X=n} \sum_{i=0}^{n-1}i = \sum_{n=0}^\infty \P{X=n} \frac{(n-1)n}{2} \\ &= \sum_{n=0}^\infty \frac{n^2}{2} \P{X=n} - \frac{\E X}{2} = \frac{\E{X^2}}{2} - \frac{\E X}{2}. \end{align*}

For general non-negative \(X\), use an indicator function to prove that

\begin{align*} \E X = \int_0^\infty x \d F(x) = \int_0^\infty G(y) \d y, \end{align*}

where \(G(x) = 1 - F(x)\).

Solution \begin{align*} \E{X} &= \int_0^\infty x \d F(x) = \int_0^\infty \int_0^x \d y \d F(x) \\ & = \int_0^\infty \int_0^\infty \1{y\leq x} \d y \d F(x) = \int_0^\infty \int_0^\infty \1{y\leq x} \d F(x) \d y\\ & = \int_0^\infty \int_y^\infty\d F(x) \d y = \int_0^\infty G(y) \d y. \end{align*}

Use an indicator function to prove that for a general non-negative random variable \(X\)

\begin{align*} \E{X^2} = 2 \int_0^\infty y G(y) \d y, \end{align*}

where \(G(x) = 1 - F(x)\).

Solution \begin{align*} \int_0^\infty y G(y) \d y &= \int_0^\infty y \int_y^\infty f(x)\, \d x \d y = \int_0^\infty y \int_0^\infty \1{y\leq x}f(x)\, \d x \d y\\ &= \int_0^\infty f(x) \int_0^\infty y \1{y \leq x}\, \d y \d x = \int_0^\infty f(x) \int_0^x y\, \d y \d x\\ &= \int_0^\infty f(x) \frac{x^2}2 \d x =\frac{\E{X^2}}2. \end{align*}

Use integration by parts to see that \(\E{X^2} = 2 \int_0^\infty y G(y) \d y\).

Hint

In \(\int_0^\infty y G(y) \d y\) integrate first the \(y\) to \(y^2/2\).

Solution \begin{equation*} \int_0^\infty y G(y) \d y = \frac{y^2}2 G(y) \bigg|_0^\infty - \int_0^\infty \frac{y^2}2 g(y)\d y = \int_0^\infty \frac{y^2}2 f(y)\d y = \frac{\E{X^2}}2, \end{equation*}

since \(g(y) = G'(y) = - F'(y) = - f(y)\). Note that we used \(\frac{y^2}2 G(y) \bigg|_0^\infty = 0 - 0 = 0\), which follows from our assumption that \(\E{X^2}\) exists, implying that \(\lim_{y \to \infty} y^2G(y) = 0\).


Use substitution to see that \(\E{X^2} = 2\int_0^\infty y G(y) \d y\).

Hint

Substitute \(y = \sqrt x \implies \d y = \d x/ 2 \sqrt x \implies \d x = 2 y \d y\).

Solution \begin{align*} \int_{0}^{\infty} y^2 \d F(y) &=\int_{0}^{\infty} \int_{0}^{\infty}\1{x\leq y^2} \d x \d F(y) = \int_{0}^{\infty} G(\sqrt x) \d x \\ &= 2\int_{0}^{\infty} y G(y) \d y, \end{align*}

Memoryless Excursions: A confusing problem with memoryless rvs

1. Introduction

BH give a quick argument to compute \(\E M\) and \(\E L\) where \(M=\max\{X,Y\}\) and \(L=\min\{X,Y\}\) are the maximum and minimum of two iid exponential rvs \(X\) and \(Y\). Since \(X\) and \(Y\) have the same distribution,

\begin{equation*} \E L + \E M = \E{L+M}=\E{X+Y} = 2\E X. \end{equation*}

Therefore,

\begin{equation} \label{org8619836} \E M = 2\E X - \E L. \end{equation}

Next, by the fact that \(X\) and \(Y\) are memoryless,

\begin{equation} \label{orgf33074a} \E M = \E L + \E X. \end{equation}

An interpretation can help to see this. There are two machines, each working on a job in parallel. Let \(X\) and \(Y\) be the production times at either machine. The time the first job finishes is evidently \(L=\min\{X, Y\}\). Then, due to memorylessness, the service time of the remaining job `restarts'; this takes an expected time \(\E X\) to complete. Adding these two equations and noting that \(\E L\) cancels we get \(2\E M = 3 \E X \), hence:

\begin{align} \label{org17bdd80} \E M &= \frac 3 2 \E X, & \E L &= \E M - \E X = \frac 1 2 \E X. \end{align}

This argument seems general enough, so it must hold for discrete memoryless rvs too, i.e., when \(X, Y \sim \Geo{p}\). But that is not the case: it is only true when \(X, Y \sim \Exp{\lambda}\) and independent. To see what is wrong I tried as many different approaches to this problem I could think of, which resulted in this text.1 In 2 we'll derive \eqref{org8619836} for geometric rvs in multiple different ways. Hence, the culprit must be \eqref{orgf33074a}. Then, in 3 we'll show that both equations are true for exponential rvs. Finally, in 4 we find a formula that is similar to \(\E M = \E L + \E X\) but that holds for both types of memoryless rvs, whether they are discrete or continuous.

The analysis of the above problem illustrates many general and useful probability concepts such as joint CDF, joint PMF/PDF, the fundamental bridge, integration over 2D areas, 2D LOTUS, conditional PMF/PDF, MGFs, and the change of variables formula. It pays off to do the exercises yourself and then study the hints and solutions carefully.

You'll notice, hopefully, that I use many different methods to solve the same problem, and that I take pains to see how the answers of these methods relate. There are at least two reasons for this. Often, a problem can be solved in multiple ways, and one method is not necessarily better than another; in fact, different methods may augment the understanding of the problem. The second reason is that it is easy to make a mistake in probability. If different methods give the same answer, the chances of having made a mistake becomes smaller.

2. Discrete memoryless rvs

Before embarking on a problem, it often helps to refresh our memory. This is what we do first. Let \(X\) be \(\sim \Geo{p}\) and write \(q = 1-p\).

What is the domain of \(X\)?

Solution

Of course \(X \in \{0, 1, 2, \ldots\}\).

With some fun tricks with recursions it is possible to quickly derive the most important expressions for geometric rvs:

\begin{align*} \P{X > 0} &= \P{\textrm{failure}} = q\\ \P{X > j} &= q \P{X > j-1} \implies \P{X > j}=q^{j}\P{X > 0} = q^{j+1}.\\ \P{X\geq j} &= \P{X > j-1} = q^{j}.\\ \P{X=j} &= \P{X > j-1} - \P{X > j} = q^{j}-q^{j+1} = (1-q)q^{j} = p q^{j}.\\ \E X &= p\cdot 0 + q(1+\E X) \implies \E X = q/(1-q) = q/p. \end{align*}

Mind that, even though this is neat, it only work for geometric rvs.

Explain the above.

Solution

For \(X > 0\), the first outcome should be a failure. Then, for \(j\) failures, we need to fail \(j-1\) times and then once more. For \(\E X\), if there is a success, we don't need another another experiment. However, in case of a fail, we need another experiment, and we start again. Thus, \(\E X = q(1+\E X) \implies (1-q)\E X = q.\)

Clearly, such tricks are nice and quick, but they are not general. We should also practice with the general method.

Simplify \(\P{X > j}= \sum_{i=j+1}^{\infty} \P{X=i}\) to see that this is equal to \(q^{j+1}\). Realize that from this, \(\P{X\geq j} = \P{X > j-1} = q^{j}\).

Solution

With the regular method:

\begin{align*} \P{X > j} &= \sum_{i=j+1}^{\infty} \P{X=i} \\ &= p\sum_{i=j+1}^{\infty} q^{i} \\ &= p\sum_{i=0}^{\infty} q^{j+1+i} \\ &= pq^{j+1}\sum_{i=0}^{\infty} q^{i} \\ &= pq^{j+1}\frac 1 {1-q} = p q^{j+1} \frac 1 p = q^{j+1}. \end{align*}

Use indicator variables to show that

\begin{equation} \label{orgb0ec031} \E X = \sum_{i=0}^{\infty} i\P{X=i} = \sum_{i=0}^{\infty} \sum_{j=0}^{\infty} \1{j < i} \P{X=i} =p/q. \end{equation}
Solution \begin{align*} \E X &= \sum_{i=0}^{\infty} i\P{X=i} \\ &= \sum_{i=0}^{\infty} \sum_{j=0}^{\infty} \1{j < i} \P{X=i} \\ &= \sum_{j=0}^{\infty} \sum_{i=0}^{\infty} \1{j < i} \P{X=i} \\ &= \sum_{j=0}^{\infty} \sum_{i=j+1}^{\infty} \P{X=i} \\ &= \sum_{j=0}^{\infty} \P{X > j} \\ &= \sum_{j=0}^{\infty} q^{j+1}\\ &= q \sum_{j=0}^{\infty} q^{j}\\ &= q/(1-q) = q/p. \end{align*}

Look up the definition of a memoryless rv, and check that \(X\) is memoryless.

Solution \begin{align*} \P{X\geq n+m\given X\geq m} &= \frac{ \P{X\geq n+m, X\geq m}}{\P{X\geq m}} \\ &= \frac{ \P{X\geq n+m}}{\P{X\geq m}} \\ &= \frac{ q^{n+m}}{q^m} \\ &= q^n= \P{X\geq n}. \end{align*}

With this refresher we can derive some useful properties of the minimum \(L=\min\{X, Y\}\), where \(X, Y\sim \Geo{p}\) and \(X, Y\) independent. For this we use that

\begin{equation*} \P{g(X,Y}\in A\} \stackrel1= \E{\1{g(X, Y) \in A}} \stackrel2= \sum_{i}\sum_j \1{g(i, j)\in A} \P{X=i, Y=j}. \end{equation*}

Step 1 is known as the fundamental bridge and step 2 as 2D LOTUS.

What is the domain of \(L\)? Then, show that

\begin{equation*} \P{L\geq i}=q^{2i} \implies L\sim\Geo{1-q^{2}}. \end{equation*}
Hint

For 2D LOTUS, take \(g(i,j) = \min\{i, j\}\).

Solution \begin{align*} \P{L\geq k} &= \sum_{i}\sum_{j} \1{\min\{i,j\}\geq k} \P{X=i, Y=j}\\ &= \sum_{i\geq k} \sum_{j\geq k} \P{X=i}\P{Y=j}\\ &= \P{X\geq k}\P{Y\geq k} = q^{k} q^{k} = q^{2k}. \end{align*}

\(\P{L > i}\) has the same form as \(\P{X > i}\), but now with \(q^{2i}\) rather than \(q^{i}\).


Show that

\begin{equation} \label{org347686d} \E L=q^{2}/(1-q^{2}). \end{equation}
Hint

\(X\sim \Geo{1-q} \implies \E X = q/(1-q)\). Now use that \(L\sim\Geo{1-q^{2}}\).

Solution

Immediate from the hint and \eqref{orgb0ec031}.


Show that

\begin{equation} \label{org526c9dd} \E L + \E X = \frac q{1-q}\frac{1+2q}{1+q}. \end{equation}
Hint

Use that \(1-q^2=(1-q)(1+q)\).

Solution \begin{align*} \E L + \E X &= \frac{q^{2}}{1-q^2} + \frac{q}{1-q} \\ &= \frac q{1-q}\left(\frac{q}{1+q} + 1\right)\\ &= \frac q{1-q} \frac{1+2q}{1+q} \end{align*}

Now we can combine these facts with the properties of the maximum \(M=\max\{X, Y\}\).

Show that

\begin{equation} \label{orgb930347} 2\E X -\E L = \frac q{1-q} \frac{2+q}{1+q}. \end{equation}
Solution \begin{align*} 2\E X - \E L &= 2\frac q{1-q} - \frac{q^2}{1-q^2}\\ &= \frac q{1-q}\left(2 - \frac{q}{1+q}\right)\\ &= \frac q{1-q}\left(\frac{2+2q}{1+q} - \frac{q}{1+q}\right)\\ &= \frac q{1-q}\frac{2+q}{1+q}. \end{align*}

Clearly, unless \(q=0\), \(\E L + \E X \neq 2\E X - \E L\), hence, \(\E M\) can only be one of the two and \eqref{org8619836} and \eqref{orgf33074a} cannot be both true.

To convince ourselves that \eqref{orgb930347}, hence \eqref{org8619836}, is indeed true, we pursue three ideas.

Here is the first idea.

Show for the PMF of \(M\) that

\begin{align*} p_M(k) &= \P{M=k} = 2 pq^{k}(1-q^{k}) + p^2q^{2k}. \end{align*}
Hint

Use 2D LOTUS on \(g(x,y) = \1{\max\{x, y\} = k}\).

Solution \begin{align*} \P{M=k} &= \P{\max\{X, Y\} = k} \\ &=p^{2}\sum_{ij}\1{\max\{i,j\} =k} q^i q^j\\ &=2 p^{2}\sum_{ij}\1{i=k}\1{j < k} q^i q^j + p^{2}\sum_{ij}\1{i=j=k} q^{i} q^j \\ &=2 p^{2}q^{k}\sum_{j < k} q^j + p^{2}q^{2k}\\ &=2 p^{2}q^{k}\frac{1-q^{k}}{1-q} + p^{2}q^{2k}\\ \end{align*}

With the previous exercise, show now that \(p_M(k) =2 \P{X=k} - \P{L=k}\).

Solution

It's just algebra

\begin{align*} \P{M=k} &=2 p^{2}q^{k}\frac{1-q^{k}}{1-q} + p^{2}q^{2k}\\ &=2 pq^{k}(1-q^{k}) + p^{2}q^{2k}\\ &=2 pq^{k} +(p^{2}-2p) q^{2k}\\ &=2 \P{X=k} - \P{L=k}, \end{align*}

where I use that \(p^{2}-2p = p(p-2) = (1-q)(1-q-2)=-(1-q)(1+q)=-(1-q^{2})\).


Finally, show that \(\E M = 2 \E X - \E L\).

Hint
Solution \begin{align*} \E M &= \sum_k k \P{M=k} \\ &= \sum_{k}k (2\P{X=k} - \P{L=k}) \\ &= 2 \E X - \E L. \end{align*}

The second idea.

First show that \(\P{M \leq k} = (1-q^{k+1})^{2}.\)

Solution \begin{align*} \P{M\leq k} &= \P{X\leq k, Y\leq k} \\ &= \P{X\leq k}\P{Y\leq k} \\ &= (1-\P{X > k})(1-\P{Y > k}) \\ &= (1-q^{k+1})^{2}. \end{align*}

Simplify \(\P{M=k} = \P{M\leq k} - \P{M\leq k-1}\) to see that \(p_M(k) = 2\P{X=k} - \P{L=k}\).

Solution \begin{align*} \P{M=k} &= \P{M\leq k} - \P{M\leq k-1}\\ &= 1-2 q^{k+1} + q^{2k+2} - (1 - 2q^{k} + q^{2k})\\ &= 2 q^{k}(1-q) + q^{2k}(q^{2}-1) \\ &= 2 \P{X=k} - q^{2k}(1-q^{2}) \\ &= 2 \P{X=k} - \P{L=k}. \end{align*}

And the third third idea.

Explain that

\begin{equation*} \P{L=i, M=k} = 2p^{2}q^{i+k}\1{k > i}+ p^{2}q^{2i}\1{i=k}). \end{equation*}
Solution \begin{align*} \P{L=i, M=k} &= 2\P{X=i, Y=k}\1{k > i} + \P{X=Y=i}\1{i=k} \\ &= 2p^{2}q^{i+k}\1{k > i}+ p^{2}q^{2i}\1{i=k}. \end{align*}

Use 1 and marginalization to compute the marginal PMF \(\P{M=k}\).

Hint

Marginalize out \(L\).

Solution \begin{align*} \P{M=k} &= \sum_{i} \P{L=i, M=k} \\ &= \sum_{i} (2p^{2}q^{i+k}\1{k > i}+ p^{2}q^{2i}\1{i=k}) \\ &= 2p^2q^k\sum_{i=0}^{k-1} q^{i}+ p^{2}q^{2k} \\ &= 2pq^k (1-q^{k})+ p^{2}q^{2k} \\ &= 2pq^k + (p^{2}-2p) q^{2k}, \end{align*}

Use 1 to compute \(\P{L=i}\).

Hint

Marginalize out \(M\).

Solution \begin{align*} \P{L=i} &= \sum_{k} \P{L=i, M=k} \\ &= \sum_{k} (2p^{2}q^{i+k}\1{k > i}+ p^{2}q^{2i}\1{i=k}) \\ &= 2p^2q^i\sum_{k=i+1}^{\infty} q^{k}+ p^{2}q^{2i} \\ &= 2p^{2}q^{2i+1} \sum_{k=0}^{\infty}q^{k}+ p^{2}q^{2i} \\ &= 2pq^{2i+1} + p^{2} q^{2i}\\ &= pq^{2i}(2q+p)\\ &= (1-q)q^{2i}(q + 1), \quad p+q=1,\\ &= (1-q^{2})q^{2i}. \end{align*}

In conclusion we verified the correctness of \(\E M = 2\E X - \E L\) in three different (useful) ways. Let us now focus on exponential rvs rather than geometric rvs.

3. Continuous memoryless rvs

In this section we analyze the correctness of \eqref{org8619836} and \eqref{orgf33074a} for continuous memoryless rvs, i.e., exponentially distributed rvs. I decided to analyze this in as much detail as I could think of, hoping that this would provide me with a lead to see how to generalize the equation \(\E M = \E L + \E X\) such that it covers also the case with geometric rvs.

First we need to recall some basic facts about the exponential distribution.

Show that \(X\) is memoryless.

Solution

With \(t \geq s\),

\begin{equation*} \P{X > t|X > s} = \frac{\P{X > t}}{\P{X > s}} = e^{-\lambda t} e^{\lambda s} = e^{-\lambda(t-s)} = \P{X > t - s}. \end{equation*}

Show that \(\E X = 1/\lambda\).

Hint

Use partial integration.

Solution

It is essential that you know both methods to solve this integral.

\begin{align*} \E X &= \int_{0}^{\infty} x f_{X}(x) \d x \\ &= \int_{0}^{\infty} x \lambda e^{-\lambda x} \d x \\ &= - xe^{-\lambda x} \biggr|_{0}^{\infty} + \int_{0}^{\infty} e^{-\lambda x} \d x \\ &= - \frac 1 \lambda e^{-\lambda x} \biggr|_{0}^{\infty} = \frac 1 \lambda. \end{align*}

Substitution is also a very important technique to solve such integrals. Here we go again:

\begin{align*} \E X &= \int_{0}^{\infty} x f_{X}(x) \d x \\ &= \int_{0}^{\infty} x \lambda e^{-\lambda x} \d x \\ &= \frac 1 \lambda \int_{0}^{\infty} u e^{- u} \d u, \end{align*}

by the substitution \(u=\lambda x \implies \d u = \d (\lambda x) \implies \d u = \lambda \d x \implies \d x = \d u / \lambda\). With partial integration (do it!), the integral evaluates to \(1\).

Now we can shift our attention to the rvs \(L\) and \(M\).

Show that \(F_L(x) = 1-e^{-2\lambda x}\).

Hint

Using independence and the specific property of the r.v. \(L\) that \(\{L > x\} \iff \{X > x, Y > x\}\):

Solution

With the hint,

\begin{align*} G_L(x) = \P{L > x} &= \P{X > x, Y > x} = G_X(x)^{2} = e^{-2\lambda x}. \end{align*}

The result follows since \(F_{L}(x) = 1-G_{L}(x)\).

Clearly, this implies that \(L\sim \Exp{2\lambda}\) and \(\E L = 1/(2\lambda) = \E X /2\). Hence, we see that \eqref{org17bdd80} holds now. Moreover, with the same trick we see that the distribution function \(F_{M}\) for the maximum \(M\) is given by

\begin{equation*} F_{M}(v) = \P{M\leq v} = \P{X \leq v, Y \leq v} = (F_{X}(v))^{2} = (1-e^{-\lambda v})^{2}. \end{equation*}

Of course this is a nice trick, but it is not a method that allows us to compute the distribution for more general functions of \(X\) and \(Y\). For more general cases, we have to use the fundamental bridge and LOTUS, that is, for any set2 \(A\) in the domain of \(X \times Y\)

\begin{equation*} \begin{split} \P{g(X, Y) \in A} &= \E{\1{g(X,Y)\in A}} = \iint \1{g(x,y) \in A} f_{X,Y}(x,y) \d x \d y \\ &= \iint_{g^{-1}(A)} f_{X,Y}(x, y) \d x \d y = \iint_{\{(x,y): g(x,y) \in A\}} f_{X,Y}(x,y) \d x \d y. \end{split} \end{equation*}

The joint CDF \(F_{X,Y}\) then follows because \(F_{X,Y}(x,y) = \P{X\leq x, Y\leq y} = \E{\1{X\leq x, Y\leq y}}\). A warning is in place: conceptually this approach is easy, but doing the integration can be very challenging (or impossible). However, this expression is very important as this is the preferred way to compute distributions by numerical methods and simulation.

Use the fundamental bridge to re-derive the above expression for \(F_M(v)\).

Solution \begin{align*} \P{M\leq v} &= \E{\1{M\leq v}} \\ &=\int_0^{\infty} \int_0^{\infty} \1{x\leq v, y\leq v} f_{XY}(x,y) \d x \d y \\ &= \int_0^{\infty} \int_0^{\infty} \1{x\leq v, y\leq v} f_{X}(x) f_{Y}(y) \d x \d y \\ &= \int_0^{v} f_{X}(x) \d x \int_0^{v} f_{Y}(y) \d y \\ &= F_{X}(v) F_Y(v) = (F_X(v))^2. \end{align*}

Show that the density of \(M\) has the form \(f_{M}(v)=\partial_{v} F_M(v) = 2(1-e^{-\lambda v}) \lambda e^{-\lambda v}\).3

Solution \begin{equation*} f_M(v) = F_M'(v) = 2 F_X(v) f_X(v) = 2(1-e^{-\lambda v}) \lambda e^{- \lambda v}. \end{equation*}

Use the density \(f_{M}\) to show that \(\E M = 2\E X - \E L\).

Solution \begin{align*} \E M &= \int_{0}^{\infty} v f_M(v) \d v = \\ &= 2 \lambda \int_{0}^{\infty} v (1-e^{-\lambda v}) e^{-\lambda v} \d v = \\ &= 2 \lambda \int_{0}^{\infty} v e^{-\lambda v} \d v - 2 \lambda \int_{0}^{\infty} v e^{-2 \lambda v} \d v \\ &= 2 \E X - \E L, \end{align*}

where the last equality follows from the previous exercises.

Recalling that we already obtained \(\E L = \E X/ 2\), we see that \(\E M = 2 \E X - \E L = 3\E X/2\), which settles the truth of \eqref{org17bdd80}.

We can also compute the densities \(f_{M}(y)\) (and \(f_{L}(x)\) by marginalizing the joint density \(f_{L,M}(x,y)\). However, for this, we first need the joint distribution \(F_{L,M}\), and then we can get \(f_{L,M}\) by differentiation, i.e., \(f_{X,Y} = \partial_{x}\partial_y F_{X,Y}\). Let us try this approach too.

Use the fundamental bridge to show that for \(u\leq v\),

\begin{equation*} F_{L,M}(u,v) = \P{L\leq u, M\leq v} = 2\int_0^u (F_Y(v)- F_Y(x)) f_X(x) \d x. \end{equation*}
Solution

First the joint distribution. With \(u\leq v\),

\begin{align*} F_{L,M}(u,v) &= \P{L\leq u, M \leq v} \\ &= 2\iint \1{x \leq u, y\leq v, x\leq y} f_{X,Y}(x,y)\d x \d y \\ &= 2\int_0^u \int_x^v f_Y(y) \d y f_X(x) \d x & \text{independence} \\ &= 2\int_0^u (F_Y(v)- F_Y(x)) f_X(x) \d x. \end{align*}

Take partial derivatives to show that

\begin{equation*} f_{L,M}(u,v) = 2f_X(u) f_Y(v)\1{u\leq v}. \end{equation*}
Solution

Taking partial derivatives,

\begin{align*} f_{L,M}(u,v) &=\partial_v\partial_{u}F_{L,M}(u,v) \\ &=2 \partial_v\partial_{u} \int_0^u (F_Y(v)- F_Y(x)) f_X(x) \d x \\ &=2 \partial_v \left\{(F_Y(v)- F_Y(u)) f_X(u) \right \} \\ &=2 f_X(u)\partial_v F_Y(v) \\ &=2 f_X(u)f_Y(v). \end{align*}

In 1 marginalize out \(L\) to find \(f_M\), and marginalize out \(M\) to find \(f_L\).

Solution \begin{align*} f_M(v) &= \int_0^{\infty} f_{L,M}(u,v) \d u \\ &= 2 \int_0^{v} f_{X}(u) f_Y(v) \d u \\ &= 2 f_Y(v) \int_0^{v} f_{X}(u) \d u \\ &= 2f_Y(v) F_X(v), \\ f_L(u) &= \int_0^{\infty} f_{L,M}(u,v) \d v \\ &= 2 f_{X}(u) \int_u^{\infty} f_Y(v) \d v \\ &= 2 f_{X}(u) G_{Y}(u). \end{align*}

We did a number of checks for the case \(X, Y, \iid, \sim\Exp{\lambda}\), but I have a second way to check the consistency of our results. For this I use the idea that the geometric distribution is the discrete analog of the exponential distribution. Now we study how this works, and that by taking proper limits we can obtain the results for the continuous setting from the discrete setting.

First, let's try to obtain an intuitive understanding of how \(X\sim \Geo{\lambda/n}\) approaches \(Y\sim\Exp{\lambda}\) as \(n\to\infty\). For this, divide the interval \([0, \infty)\) into many small intervals of length \(1/n\). Let \(X\sim \Geo{\lambda/n}\) for some \(\lambda > 0\) and \(n\gg 0\). Then take some \(x\geq 0\) and let \(i\) be such that \(x\in[i/n, (i+1)/n)\).

Show that

\begin{equation} \label{org01835a1} \P{X/n \approx x} \approx \frac{\lambda} n \left(1-\frac \lambda n\right)^{x n}. \end{equation}
Solution

First,

\begin{align*} \P{X/n \approx x} &= \P{X/n \in [i/n, (i+1)/n]} = \P{X \in [i, i+1]} = p q^{i} \\ &\approx p q^{n x} = \frac{\lambda} n \left(1-\frac \lambda n\right)^{xn}, \end{align*}

since \(p=\lambda/n\), \(q=1-p=1-\lambda/n\), and \(i=nx\).

Next, introduce the highly intuitive notation4 \(\d x = \lim_{n\to \infty} 1/n\), and use the standard limit5 \((1-\lambda/n)^n\to e^{-\lambda}\) as \(n\to\infty\) to see that \eqref{org01835a1} converges to

\begin{equation*} \P{X/n \approx x} \to \lambda e^{-\lambda x} \d x = f_{X}(x) \d x,\quad\text{ as } n \to \infty. \end{equation*}

If you don't like this trick with \(\d x\), here is another method, based on moment-generating functions.

Derive the moment-generating function \(M_{X/n}(s)\) of \(X/n\) when \(X\sim \Geo{p}\). Then, let \(p = \lambda/n\), and show that \(\lim_{n\to\infty}M_{X/n}(s) = M_{Y}(s)\), where \(Y \sim \Exp{\lambda}\).

Hint

If you recall the Poisson distribution, you know that \(e^{\lambda} = \sum_{i=0}^{\infty}\lambda^{i}/i!\). In fact, this is precisely Taylor's expansion of \(e^{\lambda}\).

Solution \begin{align*} M_{X/n}(s) &= \E{e^{s X/n}} = \sum_{i} e^{s i/n} p q^{i} = p\sum_{i} (qe^{s/n})^{i} = \frac{p}{1-qe^{s/n}}. \end{align*}

With \(p=\lambda/n\) this becomes

\begin{align*} M_{X/n}(s) &= \frac{\lambda/n}{1-(1-\lambda/n) (1+s/n + 1/n^{2}\times(\cdots))} \\ &= \frac{\lambda/n}{\lambda/n - s/n + 1/n^{2}\times (\cdots)} \\ &= \frac{\lambda}{\lambda - s + 1/n\times(\cdots)} \\ &\to \frac{\lambda}{\lambda - s}, \quad\text{as } n\to \infty, \end{align*}

where we write \(1/n^{2}\times(\cdots)\) for all terms that will disappear when we take the limit \(n\to \infty\). This is just handy notation to hide details in which we are not interested.

With these limits in place, we can relate the minimum \(L=\min\{X, Y\}\) for the discrete and the continuous settings.

Suppose that \(X, Y\sim \Geo{\lambda/n}\), then check that \(\lim_{n\to\infty}\E{L/n} = 1/2\lambda\).

Hint

Use \eqref{org347686d}.

Solution \begin{align*} \E{L/n} &= \frac 1 n \E L = \frac 1 n \frac{q^2}{1-q^{2}} \\ &= \frac 1 n \frac{(1-\lambda/n)^2}{1-(1-\lambda/n)^{2}} \\ &= \frac 1 n \frac{1-2 \lambda/n + (\lambda/n)^2}{2 \lambda /n + (\lambda/n)^{2}} \\ &= \frac{1-2 \lambda/n + (\lambda/n)^2}{2 \lambda + \lambda^{2}/n} \\ &\to \frac 1{2\lambda}. \end{align*}

Clearly, \(1/2\lambda=\E X/2\) when \(X\sim\Exp{\lambda}\).

Here is yet another check on the correctness of \(f_M(x)\).

Show that the PMF \(\P{M=k}\) for the discrete \(M\) in 1 converges to \(f_M(x)\) of 1 when \(n\to \infty\). Take \(k\) suitable.

Solution

Take \(p=\lambda/n\), \(q=1-\lambda/n\), and \(k \approx x n\), hence \(k/n \approx x\). Then,

\begin{align*} \P{M/n=k/n} &= 2 p q^{k/n}(1-q^{k/n}) + p^2q^{2k/n} \\ &= 2 \frac \lambda n \left(1-\frac \lambda n\right)^{k/n}\left(1-\left(1-\frac \lambda n\right)^{k/n}\right) + \frac{\lambda^2}{n^2}\left(1-\frac \lambda n \right)^{2k/n} \\ &\to 2 \lambda \d x e^{-\lambda x} \left(1 - e^{-\lambda x}\right) + \lambda^{2} \d x^{2}e^{-2\lambda}. \end{align*}

Now observe that the second term, proportional to \(\d x^{2}\) can be neglected.

Finally a third way to check the above results, namely by verifying \eqref{orgf33074a}, i.e. \(\E{M-L} = \E X\). For this, we compute the joint CDF \(f_{L, M-L}(x, y)\). With this, you'll see directly how to compute \(\E{M-L}\).

Use the fundamental bridge to obtain

\begin{equation*} F_{L,M-L}(x,y) = (1- e^{-2\lambda x}) (1-e^{-\lambda y}) = F_L(x) F_Y(y). \end{equation*}
Hint

The idea is not difficult, but the technical details require attention, in particular the limits in the integrations.

Solution \begin{align*} F_{L, M-L}(x, y) &=\P{L\leq x, M-L\leq y}\\ &= 2\P{X\leq x, Y-X \leq y, X \leq Y} \\ &= 2\int_0^{\infty} \int_0^{\infty} \1{u\leq x, v-u\leq y, u \leq v} f_{X,Y}(u, v) \d u \d v\\ &= 2\int_0^{\infty} \int_0^{\infty} \1{u\leq x, v-u\leq y, u\leq v} \lambda^2e^{-\lambda u}e^{-\lambda v}\d u \d v\\ &= 2\int_0^{x} \int_0^{\infty} \1{u\leq v\leq u+y} \lambda^2e^{-\lambda u}e^{-\lambda v}\d v \d u\\ &= 2\int_0^{x} \lambda e^{-\lambda u} \int_u ^{u + y} \lambda e^{-\lambda v}\d v \d u\\ &= 2\int_0^{x} \lambda e^{-\lambda u} (-e^{-\lambda v}) \biggr|_u^{u+y}\d u \\ &= 2\int_0^{x} \lambda e^{-\lambda u} (e^{-\lambda u} - e^{-\lambda (u+y)})\d u \\ &= 2\lambda\int_0^{x} e^{-2 \lambda u}\d u - 2\lambda \int_{0}^{x} e^{-\lambda (2u+y)}\d u \\ &= 2\lambda\int_0^{x} e^{-2 \lambda u}\d u - 2\lambda e^{-\lambda y }\int_{0}^{x} e^{-2 \lambda u}\d u \\ &= (1-e^{-\lambda y}) 2 \lambda \int_0^{x} e^{-2 \lambda u} \d u\\ &= (1-e^{-\lambda y}) ( -e^{-2\lambda u}) \biggr|_0^{x} \\ &= (1-e^{-\lambda y})(1- e^{-2\lambda x}). \end{align*}

Conclude that \(M-L\) and \(L\) are independent, and \(M-L\sim Y\).

Solution

As \(F_{L,M-L}(x,y) = F_Y(y) F_L(x)\). So the CDF factors as a function of \(x\) only and a function of \(y\) only. This implies that \(L\) and \(M-L\) are independent, and moreover that \(F_{M-L}(y) = F_Y(y)\), so \(M-L \sim Y\). We can also see this from the joint PDF:

\begin{equation*} f_{L, M-L}(x,y) = \partial_x \partial_y (F_Y(y) F_L(x)) = f_Y(y) f_L(x), \end{equation*}

so the joint PDF (of course) also factors. The independence now follows from BH 7.1.21. #but it is also okay to conclude it directly from the CDF factoring as a function of \(x\) only and a function of \(y\) only. Because \(L\) and \(M-L\) are independent, the conditional density equals the marginal density:

\begin{equation*} f_{M-L|L}(y|x) = \frac{f_{L, M-L}(x, y)}{f_L(x)} = \frac{f_Y(y) f_L(x)}{f_L(x)} = f_Y(y). \end{equation*}

By the above exercise, we find that \(\E{M-L} = \E Y = \E X\), as \(X\) and \(Y\) are iid.

This makes me wonder whether \(M-L\) and \(L\) are also independent for the discrete case, i.e., when \(X, Y\) iid and \(\sim \Geo{p}\). Hence, we should check that for all \(i, j\)

\begin{equation} \label{org0679a78} \P{L=i, M-L=j} = \P{L=i} \P{M-L=j}. \end{equation}

Use 1 to see that

\begin{equation*} \P{L=i, M-L=j} = 2p^2q^{2i+j} \1{j\geq 1} + p^2q^{2i}\1{j=0}. \end{equation*}
Hint

Realize that \(\P{L=i, M-L=j}= \P{L=i, M=i+j}\). Now fill in the formula of 1.

Now for the RHS.

Derive that

\begin{equation*} \P{M-L=j} = \frac{2p^2q^{j}}{1-q^2}\1{j\geq 1} + \frac{p^2q^{j}}{1-q^2}\1{j=0}. \end{equation*}
Solution

Suppose \(j > 0\) (for \(j=0\) the maths is the same). Then,

\begin{align*} \P{M-L=j} &=2\sum_{i=0}^{\infty}\P{X =i, Y=i+j } = &=2\sum_{i=0}^{\infty}p q^i pq^{i+j} = &=2p^{2}q^{j}\sum_{i=0}^{\infty}q^{2i} = &=2p^{2}q^{j}/(1-q^2). \end{align*}

Recalling that \(\P{L=i} = (1-q^2)q^{2i}\), it follows right away from \eqref{org0679a78} that \(L\) and \(M-L\) are independent. Interestingly, from 1 we see \(M-L\sim Y\) for the continuous case. However, here, for the geometric case, \(\P{M-L=j} \neq pq^{j} = \P{Y=j}\). This explains why \(\E M \neq \E L + \E X\) for geometric rvs: we should be more careful in how to split \(M\) in terms of \(L\) and \(X\).

All in all, we have checked and double checked all our expressions and limits for the geometric and exponential distribution. We had success too: the solution of the last exercise provides the key to understand why \eqref{org8619836} and \eqref{orgf33074a} are true for exponentially distributed rvs, but not for geometric random variables. In fact, in the solutions we can see the term corresponding to \(X=Y=i\) for \(X,Y\sim \Geo{p}\) becomes negligibly small when \(n\to 0\). In other words, \(\P{X=Y}>0\) when \(X\) and \(Y\) are discrete, but \(\P{X=Y}=0\) when \(X\) and \(Y\) are continuous. Moreover, by 1, \(\E M = \E{L + M-L} = \E L + \E{M-L}\), but \(\E{M-L} \neq \E X\). So, to resolve our leading problem we should reconsider \(\E{M-L}\).

4. The solution

Let us now try to repair \eqref{orgf33074a}, i.e., \(\E M = \E L + \E X\), for the case \(X,Y\sim\Geo{p}\). We should be careful about the non-negligible case that \(M=L\), so we move, carefully, step by step.

Why is the following true:

\begin{align} \label{org31970a6} \E M &= \E L + \E{(M-L) \1{M > L}} = \E L + 2\E{(Y-X) \1{Y > X}}. \end{align}
Solution

Because either \(M=L\) or \(M > L\). Recall from earlier work that the factor 2 in the second equality follows from the fact that \(X,Y\) iid.


Show that

\begin{equation} \label{orgc328e15} 2\E{(Y-X) \1{Y > X}} = \frac{2q} {1-q^{2}}. \end{equation}
Hint

Reread BH.7.2.2. to realize that \(\E{M-L} = \E{|X-Y|}\). Relate the latter expectation to the expression in the problem.

Solution \begin{align*} \E{(Y-X) \1{Y > X}} &= p^2\sum_{ij} (j-i)\1{j > i} q^iq^j\\ &= p^2\sum_{i} q^i\sum_{j=i+1}^{\infty}(j-i)q^j\\ &= p^2\sum_{i} q^i q^{i}\sum_{k=1}^{\infty}kq^k, \quad k = j-i\\ &= p\sum_{i} q^{2i}\E X \\ &= \frac{p}{1-q^{2}}\E X \\ &= \frac{p}{1-q^{2}}\frac q p \\ &= \frac{q}{1-q^{2}}. \end{align*}

Combine the above with the expression for \(\E L\) of \eqref{org347686d} to obtain \eqref{orgb930347} for \(\E M = 2\E X - \E L\), thereby verifying the correctness of \eqref{org31970a6}.

Solution \begin{align*} \E L + 2 \E{(Y-X) \1{Y > X}} = \frac{q^{2}}{1-q^2} + \frac{2q}{1-q^{2}} = \frac q{1-q}\frac{q+2}{1+q}, \end{align*}

where I use that \(1-q^{2}=(1-q)(1+q)\).

While \eqref{org31970a6} is correct, I am still not happy with the second part of \eqref{org31970a6} as I find it hard/unintuitive to interpret. Restarting again from scratch, here is another attempt to rewrite \(\E M\) by using \(Z\sim \FS{p}\), i.e., \(Z\) has the first success distribution with parameter \(p\), in other words, \(Z \sim X+1\) with \(X\sim\Geo{p}\).

Explain that

\begin{equation} \label{org459bf1d} \E M = \E L + \E{ Z \1{M > L}}, \end{equation}
Solution

To see why this might be true, I reason like this. After `seeing' \(L\), we want to restart. Let \(Z\) be the time from the restart to \(M\). When \(Z\sim \Geo{p}\), it might happen that \(Z=0\) (with positive probability \(p\)). But if \(Z=0\), then \(M=L\), and it that case, we should not restart. Hence, if \(Z\sim \Geo{p}\) we are `double counting' when \(Z=0\). By including the condition \(M > L\) and by taking \(Z\sim\FS{p}\) (so that \(Z > 0\)) I can prevent this.


Show that

\begin{equation*} \E{Z\1{M > L}} = \frac{2q} {1-q^{2}}, \end{equation*}

i.e., the same as \eqref{orgc328e15}, hence \eqref{org459bf1d} is correct.

Hint

Observe that \(Z\) is independent from \(X\) and \(Y\), hence from \(M\) and \(L\)

Solution

With the hint:

\begin{align*} \E{Z\1{M > L}} = \E Z \E{\1{M > L}} = \E Z \P{M > L} = \frac 1 p \frac{2pq}{1-q^{2}} = \frac{2q}{1-q^{2}}, \end{align*}

We know that \(\E Z = 1 + \E X = 1 + q/p = 1/p\), while

\begin{align*} \P{M > L} &= 1-\P{X=Y} = 1-\sum_{i=0}^{\infty} \P{X=Y=i} \\ &= 1-\frac{p^{2}}{1-q^{2}} = \frac{1 - q^{2} + p^{2}}{1-q^{2}} = \frac{2pq}{1-q^{2}}. \end{align*}

I am nearly happy, but I want to see that \eqref{org459bf1d}, which is correct for discrete rvs, has also the correct limiting behavior.

Show that \(\E{Z/n\1{M > L}} \to 1/\lambda\), which is the expectation of an \(\Exp{\lambda}\) rv!

Solution

By independence,

\begin{equation*} \E{Z/n\1{M > L}} = \E{Z/n} \P{M > L}. \end{equation*}

Then,

\begin{equation*} \P{M > L} = \frac{2pq}{1-q^{2}} = \frac{2\lambda/n (1-\lambda/n)}{1-(1-\lambda/n)^{2}} = \frac{2\lambda/n (1-\lambda/n)}{2\lambda/n -\lambda^{2}/n^{2}} = \frac{2 (1-\lambda/n)}{2 -\lambda/n} \to 1, \end{equation*}

and

\begin{equation*} \E{Z/n} = \frac 1 n \E Z = \frac 1 n \frac 1 p = \frac 1 {n \lambda/ n} = 1/\lambda. \end{equation*}

Finally I understand why \(\E M = \E L + \E X\) for \(X,Y\sim \Exp{\lambda}\) and why this is not true for discrete \(X, Y\). For discrete rvs, \(L\) and \(M\) can be equal, while for continuous rvs, this is impossible6 It took a long time, and a lot of work, to understand how to resolve the confusing problem, but I learned a lot. In particular, I find \eqref{org459bf1d} a nice and revealing equation.

Finally, if you like to train with the tools you learned, you can try your hand at analyzing the same problem, but now for uniform \(X, Y\).

Footnotes:

1

Part of this material was born out of annoyance. The book uses one of those typical probability arguments: slick, half complete, and wrong as soon as one tries it in other situations. In other words, the type of argument beginner books should stay clear of. I admit that I was quite irritated about the argument offered by the book.

2

If you like maths, you should be quite a bit more careful about what type of set \(A\) is acceptable. Here such matters are of no importance.

3

We write \(\partial_{v}\) as a shorthand for \(\d/\d v\) in the 1D case, and for \(\partial/\partial_{v}\) the partial derivative in the 2D case.

4

In your math classes you learned that \(\lim_{n\to \infty} 1/n = 0\). Doesn't this definition therefore imply that \(\d x = 0\)? Well, no, because \(\d x\) is not a real number but an infinitesimal. Infinitesimals allow us to consider a quantity that is so small that it cannot be distinguished from 0 within the real numbers.

5

This is not entirely trivial to prove. If you like mathematics, check the neat proof in Rudin's Principles of mathematical analysis.

6

A bit more carefully formulated: the event \(\{L=M\}\) has zero probability for continuous rvs.

Tweaks and Tricks for Org mode Files

Including the \(\LaTeX\) Command

For some reason org mode does not support the \(\LaTeX\) command. In standard \(\LaTeX\) files, the command is \LaTeX, without dollar signs. Now, when exporting an org mode file to html and get the \(\LaTeX\) symbol, I could enclose the \(\LaTeX\) symbol in dollars, because that mathjax seems to understand that I want to get the \(\LaTeX\) symbol. However, putting dollars around \(\LaTeX\) breaks the export of the org mode file to \(\LaTeX\). To solve this, I added this macro to my list of macros and to macros.org in the orgmode plugin for Nikola.

#+MACRO: latex @@latex:\LaTeX{}@@@@html:\(\LaTeX\)@@

With this macro I can now write {{{latex}}} in my org files to get the correct symbol in tex and orgmode files.

Including Python Code by Line Numbers

1. Intro

For quite some problems I have a separate python program, which I want to keep separate, and a \(\LaTeX\) or org mode file that explains or contains part of the python file. It is easy to include part of the code by using line numbers, like so

\inputminted[firstline=12, lastline=23]{python}{python_file.py}

However, if I add code above line 12 or add some code between lines 12 and 23, or move the block of code altogether, I have to update the line numbers. As I don't like this type of work, I decided to write some code to solve this problem, for \(\LaTeX\) and orgmode files.

2. Plan given to ChatGPT

As a first step, I drafted a list of requirement that my python program had to satisfy. Then I gave this list to ChatGPT, and got a program that did not work completely, but 80% or so was ok. I was amazed! Here are my initial requirements.

  1. In a \(\LaTeX\) file look up lines with strings that contain inputminted.
  2. Look up a tag that appears in such lines as a comment at the end of a line. For instance, tictoc is the comment at the end of this line \inputminted[firstline=30, lastline=35]{python}{python_file.py} % tictoc.
  3. Also look up the name of the python file after the \inputminted command, here python_file.py.
  4. Open the python file, and look up the line numbers of the code between lines tag with the comments # block tictoc.
  5. Update the firstline= and the lastline= in the \(\LaTeX\) file accordingly.

BTW, as it's easy to copy and move complete lines, I use the same string to demarkate the starting and termining lines; in other words, I don't use comments as # begin tictoc and # end tictoc.

It took a few of additional roundes with ChatGPT, and some additional work on my own, but the final result works nicely for my goals. Once the version worked for \(\LaTeX\), I updated it so that it can work with org mode files.

So, for a \(\LaTeX\) file, tag like this:

\inputminted[firstline=12+, lastline=23]{python}{python_file.py} % tictoc

and for an orgmode file, like this:

#+INCLUDE: "python_file.py" src python:lines "84-110" ## tictoc

Note the intentional double ## to comment the tag.

3. The Code

3.1. The modules

import argparse
import glob
import re

3.2. Finding the tagged line numbers in the python file

This function looks up the line numbers. It strips trailing white space between the python code and the terminating comment tag.

def find_block_lines(python_lines, marker):
    """Return the linenumbers of blocks of python code morked like this:

    # block marker
    import numpy as np

    # block marker

    Strip the lines that end with # block marker.
    Mind that inputminted starts counting at 1 while python starts at 0.
    Therefore we need to add +1 when returning the line numbers.
    """
    indices = [
        idx
        for idx, line in enumerate(python_lines)
        if f"# block {marker}" in line
    ]
    if len(indices) != 2:
        return -1, -1

    # remove the lines with  # block marker string
    start_idx = indices[0] + 1
    end_idx = indices[1] - 1

    # remove trailing empty lines
    while end_idx > indices[0] and python_lines[end_idx].strip() == "":
        end_idx -= 1

    # Add one to convert python index to inputminted index
    return start_idx + 1, end_idx + 1

3.3. Updating a \(\LaTeX\) file

This function looks up the tags mentioned in the \(\LaTeX\) file. In the for loop, it looks up the line numbers of the tagged code in the python file, then updates the \(\LaTeX\) file.

def update_latex_file(latex_file):
    with open(latex_file, "r") as fp:
        latex_content = fp.read()

        pattern = (
            r"\\inputminted\[firstline=\d+, lastline=\d+\]"
            r"{python}{(.*?\.py)}\s*\% ([\w]+)"
        )
        matches = re.findall(
            pattern,
            latex_content,
        )
        for python_file, marker in matches:
            with open(python_file, "r") as py_file:
                lines = py_file.readlines()
                begin_index, end_index = find_block_lines(lines, marker)
                print(begin_index, end_index)
                if begin_index != -1 and end_index != -1:
                    pattern = (
                        fr'\\inputminted\[firstline=\d+, lastline=\d+\]{{python}}'
                        fr'{{{python_file}}}\s*\%\s*{marker}'
                    )
                    replacement = (
                        fr'\\inputminted[firstline={begin_index}, '
                        fr'lastline={end_index}]{{python}}{{{python_file}}} \% {marker}'
                    )
                    latex_content = re.sub(pattern, replacement, latex_content)

    with open(latex_file, "w") as fp:
        fp.write(latex_content)

3.4. Updating an orgmode file

Updating the org file works similarly. However, in the org mode I write the tag after two hashes, like ## tag. I noticed that org mode changes the % in the code for the \(\LaTeX\) files above.

def update_org_file(org_filename):
    with open(org_filename, "r") as fp:
        org_content = fp.read()

        matches = re.findall(
            r'#\+INCLUDE: \"(.*?)\" src python:lines \"\d+-\d+\" ## ([\w]+)',
            org_content,
        )
        for python_file_name, marker in matches:
            with open(python_file_name, "r") as py_file:
                python_lines = py_file.readlines()
                begin_index, end_index = find_block_lines(python_lines, marker)
                print(begin_index, end_index)
                if begin_index != -1 and end_index != -1:
                    pattern = (
                        fr'#\+INCLUDE: "{python_file_name}" src '
                        fr'python:lines "\d+-\d+" ## {marker}'
                    )
                    replacement = (
                        fr'#+INCLUDE: "{python_file_name}" src '
                        fr'python:lines "{begin_index}-{end_index+1}" ## {marker}'
                    )
                    org_content = re.sub(pattern, replacement, org_content)

    with open(org_filename, "w") as fp:
        fp.write(org_content)

3.5. Last steps

I want to be able to call the function on multiple files at ones.

def process_files(pattern):
    for file_path in glob.glob(pattern):
        if file_path.endswith('.org'):
            update_org_file(file_path)
        elif file_path.endswith('.tex'):
            update_latex_file(file_path)
        else:
            print(f"{file_path} is not an .org nor a .tex file")

The main reads the filenames as arguments and has the files updated.

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Update latex or org files based on Python markers."
    )
    parser.add_argument(
        "file_pattern",
        type=str,
        help="Provide a .tex or .org file, or *.tex/*.org",
    )
    args = parser.parse_args()

    process_files(args.file_pattern)

Simulating Arrival Times for a Non-homogeneous Poisson Process

1. Intro

We provide an algorithm to generate arrival times of a non-homogeneous Poisson process (nhpp) when the rate function \(\lambda(s)\) is piecewise constant. Specifically, we assume given two sequences \(\{\lambda_{j}\}_{j=0}^{\infty}\) and \(\{s_{j}\}_{j=0}^{\infty}\), with \(s_{0}=0\), the \(s_{j}\) strictly increasing, and \(s_{k}\to \infty\) as \(k\to\infty\), and then we take \(\lambda(s) = \lambda_{j}\) for \(s\in [s_{j}, s_{j+1}]\). An example would be a daily scheme of arrival rates of patients per hour at a hospital, where \(\lambda_{0} = 1\) from \([0, 8)\), \(\lambda_{1} = 10\) from \([8, 12)\), \(\lambda_{2} = 8\) from \([12, 18)\) and \(\lambda_{4}=3\) from \([18, 24)]\), update \(\{s_{j}\}\) accordingly).

2. The Maths

One way to simulate a non-homogenous Poisson process with piece-wise constant rates is as follows.

  1. For each interval \([s_j, s_{j+1}]\) compute the duration \(t_j = s_{j+1}-s_j\).
  2. Generate a random deviate \(R_j \sim \Pois{\lambda_j t_j}\).
  3. Generate \(R_j\) uniform deviates \(U_i \sim \Unif{[s_j, s_{j+1}]}\).
  4. The uniform deviates form the arrivals times on \([s_j, s_{j+1}]\).

This procedure is easy, but the computation of a Poisson distributed random deviate is somewhat costly.

Perhaps the next procedure is more efficient–I have not tested it, so I don't know. The paper Generating Nonhomogeneous Poisson Processes by Raghu Pasupathy explains that we need to find the smallest \(t\) such that \(\Lambda (s, t) = \int_{s}^{t} \lambda(y)\d y =x \), where \(x = -\log u\) and \(u\sim\Unif{[0.1]}\).

2.1. A constant rate throughout

If the rate \(\lambda(x) \equiv \lambda\) is contant throughout, finding such \(t\) is simple because \(\Lambda(s, t) = \lambda (t-s)\) so that \(t= s + s/\lambda\). This gives the following algorithm:

  1. \(s = 0\).
  2. \(\mineq \log(u)/\lambda, \quad u \sim \Unif{[0,1]}\).
  3. Yield \(s\) and return to step 2.

2.2. Piecewise Constant

Let us generalize this proceduce so that we can deal with piecewise constant \(\lambda(\cdot)\). Noting that the sequence \(\{s_{j}\}\) places a grid on \(\R^{+}\), we define

\begin{align*} j &= \max\{i: s_{i} \leq s\}, & k &=\min\left\{i : \Lambda(s, s_{i}) >x\right\}, \end{align*}

i.e., \(s_{j}\) is the point on the grid just below \(s\), and \(s_{k}\) is the smallest point on the grid such that the integral \(\Lambda(s, s_{k})\) exceeds \(x\). For the algorithm, if \(s+x/\lambda_{j}< s_{j+1}\), then we can set \(s += x/\lambda_{j}\), and return \(s\) as the next arrival time. However, if if \(s+x/\lambda_{j}\geq s_{j+1}\), we use the following idea:

\begin{equation*} x = \Lambda(s, t) = \Lambda(s_{j}, s_{k}) - \Lambda(s_{j}, s) - \Lambda(t, s_{k}). \end{equation*}

Since \(\lambda(\cdot)\) is \(\lambda_{j}\) on \([s_{j}, s]\) and \(\lambda(\cdot) = \lambda_{k-1}\) on \([t, s_{k})\),

\begin{align*} \Lambda(s_{j}, s) &= \lambda_{j}(s-s_{k}), & \Lambda(t, s_{k}) = \lambda_{k-1} (s_{k}-t). \end{align*}

Thus, if we have \(s_{j}\) and \(s_{k}\), the next arrival time \(t\) after \(s\) is easily solved from the equation

\begin{equation*} \lambda_{k-1}(s_{k} -t ) = x + \Lambda(s_{j}, s) - \Lambda(s_{j}, s_{k}). \end{equation*}

That is,

\begin{equation}\label{eq:nh8} t = s_{k } - (\Lambda(s_{j}, s_{k}) - x - \Lambda(s_{j}, s) )/\lambda_{k-1}. \end{equation}

3. The Algorithm

With this idea, we obtain the following algorithm. Assume that \(\lambda_{0} > 0\) (Otherwise search for the first \(\lambda_{j}>0\), and start from there).

  1. \(s=0\), \(j=0\)
  2. \(s \mineq \log(u)/\lambda_{j}\)
  3. If \(s < s_{j+1}\), yield \(s\) and return to step 2, else
  4. \(T = \lambda_{j}(s_{j+1}-s)\).
  5. While \(T<0\): \(j \pluseq 1\), \(T \pluseq \lambda_{j}(s_{j+1}-s_{j})\).
  6. \(s = s_{j+1} - T/\lambda_{j}\).
  7. Yield \(s\) and return to step 2.

4. The Python <Code

from typing import Iterator

import numpy as np


def nhpp_range(
    start: float,
    end: float,
    times: Iterator[float],
    rates: Iterator[float],
    gen: np.random.Generator = np.random.default_rng(3),
) -> Iterator[float]:
    rate = next(rates)
    down, up = next(times), next(times)
    while np.isclose(rate, 0):
        rate = next(rates)
        down, up = up, next(times)
    now = max(start, down)
    while True:
        now += gen.exponential() / rate
        T = rate * (up - now)
        while T < 0:
            rate = next(rates)
            down, up = up, next(times)
            T += rate * (up - down)
        now = up - T / rate
        if now > end:
            break
        yield now

5. The Tests

from datetime import datetime, timedelta
from itertools import cycle, accumulate
from collections import Counter
import numpy as np
from nhpp import nhpp_range


def test_floats():
    gen = np.random.default_rng(3)

    start = 0
    end = 40
    periods = [1, 1, 3]
    rates = [0, 1, 10]

    times = []
    counter = Counter()
    counter[0] = 0
    for time in nhpp_range(
        start, end, accumulate(cycle(periods), initial=start), cycle(rates)
    ):
        times.append(time)
        counter[int(time) % 5] += 1
    print(counter)

    A = np.array(times)
    X = A[1:] - A[:-1]
    print(X.mean(), X.std())
    theoretical_mean_rate = sum(p * r for r, p in zip(rates, periods)) / sum(
        periods
    )
    print(1 / theoretical_mean_rate)


def test_datetimes():
    start = datetime(2023, 8, 9, 10, 0)
    end = datetime(2023, 8, 10, 18, 0)
    rates = list(range(13))
    periods = [2] * len(rates)

    time = start
    for delta in nhpp_range(
        0,
        (end - start).total_seconds() / 3600,
        accumulate(cycle(periods), initial=0),
        cycle(rates),
    ):
        time += timedelta(hours=delta)
        print(time)
    print(start, time)


if __name__ == "__main__":
    test_floats()
    test_datetimes()

Mathjax tricks

1. Mathjax in Nikola

It took a bit of time to set up Mathjax for Nikola. This is what I had to add.

The equationNumbers part adds equation numbers. The Macros part contain the equivalent of the \(\LaTeX\) newcommand~s. I'll add more commands over time, but the list below shows how to do that. Mind that ~\\\\ is necessary to escape a single backslash.

MATHJAX_CONFIG = """
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
  TeX: {
    equationNumbers: { autoNumber: "AMS" },
    tagSide: "right",
    tex2jax: {
        inlineMath: [ ['$','$'], ["\\\(","\\\)"] ],
        displayMath: [ ['$$','$$'], ["\\\[","\\\]"] ],
        processEscapes: true
    },
    Macros: {
      P: ["\\\\textrm{P}\\\\left\\\\{#1\\\\right\\\\}", 1],
      E: ["\\\\textrm{E}\\\\left\\\\{#1\\\\right\\\\}", 1],
      F: "\\\\mathcal{F}",
      mineq: "\\\\mathrel{-}=",
      pluseq: "\\\\mathrel{+}=",
      R: "\\\\mathbb{R}",
      1: ["\\\\mathbf{1}\\\\{#1\\\\}",1],
      d: ["\\\\,\\\\textrm{d}#1", 1],
      Norm: ["\\\\mathrm{Norm}(#1)", 1],
      Unif: ["\\\\mathrm{Unif}(#1)", 1],
    },
  }
});
</script>
<script type="text/javascript" charset="utf-8"
  src="https://cdn.jsdelivr.net/npm/mathjax@2/MathJax.js?config=TeX-AMS_CHTML">
</script>
"""

2. Single Shot Remarks

Sometimes I want to use mathjax macros in just file. Put these lines on top of the file so that the macros apply just this document.

\(
   \def\RR{{\bf R}}
    \def\bold#1{{\bf #1}}
\)

(Very) Complex Graphs form (Very) Simple Ideas

1. Introduction

In this chapter we will explore how very simple ideas and lots of computations lead to astonishingly intricate graphs. There are no exercises this time, but you should study the graphs carefully. Except for the first example we will use recursion again, but this time to make graphs.

2. You always thought you understood multiplication

We all know that \(2\times 3 = 6\), and that \(2\times 8 = 16\). We also know that \(2\times 3 \bmod 10 = 6\) and \(2 \times 8 \bmod 10 = 6\). As a variation on this theme, let's lay out the numbers \(0, 1, \ldots, 9\) on a circle at equally spaced intervals, like the hours on a clock, and draw lines between points that relate through multiplication by \(2\) modulo 10. Thus, we draw a line from \(1\) to \(2\), from \(2\) to \(4\), from \(3\) to \(6\), and so on. Realize that \(2 \times 9 \bmod 10 = 18 \bmod 10 = 8\), so from \(9\) we draw a line to \(8\). In more formal terms, we place the remainders \((2a) \bmod b\) on the circle where \(2\) is the multiplier, \(2a\) the divident, and \(b\) the divisor.

To actually draw the line from \(1\) to \(2\), we need a bit of geometry. The unit circle can be cut into \(10\) pieces of equal length by using an angle \(\phi= 2\pi/10\). Then, by setting

\begin{align*} x(n) &= \sin \phi n, & y(n) &= \cos \phi n, \end{align*}

we find that \((x(0), y(0)) = (0, 1)\), \((x(1), y(1)) = (\sin \phi, \cos \phi)\), \ldots. It's best to show how all this works by means of a figure, so we need to set up that first.

In the left most circle of the next figure, we draw a line from \(1\) to \(2\). In the second cicle we add the line that corresponds to multiplying \(2\) by \(2\), in the third \(3\times 2\). In the last we see that \(5\times 2\bmod 10 = 0\). If we would continue, we would obtain a line from \(6\) to \(2\), and continuing until \(9\) we obtain a line from \(9\) to \(8\), as explained above. Then we are full circle, and the lines will start to overlap with previously drawn lines.

Here is the code to make the figure below. (If you want to code along, load the standard modules first.)

fig, axes = plt.subplots(nrows=1, ncols=5, figsize=(6, 1.5))
divisor = 10
phi = 2 * np.pi / divisor
multiplier = 2
for i, ax in enumerate(axes):
    ax.set_xticks([])
    ax.set_yticks([])
    ax.add_patch(Circle((0.0, 0.0), 1, ec='k', fc='none'))
    ax.set_frame_on(False)
    r1, r2 = 0.98, 1.02
    for j in range(10): # put the ticks on the circle
        x, y = np.sin(j * phi), np.cos(j * phi)
        ax.plot([r1 * x, r2 * x], [r1 * y, r2 * y], color="k", lw=1)
    for j in range(i + 2): # draw the lines
        x = [np.sin(j * phi), np.sin(j * multiplier * phi)]
        y = [np.cos(j * phi), np.cos(j * multiplier * phi)]
        ax.plot(x, y, color="k", lw=1)
fig.tight_layout()
# fig.savefig("../figures/circle-multiply.pdf")
fig.savefig("../images/circle-multiply.png", dpi=300)
Figure for above code.

nil

All this seems pretty dull and simple. However, the figures on the next page show what happens if we stick to using 2 as the multiplier but vary the divisor over the numbers \(10\), \(20\), \(50\), \(75\), \(100\). We made the graphs with the next pieces of code. The next function plots the lines from a point \(x\) on the circle boundary to \(2x\), as shown in the figure above, for all points equally spaced on the circle boundary with distance \(2\pi/n\), where \(n\) is the divisor.

def draw_lines(multiplier, divisor, ax, lw=0.1, ls="-"):
    phi = 2 * np.pi / divisor
    for i in range(1, divisor):
        x = [np.sin(i * phi), np.sin(i * multiplier * phi)]
        y = [np.cos(i * phi), np.cos(i * multiplier * phi)]
        ax.plot(x, y, color="k", ls=ls, lw=lw)

This is the loop to plot the circles with the lines corresponding to multiplication with \(2, 3, \ldots, 7\). The next set of figures contains the result; I admit that I was astonished by this result.

fig, axes = plt.subplots(nrows=6, ncols=5, figsize=(6, 10))

for i, multiplier in enumerate([2, 3, 4, 5, 6, 7]):
    for j, modulo in enumerate([10, 20, 50, 75, 100]):
        axes[i, j].set_xticks([])
        axes[i, j].set_yticks([])
        draw_lines(multiplier, modulo, axes[i, j], lw=0.3)
        axes[i, j].add_patch(Circle((0.0, 0.0), 1, ec='k', fc='none'))

fig.tight_layout()
fig.savefig(f"../images/circle-tables.png", dpi=300)
# fig.savefig(f"../figures/circle-tables.pdf")
Figure for the code above.

nil

But this is not the end of matter. The graph can become much more interesting. We can fix the modulo to \(200\), and vary the multiplier from \(2\) to \(7\) in small steps. The code below contains yet some further variations. The results are amazing in my opinion; I did not know that simple multiplication could lead to such beautiful graphs. If you run the code on your screen you can see yet more detail. The graphs are wonderfully complex: reading from left to right, we can see that the cusps are `woven into the graphs', and the larger the divisor, the more cusps.

def vary_multiplier(fr, to, modulo, ls="-", lw=0.1):
    fig, axes = plt.subplots(nrows=7, ncols=6, figsize=(6, 10))
    axes = axes.flatten()

    for i, multiplier in enumerate(np.linspace(fr, to, len(axes))):
        axes[i].set_xticks([])
        axes[i].set_yticks([])
        draw_lines(multiplier, modulo, axes[i], lw=lw, ls=ls)
        axes[i].add_patch(Circle((0.0, 0.0), 1, ec='k', fc='none'))

    fig.tight_layout()
    fig.savefig(f"../images/circle-modulo-{modulo}-{fr}-{to}.png", dpi=300)
    # fig.savefig(f"circle-modulo-{modulo}-{fr}-{to}.pdf")

vary_multiplier(2, 7, 200)
vary_multiplier(22, 23, 200)
vary_multiplier(50, 52, 200)
vary_multiplier(380, 381.5, 855, ls=":")
Figure for case 2, 7, 200

nil

Figure for case 22, 23, 200

nil

Figure for case 50, 52, 200

nil

Figure for case 380, 381.5, 800

nil

The figures can be turned into a movie with the next code.

import os
import numpy as np
import matplotlib.pyplot as plt
import imageio.v2 as imageio


divisor = 855
filenames = []

for multiplier in np.linspace(380, 381.5, 300):
    print(multiplier)
    fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 10))
    draw_lines(multiplier, divisor, ax)
    ax.set_title(f"{multiplier}")
    filename = f"frame_{multiplier:.4f}.png"
    fig.savefig(filename)
    plt.close(fig)
    filenames.append(filename)

with imageio.get_writer('circle_movie.gif', mode='I') as writer:
    for filename in filenames:
        image = imageio.imread(filename)
        writer.append_data(image)

# remove the temporary files
for filename in filenames:
    os.remove(filename)

3. You always thought you understood division

Draw an equilateral triangle on a unit circle with its tip on \(z_{3} = x(0,1)\), and \(z_1\) and \(z_2\) as the two corners at the base. Start a walk at \(x_{0}\) somewhere in the triangle, and throw a die. Suppose the outcome lies in \(1, 2\), then select point corner \(1\) and plot a point half way \(x_0\) and \(z_1\), i.e., at \(x_1 = (x_{0}+z_{1})/2\). If the outcome lies in \(3, 4\), draw \(x_1\) halfway between \(x_0\) and \(z_2\), and otherwise between \(x_0\) and \(z_3\). Throw the die again to select one of the three corners, and draw \(x_2\) halfway \(x_1\) and the seleced corner. Keep on repeating this until bored, and then ask the computer to take over to generate an enormous sequence of points \(\{X_k\}\).

First we make the three corners of the equilateral triangle with its tip on \((0,1)\).

import numpy as np
import gnuplotlib as gp

n_angles = 3 # make the triangle
Z = np.zeros((n_angles, 2))
for j in range(n_angles):
    Z[j, 0] = np.sin(j * 2 * np.pi / n_angles)
    Z[j, 1] = np.cos(j * 2 * np.pi / n_angles)

Next, we make the iterated sef of points half way a uniformly selected corner and the current point.

num = 50000

X = np.zeros((num, 2))
U = np.random.randint(0, n_angles, size=num)

X[0] = Z[2]
for j in range(1, num):
    X[j] = (X[j - 1] + Z[U[j]]) / 2

It remains to plot the points. In the code below I use gnuplot, rather than matplotlib, because gnuplot works faster with many points and it's also better at plotting many single points. The pointsize can be tuned with ps 0.1, and the point type by pt.1

gp.plot(
    (X[:2000, 0], X[:2000, 1]),
    (X[:5000, 0], X[:5000, 1]),
    (X[:50000, 0], X[:50000, 1]),
    _with="points pt 0 lc 'black'",
    unset=['xtics', 'ytics'],
    xlabel="",
    title="",
    multiplot='layout 1,3',
    hardcopy='../images/sierpinsky.png',
)
Figure of the Sierpinksy triangle.

nil

What works for a triangle might also work for a hexagonal. For \(j=0, 1, \ldots, 5)\), take \(z_j = (\cos (2\pi j/ 6), \sin(2\pi j/6))\), so that \(z_j\) corresponds to the \(j\) th corner of a hexagon. A bit of experimentation shows that the update rule \((x_k + z_{i})/3\), i.e., dividing by \(3\) instead of \(2\), gives nice results. For the rest, the code is the same as above.

Figure of the hexagonal triangle.

nil

These are suprising results, aren't they? The first example is known as Sierpinsky's triangle, and, interestingly, there exists a direct relation with Pascal's triangle. For further background and explanations, you should consult the book `Chaos and Fractals' by H.O.Peitgen, H.Jürgens and D.Saupe.

4. You always thought you understood stability in 1D

Time series appear to be relatively simple objects: let \(X_{k+1} = a X_{k} + U_{k+1}\), with \(a \in (-1, 1)\) and \(U_{k}\sim\Unif{\{-1, 1\}}\). It may seems that this will lead to simple stochastic process: since \(|a|<1\), it stays in the neighborhood of \(0\), and that's it. To see whether this is indeed true, let's apply this rule many, many times, first with \(a=0.8\), then with \(a=0.61\).

We make bins such that each bin contains \(1\%\) of the observed valued of \(\{X_k\}\). It's easy to np.histogram, but there is one small detail: this function returns the (scaled) number of observations per bin as hist, and the boundaries of the bins as bins. Now there is one more boundary than bins, that is, to make one bin, we need two `walls', to make two bins, we need three `walls'. Thus, to plot the points, we remove the left most boundary, so that we have just as many values for the $xS coordinate as for the \(y\) coordinate. Here is the code.

def simulate(a):
    num = 2000000
    X = np.zeros(num)
    U = 2 * np.random.randint(2, size=num) - 1

    for j in range(1, num):
        X[j] = a * X[j - 1] + U[j]

    n_bins = int(num / 100)
    hist, bins = np.histogram(X, bins=n_bins, density=True)
    X = np.zeros((n_bins, 2))
    X[:, 0] = bins[1:]
    X[:, 1] = hist
    return X


X1 = simulate(a=0.8)
X2 = simulate(a=0.61)

gp.plot(
    (X1[:, 0], X1[:, 1]),
    (X2[:, 0], X2[:, 1]),
    _with="points pt 0 lc 'black'",
    unset=['xtics', 'ytics'],
    xlabel="",
    title="",
    multiplot='layout 1,2',
    hardcopy=f"../images/time_series.png",
)

These are the densities of the time series, the left is for \(a=0.8\), the rigth for \(a=0.61\). Perhaps a bit unexpected, but the densities turn out to be very complicated objects. If you're interested in a more profound mathematical analysis, consult Promenade Aleatoire by M.Benaim and N.El Karoui.

Figure of the density for time series.

nil

5. You always thought you understood stability in 2D

This final example is just a straightforward generalization of the above time series, but now in 2D. (This is again one these ideas that are straightforward once you see it, but they are often hard to invent.) We have two matrices \(A_0\) and \(A_1\), and two vectors \(B_0\) and \(B_1\). Then, let

\begin{align*} X_{k+1} = A_{U_{k+1}} X_k + B_{U_{k+1}}, \end{align*}

where \(U_k \sim \Unif{\{0, 1\}}\).

In the code below we call the numba package to speed up the computation; it saves about a factor \(10\) while it costs us nothing except loading the package, and typing @jit on top of a function.

import numpy as np
from numba import jit
import gnuplotlib as gp


A = np.zeros([2, 2, 2])
A[0] = np.array([[0.839, -0.303], [0.383, 0.924]])
A[1] = np.array([[-0.161, -0.136], [0.138, -0.182]])

B = np.zeros([2, 1, 2])
B[0] = np.array([0.232, -0.080]).reshape(2)
B[1] = np.array([0.921, 0.178]).reshape(2)


num = 500 * 1000

gen = np.random.default_rng(3)
U = gen.binomial(1, 0.2, size=num)

X = np.zeros((num, 2))


@jit(nopython=True)
def do_run(X):
    for j in range(1, num):
        X[j, :] = A[U[j]] @ X[j - 1, :] + B[U[j]]


do_run(X)


gp.plot(
    (X[:, 0], X[:, 1], dict(_with="points pt 7 ps 0.02 lc 'black'")),
    xlabel="",
    title="",
    hardcopy="../images/spiral.png",
)

This is the spectacular result. By the way, you might try to make the same result with matplotlib, but I am underwhelmed about its performance on making such plots.

Figure of the spiral.

nil

Footnotes:

1

Gnuplot is a great plotting tool. It loads super fast and deals easily with a million points.

MMM vs MUM: First Step Analysis

1. Introduction

In the previous chapter we developed several algorithms to use multiple throws of a die to give a biscuit to one child out of 19 such that each child has the same probability to win. One algorithm to accomplish this worked like this. Continue throwing the die until the last two outcomes lie for the first time somewhere in the set \(\{(1,1), (1,2),\ldots, (1,6), (2, 1), \ldots, (2,6), (3,1),\ldots, (3,6), (4,1)\}\). (Note that this set has 19 elements.) Child \(1\) wins when the last two outcomes are \( (1,1)\), child \(2\) wins in case when \( (1, 2) \) , and so on. This idea, while promising, proved wrong: some children are more likely to win than others.

In this section we will develop the tools to understand why this algorithm is not fair, as it based on the famous monkey typing problem, which reads like this. A monkey types with equal probability, and independent of previous key strokes, any key on a keyboard with 26 letters. The monkey problem is find out which of the two words \(mmm\) and \(mum\) has the largest probability to be typed first. What is the relation between the monkey and the cookie problem? Well, the monkey keeps on typing until the last three key strokes lie in the set \(\{mum, mmm\}\), and it turns out that the word \(mum\) has a larger probability to get hit first. In a similar way, the sequences related to the children do not have equal expected hitting times, hence unequal winning probabilities.

To understand this interesting problem, we will develop a few useful concepts: first-step analysis, stopping times and hitting times.

In the typing monkey case, the probability to hit a certain key is \(p=1/26\). Below we will see that using such small probabilities in simulation requires long simulation times. For this reason, we simplify our alphabet often to just three symbols \(m, u, n\). Consequently, the success probability becomes \(p=1/3\), and \(q=2/3\).

2. First-step analysis

We first develop a model for the \(mmm\) sequence. We start in some state \(0\); let \(X_{i}\) be the \(i\)th letter typed. If \(X_{1}\neq m\), which happens with probability \(q=25/26\), the monkey doesn't make any progress and it remain in state \(0\). If, on the other hand, \(X_{1}=m\), which occurs with probability \(p=1-q=1/26\), the monkey moves to state \(m\). When in state \(m\), and \(X_{2}\neq m\), it returns to state \(0\); if \(X_{2}=m\), it moves to state \(mm\). Finally, in state \(mm\), if \(X_{3}\neq m\), it returns to state \(0\); otherwise it moves to state \(mmm\) and the game stops. Thus, more generally, the monkey is interested in the stopping time \(\tau = \inf\{n \geq 3 : X_{n-2} = X_{n-1} = X_n = m\}\).

As we only have to check the last three key strokes, it suffices to take as state space \(S = \{0, m, mm, mmm\}\). For notational ease, we often label these states as \(\{0, 1, 2, 3\}\), where the number represents the number of consecutive \(m\)'s, and when the monkey reaches state \(3\) it achieves its goal.

When we are in state \(i\), \(i\in \{0, 1, 2, 3\}\), let \(T(i)\) be the expected time to hit the final state state. Obviously, \(T(3)=0\) since state \(3\) corresponds to \(mmm\). What can we say about \(T(0), T(1)\) and \(T(2)\)? 1

Explain that the expected times have to satisfy the system of equations:

\begin{align*} T(0) &= 1 + q T(0) + p T(1), \\ T(1) &= 1 + q T(0) + p T(2),\\ T(2) &= 1 + q T(0) + p T(3). \end{align*}
Solution

To move from one state to the next, the monkey always types one key. Then, depending on the outcome \(X\), we move to a certain state. Consider the first equation. When in state \(0\), and the monkey does not hit \(m\) we have to start again, hence, we remain in state \(0\). Otherwise, when the monkey types the good key, i.e., \(X=m\), we progress with one extra \(m\). Similar reasoning gives the other expressions for the other states. Once state \(3\) is hit, we stop.

Clearly, we obtain a set of equations by considering for each state what can happen in the next step; this process is called first-step analysis. Once we have this system, it remains to solve it by hand or numerically. Now this system can be solved by hand with a little effort, but knowing how to tackle this with numerical tools is much more useful.

Explain that the above system can be written in matrix form as \(x = b + P x\), with

\begin{align*} P &= \begin{pmatrix} q & p & 0\\ q & 0 & p\\ q & 0 & 0\\ \end{pmatrix}, & b &= \begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix}, & x &= \begin{pmatrix} T(0) \\ T(1) \\ T(2) \end{pmatrix}. \end{align*}

The matrix \(P\) is known as the transition matrix.

Solution

Since \(T(3)=0\), we don't have to include that in the system of equations. The other equations are a one-to-one copy of the the system.

Numerical solvers work with linear systems of the type \(A x = b\) for a matrix \(A\) and a vector \(b\). Thus, we rewrite \(x = b + P x\) to \((I-P) x = b\), where \(I\) is the identity matrix.

import numpy as np
from numpy.linalg import solve

p = 1.0 / 3  # Probility to hit m
q = 1 - p  # Probability to hit u of n

b = np.ones(3)
P = np.array([[q, p, 0], [q, 0, p], [q, 0, 0]])
A = np.eye(3) - P  # Eye gives the identity matrix
T = solve(A, b)
print(f"{T=}")
T=array([39., 36., 27.])

We conclude that it takes \(T(0)=39\) keys for the monkey to arrive at \(mmm\) when \(p=1/3\).

We follow the same line of reasoning for the \(mum\) case.

Now the states are \(\{0, m, mu, mum\}\). Why?

Solution

We move from \(0\) to \(m\), just like earlier. In state \(m\), if the next key hit gives a \(u\), we move to \(mu\). Once in \(mu\), hitting \(m\) brings us to \(mum\).

We use the same notation for the expected times \(T(0), T(1)\) and \(T(2)\) as in the \(mmm\) case; \(T(3)=0\) is both problems.

Explain that for the \(mum\) sequence the expected times have to satisfy:

\begin{align*} T(0) &= 1 + q T(0) + pT(1),\\ T(1) &= 1 + (1-2p)T(0) + p T(1) + p T(2),\\ T(2) &= 1 + q T(0). \end{align*}
Solution

Consider the second equation, corresponding to state \(0\), i.e., state \(m\). Hitting the key \(m\) brings us back to state \(m\), key \(u\) brings us to \(mu\); other keys make us return to \(0\). The probability to hit \(m\) or \(u\) is \(p\), thus the probability to return to \(0\) is \(1-2p\). In state \(mu\), if the monkey hits \(m\), we move to \(mum\), otherwise, we go back to \(0\) again.

While the vector \(b\) is the same for both problems, the (transition) matrix \(P\) for the \(mum\) case becomes:

\begin{align*} P &= \begin{pmatrix} q & p & 0\\ 1-2p & p & p\\ q & 0 & 0\\ \end{pmatrix}. \end{align*}
P = np.array([[q, p, 0], [1 - 2 * p, p, p], [q, 0, 0]])
A = np.eye(3) - P
T = solve(np.eye(3) - P, np.ones(3))
print(f"{T=}")
T=array([30., 27., 21.])

Apparently the number of keys is \(T(0)=30\) to get to \(mum\). This is quite a bit smaller than the \(39\) expected strokes to reach \(mmm\).

Compare the two matrices \(P\) between the two cases to provide an intuitive explanation for this difference, i.e., \(30\) versus \(39\).

Solution

For the \(mum\) case, when in state \(m\) we can move on to \(mu\), but also move to state \(m\) again (by typing \(m\)). Thus, with probability \(p\) we don't fall back entirely to state~\(0\). In the \(mmm\) case, we either return to \(0\) or move on.


In the numerical example, our monkey had just three keys to hit. Adapt the code to show that \(T(0)=17602\) when the monkey has the original keyboard with 26 keys at its disposal.

Solution

Just change \(p=1/3\) to \(p=1/26\) in the code and run it.

Here is a related fun problem to analyze with first-step analyis. A mouse sits on one corner of a cube whose edges are made of wire. There is some chese diagonally opposite the mouse. One any corner, the mouse chooses at uniformly any edge, and moves to the next corner. When traversing an edge takes one minute, what is the expected time for the mouse to reach the cheese? 2

Show that the expected time is \(10\) minutes.

Solution

The mouse starts at the far side of the cheese. Thus, it takes at minimum three edges to get to the cheese.. Let the expected time to hit the cheese be given by \(T(3)\). Now the mouse takes one edge, and then it has at least two edges to travel, which takes expected time \(T(2)\). \(T(1)\) is defined similarly. A tiny bit of though shows that the times should satisfy the relations:

\begin{align*} T(3) &= 1 + T(2), & T(2) &= 1 + T(1) \frac 2 3 + T(3)\frac 13, & T(1) &= 1 + T(2)\frac 23. \end{align*}

Solving gives \(T(1)= 7, T(2)=9, T(3) = 10\).

3. Estimating hitting times with simulation

We can also use simulation to estimate the hitting times for words \(mmm\) and \(mum\). We do this in two steps. First we consider some ideas that seem correct, but turn out to be wrong. The explanation of why this is so, is interesting in itself, and provides another view on why it takes more keys to get to \(mmm\) than to \(mum\). Once we figured out what is wrong, we'll develop the correct code. Note from the last exercise above that it takes \(17602\) keys on average to reach \(mmm\) when \(p=1/26\). As this requires much more simulation effort than to estimate the same hitting time with \(p=1/3\), we consider just the keys \(m, n, u\)$ instead of the full alphabet.

Here is an initial idea to estimate \(T(0)\). Generate a large string of \(N\) iid rvs on \(\{0, 1, 2\}\). (Once again, we use just three outcomes to speed up the computation.) Then, count how often the sequences \((0,0,0)\), which codes for \(mmm\), and \((0, 1, 0)\), which codes for \(mum\), occur. If \(N_{m}\) counts \(mmm\) and \(N_{u}\) counts \(mum\), then it might seems that \(N/N_{m}\) must be a good estimator for \(T(0)\) for the \(mmm\) problem, and likewise, \(N/N_{u}\) for the \(mum\) problem.

It is not hard to write code to count the number occurrences of sub string within a string. However, this is such a common problem in computer science that people wrote very efficient algorithms for this. So we use regular expressions instead of doing the work ourselves. (I used ChatGPT to construct the regular expressions.)

import re

gen = np.random.default_rng(1)

N = int(1e6)
L = ''.join(gen.choice(list("mnu"), size=N))
Nm = len(re.findall(r'(?=mmm)', L))
Nu = len(re.findall(r'(?=mum)', L))
print(f"{N/Nm=:2.2f}, {N/Nu=:2.2f}")
N/Nm=27.03, N/Nu=26.89

This results is strange at first sight. Why are both results nearly equal to 27, while from the above we know that \(T(0)\) is 39 for the \(mmm\) case and \(30\) for the \(mum\) case?

Why is doesn't the above counting algorithm provide an estimator for \(T(0)\) for the \(mmm\) case?

Solution

The probability to see \(mmm\) at an arbitrary location in a string with characters \(m\), \(n\) and \(n\) is \((1/3)^3=1/27\); and this is the same for \(mum\). Note that this explains the simulation results above: when success occurs with probability \(1/27\), the expected time between two successes is \(27\).

But then, if the probabilities to see \(mmm\) and \(mum\) in a long strong of letters are the same, why then does it take longer to see \(mmm\) than \(mum\) when we start from state \(0\)? This is somewhat puzzling; we need a new idea to make progress.

Suppose that in the long array of letters, we saw the pattern \(mmm\), at positions \(i-2, i-1, i\). Let the monkey type the key for the \(i+1\)th time. When \(X_{i+1}\neq m\), we have to start all over again, while if \(X_{i+1}=m\) we get the string \(mmmm\), i.e., \(4\) times an \(m\) in row, of which the last three also forms\(mmm\). Hence, when \(X_{i+1}=m\) we immediately have another occurrence of the pattern \(mmm\).

Write \(T\) for the expected time between two occurrences of the pattern \(mmm\). Why does\(T\) satisfy the relation

\begin{equation} \label{org31730ec} T = 1 + q T(0), \end{equation}

where \(T(0)\) is the expected time to hit \(mmm\)?

Solution

Suppose we are in state \(mmm\), and the monkey hits a key \(X\). Clearly \(\E{T|X=m} = 1\), that is, if the next key is an \(m\), the time \(T\) to see \(mmm\) again is just this one key. However, \(\E{T | X \neq m} = 1 + T(0)\). To see this last equation, if \(X_{i}\neq m\) we start anew and it takes an expected number of keys \(T(0)\) keys to hit \(mmm\) again, and the 1 counts the key hit by the monkey. Combining this with the law of total expectation, \(T = p\E{T|X=m} + q \E{T|X\neq m}\), gives the result.


Check the numerical results we obtained above to see that \eqref{org31730ec} indeed is satisfied for \(mmm\).

Solution

Recall that \(T=27\), \(q=2/3\) and \(T(0)= 39\) for \(mmm\) string. Indeed, \(27 = 1 + 39\cdot 2/3 = 1+ 26\).

We next consider the \(mum\) case.

For the \(mum\) sequence, explain that \(T = 1+(1-2p) T(0) + p T(1) + p T(2)\), and check the numbers for the example.

Solution

Starting from \(mum\), when \(X_{}=m\) we move to string \(mumm\) which corresponds to state \(m\). When \(X_{}=u\), we get string \(mumu\) which is state \(mu\). Only after \(X_{i} = n\) we return to \(0\).

To check, note that \(T=27, p = 1/3, T(0)=30, T(1)=27, T(2)=21\). Filling it in: \(27=1+30/3 + 27/3 + 21/3 = 1 + 10+9+7\).

And now it is time to reconcile all the above.

Explain why stopping to throw a die as soon as a winning pattern emerges does not necessarily lead to equal winning chances?

Solution

The pattern \(mum\) has more potential to build up from previous outcomes. Therefore it can appear sooner. However, once we have seen \(mmm\), it can appear within one extra throw again, while \(mum\) can never recur after one key stroke. These two effects balance out for \(mum\) and \(mmm\), so that \(T=27\) for both sequences.

Now we know this, we also understand why the children need not have the same winning probabilities if we use hitting times of different types of outcomes (compare \((1,6)\) to \((1,1)\) for example) to determine which child should win the cookie.

4. Summary

By studying the above toy problem we covered many nice and useful ideas. With the typing monkey, we observed the subtle fact that the expected time to first hit a pattern need not be equal to the average time between such a pattern occuring in the long run. The model and methods of this section can be formulated in terms Markov chains, which form a very useful and powerful framework to analyze probability problems.

Footnotes:

1

There is a subtle point. For the argument to work the expected times \(T_{i}\) should be finite. We will assume that this is the case.

2

We assume that the cheese does not move. A really interesting extension is to suppose that the cheese is also moving.