#+BEGIN_COMMENT
.. title: Weighted random shuffling
.. slug: weighted-random-shuffling
.. date: 2022-03-14 16:44:53 UTC+01:00
.. tags:
.. category:
.. link:
.. description:
.. type: text
.. has_math: true
#+END_COMMENT
#+TITLE: Weighted random shuffling
#+AUTHOR: Nicky van Foreest
#+EMAIL: vanforeest@gmail.com
#+DATE: 2022-03-14
#+setupfile: ../preamble.org
#+PROPERTY: header-args:python :session :eval no-export :exports both
#+STARTUP: indent
#+STARTUP: showall
* Weighted random shuffling
For some application I needed a random shuffle $r$ of a list of numbers
$a$. The shuffle, however, should be weighted, such that, given a list
of weights $w$, the element $a_i$ with weight $w_i$ should be the first
element of the random shuffle $r$ with probability $w_i/\sum_k w_k$. I
searched for a python implementation of such an algorithm, but I
couldn't find one. Also, ~numpy~ doesnt' provide it, and
~numpy.random.shuffle~ only offers shuffling with uniform weights. Thus,
this code.
* Weighted shuffling in pure python
Weighted random shuffling is the same as weighted random sampling from a
list $a$ without replacement. In steps:
1. Choose with probability $w_{i}/\sum_{i} w_{i}$ element $a_{i}$ from $a$.
2. Add this element to a list $r$.
3. Remove element $a_{i}$ from $a$ and $w_{i}$ from $w$.
4. Stop if $a$ is empty, else,
5. Return to step 1 with the updated lists $a$ $w$.
However, I found the code below for weighted random sampling on [[https://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python][this page]].
#+begin_src python
from random import random
def weighted_choice(weights):
rnd = random() * sum(weights)
for i, w in enumerate(weights):
rnd -= w
if rnd < 0:
return i
#+end_src
Thus, ~weighted_choice()~ gives an index to the chosen element of an
array. With this index I can fill a random shuffle $r$ with elements of
$a$.
As removing elements from lists is a bit expensive, I chose to set the
weight $w_{i}$ of the chosen element to zero. This ensures that the
element $i$ cannot be selected the next time.
#+begin_src python
def weighted_shuffle(a,w):
w = list(w) # make a copy of w
if len(a) != len(w):
print("weighted_shuffle: Lenghts of lists don't match.")
return
r = [0]*len(a) # contains the random shuffle
for i in range(len(a)):
j = weighted_choice(w)
r[i]=a[j]
w[j] = 0
return r
#+end_src
* Fast shuffling
Once I had the above code, I went to Gerlach van der Heide. He managed
to speed it up by several orders of magnitude.
#+begin_src python
from random import random
from bisect import bisect_right
import numpy as np
def weighted_shuffle(a,w):
r = np.empty_like(a)
cumWeights = np.cumsum(w)
for i in range(len(a)):
rnd = random() * cumWeights[-1]
j = bisect_right(cumWeights,rnd)
#j = np.searchsorted(cumWeights, rnd, side='right')
r[i]=a[j]
cumWeights[j:] -= w[j]
return r
a = np.arange(1,1000)
w = np.arange(1,1000)
r = weighted_shuffle(a,w)
#+end_src
Here is the result
#+begin_src python
print(r[:2])
#+end_src
Some testing shows, for a reason unknown to me, that ~np.searchsorted~ is slower than ~bisect_right~.
* A wrong idea
While searching the web I came across the one-liner below. It is
elegant, but wrong. The weights according to which the elements of $a$
are chosen are not in proportion to $w$. Some basic probability theory
shows why this is the case. It comes down to proving that
$$P\{w_1 U_1 \leq w_2 U_2\} \neq \frac{w_1}{w_1+w_1},$$
where $w_i$ are the weights and $U_i$ uniform random deviates on
$[0,1]$.
#+begin_src python
def wrong(a,w):
aa = range(len(a))
aa.sort(key=lambda i: w[i]*random()*beta[i])
return [a[i] for i in aa]
#+end_src