Talk about the principle and optimization of std::uniform_int_distribution

Normally, the Pseudo Random Number Generator generates integers in the interval [0, 2^N). Each integer in this interval has an equal chance of being generated. However, the random numbers we need usually belong to a smaller interval (e.g., the dice simulator needs random numbers in the interval [1, 6]), so we must use std::uniform_int_distribution to map [0, 2^N) to the interval [min, max]. This article follows Daniel Lemire’s paper Fast Random Integer Generation in an Interval and introduces several different mapping methods in order.

Specifications and Design Considerations

The specification of std::uniform_int_distribution is that given a random number generator that generates [0, 2^N) interval integers, it generates an integer between [min, max]. The mapping of the interval [0, 2^N) to the interval [min, max] must ensure that the probability of each integer in each [min, max] interval is equal.

During the calculation std::uniform_int_distribution can discard the random numbers previously provided by the random number generator and extract the new random numbers from the random number generator again. For example, if the random number generator can produce [0, 2^32) integers, and the value domain of std::uniform_int_distribution is [1, 6], a simple implementation would be to discard all random numbers that do not belong to the [1, 6] region. However, a good mapping should avoid rounding random numbers as much as possible; after all, pseudo-random number generators also require computation time. The Entropy Pool-based pseudohash generator has to wait for the operating system to collect enough Entropy.

Another consideration is the computation needed for the mapping itself. Some numerical simulators require a large number of random numbers, so the mapping must be as fast as possible. The last two algorithms in this paper are different ways to save the residual operator (Modulo, %).

OpenBSD Algorithms

The concept of the first algorithm is very simple: let s be max - min + 1, segment [0, 2^N) by s from the back, and if the random number belongs to these segments, we pass back x % s. If the length of the first paragraph is less than s, we discard the random numbers in that interval. Also, to calculate 2^64 % s using uint64_t, a constant equation is used here: 2^64 % s == (2^64 - s) % s. The complete code is as follows.

class uniform_int_distribution_openbsd {
private:
  const uint64_t min_value;
  const uint64_t s;

public:
  uniform_int_distribution_openbsd(uint64_t min, uint64_t max)
      : min_value(min), s(max - min + 1) {}

  template <typename UniformRandomBitGenerator>
  uint64_t operator()(UniformRandomBitGenerator &urbg) const {
    // 2^64 % s = (2^64 - s) % s = uint64_t(-s) % s
    uint64_t discard_bound = static_cast<uint64_t>(-s) % s;
    uint64_t x = urbg();
    while (x < discard_bound) {
      x = urbg();
    }
    return x % s + min_value;
  }
};

If the random number generation interval is [0, 2^3) and the value domain you want to map is [0, 2], the resultant comparison table is as follows.

x	results
0	Discard
1	Discard
2	2
3	0
4	1
5	2
6	0
7	1

Note: The author of the paper, Daniel, refers to this algorithm as the OpenBSD algorithm because he can find it in the OpenBSD source code library. Although I did not look for an earlier source, I think there is an earlier implementation.

OpenJDK Algorithm

The second algorithm is the algorithm in OpenJDK. It also segments the [0, 2^N) interval in s units. The difference is that it starts from the front. If the random number belongs to the last group, we discard the random number. Unlike the first algorithm, this method uses x - (x % s) > 2^N - s as the last group. The idea is that x % s is the operator that must be used to calculate the result. We can get the beginning of each segment by x - (x % s). If the beginning is larger than 2^N - s, the segment does not have the full [0, s) interval, so the random number is discarded.

The advantage of this algorithm is that if s is small enough with respect to 2^N, there is a good chance that this algorithm does not need to enter the loop, and only a remainder operator is needed. The drawback of this algorithm is that in the worst case this algorithm may require “multiple” remainder operators, and the number of remainder operators is not a constant. In contrast, the OpenBSD algorithm at the beginning of this paper guarantees that only two remainder operators are always needed in the worst case.

The complete code of the OpenJDK algorithm is as follows.

class uniform_int_distribution_openjdk {
private:
  const uint64_t min_value;
  const uint64_t s;

public:
  uniform_int_distribution_openjdk(uint64_t min, uint64_t max)
      : min_value(min), s(max - min + 1) {}

  template <typename UniformRandomBitGenerator>
  uint64_t operator()(UniformRandomBitGenerator &urbg) const {
    uint64_t x = urbg();
    uint64_t r = x % s;
    while (x - r > static_cast<uint64_t>(-s)) {  // 2^64 - s
      x = urbg();
      r = x % s;
    }
    return r + min_value;
  }
};

If the random number generation interval is [0, 2^3) and the value domain you want to map is [0, 2], the resultant comparison table is as follows.

x	r	x - r	results
0	0	0	0
1	1	0	1
2	2	0	2
3	0	3	0
4	1	3	1
5	2	3	2
6	0	6	Discard
7	1	6	Discard

Daniel Lemire’s Algorithm

Finally, the algorithm proposed by the author of the paper, Daniel Lemire, takes a different approach by mapping [0, 2^N) to [0, s * 2^N) using 128-bit integer multiplication, and then dividing the 128-bit integer into an upper half (i.e., dividing by 2^64) and a lower half (i.e., taking the remainder of 2^64). The upper half will be between [0, s) and will be the return value if other conditions are met. If the lower half is smaller than s we have to make additional judgments.

This method divides [0, 2^N) into s intervals. If s can divide 2^N, each interval will have floor(2^N / s) values. If it is not divisible, the first 2^N % s interval will be divided by 1 more value. If we discard the first number of each interval, we can compare the relationship between the lower half and 2^N % s. If the bottom half is smaller than 2^N % s, the random number belongs to the first number of the first 2^N % s interval and is therefore discarded.

Also, since 2^N % s < s, we can start with lower < s as a fast filter. The remainder operator is executed only when lower < s. As with the OpenBSD algorithm, the result of this remainder operator can be reused, so the Daniel Lemire algorithm will only use one remainder operator in the worst case.

But this algorithm raises the question: Is 128-bit integer multiplication better? As of this writing 128-bit integer multiplication is not a standard built-in integer type in C/C++; the GCC or Clang compilers each define 128-bit integers. We can do 128-bit integer multiplication with __uint128_t. Many 64-bit computer architectures (e.g., x86-64, aarch64, riscv64) define instructions to calculate the upper half of the result of multiplication (e.g., mulh), so in practice 128-bit integer multiplication is faster than remainder operators.

The complete program code is as follows.

class uniform_int_distribution_lemire {
private:
  const uint64_t min_value;
  const uint64_t s;

public:
  uniform_int_distribution_lemire(uint64_t min, uint64_t max)
      : min_value(min), s(max - min + 1) {}

  template <typename UniformRandomBitGenerator>
  uint64_t operator()(UniformRandomBitGenerator &urbg) const {
    uint64_t x = urbg();
    __uint128_t m = static_cast<__uint128_t>(x) * s;
    uint64_t lower = static_cast<uint64_t>(m);
    if (lower < s) {
      // (2^64 - s) % s = (2^64 - s) % s = uint64_t(-s) % s
      uint64_t discard_bound = static_cast<uint64_t>(-s) % s;
      while (lower < discard_bound) {
        x = urbg();
        m = static_cast<__uint128_t>(x) * s;
        lower = static_cast<uint64_t>(m);
      }
    }
    uint64_t upper = static_cast<uint64_t>(m >> 64);
    return upper + min_value;
  }
};

If the random number generation interval is [0, 2^3) and the value domain you want to map is [0, 2], the resultant comparison table is as follows.

x	x*s	upper	lower	results
0	0	0	0	Discard
1	3	0	3	0
2	6	0	6	0
3	9	1	1	Discard
4	12	1	4	1
5	15	1	7	1
6	18	2	2	2
7	21	2	5	2

Summary

The motivation for writing this article was my dissatisfaction with the Abseil random number library. The Abseil random number library deliberately messes up the random number seeds, so that the same program, the same random number seeds, will generate different random number sequences each time it is executed. The authors of the Abseil random number library explicitly state that they do not provide any API to turn off this behavior because if they don’t mess with random seeds, users of the library will start relying on specific sequences, which will make it impossible to improve the random number library. Because I just want fixed sequences, and that makes me very unhappy. However, after some research, I realized that the C++ standard does not require the result of std::uniform_int_distribution either. Then I found Daniel Lemire’s paper and saw the various optimizations. My opinion about Abseil’s random number library changed from dissatisfaction to approval. I think this “not specifying the result, but only the nature of the result” API design is very special, so I spent some time to document the idea.

Reference

https://zh-blog.logan.tw/2022/08/14/cxx-std-uniform-int-distribution/
Daniel Lemire, Fast Random Integer Generation in an Interval
The Abseil Random Library, Stability of Generated Sequences

Table of Contents

Specifications and Design Considerations

OpenBSD Algorithms

OpenJDK Algorithm

Daniel Lemire’s Algorithm

Summary

Reference