Monday, December 22, 2014

More on Observing All Outcomes in M Events

I wrote about the probability of observing all N, equally likely outcomes over M events here. I was chatting with Neal, who served as my PhD advisor, about this problem and he reminded me that he had given me a similar problem in the random processes class I took from him 5 years ago (which I had forgotten). The problem was a "gotta collect them all" problem. If you are trying to collect all the toys of the available set offered in, for example, McDonald's Happy Meals and assume each toy you receive can be modeled as an iid event, how long will it take to collect them all? In the class we modeled the process as a Markov chain. In the first state you have a probability of 1 of getting a toy that you don't already have (because you start with having none). In the second state you have a probability of $\frac{N-1}{N}$ of getting a new toy (moving on to the next state) and a probability of $\frac{1}{N}$ that you get a toy you already have (staying in same state). This continues all the way to the final state where you have collected them all.

After finally publishing my previous post I began to think more about this. Could I get a distribution from that as well? Short answer: yes. Longer answer: yes and let's talk about it more!

The reason you can (easily) model the process as a Markov chain is that the probability of observing events is uniformly distributed. That's important because it means the state transition probabilities are independent of the distribution of outcomes. If you had non-uniformly distributed outcomes, a Markov chain's state transition probabilities would change depending on which particular event had occurred. It's not impossible to model but not simple, either.

So the Markov chain we use to model this process is a relatively simple one. For a process with $N$ possible outcomes, we have a Markov chain with $N+1$ states. The transition probabilities for each state are as I described previously, progressively getting harder (less likely) to move from the current state to the next.

A really nice property of Markov chains is that you can take the state transition probability matrix $\mathbf{P}$ and use it to quickly calculate the probability of being in any given state after $T$ steps by taking the matrix to the $T$th power as $\mathbf{P}^T$. We'll also assume we always start in the first state of the Markov chain.

The probability of being in any given state after $T$ steps... that's the CDF! It's the CDF for our final state, anyway. That's because that probability goes to 1 as $T$ goes to infinity for our process. A discrete derivative, again, gives the pmf. Even better is that this way of calculating the distribution is way, way, way, faster than doing using the integer counting method I described in my previous post. (This is probably due to the use of the BigInt type in Julia, which is an arbitrary-precision integer type that is necessary for the large numbers involved.) But they both work and give the same answers. Counting huge integers like these also requires using arbitrary-precision data types whereas using a Markov chain can give good results with native 32 and 64 bit floating point numbers.

To show the difference in speed I'll give an example. To calculate the distribution for $N=12$ out to $M=75$ using my outcome counting method, it took my Julia code about 200 seconds. Doing the same thing with a Markov chain state transition probability matrix took around 380 micro seconds. It was also a heck of a lot easier to think through and code up. But boy do I love a good counting problem. (As part of a poster presentation I made before graduation I offered a \$5 cash prize to anyone that could solve a difficult counting problem related to the research I was presenting. I'll have to give that problem its own post sometime.)

Calculations that are orders of magnitude faster mean I can give distributions for outcome spaces that are orders of magnitude bigger. Here's the distribution for $N=512$ (Hey! You think twelve is big? How about five hundred and twelve?!) :
And again but with a log scale on the x-axis:

Hmm... That actually looks right tailed, even after scaling the x-axis. I guess it's not a log-normal. What is this distribution? Maybe skew-normal? or log-skew-normal?

Let me give another example of where you could apply this problem. If you have a communication system using $N$ symbols to represent data transmitted across your medium, you could measure how often each symbol occurs to make sure it's close to uniform. But you could also count how many symbol transmissions you need before you observe all $N$ symbols at least once. The distribution of this transmission count would follow this log-normal-like distribution I've described.

Similarly, you could use it as one test in a suite of tests for uniformity of a random number generator. You would use $B$ bits to form $2^B$ possible outcomes. The number of $B$ bit sequences needed to observe all $2^B$ possible outcomes would follow this distribution, whatever it is. It should also be true for all overlapping sequences.

But the question still is: what is this distribution?