Tuesday, December 2, 2014

Random Assignment Success Probabilities in Sunday School

A few Sundays ago I was sitting in Sunday School during church. The teacher put 9 scripture references on the chalkboard that would be discussed during the class hour (from the Old Testament, Lesson 37). She then said something to the effect of, "Will everyone please look up one of these scriptures."

Hmm.... what are the odds that every scripture will be chosen by someone in the room? Assumptions: everyone in the room actually looks up a scripture, no one in the room collaborates with anyone else to decide which scripture to choose, and all scriptures are equally likely to be chosen. Under those assumptions we can model each person's choice as an iid event.

Oh, I'm supposed to be listening to a lesson about.... what's it about? Okay, we're reading from Isaiah. Let's see...

There's nine scriptures, (about) thirty people, how do I figure this out?

We can think of a 30 digit number where only digits 1 through 9 are used. As long as each number between 1 and 9 shows up at least once in the 30 digit number, we're good! Dang, that's complicated. Let's simplify, first.

We can say we have $M$ events (30 people) and $N$ outcomes (9 scriptures). To simplify, let's consider 2 events and 2 outcomes. The first event will randomly end up in one of the two outcomes with probability 1 (it doesn't matter which outcome). The second event has a probability of 0.5 of ending in the outcome opposite the first. The product of 1 and 0.5 gives 0.5 (iid). So we end with a probability 0.5 that with two events (people) and two possible outcomes (scriptures) that both outcomes will occur.

So, intuitively, if we have more events, we should have a greater probability of having both outcomes occur. The more times we flip a coin, the more likely we are to see both a head and a tail at least once. How much more likely? For three events, let's just list out the possible outcomes
000
001
010
011
100
101
110
111
where 0 and 1 are the two different outcomes. (That's a 3 digit number with 2 possible numbers for each digit.) For three events and two possible outcomes, we have 8 ways things can shake out. Of those, 6 are successful (at least one 1 and at least one 0). Therefore, $p=\frac{6}{8}=0.75$. The probability went up like we expected.

We can easily calculate 8 as $2^3$ or $N^M$. How do we calculate 6? I'll say right now it's NOT $8-2$*.

(*You most certainly can calculate it as $(N^M)-X$. But X is not necessarily 2 as a general rule but calculated as the complement of what will follow.)

Rather than think about long numbers using odd bases, perhaps it's better to think of buckets and balls, where $N$ is the number of buckets and $M$ is the number of balls. Each ball randomly lands into one of the $N$ buckets. What's the likelihood of each bucket having at least one ball after all is said and one?

So when $N=9$ and $M=30$, there are over $4 \times 10^{28}$ possible outcomes, that is, we can't just list out all the outcomes and count the ones that work. We have to count smarter than that, which we can do.

For the example above, we calculate 6 as $\binom{3}{2} + \binom{3}{1} = 6$. If we had four events, we'd calculate the number of successful outcomes as $\binom{4}{1} + \binom{4}{2} + \binom{4}{3} = 14$. Using $n$ choose $k$ notation is correct but perhaps not the most instructive as it is written. Writing the binomial coefficient out using factorials will help you understand going forward.
\[
\binom{4}{1} = \frac{4!}{1!3!}
\]
The denominator is $1!$ times $3!$, where 1 and 3 are the number of events resulting in different outcomes, one time for outcome 0 or A and three times for outcome 1 or B (or whatever). In the buckets with balls example, that would be one ball in bucket A and three in bucket B.

With multiple possible outcomes (multiple scriptures to choose from), we have the product of $N$ factorials in the denominator, which is a multinomial coefficient. The sum of all $N$ numbers (not their factorials) is $M$. An example: $M=10$, $N=3$ (where the outcomes are A, B, and C), and we ended up seeing A twice, B five times, and C three times. How many ways can we have this outcome? Because order doesn't matter (it's a combination and not a permutation), we can calculate it as $\frac{10!}{2!5!3!}$. Or if we saw A five times, B zero times, and C five times, we would have $\frac{10!}{5!0!5!}$.

So to guarantee all $N$ outcomes are seen (all nine scriptures are chosen or all nine buckets each have at least one ball) after $M$ events, we must have all numbers in the denominator be greater than or equal to 1. So now we must also count all the ways we can add $N$ numbers between $0$ and $M$ to $M$. Then use only those where all $N$ numbers are greater than 1. Note here that ways to add to $M$ are permutations. For example, consider again $\frac{10!}{5!0!5!}$. Is that the same as $\frac{10!}{0!5!5!}$ or $\frac{10!}{5!5!0!}$? No. They give the same count but are not the same outcomes for our $M$ events. But if we only have one of the three, we can quickly calculate that we need three as $\binom{3}{2}$ or $\binom{3}{1}$ (2 fives and 1 zero).

Funny enough, I had already written a python script to enumerate all the ways you can add $N$ numbers between $0$ and $M$ to $M$ just a few months ago. It gets really slow as the numbers get big (there's lots of room for optimization in my algorithm, I'm sure). Lucky for me this is not a new problem. The different ways you can add positive integers to $N$ are called partitions. And an efficient algorithm for generating partitions are given by Donald Knuth in Volume 4A of his "The Art of Computer Programming" book series. And this algorithm is already implemented in Julia's standard library of functions. Is Julia's implementation of Knuth's algorithm better than my Python implementation of my own algorithm? Yes. Absolutely, yes. Not only is it faster but I can call either partitions(m), which generates all partitions that sum to m, or partitions(m,n), which generates all partitions of size n that sum to m. How convenient! (I've reversed the Julia documentation's n,m notation to match ours.)

Now, for each partition given by partitions(m,n), we'll calculate its multinomial coefficient. Each of those is multiplied by the multinomial coefficient of the counts of the digits in the partition. An example: partitions(10,5) will give seven different partitions but one of those will be [4,3,1,1,1]. The multinomial coefficient for this is $\binom{10}{4,3,1,1,1}$ or $\frac{10!}{4!3!1!1!1!}$. This is multiplied by $\binom{5}{1,1,3}$ or $\frac{5!}{1!1!3!}$, which comes from the fact that there are 5 digits in the partition where 1 is a four, 1 is a three, and 3 are ones. This is the number of ways we can rearrange our partition uniquely. This product of multinomials is calculated for each partition and these products are summed. This sum is the number of ways we can successfully observe all $N$ outcomes over $M$ trials. Divide this sum by $N^M$ to get the probability of observing all $N$ outcomes over $M$ or fewer trials. That's critical. Or fewer. We mindlessly perform $M$ trials and don't care when we succeeded in observing all $N$ outcomes. Maybe it happened right at trial $M$ but maybe it happened before that (if $M>N$). In other words, we are calculating values of the CDF.

So for our Sunday School teacher, the probability that all 9 of the scriptures will be chosen by the class of 30 is about 75.6% (again, under the assumptions we made earlier). So most weeks she won't have to worry about having to wait for someone to look up a scripture that no one else chose to look up ahead of time if she employs this method of scripture assignment.

A discrete derivative of this CDF will give us the pmf. But what does the value from the pmf mean? It is the probability of requiring exactly $M$ trials to observe the $N$ outcomes. So intuitively we would expect the probability to be low when $M$ is close to $N$. You would need to be very lucky to roll each of the six values on a standard, six-sided die in just six rolls (about a 1.5% chance). But you would also expect very large values of $M$ to be unlikely, too. You probably won't need to roll that die 100 times to see all six sides at least once. For that to happen you would have needed five of the outcomes to occur in the first 99 rolls and the sixth to occur on the 100th roll. That seems like too many rolls. So somewhere in the middle is our most likely outcome.

Let's look at some examples. The following is a plot of the pmf when $N=6$. Again, these are the probabilities that we will complete our observation of all $N$ outcomes in $M$ events exactly. The maximum probability (or mode) occurs at $M=11$. The median is approximately $M=13$. The expected value is approximately $14.7$ (using values calculated out to $M=50$).

(And these plots were made in Julia 0.3.2 using Gadfly 0.3.9 and are zoomable. Neat!)

Number of events: M -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 -50 0 50 100 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 -0.100 -0.095 -0.090 -0.085 -0.080 -0.075 -0.070 -0.065 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 0.125 0.130 0.135 0.140 0.145 0.150 0.155 0.160 0.165 0.170 0.175 0.180 0.185 0.190 0.195 0.200 0.205 -0.1 0.0 0.1 0.2 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 p pmf for N=6 Created with Snap

That's an interesting shape. Let's look at some more for $N=8$ and $N=12$.

Number of events: M -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 -50 -48 -46 -44 -42 -40 -38 -36 -34 -32 -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 -50 0 50 100 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 -0.060 -0.058 -0.056 -0.054 -0.052 -0.050 -0.048 -0.046 -0.044 -0.042 -0.040 -0.038 -0.036 -0.034 -0.032 -0.030 -0.028 -0.026 -0.024 -0.022 -0.020 -0.018 -0.016 -0.014 -0.012 -0.010 -0.008 -0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 0.022 0.024 0.026 0.028 0.030 0.032 0.034 0.036 0.038 0.040 0.042 0.044 0.046 0.048 0.050 0.052 0.054 0.056 0.058 0.060 0.062 0.064 0.066 0.068 0.070 0.072 0.074 0.076 0.078 0.080 0.082 0.084 0.086 0.088 0.090 0.092 0.094 0.096 0.098 0.100 0.102 0.104 0.106 0.108 0.110 0.112 0.114 0.116 0.118 0.120 -0.1 0.0 0.1 0.2 -0.060 -0.055 -0.050 -0.045 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 0.090 0.095 0.100 0.105 0.110 0.115 0.120 p pmf for N=8 Created with Snap

Number of events: M -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -100 0 100 200 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 -0.040 -0.038 -0.036 -0.034 -0.032 -0.030 -0.028 -0.026 -0.024 -0.022 -0.020 -0.018 -0.016 -0.014 -0.012 -0.010 -0.008 -0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 0.022 0.024 0.026 0.028 0.030 0.032 0.034 0.036 0.038 0.040 0.042 0.044 0.046 0.048 0.050 0.052 0.054 0.056 0.058 0.060 0.062 0.064 0.066 0.068 0.070 0.072 0.074 0.076 0.078 0.080 0.082 -0.05 0.00 0.05 0.10 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 p pmf for N=12 Created with Snap

Hmm... that reminds me of a log-normal distribution. But if a Poisson distribution can approximate a normal distribution, is there a similar discrete distribution that approximates a log-normal (for large $N$)?

The following is a plot for $N=12$ again but plotting versus the natural log of the number events.

Log number of events: M -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0 -2.5 0.0 2.5 5.0 7.5 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 6.6 6.8 7.0 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 -0.040 -0.038 -0.036 -0.034 -0.032 -0.030 -0.028 -0.026 -0.024 -0.022 -0.020 -0.018 -0.016 -0.014 -0.012 -0.010 -0.008 -0.006 -0.004 -0.002 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 0.020 0.022 0.024 0.026 0.028 0.030 0.032 0.034 0.036 0.038 0.040 0.042 0.044 0.046 0.048 0.050 0.052 0.054 0.056 0.058 0.060 0.062 0.064 0.066 0.068 0.070 0.072 0.074 0.076 0.078 0.080 0.082 -0.05 0.00 0.05 0.10 -0.040 -0.035 -0.030 -0.025 -0.020 -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 0.080 0.085 p pmf for N=12 Created with Snap

I like it. Maybe this discrete distribution could approximate a log-normal?

Voice of my ego: Poisson gets a cool distribution and process named after himself. This is your chance to have something named after you!
Voice of reason: No one has ever thought of this before? I doubt it. There's no way I'm the first. (I'll hedge by saying the probability is close to 0 that I'm the first.)

Strangely enough, I'm having a hard time finding anything about a discrete distribution that approximates a log-normal distribution out on the sea of Google and Bing the way a Poisson distribution approximates a normal distribution. Anyone know something I don't and wishes to share?