Friday, September 26, 2014

Am I a better Gin Rummy player than the AI?

When I have a moment of downtime I like to play Gin Rummy on my cell phone. (I play the one made by AI Factory on an Android phone.) The game offers a number of different opponents to play against (unfortunately not another human, though. My wife and I tried one app that allows us to play each other over Bluetooth but its interface just doesn't match AI Factory's. We also have played with actual cards but, after playing on a phone for too long, shuffling and dealing become annoyingly slow.). The opponents are given a difficulty rating between 1 and 5 stars. I lost most of the stats I had accumulated after switching to a new phone but I always thought "Jane" was the most challenging. On my new phone she's the only one I've played against. Right now I've got a 40-46 win/loss record against Jane. That's a 0.465 win rate. So does that mean I'm worse than Jane?

Don't answer that question! Give me a minute to explain myself before you blurt out "yes!" ... So would you believe me if I said my wife took my phone, lost a bunch of games and ruined my stats? No? I've never been good at lying so let's think about this. My ego is at stake here.

Shoulder devil: You stink at Gin Rummy. You're losing to some freaking electrons zipping around in your phone! Pathetic.
Shoulder angel: You don't stink. 46.5% is good! And this is a game with a lot of random variation. Even a skilled player will lose to an amateur sometimes. And you're getting better each time you play!
(We will ignore the comment on skill and assume for today that my skill was the same over all games played.)
Self: Hmmm... random variation. So maybe it was just a string of bad luck?
Shoulder angel: Yes... I mean maybe... Yes to maybe.

I have played 86 games so far. If I flipped a coin 86 times, how many times will it come out heads? The most likely outcome is 43. But even more likely is not getting this outcome. I'll say that again. You're more likely not to get 43 heads in 86 flips but some other number. How much more likely?

The pmf for the binomial distribution is
Pr(K=k) = \binom{n}{k} p^{k} (1-p)^{n-k}
where $n$ is the number of trials, $k$ is the number of successes and $p$ is the probability of success. For our problem we get
Pr(K=43) = \binom{86}{43} (0.5)^{43} (1-0.5)^{86-43} = 0.086.

So if you flipped a fair coin 86 times every day and did this for thousands of days, you'd get 43 heads on 8.6% of those days. The other 91.4% of the days would result in some other number of heads. (The fact that our rounded probability of 0.086 is 86/1000 and 86 is the number of trials is a coincidence. Don't get hung up on that.)

Shoulder devil: And...
Self: So if you figure out the probability for 40 successes out of 86, you get 0.070. Also small but not so different than 0.086 that you might say that the losses were entirely due to skill.
Shoulder devil: That sounds kind of subjective to me. I think 0.070 and 0.086 are very different. Loser.
Self: Let's quantify it!
(This is when "QUANTIFY" would zoom in and zoom out with a spinning background a-la the Batman symbol.)
Shoulder angel: So what exactly are we quantifying?

What we'd ultimately like to know is the true probability $p$ of success. If it's greater than 0.5 then I am a better player. If it's less, then Jane is better. But we really can't know what that value is so we'll find a confidence interval that makes us comfortable.

At this point we'll try to find the "binomial proportion confidence interval." (I know Wikipedia isn't the best reference. But it is sometimes a good source of references.) The binomial proportion is the probability $p$. The confidence interval will give a range of values of $p$ that is most likely to contain our true value. I won't go into the many methods for calculating this (this is left as an exercise to the reader (I hate reading that in books)). I will give the Clopper-Pearson interval for today. A 95% confidence interval using this method for 40 successes in 86 trials gives $0.35678 \le p \le 0.57592$.

Shoulder angel: Look at that! It could be as high as 0.575! Maybe you are better than Jane.
Shoulder devil: Or as low as 0.357. Don't celebrate too much, chump.

So am I better than Jane? Maybe... Yes to maybe.
(My gut tells me that right now that I am not better. Recently, however, I've thought of some new strategies that may help. So, again, maybe.)

Kevin McCallister: You guys give up? Or ya thirsty for more?

A while ago I read this article at Grantland by Bill Barnwell. In the section titled "The Best of the Best" he gives a table of win/loss records of NFL quarterbacks in games decided by one touchdown or less (in footnote 2 of the article he indicates that this means games ending with a point differential of 7 or less). His table includes ties but for my purposes I'll leave those out. His table is sorted, logically, by win percentage. But the number of games ending in this situation for each quarterback ranges from 50 to 117. Is Terry Bradshaw's 0.593 in 59 games really better than Brett Favre's 0.581 in 117 games?

Shoulder devil: Yes. 0.593 is greater than 0.581. Greater is better. Yet you ask the question as if there's some reason for it not to be better. And if you have something in mind, why are you asking a question and not just telling me what you have in mind?

There's a lot of random variation in football games and the quarterback doesn't always control the outcome of a game, even if he is on the field. I submit that we should sort this list of quarterbacks by the lower bound of a 95% confidence interval on their win percentage, i.e., their binomial proportion.

Shoulder angel: Don't forget to be completely honest about the assumptions you're making.

I didn't say this when talking about Gin Rummy games but the assumption with a binomial process is that all trials are iid (independent and identically distributed). Since I assumed my skill level remained the same over all games and that the AI skill doesn't change either, this assumption seems fair. In football games this assumption is weaker (but we're going to make it anyway). For example, Brett Favre has 117 games here. That would take at least 8 seasons to play that many games (ignoring the playoffs). A quarterback's skill can vary over time. I don't have the raw data but how the game plays out matters, too. Was the team ahead and then gave up a meaningless end-game touchdown? Were they behind near the end and taking more risks to win? Were they playing at home? Who were they playing? The point of this is that iid is not as strong of an assumption as it was in Gin Rummy. But, like I said, we're going to make it anyway.

So if we add some new columns to the table and sort it by the lower bound of our 95% confidence interval we get the following.

Player GP W L Win% Win% LB Win% UB Win% Rank LB Rank Change
Tom Brady 76 54 22 0.711 0.595 0.809 1 1 0
Peyton Manning 103 66 37 0.641 0.540 0.733 3 2 1
Jim Kelly 69 44 25 0.638 0.513 0.750 4 3 1
Jay Schroeder 52 34 18 0.654 0.509 0.780 2 4 -2
Dan Marino 107 64 43 0.598 0.499 0.692 8 5 3
Brett Favre 117 68 49 0.581 0.486 0.672 11 6 5
Matt Hasselbeck 65 40 25 0.615 0.486 0.733 5 7 -2
Ken Stabler 67 41 26 0.612 0.485 0.729 6 8 -2
John Elway 112 64 48 0.571 0.474 0.665 14 9 5
Jake Plummer 53 32 21 0.604 0.460 0.735 7 10 -3
Brian Sipe 57 34 23 0.596 0.458 0.724 9 11 -2
Terry Bradshaw 59 35 24 0.593 0.457 0.719 10 12 -2
Joe Montana 69 40 29 0.580 0.455 0.698 13 13 0
Phil Simms 71 40 31 0.563 0.440 0.681 17 14 3
Dave Krieg 80 44 36 0.550 0.435 0.662 20 15 5
Ben Roethlisberger 64 36 28 0.563 0.433 0.686 18 16 2
Eli Manning 60 34 26 0.567 0.432 0.694 16 17 -1
Dan Pastorini 58 33 25 0.569 0.432 0.698 15 18 -3
Joe Theismann 50 29 21 0.580 0.432 0.718 12 19 -7
Fran Tarkenton 55 31 24 0.564 0.423 0.697 19 20 -1

Congratulations to Brett Favre and John Elway. Sorry, Joe Theismann. Interesting also is that only four quarterbacks have lower bound numbers higher than 0.500. That's not to say the rest of the quarterbacks weren't any better than some average replacement but it does go to show just how hard it is to be a really dominant quarterback in close games over a long period of time. Again that comes back to just how much other factors come into play in the outcome of a football game.