Set up

In this post I want to write about a striking probability fact that I came across.

Consider this game: A coin will be flipped 10 times in a row, Alice wins a point if heads happens 3 times in a rows during the game. Bob wins if the flip sequence heads tails tails occurs.

Which player is more likely to win? My intuition was that the probability would be equal, since every random sequence is equally likely. In fact Bob is more likely to win! Alice has a 50% chance of winning and Bob has a 75% chance of winning (roughly).

Simulation

Here is a script that counts simulates flipping a coin 10 times and counts how many times HHH and how many times HTT occurs. The result is this

checked 1024 strings
HHH occured 520 times (50.78125%)
HTT occured 792 times (77.34375%)

import itertools

def generate(symbols, length):
    return ["".join(seq) for seq in itertools.product(symbols, repeat=length)]

# print(generate("HT", 3))
# ['HHH', 'HHT', 'HTH', 'HTT', 'THH', 'THT', 'TTH', 'TTT']

def find_subsequence(haystack, needle):
    if needle in haystack:
        return True
    return False

hhh_counter = 0
htt_counter = 0
sequences = generate("HT", 10)
total = len(sequences)
for flips in sequences:
    if find_subsequence(flips, "HHH"):
        hhh_counter += 1
    if find_subsequence(flips, "HTT"):
        htt_counter += 1

print("checked %a strings" % total)
print("HHH occured %a times (%a%%)" % (hhh_counter, 100*hhh_counter/total))
print("HTT occured %a times (%a%%)" % (htt_counter, 100*htt_counter/total))

Why is this true

We have seen that this really is true based on a computer simulation. But it's still a mystery! Conceptually, why is this true? It is at odds with the fact that every coin flip sequence is equally likely.

The sequence HHH occurs just as many times as HTT does over the whole space of sequences of 10 coin flips. The tricky part is that it bunches up together more. For example, the sequence HHHHHHHHHH contains HHH 8 times, while the best you can do for HTT is HTTHTTHTTH only 3 times.

The sequence HHH overlaps with itself better. This is called autocorrelation.

https://en.wikipedia.org/wiki/Autocorrelation_(words)

Mathematics of counting subsequences

I assume you are familiar with finite state automata.

To 'recognize' the sequence HTT within a string you can have states and transitions between them like this:

(1) --H-> (2) --T-> (3) --T-> (4)
    <-T--     <-H--

we might write this as a PEG (generalized regular expression) grammar:

L1 <- 'H' L2 + 'T' L1
L2 <- 'H' L2 + 'T' L3
L3 <- 'H' L2 + 'T' L4
L4 <- 'H' L2 + 'T' L1

i.e if we see H (then T, then T) we advance our state to the accept state 4, and we see something else we jump back to the start.

This setup enables us to count occurrences using a technique called generating functions. The generating function will tell us exactly how many of the length "n" coin flip sequences our subsequence occurs in - it packages all this information up for us in a single object that we can do algebra with.

Let

L1 = x L2 + x L1
L2 = x L2 + x L3
L3 = x L2 + x L4
L4 = 1 + 2 x L4 # every length bigger than this gets a match whether we do H or T next

and we can try to eliminate all of these except L_1 and get a rational expression:

L1 = x/(1-x) L2
L2 = x/(1-x) L3
L3 = x L2 + x/(1-2x)

L1 = x/(1-x) * x^2/((1-x)*(1-2*x)*(1-x^2/(1-x))) = x^3/(-2*x^4 + x^3 + 4*x^2 - 4*x + 1)
= x^3 + 4*x^4 + 12*x^5 + 31*x^6 + 74*x^7
+ 168*x^8 + 369*x^9 + 792*x^10 + 1672*x^11
+ 3487*x^12 + 7206*x^13 + 14788*x^14 + 30185*x^15
+ 61356*x^16 + 124308*x^17 + 251199*x^18
+ 506578*x^19 + O(x^20)

as you can see the number 792 that came up earlier is next to x^10, meaning it is the number of times this subsequences occurs in length 10 sequences.

You can also use the autocorrelation polynomial instead of working out the state transition system manually.

For more on this see [Philippe Flajolet and Robert Sedgewick, Analytic Combinatorics]

#math #probability