Wikipedia:Reference desk/Archives/Mathematics/2020 April 4

= April 4 =

John Marlowe and the other John Marlowe
Being stuck indoors (that's my excuse; I probably would have done this anyway), I chose to watch a couple of old b/w movies on TV today, ones I'd never seen before.
 * Trent's Last Case (1952) had a major character, played by John McCallum, named John Marlowe.
 * Then came State Secret (1950), whose major character, played by Douglas Fairbanks Jr., was ... John Marlowe.

What a weird coincidence, I said to myself. So, naturally, I got to wondering how likely this would have been, assuming the films weren't chosen for broadcast deliberately because of the names of the characters, which I think would be extremely unlikely.

I don't know what sorts of assumptions one would need to make to have a stab at this, but let me phrase it thus:
 * How likely would it be that two films chosen more-or-less randomly would have major characters with identical names? I guess one could narrow it down to English-language films, British films, black and white films, films made in the early 1950s, etc. --   Jack of Oz   [pleasantries]  07:50, 4 April 2020 (UTC)
 * Was The Horse Soldiers also shown by any chance? In this 1959 war film (not b/w) John Wayne's character is one Colonel John Marlowe. --Lambiam 11:05, 4 April 2020 (UTC)
 * Not today. --   Jack of Oz   [pleasantries]  11:26, 4 April 2020 (UTC)
 * Our article Coincidence notes that "[f]rom a statistical perspective, coincidences are inevitable". I dare to state that they are even more inevitable when you are getting bored. In case you are fed up watching old movies, here are three not too long reads about the odds of coincidences:
 * "Coincidences: What are the chances of them happening?" (BBC Future)
 * "Coincidences and the Meaning of Life" (The Atlantic)
 * "The strangest coincidences of your life probably aren’t that strange at all" (The Washington Post)
 * --Lambiam 13:56, 4 April 2020 (UTC)
 * There's no way of knowing for sure, but the writers of both movies might have been influenced by the Raymond Chandler character Philip Marlowe, especially since the radio program The Adventures of Philip Marlowe was airing at about the same time. There's also the Joseph Conrad character Charles Marlow; perhaps not as well known at the time but a professional writer worthy of the title would be familiar with him. Put that together with the fact that John is a very common first name in English and it's not too surprising that there would be two movies from the early 50's with characters named John Marlowe. The real coincidence is that you happened to pick those two same movies as a double feature, but as pointed out above such coincidences aren't always as unlikely as they may seem. You might be interested in the Stanisław Lem novel The Chain of Chance, which explores the nature of coincidence in the guise of a futuristic detective story. --RDBury (talk) 19:12, 4 April 2020 (UTC)


 * A database with film character names would not be of help in getting a precise value unless we also know the likelihood of a pair of films being chosen in succession. It is not very likely that a channel will programme The Texas Chain Saw Massacre to follow a broadcasting of The Sound of Music. But Earth vs. the Flying Saucers, although rarely shown, is more likely to be shown right after The Day the Earth Stood Still than after most other flicks. Below I follow an entirely different "armchair statistics" approach. I would not dream of submitting this to a peer-reviewed journal – in real life I have a reputation to uphold.


 * OK, here we go. Assume that the screen writer (or book author if the film is adapted from a book) creates a character's name by picking the given name of someone reminiscent of the character and the surname of someone else also reminiscent of the character. So for a serial killer they might combine Leonard Fraser with Alexander Pearce to name a character "Leonard Pearce". There is a non-zero chance that this procedure results in a name that must be rejected for obvious reasons, such as "Tony Abbott", but I think this can be disregarded, as the chance is still fairly small. I believe that any name for a serial killer is equally likely as the name is for a bookkeeper, so we can disregard the character of the character. I'll confine myself, though, to English-language male names. Not all names have an equal prevalence. Let us assume that both given names and surnames independently follow (the simplest case of) Zipf's law. While this assumption is not founded on evidence, it is not unreasonable as an approximation.


 * Before moving on to applying this model to the question, let us first examine a more general question. Given is a discrete probability distribution over a set of $$N$$ items, numbered $$i$$ through $$N$$, where the probability (relative frequency) of the $$i$$-th item is denoted by $$p_i.$$ Consider a pair of random draws (with replacement) according to the given distribution from these $$N$$ items. If the first one drawn is item $$i$$, the probability that the second draw yields the same item equals $$p_i$$. To find the overall probability of a matching pair, we need to take the weighted sum, where the weights are the probabilities of the first item. This results in
 * $$P_\mathrm{match} = \sum_{i=1}^N p_i^2.$$


 * Zipf's law corresponds to the distribution given by
 * $$p_i = (i\cdot H_N)^{{-}1},$$
 * in which the notation $$H_n$$ denotes the $$n$$-th harmonic number, so that the probabilities sum up to $$1$$ as they should. Let $$M$$ be the number of given names and $$N$$ the number of surnames, so that there are $$M\cdot N$$ given-name–surname combinations in total. Each name can be indexed by a pair $$(i,j)$$ and then has probability $$p_{i,j} = (i\cdot H_M \cdot j\cdot H_N)^{{-}1}$$. Now we find
 * $$P_\mathrm{match} = \sum_{i,j} p_{i,j}^2 = \sum_{i,j} (i\cdot H_M \cdot j\cdot H_N)^{{-}2} = \left( \sum_i^M i^{{-}2} \right) \cdot \left( \sum_j^N j^{{-}2} \right) \cdot (H_M \cdot H_N)^{{-}2} .$$
 * The two sums are partial sums of a convergent series with limit $$\frac{\pi^2}{6}$$ (for which see the Basel problem). Since the series converge quickly, we can approximate both sums for large values of $$M$$ and $$N$$ by the limit $$\frac{\pi^2}{6}$$. The harmonic numbers can be approximated by the leading term of their well-known asymptotic expansions: $$H_M \approx \ln M$$, $$H_N \approx \ln M$$. Combining all this gives us the approximation
 * $$P_\mathrm{match} \approx \frac{\pi^4}{36} \cdot (\ln M \cdot ln N)^{{-}2}.$$
 * It remains to supply numbers for $$M$$ and $$N$$. For this we use the numbers of entries (as of 19:58, 4 April 2020 (UTC)) in the Wikipedia categories English-language masculine given names and English-language surnames. This gives us $$M = 214$$ and $$N = 1769$$. Plugging this in and taking numeric values results in
 * $$P_\mathrm{match} \approx 0.0016$$.
 * This approximate estimate is for a name match between one character from the first and one from the second film, say the two main characters. If more characters from each cast are considered, say $$A$$ from movie number one and $$B$$ from movie number two, where both numbers are fairly limited, the chance of a match increases by almost a factor of $$A\cdot B$$. If both equal $$10$$, we get $$100 \times 0.0016 = 0.16 = 16%.$$ I agree that this seems implausibly high.
 * Concluding thought. If character names were really distributed as in real life, occasionally two characters should happen to coincidentally have the same name without this being relevant to the plot. Why do we never see this? So many questions remain. --Lambiam 19:58, 4 April 2020 (UTC)


 * Double wow! Thanks for all that. I'm very surprised that N is as low as 1769. --   Jack of Oz   [pleasantries]  00:23, 5 April 2020 (UTC)


 * I have computed $$P_\mathrm{match}$$ using exact rational arithmetic instead of approximations. It makes a considerable difference for the harmonic numbers. With the values of $$M$$ and $$N$$ as before, I then find $$P_\mathrm{match} = 0.00117579450802408...$$ . For a match in 100 possible pairs, the probability goes up to $$11%.$$


 * Thanks for that. Most intriguing. --   Jack of Oz   [pleasantries]  07:55, 8 April 2020 (UTC)