Wikipedia:Reference desk/Archives/Mathematics/2022 July 24

= July 24 =

Confidence levels in Bayesian analysis
Hi! I've been editing some articles where there is new information from a radiocarbon dating study that uses Bayesian analysis in its dating model:

The topic is Chinese oracle bones excavated from Yinxu, but the question I have pertains only to statistics and what can be said in Wikipedia's voice. Only pages 160–165, which include nearly two full pages on non-prose tables, are really relevant to my question. No knowledge of radiocarbon dating or early Chinese history or bronze age archaeology is required.

The cited study includes a lot of things I don't quite understand, like how it seems to be training the model with the same dataset it uses to test the model, removing from the dataset individual samples whose "agreement index" is unacceptably low (attributed to sample contamination, p 161), and how the overall agreement index exceeds 100% (p 165), but I can only assume that since the study has been published in a peer reviewed journal and conducted by professionals who know what they're doing, that these are all acceptable methods and outcomes. (There is one methodological issue, regarding the uncritical acceptance of the reign length of Wu Ding, that is outside of scope.)

My question relates to confidence levels and uncertainty as measured in absolute years.

The study states (p 165): "The calibrated age within the 68% range is assigned to every sample in the model, and then the range, which superposed the 68% ranges of all the samples in a phase, is taken as the 68% range of that phase. Actually, the probability of true age falling in 68% range is higher than 80% or even reach 90%."

In the article Wu Ding, I changed text stating the confidence level was "between 80 and 90 percent", to state that the confidence level "exceeded" 80, and introduced the ±10 year uncertainty, stated by the study's authors in numerous places, including contexts apparently outside the Wu Ding period samples specifically (pages 165 and 168). Later,, who astutely first cited the study in question in March, changed the text to read that the confidence level "corresponds to" an uncertainty of ±10 years.

My questions are these: given the quote above, does the math stand up? Can it really be said in Wikipedia's voice that the confidence level falls between 80 and 90 percent? Or even that it exceeds 80 percent? Or is the conservative number of 68% more appropriate? And does the confidence level "correspond to" an uncertainty of ±10 years, or is this a separate piece of information produced by the analysis? Folly Mox (talk) 16:41, 24 July 2022 (UTC)


 * The wording "the probability of true age falling in 68% range" makes some alarm bells go off – unless they mean, "among the samples for which the true age is known, the probability of their known true ages falling in 68% range". The writing is fuzzy; without thoroughly studying the details (which I don't feel like doing) I cannot work out what they mean – and not just in this sentence. The value 68% is of course quite arbitrary. I suggest keeping it at the level of detail of the article's abstract. One more thing, in the context of the statement, "Twenty-six oracle bone divinations of his reign have been radiocarbon dated to 1254 to 1197 BC" (a range of 57 years), who cares at this level what the confidence level is for a 20-year confidence interval? Why mention it at all? --Lambiam 18:43, 24 July 2022 (UTC)
 * For those who haven't read the Wu Ding article, the idea is to determine when he was king of China from the radiocarbon dating of oracle bones that were used for divinations in his reign. Wu Ding, who ruled around 1200 BC, is the earliest Chinese ruler whose reign can be confirmed by contemporary material.


 * User Folly Mox added this to the article: "Radiocarbon dating on a sample of these oracle bones has yielded results closely according with dates derived from the literary record." What does this mean? There is no citation. Sima Qian's chronology goes back only to the Gonghe Regency of 841-828 BC. "For Xia, Shang, and Zhou Dynasties before 841 BC, the Shiji recorded only lists of kings with their genealogy, without the years of their reign," according to Liu's article.


 * Liu's article could certainly have used a better copy editor. But as I interpret it, it gives two alternative ways of expressing the margin of error. The first way is, "the probability of true age falling in [to the given date range] ...is higher than 80% or even reach 90%." The second way is, "the uncertainty of about 10 years possibly exists for the calibrated age of a phase." In Phase I, which corresponds to the reign of Wu Ding, 26 divinations were radiocarbon dated. Fairnesscounts (talk) 20:40, 24 July 2022 (UTC)
 * What is the relevance of the interval and the confidence level? Why use 10 years, why not 100 years? Then you have a 99.99% confidence level of something or other. I can do the maths for the given data, using as a model the sum of two real-valued random variables, one uniformly distributed over the interval [−1260, −1182] and the other having a normal distribution with μ = 0, σ = 10. Then I get 97.9%, but I do not know what it means in relation to the reign of Wu Ding. Also, this has no longer anything to do with Bayesian analysis. --Lambiam 21:47, 24 July 2022 (UTC)
 * Lambiam, thank you for your directness! I haven't taken a statistics course since 2009 and I didn't know what importance to attach to the confidence level, but it appears the answer is: none at all. It sounds like including the ±10 year uncertainty should suffice, without mention of confidence levels?
 * Fairnesscounts, the text I added about the radiocarbon dates closely according with the literary record, apparently I should have been much clearer that I was referring to the dates given by the Cambridge History of Ancient China, based on work by Pankenier and Nivison amongst others, and relying on received texts such as the Yi Zhou Shu and epigraphic evidence such as the Li gui, which is cited in the first paragraph after the lead; and the results of the Xia–Shang–Zhou chronology project, whose methodology I'm unfamiliar with, also cited in the following paragraph. It's been my understanding of MOS:LEADCITE that facts cited in the body of the article do not require a separate citation in the lead, and I was making the assumption that readers could look at the 1189 end date, the 1250–1192 calculation, and the 1254–1197±10 results and conclude that they closely accord, without weighing down the lead paragraph with a bunch of numbers. Happy for a rewrite. Folly Mox (talk) 00:19, 25 July 2022 (UTC)
 * Their (Liu et al) simulations suggested the calibration has the possible uncertainty of 10 years ("and for an individual boundary ... up to 20 years in an extreme case"). Nevertheless, it is because their calculated probabilities of the true age falling within one standard deviation of the calibration is so high (<80 - 90%) that they only presented the 1σ results. Be warned I don't know much applied statistics, much less the software, and I can't see from the documentation why the agreement indices are so much greater than 1. But as you can no doubt infer the agreement indices are not the same as the probability of the measurement agreeing with the model, which they seem to not provide for each sample -- I'm not sure if that's in a supplemental dataset or if that is something that's supposed to be derived from what's given.
 * Regardless, my limited understanding of radiocarbon dating is that the results of any individual study should be reported with great caution, and it's especially recommended to report details only from reviews and metastudies, and only provide bare-bones general summaries of the more recent primary literature, if you cover it at all. You'd know the state of the scholarship of this field better than I would of course. SamuelRiv (talk) 22:46, 24 July 2022 (UTC)
 * I must thank Fairnesscounts for finding the study in the first place: I'm not at all familiar with radiocarbon dating or the state of the field. My edits have been an attempt to urge caution in overspecifying our understanding of the dating of Shang dynasty monarchs based on this study. Folly Mox (talk) 00:19, 25 July 2022 (UTC)

The paper has a number of odd features.

Paleographers studying oracle bone inscriptions from the late Shang period have divided them on the basis of internal evidence into five periods (called phases in the paper), each corresponding to the reign of one or two kings, e.g. period I corresponds to the reign of Wu Ding, while each of the others corresponds to two kings. The paper aims to assign absolute dates to these periods, and thus to reigns.

The method was to perform C14 dating on samples of bones that had each been assigned a period on textual grounds. Here are the 68% confidence intervals for individual samples (Table 3):

Periods: ,, , ,.

It is surprising that these intervals (especially periods II-IV) cluster so neatly: one might expect the midpoints to range continuously across the interval. Part of the explanation is that the authors discarded about a third of their results due to having too low an "agreement index", presumably between the dating and the expected period sequence (bottom of p161).

They them "superpose" these 68% confidence intervals for individual samples to obtain what they call a 68% range for each period (Table 2):

For example, period I, and thus the reign of Wu Ding, is assigned the range 1254–1197 BC, the union of the 68% ranges for the samples assigned to period I. The authors state that the period boundaries have an uncertainty of about 10 years and up to 20 years in extreme cases, and that "the probability of true age falling in 68% range is higher than 80% or even reach 90%". I don't see how these numbers follow from the methodology, and as User talk:Lambiam says, the focus on 68% intervals cannot be justified.

I agree with User:SamuelRiv that we should be very cautious about reporting an individual study of this nature. Kanguole 10:40, 26 July 2022 (UTC)


 * In assigning an estimated numerical interval Lo–Hi to a historical period P, there are at least three different kinds of association one may seek to assert: (1) P is contained in [Lo, Hi]; (2) [Lo, Hi] is contained in P; (3) neither is (necessarily) contained in the other, but there is some overlap between the two. Each kind requires different methods for determining confidence intervals, given a desired confidence level. Which of the three kinds are the authors of the study seeking to assert? It is not clear to me. A methodological issue: discrepancies between ages obtained by dating and true ages may be systematic across specimens from the same site and period, so they may not be assumed to be statistically independent. --Lambiam 11:41, 26 July 2022 (UTC)