Wikipedia:Reference desk/Archives/Computing/2019 December 18

= December 18 =

Solving for a pattern of numbers
Given an ancient scanned book, for example De re anatomica libri XV, physical leaf number 10 equates to printed page number 3. Thus https://archive.org/details/BIUSante_08734/page/n10 ("n10" is the physical leaf number) shows printed page #3. If we run OCR on the book and build a table mapping physical leaf numbers to printed pages, for the first 27 leafs:

Page[0].ppagei = 0 Page[1].ppagei = 0 Page[2].ppagei = 0 Page[3].ppagei = 0 Page[4].ppagei = 0 Page[5].ppagei = 0 Page[6].ppagei = 0 Page[7].ppagei = 0 Page[8].ppagei = 0 Page[9].ppagei = 0 Page[10].ppagei = 0 Page[11].ppagei = 4 Page[12].ppagei = 5 Page[13].ppagei = 0 Page[14].ppagei = 7 Page[15].ppagei = 8 Page[16].ppagei = 5 Page[17].ppagei = 0 Page[18].ppagei = 0 Page[19].ppagei = 0 Page[20].ppagei = 13 Page[21].ppagei = 14 Page[22].ppagei = 0 Page[23].ppagei = 16 Page[24].ppagei = 17 Page[25].ppagei = 0 Page[26].ppagei = 0 Page[27].ppagei = 20

Due to OCR errors, some of the printed page numbers can't be determined (" = 0") and some are wrong ("Page[16"] = 5"). Is there a suggested method or algorithm for discovering runs of sequential numbers, and from that fill in blank or incorrect pages? This is a general question for many scanned books not just this example. -- Green  C  16:24, 18 December 2019 (UTC)
 * If you plot a graph, this will be very easy to spot by visualisation, as the line of correct point will stand out. It is also similar to some time-domain astronomy problems, where you have partial observations of something that repeats. Are you allowing extra inserted sheets, or totally missing pages? If not this should be a very easy computation. The method I would choose would be to subtract the index from the value, and then find the value that occurs most commonly. You can rule out the 0's from the start. If the pages are in roman numerals, then I suppose you have to convert to "Arabic" numerals, and count this as another sequence. If it has letter-number or number-number text values then you will also have to subdivide the sequences. Graeme Bartlett (talk) 21:52, 18 December 2019 (UTC)
 * Ok this gives me some new ideas for things to try. The complications are with extra inserted pages such as photos that don't have page numbers and show up as "0" but not easy to tell if they are OCR error or extra page. This causes the offsets to change. Counting most common offsets is interesting will think about how that might work, such as within smaller blocks of pages where the offset diffs won't get too large. -- Green  C  05:45, 19 December 2019 (UTC)

I expect they are sequential, and page16=5 is an ocr error where a "9" got interpreted as "5". 173.228.123.190 (talk) 11:29, 21 December 2019 (UTC)