User:GreenC/software/pages

Making a page assertion with "V1"


 * Download the djvu file, they are typically large 10-50MB.
 * Parse the XML file:
 * For each physical page in the book (records bounded by OBJECT), check the first and second LINE at the top of page, looking for the first and last WORD. And check last LINE at bottom of page, looking for last WORD.
 * Match those words on a regex "[>][0-9]{1,4}[<]" .. if it finds a match add it to an array for example Pages[490]=452 .. in this case the physical page number is 490 and the discovered OCR page number is 452. If no page number is found set the array value to 0.


 * Parse the Pages[] array:
 * Iterate over Pages[] looking for the target page number in the value portion. If it is found, the array index number is the n-page number and you are done.
 * The array will contain garbage due to OCR errors or missing page numbers, and thus will often not contain the page number. Search for nearby page numbers and if they can be found offset to the n-page. For example:
 * Pages[400]=490
 * Pages[401]=0
 * Pages[402]=0
 * Pages[403]=493
 * Thus if your looking for page 492 it doesn't exist in the array. Search for 491, which it can't find either. Search for 490 which it finds and add a +2 offset to derive the correct index (400 + 2 = n-page 402 = printed 492)
 * My algo searches backwards up to 10 pages. Unless the target page is < 10, in which case it searches forward and subtracts rather than adds the offset.
 * This system does not (yet) account for roman numeral pages, but it should be possible by changing the regex search pattern above to include the roman set. Needs testing.
 * Once the Pages[] array is created it can be saved in a database so future checks of the book will not require download the XML file.
 * The algo when it makes a mistake it is often apparent because there is a large difference between the target page and resolved page eg. page 40 = /page/n450 .. there might be some additional error correction here.