User:Tom de Neef/Loops in genealogical data

Loops in genealogical data A loop or circular reference in a genealogical database is often hard to localize. It is clearly an error: nobody can be his/her own ancestor or (grand) child. Yet that is what happens when a loop is present. Loops are the result of incorrectly linking a person to a family. In particular when a given name appears often in a parental tree, the wrong parent can be selected when linking a child. The illustration shows such a case for a representation of a pedigree as a graph..

Incorrect links may be discovered through inconsistencies in the dates (see Gedcom data validation). If a child was born after the death of the mother, for instance, the dates may be wrong or the link may be erroneous. But when dates are absent, the detection and correction of loops is not trivial.

A simple-minded approach would be to trace all ancestors of every person in the database and test if the starting person reappears. But this will result in endless loops if a loop is present in the pedigree of the person (excluding the person himself). And for large databases the computational time may be excessive. A better approach is to color all ancestors of a person. If a parent is encountered who is already colored there are evidently more paths from the proband to this ancestor. But that will also be true in families with blood relationship between parents.

An algorithm to identify loops must thus consider the case where inter-marriage is present, maybe across many generations. The figure presents such a case. The following steps will result in the localization of the loop, if one is present:
 * 1) Remove (mark) all childless persons. They can not be part of a loop. Keep doing this until exhausted.
 * 2) Remove all persons without a parent. They can not be part of a loop either. Repeat until exhausted.
 * 3) Any remaining persons will be on a loop or on a chain of persons connecting loops. For every remaining person: color its parents and their parents (provided they are not colored yet), recursively. If, at the end of this coloring, the initial person is still uncolored, it is not on a loop and can be removed.
 * 4) All remaining persons will belong to a loop. They can be reported.
 * 5) If there is positive indication that a child's birth is in harmony with the age of the remaining unmarked parent, then that link is most likely not erroneous. The wrong link will be one of the child-parent links that can not be disregarded in this way.

Loop-detection is not among the standard validation services of genealogical programs. A researcher must therefore export his data in GEDCOM format, so that it can be analyzed by a specialized program.