User:Tazerdadog/Regex reccommendations

The previous base regex used was here: (?<=(. a | an ))(.*?)(?=(\.|,| which | known | found | describe|<ref|\(| native | grow| that | within | from | cause))

This produced a list located at: User:Galobtter/Organism_descriptions.

My comments/recommendations are as follows:


 * 1) Pull the first three sentences instead of the first sentence. Many of the first sentence samples produced were not really the first sentence, but were instead hatnotes, abbreviations, etc. I believe that this is as easy as changing a number in mw:Extension:TextExtracts.
 * 2) Implement Galobtter's change from starting on ". a " or " an " to starting on ".. is a " or ". is an ". This will help eliminate some of the worst false positives. (see Bombay duck on Galobtter's list for a particularly bad example)
 * 3) Add " is the " to the start codes.
 * 4) Figure out how to add " is one " to the start codes while maintaining the "one" in the short description (nested lookahead?).
 * 5) Strip out parenthesis, and anything inside them BEFORE running the regex. (Testing workaround: remove \( from the stop codes, and remove parentheticals mentally when examining the results, suboptimal because other things inside the parenthesis could stop the regex.) (see Crepis for an example of this)
 * 6) Potential additional stopcodes: " whose " (10/10 should be in there), " used " (not sure), " and ", " with ", ";"

Trying to write a new first regex (warning, untested):

(?<=(.. is a |. is an | is the |.... is (?=one )))(.*?)(?=(\.|,| which | known | found | describe|<ref| whose | native | grow| that | within | from | cause| used | and | with |;))