User:Tazerdadog/sandbox 2

Hello everyone. I have developed a way to procedurally generate short descriptions for the 387,816 articles in Category:Articles with 'species' microformats. My method, while not perfect, generally produces a better short description than the Wikidata description. It operates by copying a snippet from the first sentence of the lede that is suitable as the short description. This method relies on the relatively systematic way articles in this category are written - use caution before expanding this beyond this category.

Pseudocode
The purpose of this is to explain exactly how a program would generate these summaries:

This produces the basic short description now we need to clean it up.

Run the above loop three more times, replacing the regex lines with the lines below to strip out links, bold, and Italics

How it works
The regex looks for a string that is immediately preceded by "Any character, space, lowercase a, space" or "Space, lowercase a, lowercase n, space" The any character is there because the lookbehinds must be of the same length (four characters in this case). It then matches any number of characters (the short description) until the string immediately in front of it is one of several stop codes.

All links and bolding/italics are then stripped out of the short description. If the short description is longer than 70 characters, it is left for a human. If the short description contains no spaces, it's left for a human. Otherwise it is posted at the top of the article in the shortdescription template.

Results
I used to generate a sample of articles. The article, my procedurally generated short description, and the wikidata description are included.

Decisions
The thinks we need to decide:


 * 1) Are the short descriptions generated in this way good enough for semiautomatic posting?  For automatic posting?  Semi-auto posting at one a second is still a 100 hour job.
 * 2) Do we bias towards shorter summaries by adding " in ", and " belonging " to the ending criteria?
 * 3) What is an acceptable "Fail Rate" wherein the bot posts something in the short description that is inappropriate for the short description?

Moving Forward
To move this forward, we need to do a couple of things:


 * 1) Develop a strong consensus that adding these summaries is a good thing, and that the occasional mistakes are worth it.
 * 2) Refine this process more.  There's still some low-hanging fruit for improvement
 * 3) Find a bot operator willing to implement this and make the runs.  I could probably do it, but it will be difficult for me, as this would be my first bot.

I'm inviting comments on this now - if it looks good, we can get a consensus for it and I'll start refining it.

Cheers, Tazerdadog (talk) 05:38, 3 June 2018 (UTC)