User talk:JL-Bot/Journal Testing

Run 1
The current contents (22:34, 23 June 2011) consist of a subset of trial output. I included the first 100 rows as an example for review and then added some additional rows for which I had particular comments/questions: Let me know if your thoughts on these cases. Also if you see any issues with the other ones. -- JLaTondre (talk) 22:53, 23 June 2011 (UTC)
 * 414: This is a single citation with two targets. This is a case where the displayed text is a redirect to one target, but it has a piped link to a different target. The second piped link (BBC History (magazine)) is actually a redirect to the first. On the principle of keeping it simple, I'd rather not have to parse the targets and figure out if they are redirects or not.
 * 592: The journal field is  which is an invalid link. It gets parsed as a interwiki link, but fails to display anything (I'm guessing it's not a valid interwiki suffix). Can I just list without linking anything that looks like an interwiki link (starts with two letters and is followed by a colon)?
 * 1464:This journal has an invalid character (a character not supported in Wikipedia page titles). I plan on listing an invalid titles as non-links.
 * 1734 & 1889: These are similar to 414 in that the displayed text is a redirect to one target, but it has a piped link to a different target. However in these cases, the redirect goes to a different target than the piped link.
 * 1905: This is also similar to the 414, 1734, & 1889 cases. However in this case, it's not a redirect and the same journal name shows up with two different piped links.
 * 2025: This is a result of a parsing error that I need to fix. I listed it just to remind myself.
 * 2849:This is another journal that has invalid characters (brackets are not supported in Wikipidia page titles). I could assume anything in brakets is not part of the title, but not sure if that is the right approach.

Headbomb {talk / contribs / physics / books} 10:40, 24 June 2011 (UTC)
 * 414: BBC History is the target, so that's the one that should be mentioned as a target.
 * 592: That's.... weird? I guess in those cases you should probably replace the colon with a spaced endash (CA – A journal for clinicians), but pipe the correct version. Aka CA: A journal for clinicians.
 * 1464 Cool.
 * 1734 & 1889: Screw the Foobar part of Barfoor . We're interested in Barfoo in this case, so linking to Barfoo would be the desired behaviour, regardless of what the citation did.
 * 1905: If you find a "Foobar" that has a corresponding, non-redirected, Foobar (journal/magazine), just behave as if the citation was to Foobar (journal/magazine). AKA link to Foobar
 * 2849: Just strip the brackets. Foobar [barfoo] &rarr; Foobar barfoo


 * Also, I've been thinking... The current setup handles existing pages (bold), redirects (italics), non-existing pages (redlinks) but disambiguation pages are left unmarked. I suggest  bold underlined  for direct links to disambiguation pages, and  italics underlined  for redirects to disambiguation pages.


 * So in the table, you would have thinks like  AAP  Headbomb {talk / contribs / physics / books} 18:55, 24 June 2011 (UTC)
 * Okay, I'll look into adding that. -- JLaTondre (talk) 20:15, 24 June 2011 (UTC)

Your above response has me a bit confused. I need to boil these down to general purpose "rules" so that I can actually code them. This is what I am gathering should be for the Journal field (ignoring formatting for redirects, etc.):
 * Case 1: No links
 * Journal column = Text
 * Case 2: Link with no piping
 * Journal column = Optional Prefix Text Optional Suffix
 * Case 3: Link with piping
 * Journal column = Optional Prefix Text Optional Suffix
 * Case 4: Link with piping where piping contains "(journal)" or "(magazine)"
 * Journal column = Optional Prefix Text Optional Suffix
 * Journal column = Optional Prefix Text Optional Suffix
 * Case 4: Link with piping where piping contains "(journal)" or "(magazine)"
 * Journal column = Optional Prefix Text Optional Suffix
 * Journal column = Optional Prefix Text Optional Suffix

Where the Target column will only be filled in if the Journal column is a redirect. Is that all correct? Are there any other cases (other than the invalid characters)? -- JLaTondre (talk) 20:15, 24 June 2011 (UTC)

The general rule is take what the reader sees, and that's what we care about.
 * Article has...
 * journal = Whatever barfoo boofar random stuff RAndom TeXt!1!!1


 * Compilation should treat it as...
 * journal = Whatever barfoo boofar random stuff RAndom TeXt!1!!1


 * And place it in the "Journal" column as
 * Whatever barfoo boofar random stuff RAndom TeXt!1!!1


 * Unless there is a corresponding (journal/magazine) page which exists [and which is not a redirect], in that case place it in the column as
 * Whatever barfoo boofar random stuff RAndom TeXt!1!!1

In all cases the "target" column should be given, regardless of whether it's a redirect or not. Don't bother piping the (journal/magazine) link in this one. AKA

Headbomb {talk / contribs / physics / books} 20:42, 24 June 2011 (UTC)


 * Okay, I think I got it. I'll update the code and post a new run. It will take a day or two as I have some commitments. -- JLaTondre (talk) 21:03, 24 June 2011 (UTC)

Run 2
Here is an updated list consisting of the first 10 (by alphabetical order) citations starting with each letter of the alphabet and also numbers. It includes the handling for dab pages that you requested above. Please review and let me know if you see any issues. There are still some additional cases I need to properly handle (like correctly parsing citations that are external links) and then I need to work on the logic for outputting the results to the proper pages. Once that's done, I'll file the BRFA. -- JLaTondre (talk) 02:06, 29 June 2011 (UTC)


 * Looks all good with minor tweaks.
 * "D-Lib Magazine [online]" which should probably be given as "D-Lib Magazine online" or "D-Lib Magazine"
 * Spaces after commas ("1, 2, 3, 4, 5" rather than "1,2,3,4,5")
 * For compatibility with Anomie's link classifier, underlines should be written as Foobar  rather than  Foobar  (at least for now)
 * "W" and "W (magazine)" should be treated as the same thing.
 * Headbomb {talk / contribs / physics / books} 02:46, 29 June 2011 (UTC)

Run 3
Another formatting question:  - This code displays "J. Algorithms" in the citation text, but if you hover your mouse over it, it will then display "Journal of Algorithms" (see Sylow theorems for an example). Which one do you want displayed in the output? -- JLaTondre (talk) 20:13, 3 July 2011 (UTC)

It is pretty much complete. I've had it update this page with the first output page (A1). I'm sure when the whole list is up, there will be some cases that need special handling. I'm going to go ahead a file the bot request. I've updated the number of records per page to 250 (it was 100) to cut down on the number of page writes. Let me know if you think that will be an issue or not. -- JLaTondre (talk) 16:54, 9 July 2011 (UTC)

Bot request filed at Bots/Requests for approval/JL-Bot 7. -- JLaTondre (talk) 17:24, 9 July 2011 (UTC)

It was approved for a trial run which has been completed, Please look through the output at WP:JCW at let me know if you want any changes or if you see any issues. A couple of items in particular: -- JLaTondre (talk) 23:53, 9 July 2011 (UTC)
 * 1) If you are fine with the 250 records per page, I'll delete the old, extra pages. If you prefer it remain at 100, I'll re-run once the trial is approved.
 * 2) If the journal field starts with a lowercase letter, it will not correctly recognize if a matching page exists as page titles always start with a capital letter. I could put in a work around, but the easiest thing would be to always convert the first letter of the journal field to an uppercase letter. Would this be acceptable or do you want it displayed as lowercase if that was the way it was in the article? Always uppercasing the first letter has the added benefit of not getting two different entries for "Title" and "title".

Headbomb {talk / contribs / physics / books} 04:45, 10 July 2011 (UTC)
 * Who in the world uses crap like ??? Anyway, it should be handled as "J. Algorithms" since that's what the readers sees.
 * 250 is fine. Feel free to delete (or tag with db-g6) whatever's unused or redundant or whatever.
 * Always upper-casing should be fine. The alternative would be to check if the Foobar has the "lowercase" template (lowercase title) its redirects, and have it display as "Foobar" or "foobar" based on that. But that might introduce some sorting issues (although you could work around them with sort)... Anyway its up to you, but defaulting to uppercasing is good enough.
 * Gave a quick look, everything looks fine and dandy! Awesome work. I'll give it a more detailed look early next week.
 * Something weird going on with rank 64 in WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Missing2 (Oxford Dict. of Biographies).
 * Something weird going on with rank 67 in WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Missing2 (Yeast).
 * Something weird going on with rank 67 in WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/Missing2 (Zzap!64).


 * : That was my guess for how to handle it and what is already implemented so no change needed.
 * Extra Pages: I deleted them.
 * First Letter Case: I've changed it to always uppercase.
 * Oxford Dict. of Biographies: This entry consists of  where ODNBsub generates an external link and causes the internal link to fail. Higher up on the page, you requested that templates be kept which works in most cases. However, it will fail in some cases like this. I can put in special handling to take care of specific issues. There are several ways to approach it. Let me know if you prefer one of there or something else:
 * Remove it: Oxford Dictionary of National Biography
 * Pipe it: Oxford Dictionary of National Biography,
 * Place it outside the link: Oxford Dictionary of National Biography,
 * Pipe it with  :  Oxford Dictionary of National Biography,
 * Yeast & Zzap!64: I wasn't properly excluding existing pages in my comparison logic for missing popular pages. It's been fixed.
 * -- JLaTondre (talk) 13:48, 10 July 2011 (UTC)


 * Ah, I see. I don't really see what logic you could use to determine the pipe target, but as far as I'm concerned, something like "Oxford Dictionary of National Biography, ODNBsub" would be fine (target = Invalid). Headbomb {talk / contribs / physics / books} 15:14, 10 July 2011 (UTC)
 * I'll add a special case for this template. I'm sure there will be others that come up over time, but it's easy enough to add more cases if needed. -- JLaTondre (talk) 19:25, 10 July 2011 (UTC)

I notice that (e.g.) Nature is cited 22044 times in 22044 articles. I know for a fact that many articles cited Nature multiple times, so by all logic, the citation count should be superior to the article count. Headbomb {talk / contribs / physics / books} 15:28, 10 July 2011 (UTC)


 * Silly mistake. It was displaying the citation count when the article count was over 5. Fixed and I should have an updated run posted later today. -- JLaTondre (talk) 19:25, 10 July 2011 (UTC)


 * Update posted. -- JLaTondre (talk) 02:11, 11 July 2011 (UTC)


 * Cool. That seems to be working fine. Although I did notice some more weirdness. For example in WikiProject_Academic_Journals/Journals_cited_by_Wikipedia/A1 Academic Medicine is reported twice, once bolded with target, the other unbolded, without target. I think it has to do with the number of spaces (aka some articles have Academic_Medicine and others have Academic__Medicine. Obviously those spaces should be stripped before processing. Headbomb {talk / contribs / physics / books} 03:18, 11 July 2011 (UTC)

I also notice that the "rank" is a bit odd. Like you'll have 1/2/3/3/3/4/4/4/4/4/5/6/6... A more "standard" way to rank them would be 1/2/3/3/3/6/6/6/6/6/11/12/12. Headbomb {talk / contribs / physics / books} 06:41, 11 July 2011 (UTC)


 * Sidenote The most recent dump is from June 20 I think. Don't know which you are using, but the fresher the data, the sexier the JCW compilation is. Headbomb {talk / contribs / physics / books} 16:33, 12 July 2011 (UTC)


 * It is the June 20 dump. I fixed the spaces issue and the rank. I don't plan on posting a new version until more issues are found or the next dump is available (or if I find more changes that make a critical mass as I'm still need to finish going through the pages myself). Let me know if that's an issue. -- JLaTondre (talk) 21:13, 12 July 2011 (UTC)


 * Usually [as in when stuff is fleshed-out], one run per dump is all it takes. The rank/spacing issues were the only "common" ones remaining, so AFAICT, any run after the spacing/ranking fixes should be the last for this dump. There may be some other tweaks to be made, but nothing warranting a new run before the next dump.


 * As a point of curiousity, how hard would it be to do something similar for publishers (using |publisher=) instead of journals? Headbomb {talk / contribs / physics / books} 21:23, 12 July 2011 (UTC)
 * If the logic is the same, just showing the contents of the  field instead of the , it would be easy. I may have some free time next week to look at it. I assume it would be WikiProject Academic Journals/Publishers cited by Wikipedia paralleling WP:JCW? -- JLaTondre (talk) 20:30, 14 July 2011 (UTC)


 * It might be better hosted at WikiProject Books, since these would also include cite book in addition to the usual cite journal and citation, and the majority of publishers deal in books rather than journals. I'll ask them what they think about it (see Wikipedia talk:WikiProject Books).
 * As for the rest, the only code updates I can think of with regards to the crunching is to take publisher instead of journal, add cite book (and redirects) to the list of templates considered, and instead of the (magazine)/(journal) disambiguator fancy logic, it should be done using the (publisher) disambiguator. Some limited trial would be needed to make sure that we're not overlooking anything. I never really dealt with publishers much, so I don't know the idiosyncracies that comes with them. Headbomb {talk / contribs / physics / books} 21:00, 14 July 2011 (UTC)


 * Ah, I thought you were interested in publishers of journals. Broadening it is not a technical issue (other than taking longer to run, but that's fine). The publisher field is used wider than books and journals. For example, cite web has a publisher field. A lot would be driven by what project wants the list and what the attend to do with it. -- JLaTondre (talk) 22:51, 19 July 2011 (UTC)

Found something in need of a tweak. "Most popular" list shouldn't depend on whether the corresponding article exists or not. For example, the most popular missing journal was "Ap J." with 483 citations, which should put it somewhere around #102 on the list of "most popular". Headbomb {talk / contribs / physics / books} 08:01, 16 July 2011 (UTC)
 * Okay, I'll change that. -- JLaTondre (talk) 22:51, 19 July 2011 (UTC)