Help talk:Citation Style 1/Archive 66

Request to add Semantic Scholar IDs to the citation template
I’m reaching out from Semantic Scholar, a free, non-profit academic search and discovery engine developed by the Allen Institute for AI (AI2) which was first launched in 2015 and now indexes 180 million research papers from all scientific domains. From a content perspective, we have indexing licensing agreements to index scientific content from 550+ publishers, pre-print servers and academic societies and are integrated with multiple data partners including PubMed, Microsoft Academic, Unpaywall and others that provide us with high-quality metadata for our results (all of our content is publicly and freely available and we do not generate any revenue). We’ve been actively working with Citation Bot to add Semantic Scholar as a source for outbound links for licensed content and based on the discussion here, the Wikipedia community recommended that we submit a request to add links to Semantic Scholar IDs as a new identifier type in the Citation Template which can then be used by the Citation Bot.

For additional context, our goal in incorporating links to Semantic Scholar in Wikipedia citations is to provide an additional discovery entry point for Wikipedia users to explore our open literature graph and find additional relevant information for scientific articles that they are unlikely to find elsewhere. For example, in addition to citations/references, figures and tables we provide AI-based features such as citation classifications and high-quality supplemental content like videos, presentation slides, and links to code libraries (you can see an example here).

We are proposing to add our persistent Paper IDs in the following format: semanticscholar=1fa190b60988a4ad272e39e132bcc12b00429464 (with a persistent link in this format: http://api.semanticscholar.org/1fa190b60988a4ad272e39e132bcc12b00429464), but are open to suggestions (if the IDs are too long we can use our persistent corpus ID instead which looks like this: 134350433 - note: these are currently not shown our website, but will be made available in our API within the next 2 weeks). Once these IDs are made available we plan to work with the Citation Bot to integrate API calls using our DOI resolver to generate corresponding links to Semantic Scholar pages (for example, DOI=10.1038/nrn3241 resolves to semanticscholar=da82f8e6ff009432896730061247fa6653bed1f0). Please let us know what additional information we can provide for this request to be considered by the Wikipedia community or if anyone has any questions or feedback! Sebaskohl (talk) 20:18, 18 February 2020 (UTC)
 * "1fa190b60988a4ad272e39e132bcc12b00429464" is indeed too long. ID codes should be designed so humans can make some sort of sense of them and actually use them. A 9 digit code can handle a billion documents. That should be plenty. There is no need for a 40-hex long code of random gibberish to encode 1640 = 2160 documents. &#32; Headbomb {t · c · p · b} 20:52, 18 February 2020 (UTC)
 * Also note that a URL of the type " " that resolves to " " is probably better than one that resolves to " ". For a bot, one format to another is very likely trivial, but for humans, it's both easier and more accessible to truncate after the identifier, rather than cut the middle part of the url. &#32; Headbomb {t · c · p · b} 21:01, 18 February 2020 (UTC)
 * Thank you for the feedback Headbomb! Good news is that we just made a change to our API to support shorter persistent IDs for papers in our index (we will be making the same IDs available on our website shortly to support manual edits). semanticscholar=37220927 now maps to 26efc3b4216a117a01975be4dfffa5267bfff64d and the new IDs will be 9 digits or less in length. Linking to these IDs is also straightforward and URLs can be constructed in the following format: https://api.semanticscholar.org/CorpusID:37220927. Let me know what you think and if this satisfies the recommendations/requirements that you've outlined. If yes, do you know what the next steps are to (1) get approval and (2) if approval is granted add this new identifier to the Citation Template? Appreciate your help! Sebaskohl (talk) 16:49, 19 February 2020 (UTC)
 * For the parameter name, what will SS display? CorpusID:37220927, which should be displayed as CorpusID:37220927? Or will you have stronger branding, like 'SemanticID:37220927' or 'SSCID:37220927' (short for Semantic Scholar CorpusID) or whatever. I personally like the later, but that's just me. &#32; Headbomb {t · c · p · b} 17:42, 19 February 2020 (UTC)
 * The new IDs will be 9 digits or less in length. Does that mean that these IDs are randomly assigned?  Sequentially assigned?  What about leading zeros; permitted; not permitted? (modifying https://api.semanticscholar.org/CorpusID:37220927 to https://api.semanticscholar.org/CorpusID:037220927 suggests that leading zeros are permitted)  Is there a minimum value?  If   is 'valid' then the only rationality checks that cs1|2 can do is max length check and a check to be sure that the ID is only digits.
 * —Trappist the monk (talk) 22:26, 19 February 2020 (UTC)
 * Thank you for the feedback! How about using S2CID as the abbreviation for Semantic Scholar CorpusID since S2 is already used on Wikidata? With regards to formatting, I can confirm that IDs are sequentially assigned and start with a minimum value of 1 with a current max value of 211133348 (to be safe a max length of 10 digits should provide sufficient extensibility into the future). Leading zeros are supported with our API, but disallowing them is probably a good idea just in case. Sebaskohl (talk) 00:11, 21 February 2020 (UTC)
 * Ultimately up to you what you decide to call/present the identifier. The Wikidata property can easily be named whatever, so if you feel SSCID is clearer/rolls off the tongue more easily than S2CID (this is my position), go for that. If you like S2CID better, because Semantic Scholar = S2, go for that although I can't say I ever encountered that abbreviation personally. &#32; Headbomb {t · c · p · b} 00:32, 21 February 2020 (UTC)
 * Sounds good! Let's go with S2CID since that aligns with the abbreviation that we already use for Semantic Scholar. Sebaskohl (talk) 00:48, 21 February 2020 (UTC)
 * This is fantastic. "(to be safe a max length of 10 digits should provide sufficient extensibility into the future)" <-- save this to look back on it in 10 years ;) – SJ +  14:33, 21 February 2020 (UTC)
 * first hack:
 * I have set the upper limit bounds to.
 * I notice that there are s2 pages like this one where the publisher keeps the article behind a paywall but s2 has a link to alternate sources at infekt.ch (is that a pirated copy?) and to pubmed which is not a link to the article. Are either of those appropriate?  When I noticed this, I wondered, because it is early days, if the s2cid couldn't somehow encode the access-status (paywalled, free-to-read, open access, ...) status of an article so that editors here don't have to bother with adding free (not yet implemented).  S2 already knows when articles are open access (see the linked page in the example citation above) so adding something to the s2cid ought not be an onerous endeavor.  With that info encoded, access icons would come automatically according to the value in s2cid (for identifiers, cs1|2 only cares about free-to-read) but other consumers of s2 via s2cid may want more/better granularity.
 * —Trappist the monk (talk) 14:55, 21 February 2020 (UTC)
 * Looks great! do you have suggestions for how to best encode open access information in the ID? Should we use a query parameter in the URL to denote when a paper is open access (e.g. https://api.semanticscholar.org/CorpusID:37220927?open-access=Y - this already works) or use a different type of delimiter? With regards to the examples that you've highlighted we will only add open-access=Y in cases when we know that the resource that we link to has been published with an open access license. This means that cases like the example you highlighted where we link out to third party websites with an unclear license like infekt.ch (a Swiss hospital website) will have a value of open-access=N (let us know if that works or if you have alternate suggestions). One additional question: Would it be possible to increase the upper limit bounds beyond  to give us room to grow (right now we are adding anywhere between 5-10 million new scientific papers and IDs per year)? Sebaskohl (talk) 17:00, 21 February 2020 (UTC)
 * cs1|2 templates cannot see the url unless editors or some other tool puts the url in url which, as I understand it, en.wiki finds to be undesirable. What I meant to say and upon rereading what I wrote, apparently failed to say, is that the access-status might be encoded into the s2cid as a suffix; perhaps: 37220927.oa or some such.  cs1|2 can then apply the free-to-read icon according to the suffix. You may have a use for more than one suffix; cs1|2 would only need whatever suffixes you choose that equate to free-to-read.
 * I chose the upper limit so that typos (or stray keystrokes that add an additional digit) might be detected.  This value will get bumped up as the need arises.  Were it up to me, s2cid would have a check-digit because as it stands right now, any sequence of digits that evaluate to less than  is the only checking that can be done.  This upper-level limit is a wholly inadequate way of verifying the integrity of an identifier but, for some identifiers, s2cid included, it's all we've got.
 * —Trappist the monk (talk) 18:18, 21 February 2020 (UTC)
 * sounds good! We'll add support for the .oa suffix on our end in cases where a link to an open access PDF is available (IDs will be in this format: 37220927.oa). If the .oa suffix is missing then that's an indicator that no link to an open access PDF is available. I will let you know when that work is complete and thank you also for the clarification with regards to the upper limit! Sebaskohl (talk) 21:16, 21 February 2020 (UTC)
 * With  suffix:
 * I notice that adding the suffix to the generic url returns an error message:
 * https://api.semanticscholar.org/CorpusID:37220927.oa →
 * —Trappist the monk (talk) 00:37, 22 February 2020 (UTC)
 * We have implemented the fix for the above internal error. The link appears to be working properly now. —Jgorney (talk) 19:05, 26 February 2020 (UTC)
 * Good, thank you. I guess that leaves, at minimum, some way for Wikipedia editors, and automated tools like Citation bot, to discover the value of an s2 page's s2cid.
 * —Trappist the monk (talk) 23:55, 26 February 2020 (UTC)
 * We implemented today the Wikipedia "W" as a paper sharing option in the upper right corner for editiors. https://www.semanticscholar.org/paper/Exercise-of-Human-Agency-Through-Collective-Bandura/b1f74216506e3a35e3c56d5ada91d4a7112616dc
 * —Jgorney (talk) 20:28, 27 February 2020 (UTC)
 * Thank you. Those urls could be used by anyone, right?  No real need to qualify them as 'wikipedia compatible'.  For the purposes of this discussion, all that editors here would need / want is the string of digits that follow   from the url; the rest of the url would be discarded.  Any way to just get that?
 * —Trappist the monk (talk) 21:27, 27 February 2020 (UTC)
 * Sorry I should have been more explicit. When that button is used it copies the api.semanticscholar.org:CorpusID link to the user's clipboard. So it is working as requested.
 * —Jgorney (talk) 20:28, 2 March 2020 (UTC)
 * Yeah, I understand what it does, but I don't think that what it does is what it should do. If the point here is to make it dead simple for editors to add an s2cid to a cs1|2 citation template then giving the editor the whole url when all they need is the s2cid number portion of the url isn't getting all the way to the goal because now the editor has to paste the url into something and then remove everything that isn't the s2cid.
 * We asked for a special form of url so that our editors don't have to deal with a 40-digit hex number which is unintelligible to normal humans. We want to be able to have editors add a parameter &lt;identifier number> to a cs1|2 template.  The code that renders the template will concatenate   and   to get a working link into s2.
 * This url to a wholly unrelated website has just about what it is that I think we want: the ID is listed at the top of the image details list; double-click, copy and paste into my citation template and I'm done; no chance that I left on an extraneous character or deleted a character that was part of the s2cid.
 * The W button still isn't Wikipedia-exclusive so ought not be marked as Wikipedia-exclusive; if you want to keep it, use a generic chain-link icon.
 * —Trappist the monk (talk) 00:11, 4 March 2020 (UTC)
 * Makes sense and appreciate the feedback on the W button. I've queued up a change to update it to a more generic chain-link icon for the URL. To address the other issue you raised we'll create a new W button that makes it easy for editors to just copy the ID so that it can be added to the citation template. I will post an update when that's done. Sebaskohl (talk) 00:16, 7 March 2020 (UTC)
 * I will still argue for a plain text representation of the s2cid on the s2 page. Sure, it can have a tooltip and under-the-bonnet javascript to copy the value portion of the s2cid to the reader's clipboard.  When the s2cid is visible as plain text, readers can see and get the s2cid; even those readers who, for whatever reason, don't have js enabled.  Hidden behind a fancy button, those js-less readers cannot get the s2cid.
 * —Trappist the monk (talk) 14:49, 7 March 2020 (UTC)
 * Just because I was curious, and because you said that s2cid begins at 1, I looked: https://api.semanticscholar.org/CorpusID:1. Why is that page so dramatically different from https://api.semanticscholar.org/CorpusID:2?
 * And, while on the subject of 2, the open access link there links to https://eprints.soton.ac.uk/377196/1/Lessmann_Benchmarking.pdf; clearly a preprint. Shouldn't s2 identify such documents as preprints instead of giving readers the impression that the non-publisher link links to an open access copy of the article of record?
 * —Trappist the monk (talk) 14:11, 22 February 2020 (UTC)
 * Hello, let me first introduce myself. I am Joe Gorney and I work at Semantic Scholar for Sebastian. While he is out on vacation, I will be managing the S2 workflow regarding this string.  To address your query about the two differences in pages is due to the source and amount of metadata we have received to build out the PDP for page #2.  We have not received a quality level of metadata to flesh out the presentation of content for page #1.  Additionally, the question around preprint identification is being discussed internally.
 * —Jgorney (talk) 17:36, 24 February 2020 (UTC)
 * It occurred to me that it's easy to zero-pad an id to the left to 12 digits. Were we to do that then we could easily calculate a check-digit using the same algorithm as isbn-13 for which we already have a validation function in Module:Citation/CS1/Identifiers.  So I hacked a sandbox to do that: Module:Sandbox/trappist the monk/check digit.
 * Were we to do this, the need to periodically adjust an upper limit value goes away. The isbn-13 check-digit isn't perfect (there are certain digit transpositions at are undetected but the occasionally undetected transposition is better than never detecting a transposition or typo except when the transposition or typo causes the s2cid to go out of bounds.  The adjacent undetectable transposed digit pairs that I know about are 16, 27, 38, and 49.
 * —Trappist the monk (talk) 16:29, 22 February 2020 (UTC)
 * I will still argue for a plain text representation of the s2cid on the s2 page. Sure, it can have a tooltip and under-the-bonnet javascript to copy the value portion of the s2cid to the reader's clipboard.  When the s2cid is visible as plain text, readers can see and get the s2cid; even those readers who, for whatever reason, don't have js enabled.  Hidden behind a fancy button, those js-less readers cannot get the s2cid.
 * —Trappist the monk (talk) 14:49, 7 March 2020 (UTC)
 * Just because I was curious, and because you said that s2cid begins at 1, I looked: https://api.semanticscholar.org/CorpusID:1. Why is that page so dramatically different from https://api.semanticscholar.org/CorpusID:2?
 * And, while on the subject of 2, the open access link there links to https://eprints.soton.ac.uk/377196/1/Lessmann_Benchmarking.pdf; clearly a preprint. Shouldn't s2 identify such documents as preprints instead of giving readers the impression that the non-publisher link links to an open access copy of the article of record?
 * —Trappist the monk (talk) 14:11, 22 February 2020 (UTC)
 * Hello, let me first introduce myself. I am Joe Gorney and I work at Semantic Scholar for Sebastian. While he is out on vacation, I will be managing the S2 workflow regarding this string.  To address your query about the two differences in pages is due to the source and amount of metadata we have received to build out the PDP for page #2.  We have not received a quality level of metadata to flesh out the presentation of content for page #1.  Additionally, the question around preprint identification is being discussed internally.
 * —Jgorney (talk) 17:36, 24 February 2020 (UTC)
 * It occurred to me that it's easy to zero-pad an id to the left to 12 digits. Were we to do that then we could easily calculate a check-digit using the same algorithm as isbn-13 for which we already have a validation function in Module:Citation/CS1/Identifiers.  So I hacked a sandbox to do that: Module:Sandbox/trappist the monk/check digit.
 * Were we to do this, the need to periodically adjust an upper limit value goes away. The isbn-13 check-digit isn't perfect (there are certain digit transpositions at are undetected but the occasionally undetected transposition is better than never detecting a transposition or typo except when the transposition or typo causes the s2cid to go out of bounds.  The adjacent undetectable transposed digit pairs that I know about are 16, 27, 38, and 49.
 * —Trappist the monk (talk) 16:29, 22 February 2020 (UTC)
 * It occurred to me that it's easy to zero-pad an id to the left to 12 digits. Were we to do that then we could easily calculate a check-digit using the same algorithm as isbn-13 for which we already have a validation function in Module:Citation/CS1/Identifiers.  So I hacked a sandbox to do that: Module:Sandbox/trappist the monk/check digit.
 * Were we to do this, the need to periodically adjust an upper limit value goes away. The isbn-13 check-digit isn't perfect (there are certain digit transpositions at are undetected but the occasionally undetected transposition is better than never detecting a transposition or typo except when the transposition or typo causes the s2cid to go out of bounds.  The adjacent undetectable transposed digit pairs that I know about are 16, 27, 38, and 49.
 * —Trappist the monk (talk) 16:29, 22 February 2020 (UTC)
 * —Trappist the monk (talk) 16:29, 22 February 2020 (UTC)

I'm not sure that's a good idea, if the journal gets acquired and either games or loses open access status, that would mean the identifier changes as well. That's not good. The best way would simply to have an open access flag that can be accessed via API. &#32; Headbomb {t · c · p · b} 21:36, 21 February 2020 (UTC)
 * definitely agree that this is likely to happen. As part of this change we are also adding an "is_open_access" boolean flag to our API to ensure that the most up-to-date open access indicator is always available so that the IDs can be updated easily by the Citation Bot and Wikipedia editors. For example, if a paper changes from "open access" to "not open access" then the .oa suffix will be removed when the Citation Bot calls our API or a Wikipedia editor is adding a link manually from one of our pages. Note that from a linking perspective links with and without .oa will always link to the same resource on our side (we are adding logic so that those links will not break regardless of what the suffix is). Would that work? Sebaskohl (talk) 21:57, 21 February 2020 (UTC)
 * Modules / templates do not have the ability to access an api.
 * —Trappist the monk (talk) 00:37, 22 February 2020 (UTC)
 * —Trappist the monk (talk) 00:37, 22 February 2020 (UTC)

Where do we stand with s2cid parameter support. Are we content to keep support for s2cid or, since the Semantic Scholar representatives Sebaskohl and Jgorney appear to have abandoned this discussion, delete support for this parameter from the sandboxen? —Trappist the monk (talk) 14:25, 17 March 2020 (UTC)
 * Apologies for the delayed response (we've been working on releasing a COVID-19 dataset that we just made public as part of a call to action from the White House). We would definitely still like to see the s2cid added to the citation template and are proceeding to make changes to our page design (scheduled to go out this week) based on your recommendations. This includes surfacing the IDs directly on the page as requested and changing the "W" icon to a "chain link" icon for the button. Please let us know if anything else is needed to add the s2cid. Sebaskohl (talk) 17:37, 17 March 2020 (UTC)
 * Good, thank you.
 * —Trappist the monk (talk) 18:11, 17 March 2020 (UTC)
 * As promised we've made the change to prominently display the s2cid on our paper detail pages and have have updated the share icon. Please let us know if anything else is needed to add the s2cid to the citation template! Sebaskohl (talk) 21:37, 19 March 2020 (UTC)
 * I just created S2CID, e.g. . It will automatically strip the  if it is appended to the base identifier. &#32; Headbomb {t · c · p · b} 23:26, 19 March 2020 (UTC)
 * As far as CS1/2 is concerned, it should probably strip the, but to automatically set free if it's found. &#32; Headbomb {t · c · p · b} 00:01, 20 March 2020 (UTC)
 * It already does strip the  flag.  See the examples earlier in this conversation.
 * —Trappist the monk (talk) 00:09, 20 March 2020 (UTC)
 * I can't think of anything else that we want for this identifier. Others may have a different opinion.  I'd like to finish off some of the remaining open topics on this page and update the module suite in early April.
 * —Trappist the monk (talk) 00:14, 20 March 2020 (UTC)
 * Sounds great! We'll keep an eye out for the update to the module suite and please let us know if any additional questions come up in the meantime. Sebaskohl (talk) 20:30, 24 March 2020 (UTC)

revisiting s2cid free-to-read annotation
Sebaskohl: I chanced upon this citation:

Following the doi link shows that the publisher has that article behind a paywall. But, following the title link shows that there is an apparently free-to-read copy of the article hosted at s2. Wikipedia should not be linking to copyrighted works where it is not clear that the distributor (s2 in this case) has been properly licensed by the copyright owner (WP:ELNEVER). It isn't clear to me that the s2 copy of this journal article is properly licensed. If it is, then en.wiki is allowed to link to it and the s2cid rendering should show the free-to-read access icon.

Right now, the only way to display that icon is with the  flag at the end of the s2cid. But, since this article is not open access, that flag is inappropriate. This suggests that if s2 may legitimately host some articles that the publisher has behind a paywall but are not open access, it is necessary to have some sort of other flag to indicate that the article is free-to-read (and appropriately licensed?) Or, we drop the whole notion of the   flag altogether and require that editors here add free when the linked article is OA or s2 is properly licensed to host the article (this latter requires that s2 make it obvious that the copy of the article that they host is properly licensed).

—Trappist the monk (talk) 14:26, 25 March 2020 (UTC)
 * And just as an aside, there is this, purportedly Medicine and Surgery of South American Camelids. I used to have a copy of that book, the pdf linked from  is not that book.
 * —Trappist the monk (talk) 22:01, 25 March 2020 (UTC)
 * Thank you for highlighting these two links! In both cases the S2 Corpus ID did not have the .oa suffix because we only append the .oa suffix in cases we are 100% certain that the open access license is current and up-to-date (we have a regular process running to keep licensing information up-to-date). Unfortunately in case of the first article that was no longer the case and I've removed the PDF. The second case was an unfortunate instance where the PDF was low quality (I've also removed this PDF). From a linking perspective you can be 100% confident that everything with a .oa suffix has a current and up-to-date open access license. I believe in other cases without the .oa suffix we decided to show a "closed access" symbol. Will that be sufficient or do you have alternative suggestions? Sebaskohl (talk) 20:04, 30 March 2020 (UTC)
 * Let me try to be sure I understand what you are saying. I think that you are saying that s2 does not / will not host copies of articles that not are properly licensed.  Is that right?  What about links to articles hosted at other locations?  For example, what about article copies that appear to be used at various universities for academic course work but are hosted at the professors' web pages?  See the Alternate sources dropdown at .  The publisher at  has that article behind a paywall.  Are those copies that s2 is linking properly licensed?  Should you and, through you, we be linking to them?
 * In writing this response, I note that using that same s2cid in this sandbox citation, I can add  to the s2cid and the link works:
 * Should that s2cid work? An editor can mark a non-OA s2cid with the   suffix to give our citation the free-to-read icon.  The editor might ask, "Why not?  The Alternate sources dropdown shows that copies of the article are readily available and I can link to them.  They must be free-to-read, right?")  Am I making sense?  I'm thinking that if we are to retain this   suffix mechanism, s2 should intercept s2cids that have the   suffix but s2 doesn't link to know OA hosts or s2 doesn't host an OA copy itself.  When intercepted, perhaps s2 can put up a banner that says something like "We don't have an open access copy of the article you are requesting, redirecting to ..."  You know what I mean, I think.  That intercept should be readily identifiable by a bot, perhaps Citation bot, so that the bot can modify the s2cid in the template where it is used.  Equally, for OA s2cids without the   suffix, s2 should immediately redirect to the OA landing page as if the suffix were present – you do this already I think.  As before, a bot should be able to easily identify redirected s2cids so that it can adjust the citation template here.
 * I don't know how much of this is possible. Pinging AManWithNoPlan for insights on the the bot perspective.
 * —Trappist the monk (talk) 23:04, 30 March 2020 (UTC)
 * Citation Bot has special code that calls a special API that (added at our request) that only returns licenses versions. Instead of the “hey I scraped this off the web, so it must be free” versions.  BUT!!! Drum roll..... does S2 have any freely available content that is also not available from the publisher?  I doubt it, other than non-journal stuff.  Which is why we don’t add such links at this time.   Also, S2 also supports DOI based URLS, so perhaps a S2-is-free=true could enable a DOI based link?  But, not all S2 stuff has a DOI.  Unlike CiteSeerX, which can probably legally host copyright infringing stuff because of crown immunity (Both state and Federal), C2 has no such immunity.  Their immmuty does not mean we should link to it though. Those are my random thoughts.   AManWithNoPlan (talk) 23:19, 30 March 2020 (UTC)
 * Thank you for looping back! The .oa flag is only set in cases where we have clear and up-to-date open access information from either Unpaywall, the publisher or a repository like PubMed Central. In those cases we also link out to open access articles using alternate sources with an open access symbol. From a bot perspective, we added an "is publisher licensed" flag to our API to ensure that Wikipedia bots only link to publisher licensed content on S2 (see example). Our proposal when working with the Citation Bot is to only add/confirm S2 links if the content "is publisher licensed" or confirmed to be open access. Would that work? Sebaskohl (talk) 16:29, 31 March 2020 (UTC)
 * I think that you have neatly avoided answering my questions about articles linked from the alternate sources list when s2 have not marked them as open access. My example was the articles linked through the alternate sources dropdown at  when the publisher's article at  is behind a paywall.  Our editors will see the purportedly free-to-read articles in the alternate sources dropdown as free-to-read (there is the unlocked-lock icon) and will add the   flag to the s2cid to get the free-to-read icon to render in the en.wiki-published citation.  Because there is no apparent indication that the article copies linked from the alternate sources dropdown are properly licensed, en.wiki should not link them indirectly through s2 just as en.wiki should not link them directly.  The free-to-read  s2cid should only link to an s2 landing page that contains OA material or properly licensed OA links.
 * I would like to have this issue settled because it is delaying the next module-suite update.
 * —Trappist the monk (talk) 13:10, 2 April 2020 (UTC)
 * Thank you for highlighting this also! We are following up on our end to look into options to further improve the alternate source links beyond the metadata that we get from Unpaywall, publishers and other sources (which we use to set the .oa indicator). Unfortunately that won't be an easy change/improvement to make and I'm wondering if the best path forward is to just remove the .oa indicator altogether for now until we have a better solution in place (we don't want to block your release)? Let me know what you think and sorry about the back-and-forth on this. Sebaskohl (talk) 23:16, 3 April 2020 (UTC)
 * I'm inclined to agree. So,   gone; new s2cid-access created; and our favorite example:
 * Farther up in this thread is a template with  showing that it is no longer recognized.  Here it is again without the  :
 * At this point, I don't think that we should resurrect .  To do so would only cause confusion – there might have been confusion had we retained it because turning on the free-to-read for s2cid would have been different from how it is turned on for other parameters.  If this experiment created anything beneficial at your end for Citation bot, that should be retained.
 * —Trappist the monk (talk) 23:48, 3 April 2020 (UTC)
 * Sounds great and glad to hear that this will work! I will ask my team to remove the .oa suffix from the Corpus ID on our pages to clean things up on our end. Let me know if there's anything else that you need from us to get the release unblocked. Otherwise we'll go back to the Citation bot to move forward from that perspective. Thanks again! Sebaskohl (talk) 20:59, 7 April 2020 (UTC)
 * I'm curious to see when you expect the citation template update to go live to include the s2cid so that we can follow up with the CitationBot and other collaborators? Thank you and appreciate the update! Sebaskohl (talk) 16:19, 28 April 2020 (UTC)
 * —Trappist the monk (talk) 16:26, 28 April 2020 (UTC)
 * Farther up in this thread is a template with  showing that it is no longer recognized.  Here it is again without the  :
 * At this point, I don't think that we should resurrect .  To do so would only cause confusion – there might have been confusion had we retained it because turning on the free-to-read for s2cid would have been different from how it is turned on for other parameters.  If this experiment created anything beneficial at your end for Citation bot, that should be retained.
 * —Trappist the monk (talk) 23:48, 3 April 2020 (UTC)
 * Sounds great and glad to hear that this will work! I will ask my team to remove the .oa suffix from the Corpus ID on our pages to clean things up on our end. Let me know if there's anything else that you need from us to get the release unblocked. Otherwise we'll go back to the Citation bot to move forward from that perspective. Thanks again! Sebaskohl (talk) 20:59, 7 April 2020 (UTC)
 * I'm curious to see when you expect the citation template update to go live to include the s2cid so that we can follow up with the CitationBot and other collaborators? Thank you and appreciate the update! Sebaskohl (talk) 16:19, 28 April 2020 (UTC)
 * —Trappist the monk (talk) 16:26, 28 April 2020 (UTC)
 * I'm curious to see when you expect the citation template update to go live to include the s2cid so that we can follow up with the CitationBot and other collaborators? Thank you and appreciate the update! Sebaskohl (talk) 16:19, 28 April 2020 (UTC)
 * —Trappist the monk (talk) 16:26, 28 April 2020 (UTC)
 * —Trappist the monk (talk) 16:26, 28 April 2020 (UTC)

it is live. See also User_talk:Citation_bot. &#32; Headbomb {t · c · p · b} 16:37, 28 April 2020 (UTC)

make ref=harv the default for CS1
This would have many benefits and very few if any drawbacks. No one raise substantial objects in that previous proposal, and now this is a blocker for Bots/Requests_for_approval/AntiCompositeBot. harv should be made default in CS1 just as it is in CS2. &#32; Headbomb {t · c · p · b} 00:06, 27 February 2020 (UTC)
 * As I understand it, the original rationale for not doing ref=harv for CS1 was that it creates invalid html when two references have the same authors and same publication year, and that CS1 is used so often in articles that don't also use the harv templates that this is likely to go undetected. Has that changed? —David Eppstein (talk) 01:03, 27 February 2020 (UTC)
 * Those are very corner case, and is also a problem that already exists within CS2. &#32; Headbomb {t · c · p · b} 01:10, 27 February 2020 (UTC)
 * I probably misunderstand David Eppstein's remark, but isn't the standard practice to add a distinguishing mark to the date in same author/year refs, as in 2020a, 2020b etc. These render properly. 98.0.246.242 (talk) 02:58, 27 February 2020 (UTC)
 * Both render properly, but if there's a citation with the same authors and years, you have two citations emitting the same anchor, so there's a collision. However, that's already the current behaviour in CS2, and is really not a big issue, especially compared to the score of broken anchors caused by CS1 not emitting those anchors to begin with. &#32; Headbomb {t · c · p · b} 03:11, 27 February 2020 (UTC)
 * It is standard practice *if you're using the harv templates* to disambiguate them so that they link correctly. The issue is that CS1 is mostly used by itself, not in conjunction with the harv templates, so there is less reason to disambigate them and (because they look ok) and editors don't realize that under the hood they are generating bad html. —David Eppstein (talk) 05:03, 27 February 2020 (UTC)
 * Most people who use CS2 templates don't use harv templates, and many who use harv templates don't know that they only work out of the box with CS2. Whatever collision this would cause, it would problematically affect an extreme minority of articles, which is far fewer than the problems it would actually solve. The point is CS1 should also emit anchors, and whatever problems it causes do not outweight the benefits this brings. As I wrote back then "If they used CS1, nothing changes. If they used CS2, then either they don't use harv or already have harv enabled. Or they use a mix of CS1 and CS2 that needs to be fixed anyway." &#32; Headbomb {t · c · p · b} 05:15, 27 February 2020 (UTC)
 * I have to admit that despite my arguing here, I think changing to ref=harv everywhere probably solves more problems than it causes. The html errors cause by duplicate ids are mostly invisible to readers, they are present in any case with CS2, and they are easily fixable (even in the case that you're not using harv links and don't want to add letters to the years) by using a custom ref in those cases. Finding and fixing these will no doubt give the gnomes plenty to gnaw on, making them non-problematic in the long term. —David Eppstein (talk) 21:58, 27 February 2020 (UTC)
 * My view, FWIW, is that adding ref=harv to CS1 references will solve more problems than it creates, and that it is the best proposal I have seen in many years of trying to make these errors more visible. It would be great if these broken refs would create an error category, or if they could be tagged by a bot, and if adding ref as a default to CS1 templates allows for bot tagging, let's try it. Javascript-wielding gnomes have been our best answer so far, and I come across short ref errors all the time, so that answer is not very good. – Jonesey95 (talk) 22:51, 27 February 2020 (UTC)
 * I don't think that I've changed my opinion about this. But, perceiving that this is almost a fait acompli, and because I have an (old) copy of Obama's article in my user-space, I edited that article to add harv to all of the cs1 templates that it holds (there are no cs2 templates).  You should look: User:Trappist the monk/Barack Obama.
 * The results suggest to me that there will be a plethora of false positives from User:Ucucha/HarvErrors.js. The example article has 111 harv warnings including everything in ; the example article does not have -family or  templates.
 * —Trappist the monk (talk) 00:19, 28 February 2020 (UTC)
 * I've trained my eyes to ignore the brown errors (full refs without matching short refs) from Ucucha's script, since they are almost always false positives. In articles with red errors (short refs without matching full refs), the brown messages sometimes help me find the one full reference that is not being cited. Maybe we could get a version of Ucucha's script that shows only the red errors. – Jonesey95 (talk) 00:37, 28 February 2020 (UTC)
 * Warnings are not problems. They're simply unused anchors. Errors are problems. That's what this is aiming to fix, as well as making template much friendlier to use to begin with. And it also allows for the removal of cluttering/pointless harv in CS1. &#32; Headbomb {t · c · p · b} 01:02, 28 February 2020 (UTC)
 * If this change causes the script to change to not show these as errors, or it causes editors to stop using the script because it shows too many false positives, and as a result they stop trying to add ref=none to articles that use CS2 but do not use harv linking, then I will count that outcome as a net positive. —David Eppstein (talk) 01:25, 28 February 2020 (UTC)
 * If this change causes the script to change to not show these as errors, or it causes editors to stop using the script because it shows too many false positives, and as a result they stop trying to add ref=none to articles that use CS2 but do not use harv linking, then I will count that outcome as a net positive. —David Eppstein (talk) 01:25, 28 February 2020 (UTC)


 * Question How possible would it be to do something like use dmy dates but for harv (like use harv cites)? I imagine that if editors could just transclude a single template that changes the output of all the cite templates, they'd be just as fine with that as they would if it was the default. &#8211; MJL &thinsp;‐Talk‐☖ 16:52, 1 March 2020 (UTC)
 * Don't really know if it's feasible, but it would be relatively undesirable and mostly pointless clutter. &#32; Headbomb {t · c · p · b} 17:12, 1 March 2020 (UTC)

Interesting that you should ask that question. Over the past couple of days I have been messing about in my sandbox; more about that in a moment. When we first added support for the templates, I speculated that we could do something similar to unify rendering of the cs1|2 templates. The example I used was the mode parameter but ref is another that could be added to such a template. The discussion is buried in.

In response to comments elsewhere, I've created some code in my sandbox that reads raw cs1|2 citation templates and builds a table of CITEREFs that the and  families of templates (using Module:Footnotes/sandbox) can read to determine if there is a matching target citation in the article. When the or  template finds its CITEREF in the table, no error message:

But, if the CITEREF isn't in the table:

When there are multiple cs1|2 citations that produce the same CITEREF:
 * – here is the sfn and two same-name / same-date c1|2

And it works with the family:
 * – here is the sfnmp

Downsides? Inevitably. This scheme does not work for wrapped templates because those kinds of templates hide a lot of parameters (author parameters, editor parameters, contributor parameters, ref, date, year) under the bonnet so they aren't visible in an article's wiki source. Does not play well with ve because ve does not preview in the same way that the wiki-source editor previews (same reason the auto-date-formatting doesn't work while editing with ve). Benefits? Error messages are visible to editors who don't have User:Ucucha/HarvErrors.js; the experiment detects both errors in the  example; the script only finds one at a time;  the experiment doesn't shout. Enhancements still to be done are support for error categories and help text. Another possible enhancement might add CITEREFs to the table when none so that harv templates without a target but that match the citation template where none could be annotated. Also, the templates support their own ref parameter. The content of that parameter overrides the normal CITEREF in the same way the cs1|2 templates with ref assigned some other text than,  ,   (as plain text or as created by ) overrides the automatic CITEREF anchor creation. The table can hold that 'ref' text for comparison to 'ref' text in templates. I don't know how common this custom ref use is; I have seen it used to just hold what looks like notes which misuses the parameter but I guess I would expect negative pushback for this enhancement.

—Trappist the monk (talk) 19:38, 1 March 2020 (UTC) 21:05, 1 March 2020 (UTC)

In the sandbox:

New maint cat to identify cs1|2 templates with harv. When that category is cleared, the code supporting harv should be rewritten.

—Trappist the monk (talk) 14:15, 2 April 2020 (UTC)

broken harv link reporting
Please see the discussion at where the above broken harv-link reporting scheme is proposed.

—Trappist the monk (talk) 17:46, 16 March 2020 (UTC)
 * Well harv being the default option would still need to be default option for the number of errors to drastically go down. &#32; Headbomb {t · c · p · b} 18:29, 16 March 2020 (UTC)
 * can we please have this in the April update? &#32; Headbomb {t · c · p · b} 18:05, 29 March 2020 (UTC)


 * I was rather surprised to see this change today, as I hadn't been aware it was on the cards. The page GirlsDoPorn has several "further reading" entries wrapped in cite web template, which are being flagged by the harv ref checker as "Harv warning: There is no link pointing to this citation". If ref=harv is now default, how do I turn it off? Thanks &mdash; Amakuru (talk) 13:48, 28 April 2020 (UTC)
 * first consider updating your script with one of the options at the top of User:Ucucha/HarvErrors. User:Trappist the monk/HarvErrors.js will in particular suppress all warnings from all Further reading sections, and will also suppress all warnings on articles that don't make use of shortened footnotes. But to bypass the warning on a specific citation, the method that worked with citation now works with all templates, just put none on the citation. &#32; Headbomb {t · c · p · b} 15:30, 28 April 2020 (UTC)
 * OK great, thanks for the info. &mdash; Amakuru (talk) 22:00, 28 April 2020 (UTC)

Make replacement characters their own category


Emits code to have it populate Category:CS1 errors: invisible characters. This is not an invisible character, and the way to fix those is very different than with invisible characters. This should instead populate Category:CS1 errors: replacement characters. &#32; Headbomb {t · c · p · b} 15:28, 29 April 2020 (UTC)

Citing a forum post
How would one cite something like a forum post (assuming that it’s an acceptable source)? Use cite web with “Thread title”. Website forums.? Or is there a more fitting template? Or do it by hand? —96.8.24.95 (talk) 03:12, 29 April 2020 (UTC)
 * Cite Web would be the one. I assume that you mean non-anonymous postings, or posts by handles proven to belong to specific individuals (whose real name should also appear in the citation). There are 2 templates in cs1 that are similar (cite mailing list & cite newsgroup). Holdovers from the days when most people concerned with templates were computer-oriented. 172.254.241.58 (talk) 13:47, 29 April 2020 (UTC)
 * Yes, from a verified primary source. How should cite web be used? In the way I said above? Because otherwise, I definitely have the wrong idea. —96.8.24.95 (talk) 21:29, 29 April 2020 (UTC)
 * Depends on how the particular forum indexes the posts. This may not be initially obvious. For instance, the proper way to cite your post above should include the revision id. But obviously this means nothing to the average reader. To make it more readable, one would have to include the date and time, the in-source location and other info, including the forum publisher who may not necessarily be the same as the website publisher. 98.0.246.242 (talk) 00:09, 30 April 2020 (UTC)

Are there really nbsp characters in Les Frangines?
I am seeing nbsp invisible character errors in Les Frangines, but when I copy and paste one of the relevant title parameter values to https://r12a.github.io/app-conversion/ or to a text editor and show all of the invisible characters, a regular space or an HTML %20 is shown where an nbsp is indicated. Here's a sample cite web template that is giving an error, copied directly from that article:

It does not show me an error here. I am confused. – Jonesey95 (talk) 15:13, 29 April 2020 (UTC) By the way, the Wikipedia citation tool for Google Books outputs "AuthorYYYY", which is absolutely uninformative. --Moscow Connection (talk) 19:21, 30 April 2020 (UTC)
 * I don't see an error in that template. =) --Izno (talk) 15:28, 29 April 2020 (UTC)
 * There were several errors, User:Citation bot fixed those here, alongside a few other things. They were hardcoded non-breaking spaces, which the bot replaced with regular spaces. &#32; Headbomb {t · c · p · b} 15:32, 29 April 2020 (UTC)
 * I think no-break spaces may be converted to ordinary spaces when you try to copy them out of the edit window. Using "Edit Source" on Les Frangines and then exploring the edit window using the Firefox built-in "Inspector" tool, I could see the no-break spaces. But, yes, they were shown as ordinary spaces when I copied from the edit window into the r12a.github.io window. That is confusing. -- John of Reading (talk) 15:44, 29 April 2020 (UTC)
 * Because I was curious about this, I looked at the bot fix. OK, the NBSP was between Apple and Music; I guess not requiring them to be on the same line is acceptable. But the bot also changed «DONNEZ-MOI» to "DONNEZ-MOI", although the former is in the actual title of the citation. Is that really a policy to change the text of a citation's title? That's quite apart from the resulting ugly repeated " character (see footnote 7). David Brooks (talk) 16:49, 29 April 2020 (UTC)
 * I believe that's inline with MOS:QUOTEMARKS, or was at one point. For discussion about that, you should try WT:MOS instead of here. &#32; Headbomb {t · c · p · b} 17:15, 29 April 2020 (UTC)
 * Per this very help page, we typically use Wikipedia-standard punctuation, capitalization, and other typographical elements in citations, and ignore or convert anything exotic (like foreign punctuation) or florishy. Barring case-by-case exceptions, of course. Edit: And that includes changing nested quotemarks from double to single and vice-versa. While taking care of that, I noticed that several of the reftag names in that article are truly awful in terms of ease of typing (which is a major factor in choosing a ), even entirely using non-Latin scripts. Might have been automated? Anyway, they’re (hopefully) improved now. —96.8.24.95 (talk) 21:50, 29 April 2020 (UTC)
 * Reference names should be meaningful and unambiguous. I won't revert your edit now. But I when I decide to update the article, I will have to. Cause the names you chose are confusing, they don't tell me anything. --Moscow Connection (talk) 07:14, 30 April 2020 (UTC)
 * In my experience, refs are usually named for the website or the author’s last name (and numbered if there are multiple, though I took a different tack here), and kept short for ease of re-use. What best practices have you seen? —96.8.24.95 (talk) 17:46, 30 April 2020 (UTC)
 * I haven't seen many practices at all, so I don't know. I prefer "News article title - Site name". A ref like that is easily recognizable and 100 % unambiguous. And when you are writing an article, it really helps if ref names are recognizable. (When I was mostly active in the Japanese Wikipedia, where I would create stubs about upcoming songs etc., I used "websiteYYYYMMDD", but that format wasn't unambiguous cause sometimes there were two articles about the same band on the same website on the same day.) --Moscow Connection (talk) 19:12, 30 April 2020 (UTC)
 * Anyway, your "album" referring to an Apple Music album profile is about the worst practice I've ever seen. :-)

Does parameter order matter?
I'm wondering whether the order in which "url=" or "date=" in a citation, for example, matters. The citation templates given in the editing bar and those given on here are in a different order. Is there a particular way I should be writing the parameters in a reference? Heartfox (talk) 20:56, 30 April 2020 (UTC)
 * Order doesn't matter, but it's a good practice to more or less match the output order to make things easier to locate. There's an example here on an unrelated page Bots/Dictionary. &#32; Headbomb {t · c · p · b} 21:04, 30 April 2020 (UTC)

Invalid date
Hello, there a a few thousand incorrect dates of 1970-01-01, 1 January 1970 and January 1, 1970 around which is some marker rather than the actual date of publication. May be we could track this, probably in a seperate category to the Category:CS1 errors: dates which is already large. Keith D (talk) 11:49, 17 April 2020 (UTC)
 * What do you mean around which is some marker rather than the actual date of publication? Show me a page where you have seen this 'marker'?
 * —Trappist the monk (talk) 11:58, 17 April 2020 (UTC)
 * I made the assumption it was a marker of some form, but see 2020 coronavirus pandemic in India and 2020 coronavirus pandemic in Singapore as a couples of examples that use that date. Keith D (talk) 13:47, 17 April 2020 (UTC)
 * For 2020 coronavirus pandemic in Singapore, changed this:
 * to this:
 * For 2020 coronavirus pandemic in India, changed this:
 * to this:
 * Both of these edits were made by WP:REFLINKS. I don't know how Reflinks gets the publication date but, the 1970-01-01 date is suspicious given the currency of the topic and looks remarkably like a Unix epoch date of 0 (  → ).
 * Pinging Editor Dispenser into this conversation because there may be a bug in Reflinks.
 * —Trappist the monk (talk) 14:29, 17 April 2020 (UTC)
 * Oof, 1,300 hits in article space for date=1970-01-01. Most of them will be bogus unix zero dates. A tiny fraction of them will probably be valid data for books or journal articles published in January 1970, like this note at B. F. Skinner. Anything using cite web with a date of 1970 is suspicious, though. – Jonesey95 (talk) 15:25, 17 April 2020 (UTC)
 * Good catch, likewise (although to a much lower extent), 1980-01-01 and 1900-01-01 are also bogus dates in many citations.
 * --Matthiaspaul (talk) 16:05, 17 April 2020 (UTC)
 * It really helps if you link to example articles. – Jonesey95 (talk) 16:49, 17 April 2020 (UTC)
 * 1980 and 1900. --Izno (talk) 17:17, 17 April 2020 (UTC)
 * That is helpful. As with 1970, the cite book usages look valid (though overspecified, since we don't typically cite month and day for book publication dates), but the cite web usages appear to be invalid. There may be a couple of edge cases, like a scanned report from 1980 posted on the web using cite web. – Jonesey95 (talk) 21:35, 17 April 2020 (UTC)
 * Independent of fixing identified tools causing this I suggest to generally test for these three "start of epoch" dates and add a maintenance category for them. In the case of cite web we could even treat it as an potential error and ask editors to use the ((...)) syntax to override the error checking in case this is really a valid date. --Matthiaspaul (talk) 19:35, 2 May 2020 (UTC)
 * Pinging Editor Dispenser into this conversation because there may be a bug in Reflinks.
 * —Trappist the monk (talk) 14:29, 17 April 2020 (UTC)
 * Oof, 1,300 hits in article space for date=1970-01-01. Most of them will be bogus unix zero dates. A tiny fraction of them will probably be valid data for books or journal articles published in January 1970, like this note at B. F. Skinner. Anything using cite web with a date of 1970 is suspicious, though. – Jonesey95 (talk) 15:25, 17 April 2020 (UTC)
 * Good catch, likewise (although to a much lower extent), 1980-01-01 and 1900-01-01 are also bogus dates in many citations.
 * --Matthiaspaul (talk) 16:05, 17 April 2020 (UTC)
 * It really helps if you link to example articles. – Jonesey95 (talk) 16:49, 17 April 2020 (UTC)
 * 1980 and 1900. --Izno (talk) 17:17, 17 April 2020 (UTC)
 * That is helpful. As with 1970, the cite book usages look valid (though overspecified, since we don't typically cite month and day for book publication dates), but the cite web usages appear to be invalid. There may be a couple of edge cases, like a scanned report from 1980 posted on the web using cite web. – Jonesey95 (talk) 21:35, 17 April 2020 (UTC)
 * Independent of fixing identified tools causing this I suggest to generally test for these three "start of epoch" dates and add a maintenance category for them. In the case of cite web we could even treat it as an potential error and ask editors to use the ((...)) syntax to override the error checking in case this is really a valid date. --Matthiaspaul (talk) 19:35, 2 May 2020 (UTC)

Wikidata sitelink optimization
I would like to get the following: changed to: I would have used the sandbox but it seems to be in the middle of being used for something else and this is a simplistic and straightforward change that avoids bumping the expensive parser function count. Thank you, —Uzume (talk) 17:44, 4 May 2020 (UTC)
 * Red information icon with gradient background.svg Not done: Please make your changes in the sandbox. They should be relatively trivial there even if there are other changes being made in and around.

This change probably also doesn't need special handling. Izno (talk) 18:49, 4 May 2020 (UTC)
 * Okay, I put it in the sandbox on top whatever else was there: Special:Diff/954872230/prev. Thank you, —Uzume (talk) 18:56, 4 May 2020 (UTC)
 * cs1|2 is updated only from the sandboxen. Since this change is in Module:Citation/CS1/Identifiers/sandbox, it will make it to the live module when the suite is next updated.
 * —Trappist the monk (talk) 20:38, 4 May 2020 (UTC)
 * You are stipulating anything that makes it into the sandbox eventually makes it into the main module and such requests are needless? That hardly seems right. —Uzume (talk) 00:33, 5 May 2020 (UTC)
 * But indeed correct. The sandbox accumulates changes and functions more or less as a live article for that purpose, rather than how most template sandboxes works on Wikipedia. --Izno (talk) 00:37, 5 May 2020 (UTC)

Question about page parameter in Template:Cite journal
That parameter is habitually filled by auto-generate tools with the entire page range for the journal. Ex. >. But recently another editor requested that I add a specific page to the citation of a journal publication. Ok, best practices and such, narrow ranges are good but - what parameter to use? I thought it is part of the MoS for journal citations to report their page ranges (it is commonly done in academic ciations). But where to add the specific page? I cannot use both page= and pages= in the template, they clash and generate an error. --Piotr Konieczny aka Prokonsul Piotrus&#124; reply here 07:56, 1 May 2020 (UTC)
 * Seems like you're looking for WP:REFPAGE which informs about different systems that can be used to define precise page numbers for verifiability, while keeping the page range of a journal article mentioned too. --Francis Schonken (talk) 08:09, 1 May 2020 (UTC)

37–58 [53] will work for a one time use. Alternatively, something like Blah blah, blah blah... but also blah! Blah blah, blah blah... but also blah!

will also work. WP:REFPAGE above also has other methods.&#32; Headbomb {t · c · p · b} 11:31, 1 May 2020 (UTC)
 * I usually just add a text note after the template, something like "See in particular p. 53." —David Eppstein (talk) 18:47, 1 May 2020 (UTC)
 * , I usually do this if I'm not using sfn short citations:


 * Or you can do it the other way round:


 * We have the same problem with cite book when citing book chapters, where we need a page range and a particular page number. SarahSV (talk) 21:50, 1 May 2020 (UTC)
 * It seems like the citation templates have a problem (missing parameter) that should be fixed. For now I have used RP templates but the SFN ones you use seem like an elegant solution that I will try to learn. --Piotr Konieczny aka Prokonsul Piotrus&#124; reply here 02:25, 2 May 2020 (UTC)
 * Wouldn't call that a problem of the citation templates. This is how short works (or specified short parts of works such as book chapters or journal articles) have been cited since... well, the beginning. Short works were not normally paginated, in the old days. What is cited in is the work (the journal) and an in-work location (the article). The citation's page/s parameter should reflect that, which also agrees with the table of contents (a discovery aid). Adding a third-level location (the specific page) may be too much trouble, especially since the alternatives offered above work. 98.0.246.242 (talk) 16:37, 2 May 2020 (UTC)
 * As works are increasingly being digitized we use them in new ways. For example in the future you may be able to mouse-hover over the page number and a pop-up digital version of that page appears. That's one example, but having the exact page number allows for new ways of using books, or more useful ways of navigation the 100s of millions of books in existence. -- Green  C  17:49, 2 May 2020 (UTC)
 * I don't think that the request for a parameter to indicate context is wrong. I was just pointing out some of the rationale for the current scheme, and the possible existing resolutions. 98.0.246.242 (talk) 19:06, 2 May 2020 (UTC)
 * Since this problem pops up repeatedly (f.e.
 * Help_talk:Citation_Style_1/Archive_16
 * Help_talk:Citation_Style_1/Archive_62
 * Help_talk:Citation_Style_1,
 * but also in many other places) and everyone is forced to either ignore it or "invent" his/her own notation how to specify both page ranges (and readers need to guess-decode it), I would appreciate if we could address this within the citation templates, so that we can establish a standard notation for such page numbers and centrally maintain the rendering in the future. This would improve consistency, and might, in a later step, even lead to new improved functionality, like individually linked pages, backlinks, etc.
 * For this, I suggest to introduce new, more specific parameters:
 * cite-page/cite-pages to specify the individual page(s) used to support a statement in the article. The output would be like for the existing page/pages parameters, f.e..
 * span-page/span-pages to specify the page(s) defining a chapter or journal / magazine article. This must be a superset of the individual pages used in the citation. Since the old page/pages parameters were used to specify both individual and span pages, we have to treat them as aliases to span-page/span-pages. If cite-page/cite-pages is not specified or defines the same page range, the output would just display what was defined by span-page/span-pages/page/pages, f.e. . If cite-page/cite-pages defines a subset, the output would be like f.e..
 * total-page/total-pages to specify the total page range of the book / work, or the count of pages. This would be a free-text parameter without error checking. If defined this would be appended at the very end of the citation output like " ". Examples:,  ,.
 * quote-page/quote-pages to specify the page/page range used for the quote in the quote parameter. This information, if present, could be added in front of the quote, f.e. . Internet Archive Bot is adding links to pages of archived works, so it should be possible to specify page links. If quote-page/quote-pages would be specified without quote parameter, this would be treated as an error. The value of quote-page/quote-pages could either be treated individually or be added to the pages defined by cite-page/cite-pages. The later case, however, would require some more sophisticated page list management in the template. In the minimal version, it would just append the string defining the pages after removing those subpatterns that are already in the list defined by cite-page/cite-pages. I guess, this would be too complicated for the first implementation, so I suggest to just treat the quote-page/quote-pages independently.
 * Example renderings:
 * This could look like:
 * Last, First (1970). "Title". Journal. Location: Publisher. 25 (5): 263, 270, 282–285 (261–336). Page 283: "Quoted text" (76 pages)
 * Even without volume and number parameters, this could still be decoded as pages (rather than issue numbers) due to the preceding colon ":" following the publisher.
 * As a magazine article this could look like:
 * Last, First (1970). "Title". Magazine. Vol. 25 no. 5. Location: Publisher. pp. 263, 270, 282–285 (261–336). Page 283: "Quoted text" (76 pages)
 * As a book the rendering could be similar to:
 * Last, First (1970). "Chapter". Title. Location: Publisher. pp. 263, 270, 282–285 (261–336). Page 283: "Quoted text" (563 pages)
 * --Matthiaspaul (talk) 19:11, 2 May 2020 (UTC)
 * That is absolutely atrocious and unparsable. &#32; Headbomb {t · c · p · b} 14:00, 3 May 2020 (UTC)
 * I have added some formatting and reduced the examples to (hopefully) be more readable. --Matthiaspaul (talk) 16:00, 3 May 2020 (UTC)
 * There is no need to distinguish quote-pages from cite-pages, and citations do not need total-pages. Kanguole 14:11, 3 May 2020 (UTC)
 * As only excerpts of text supporting statements in an article may be quoted and a reference may be used to support multiple statements in an article quote-pages may be a subset of cite-pages. If both parameters would specify the same page range, the  prefix in front of the quote would be redundant information and could be silently muted.
 * Regarding total-pages, it is certainly not essential, but some users repeatedly asked for it, and there are valid uses of this. So it is best to have a proper place for this so that we can control how it is rendered rather than everyone inventing his own conventions. This will improve consistency. --Matthiaspaul (talk) 16:00, 3 May 2020 (UTC)
 * I see no reason for total-pages, and I put little weight on user requests unless the user demonstrates he/she understands citations, both inside and outside Wikipedia.
 * I also am dubious that a mass of pages can be displayed in one citation without confusing the reader. Short endnotes, where the pages that support a statement are given in the endnote, and all the pages in the article are specified in the bibliography entry, are easier to understand.
 * The suggestion, and the term "span-pages", belie some ignorance of publishing practices. Some magazines and journals put all the pages of an article next to each other, but many magazines, articles, and especially newspapers will print part of an article on one page and continue it later in the issue. So the page list for an entire article might look like 11, 14–16, 102–107. Jc3s5h (talk) 16:42, 3 May 2020 (UTC)
 * Yeah, and that's a span, although a fragmented one. Alternative names could be chapter-pages or article-pages etc., but I tried to avoid becoming too specific. I'm happy for suggestions.
 * The span-pages parameter was meant to carry things like  as well, if necessary. And cite-pages could then hold f.e.   pointing to the actual pages supporting the (one or more) statements in an article. Combined this could look like.
 * Some people prefer short footnotes, others don't, and some make it dependable on the article and the citations used. The fact that we have sfn does not mean we shouldn't further improve cite templates.
 * --Matthiaspaul (talk) 17:10, 3 May 2020 (UTC)
 * I agree that it would be helpful to have a  parameter for book chapters and journal articles, and leave   or   for specific page numbers needed to satisfy WP:V. If it looks too confusing (e.g. a newspaper article spread over many pages), then simply leave it out. SarahSV (talk) 17:59, 3 May 2020 (UTC)
 * Sarah, since you proposed the (round-bracket) notation above, would you also be happy with the swapped [square-bracket] notation discussed in the parallel thread at ?
 * --Matthiaspaul (talk) 13:16, 6 May 2020 (UTC)
 * --Matthiaspaul (talk) 13:16, 6 May 2020 (UTC)


 * --Matthiaspaul (talk) 16:39, 4 May 2020 (UTC)
 * --Matthiaspaul (talk) 16:39, 4 May 2020 (UTC)

Google books example
https://en.wikipedia.org/wiki/Help:Citation_Style_1#Online_sources Google books is moving to a new format (book in the URL, not the hostname), so the examples suggesting changing "books.google.com/books" to just "books.google.com" is probably not a good idea anymore. AManWithNoPlan (talk) 23:45, 5 May 2020 (UTC)
 * Linking to a page number looks similar to the old way: https://www.google.com/books/edition/The_Old_Way/rtHR8_gK_WwC?hl=en&gbpv=1&pg=PA3 -- Green  C  00:13, 6 May 2020 (UTC)
 * Comparable URLs:
 * Old: https://books.google.com/books?id=vh8xDQAAQBAJ&pg=PT207
 * New: https://www.google.com/books/edition/_/vh8xDQAAQBAJ?gbpv=1&pg=PT207
 * -- Green  C  00:28, 6 May 2020 (UTC)
 * If I may make a highly pedantic observation, the actual identifier always was and is part of the url, and it was never part of the "hostname" (actually, domain name). Originally "books" had its own subdomain. The subdomain was moved directly under the main domain name hierarchy, as a subpage. Sorry about the geek storm. 98.0.246.242 (talk) 01:18, 6 May 2020 (UTC)
 * Please read subdomain and hostname they are interchangeable terms in this context. They eliminated the books hostname/subdomain, added a "www" hostname/subdomain, and added a "/books/" path to the URL. -- Green  C  01:56, 6 May 2020 (UTC)
 * Beg to differ. Hostnames are singular device identifiers and are rarely published in large networks like Google's. They are in no case interchangeable with subdomains, which are subnetworks described in a distinct zone file. Hosts are components of network domains. The books path is only a fork of the main index, that contains it's own pages. 172.254.241.58 (talk) 13:19, 6 May 2020 (UTC)
 * Please read hostname. "Any domain name can also be a hostname, as long as the restrictions mentioned below are followed. So, for example, both en.wikipedia.org and wikipedia.org are hostnames because they both have IP addresses assigned to them. A hostname may be a domain name, if it is properly organized into the domain name system. A domain name may be a hostname if it has been assigned to an Internet host and associated with the host's IP address." --  Green  C  13:48, 6 May 2020 (UTC)
 * That is a poor article. It mixes up elements of physical networks, network domains, and directory systems like DNS. Confusion. I am fairly certain that there was no host named "books." @ google. There were probably dozens of hosts under the "books.google." subdomain, sharing the load of Google Books. 172.254.241.58 (talk) 14:25, 6 May 2020 (UTC)


 * Do the older forms continue to work into the future?
 * Given Google Books' widespread use in citations, perhaps we should add some special support for them and treat them similar to identifiers, so that editors would only have to specify the volume ID "vh8xDQAAQBAJ" and a number of optional parameters like for pages ("PT207") and keywords, and let our citation template create the actual url. This way, we could centrally update the url when Google once again changes their scheme in the future.
 * The parameter could be named gbooks, and perhaps the sub-parameters for pages etc. could just be appended to the volume ID, like in.
 * We already have a similar template gbooks which can also be used for url in citation templates like in  (currently still) resulting in  . Built-in support for this would help to free the url parameter for other uses. (Although I don't support this RFC, if Village_pump_(proposals) succeeds, the title could be auto-linked to the gbooks link (or one of the other identifiers discussed there), if url is not present.)
 * There's one more unresolved problem: The changes at Google indicate that Google Books links are (unfortunately) far from being "permanent links" as claimed by some users, who habitually (and against consensus) remove url-status, archive-url and archive-date parameters from citations when url points to Google Books. Moving Google Books links into gbooks would encourage them even more to do this. This also applies to a number of other identifier links, as we have multiple links which should better be archived to avoid potential link rot in the long-term future, but only one parameter to hold the archived link. I have some ideas how to possibly solve this but since this would be a general solution (not only related to Google Books), it better belongs into another thread, I guess.
 * --Matthiaspaul (talk) 15:22, 6 May 2020 (UTC)
 * You are right Google Books are unstable. We can expect they will change any time, for any purpose that aligns with the primary goal of being a commercial book seller. I personally am not a fan of abstracting URLs into templates because it limits tools, most tools are not programmed for the literally thousands of custom URL templates we have making them invisible to maintenance tasks. A plain URL is easy for everyone to understand and work with. If/when we modify all the Google Books links to the new format, each new URL has to be tested, because they never all just work, so having templates would actually increase link rot because we would never go through and test they are working - and if we did, it might as well just make the change which negates the benefit of the template. Those custom URL templates are seductive but hide a lot of problems that create link rot. The archiving of Google Books is sometimes a problem because Google has anti-theft techniques so the archive providers have trouble creating usable snapshots. -- Green  C  15:58, 6 May 2020 (UTC)
 * I was a bit reluctant to bring this to the table as I see some of this as well. I didn't know that Google Books doesn't switch all links to a new format, though - this would defeat the idea. Or is this down to incomplete documentation and therefore unknown url parameters accepted in old links which are no longer supported in a newer format? In this case, reducing the links to the essential parameters as early as possible would be even more desirable (not only to get rid of tracking parameters and other junk) - and if the template would help an editor by highlighting unneeded url parameters in edit preview, this could actually improve the quality and reduce the risk for link rot. Either way, the archiving problem stands. --Matthiaspaul (talk) 19:08, 6 May 2020 (UTC)
 * I do the WP:URLREQ board and have gained a lot of experience with URL switches. In my experience there has never been a URL switch where every URL is migrated on the remote side. There are always stragglers, by mistake or intentional - the site admins use the opportunity to clean up, or they neglect things. With close to a million Google URLs on enwiki even a small percentage would be a lot. Because it's Google they could be better, but Google Books URLs have a high 404 rate so they are not known for link reliability. Google Books URLs are complicated and undocumented, the examples given above are the least complex for linking to a single page. There are different types of pages like multi-page short-snippit views etc.. --  Green  C  20:59, 6 May 2020 (UTC)

spam black list and archive urls
There is a discussion:. Apparently, the spam blacklist can be triggered by a url embedded in an archive.org snapshot url (and presumably in other achive urls that include the original url). This presents a problem to editors who try to fix cs1|2 template citations. One solution described at the aforementioned discussion is to percent encode the original url in the archive url; this:
 * https://web.archive.org/web/20091002033137/http://www.example.com/

becomes this:
 * https://web.archive.org/web/20091002033137/http%3A%2F%2Fwww.example.com%2F

I have hacked on Module:Citation/CS1/sandbox and implemented this solution. Here for url and title:

and here for chapter-url and chapter:

This code looks for the original url (url) in the archive url (achive-url). If found, the achive url is split at the beginning of the embedded original url. The embedded original url is then percent encoded and the two parts rejoined to make a new archive url. The same is true when chapter and chapter-url are set, and unfit (or ).

For now this applies to all 'unfit' and 'usurped' urls. Presuming we keep this, I wonder if we ought not have another keyword for url-status; perhaps. A separate maintenance category might also be in order.

Keep? Discard? Opinions?

—Trappist the monk (talk) 17:00, 3 October 2019 (UTC)


 * I think this is as much an acceptable solution as any, at least as long as archive services do not disallow percent-encoding referrals for whatever weird reason. A social rather than technical issue may arise from editors who may wonder why a blacklisted url displays in the first place. 72.43.99.130 (talk) 18:37, 3 October 2019 (UTC)
 * ... editors who may wonder why a blacklisted url displays in the first place. I think that's not an issue because the title is not linked to the blacklisted url but to a (presumably) good snapshot of the website page before it was blacklisted.  I presume here that the editor who chose the archive url did so in good faith and that the archived source does, indeed, support the Wikipedia article's text.  I suppose that the argument might be made that a blacklisted url is a blacklisted url whether it's archived or not.  Still, to your point, using unfit or usurped disables the link to the original url in the rendered citation.
 * —Trappist the monk (talk) 19:12, 3 October 2019 (UTC)

Never mind. I have reverted this change per the linked discussion.

—Trappist the monk (talk) 22:30, 3 October 2019 (UTC)


 * Regarding this:
 * I think that shouldn't be an issue. We should distinguish between these two cases:
 * The url (or domain) was always malware/spam; it was never suitable for a reference, and still is not.
 * The url (or domain) started off as a good source, but is malware/spam now.
 * One strength of having an archive in the first place, is that it can help us deal with case #2, and provide a good copy of an url back before it changed. This may be an argument for different handling of the two cases above, which may imply different values for.
 * I am not certain what your expectations were about how editors should employ the values unfit and usurped, given that the CS doc for  has little to say about them. But we could, I suppose, assign (or reassign) the usurped value to case #2: that is, "The url was good once (and the archive may still retain a copy), but it isn't good anymore", which goes along with one set of display possibilities including a displayable  . That might leave unfit to cover case #1, with a different set of display characteristics (including forbidding  , if it was always bad). Or, if that's not what you intended unfit to be, then perhaps some new value (forbidden, blacklist, or whatever) to indicate that this was never a usable url and the   should be suppressed if there is one.
 * Whatever the case (and even if nothing changes wrt to those two values), the documentation should be updated to clearly explain these two values, and how they should be used. I'm okay with not having it updated now, especially if the usage or meaning of these values is in flux, but once things shake out, there should be a clear and thorough explanation. (If you want help editing some doc for it when the time is right, feel free to issue a request on my Talk page, and I'll be happy to help.) Mathglot (talk) 02:44, 5 October 2019 (UTC)
 * Whatever the case (and even if nothing changes wrt to those two values), the documentation should be updated to clearly explain these two values, and how they should be used. I'm okay with not having it updated now, especially if the usage or meaning of these values is in flux, but once things shake out, there should be a clear and thorough explanation. (If you want help editing some doc for it when the time is right, feel free to issue a request on my Talk page, and I'll be happy to help.) Mathglot (talk) 02:44, 5 October 2019 (UTC)


 * Original discussions about parameter values  and   are at:
 * Neither of those discussions consider blacklisted urls.
 * There were subsequent discussions with regard to parameter values:
 * – mentions blacklisted urls
 * With regard to your statement:
 * The url (or domain) was always malware/spam; it was never suitable for a reference, and still is not.
 * It has been pointed out that percent-encoding the original url in an archive url may be used to mask a cite that has always been malicious. That is also true of archive sites that support url shortening – create an archive copy of the malicious site at archive.today, use the shortened url to avoid the blacklist (until one of the bots that lengthens shortened urls arrives to lengthen it). As an aside, when these lengthening bots attempt to save an article that now has a blacklisted url embedded in an archive url, what happens?
 * I suppose that when archive urls link to malicious archives, the whole archive url can be blacklisted (presumably with sufficient flexibility that such blacklisting catches all archive urls regardless of timestamp). If there is a specific archive timestamp that can be shown to not be malicious, then an editor could possibly petition whomever does this sort of thing to white-list that particular archive.  The question then becomes, how do we mark such white-listed archive urls?
 * For me, I understand  and   to mean that the url links to:
 * – link farm or advertising or phishing or porn or other generally inappropriate content
 * – new domain owner with legitimate content; original owner with legitimate content unrelated to the originally cited url's content
 * Yep, there is no bright line separating the two but, as can be seen from the original discussions of these parameter values, we struggled to get even these because the waters, they are muddy.
 * And I repeat myself yet again: if you can see how the documentation for these templates can be improved, please do so.
 * —Trappist the monk (talk) 14:34, 5 October 2019 (UTC)
 * - I believe there is a flag to exempt bot accounts from being blocked on save. I prefer to get blocked to manually fix. My bot also decodes encoded schemes in the path/query portion so the filters are not bypassed. IMO re whitelisting, it is often a matter of judgement/opinion and also double jeaporady since the original blacklisting presumably had a consensus discussion, it opens every blacklisted URL up to a new potential consensus discussion. This is a loophole for users to get past blacklists and overhead to manage. -- Green  C  22:10, 5 October 2019 (UTC)
 * – new domain owner with legitimate content; original owner with legitimate content unrelated to the originally cited url's content
 * I assumed to be closer to ? If there is a new, properly registered owner (publisher) did any usurpation take place? 72.43.99.138 (talk) 15:42, 6 October 2019 (UTC)
 * For me, I understand  and   to mean that the url links to:
 * – link farm or advertising or phishing or porn or other generally inappropriate content
 * – new domain owner with legitimate content; original owner with legitimate content unrelated to the originally cited url's content
 * Yep, there is no bright line separating the two but, as can be seen from the original discussions of these parameter values, we struggled to get even these because the waters, they are muddy.
 * And I repeat myself yet again: if you can see how the documentation for these templates can be improved, please do so.
 * —Trappist the monk (talk) 14:34, 5 October 2019 (UTC)
 * - I believe there is a flag to exempt bot accounts from being blocked on save. I prefer to get blocked to manually fix. My bot also decodes encoded schemes in the path/query portion so the filters are not bypassed. IMO re whitelisting, it is often a matter of judgement/opinion and also double jeaporady since the original blacklisting presumably had a consensus discussion, it opens every blacklisted URL up to a new potential consensus discussion. This is a loophole for users to get past blacklists and overhead to manage. -- Green  C  22:10, 5 October 2019 (UTC)
 * – new domain owner with legitimate content; original owner with legitimate content unrelated to the originally cited url's content
 * I assumed to be closer to ? If there is a new, properly registered owner (publisher) did any usurpation take place? 72.43.99.138 (talk) 15:42, 6 October 2019 (UTC)
 * I assumed to be closer to ? If there is a new, properly registered owner (publisher) did any usurpation take place? 72.43.99.138 (talk) 15:42, 6 October 2019 (UTC)




 * I think that these definitions of usurped, unfit, and possibly other values of  need solid, agreed-upon definitions.  Just from the point of view of English usage, never mind specialized wiki vocabulary, usurped is much more like what IP 72 stated.  The sense of a new domain owner with legit content is nothing like most native English speakers would imagine, I don't think, when seeing the word usurped.
 * To me, your definition is a bit more like what would apply to a word like, repurposed, or reassigned, or repositioned or perhaps some word from marketing vocab when one company buys another's superannuated property, if there is such a word. The term usurped does not seem appropriate for the meaning you assume for it. This all needs further airing out, before the spam blacklist wrinkle, which is an edge case of the broader problem, can even be discussed. I have a feeling that there may be a need for at least one, perhaps two more values for  to cover the different meanings that we seem to be alluding to for it, and trying to cram into two few values. Mathglot (talk) 23:12, 9 October 2019 (UTC)
 * Just wanted to be clear about one point: I don't think we need new values, just for the sake of new values; there's not need to distinguish every possible thing that could happen with an url. But, when they should be handled differently by the software, then, yes: we do need values for those cases. When the confusion surrounding the current meanings of usurped and unfit are settled, I suspect we will find that we will need at least one more value, in order to assign it to different handling in the software, and I think the spam blacklist case may be one such example. Mathglot (talk) 23:19, 9 October 2019 (UTC)
 * If you don't like the definitions that I offered above, write better definitions. I did write above: ...as can be seen from the original discussions of these parameter values, we struggled to get even these...  Yeah, we know that these parameter keywords are less than optimal so there is no real need to spend a lot of words telling us what we already know.  Suggest better definitions and / or suggest better keywords.
 * —Trappist the monk (talk) 12:43, 10 October 2019 (UTC)
 * For domain names that are not trademarked, reassigned would be imo a good option to clarify there is a new registrant. Obviously trademarked domains (like say, newyorktimes.com) would not normally lapse, so in these cases usurped would be more accurate. 72.43.99.138 (talk) 13:55, 11 October 2019 (UTC)
 * I agree with 72.43.99.138. — UnladenSwallow (talk) 17:25, 11 October 2019 (UTC)

Dynamic original vs. static archive
I have faint recollection that this may have been covered before. If so forgive my laziness in searching.

The problem is with preemptively archived sources whose originals are subject to update, therefore causing the respective original/archived cited versions to differ. In general terms, any updatable database of information (take your pick) could fall in this category, as the archived snapshot could conceivably become stale with the next update. What to do in this case? Having just edited one such entry (from the Worldcat Identities database) I was forced to use a link note to explain the discrepancy. A native solution imo would be one of the following: 1. Make url-status context-sensitive to archive changes (with an additional option superseded archive or similar) 2. Add a new archive-url-status.

Is there any traction on this? 98.0.246.242 (talk) 02:39, 8 May 2020 (UTC)