Wikipedia:Bots/Requests for approval/WikiTransBot


 * The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Symbol delete vote.svg Denied.

WikiTransBot
Operator:

Time filed: 21:30, Friday September 9, 2011 (UTC)

Automatic or Manual: Manual, but will not make any edits.

Programming language(s): PHP

Source code available: Not finished yet.

Function overview: Once per day, using the API:
 * Fetch all revision numbers for articles edited in the last 24 hours.
 * Fetch most recent version of the edited articles with API action=parse; comes out to around 60000 articles per day.
 * Translate the fetched article to Esperanto and Danish, then store them at http://wikitrans.net/

Edit period(s): None. Will not edit anything.

Estimated number of pages affected: None. Will not edit anything.

Exclusion compliant (Y/N): Y

Discussion

 * Was told by User:Wifione to request approval here even if the bot won't be editing, so I did. Tino Didriksen (talk) 21:30, 9 September 2011 (UTC)

It sounds like you're planning to run a fork the wrong way. From Mirrors and forks: "The appropriate way to run a mirror is to download a dump of the compressed 'pages-article' file and the images from http://download.wikimedia.org/, and then use a modified instance of MediaWiki to generate the required HTML, along with above mentioned copyrights information. Please use Articles, templates, image descriptions, and primary meta-pages for mirroring purposes."

Is there any reason you cannot do this instead? Anomie⚔ 00:21, 10 September 2011 (UTC)
 * We did that already, and it is a fallback. But, it takes far too long and doesn't replicate the pages exactly. People were wondering how they could get their edits showing on our site faster. Doing it the dump way would mean 2-3 months between a person edits enwiki and it shows on our site (import dump, render articles, translate, Lucene index, etc). So we really want to do it the rolling update way. And from what I can see, using the API for it will hit the cache. Tino Didriksen (talk) 10:05, 10 September 2011 (UTC)
 * I've found Robot policy, which appears to have been written by someone in a position to authoritatively write such a document. According to that document, if it's really necessary to get updates more frequently than monthly via database dumps,
 * You may fetch recent changes wikitext using prop=revisions&rvprop=content if you do this very shortly after the edit. You may not do it on a batched daily basis.
 * You may download HTML data for recently changed pages by fetching the standard  URL, without any cookies and including the   HTTP header. This will minimize front-end cache misses. You may not use the API or any other URL scheme to fetch parsed output.
 * Note that, unless you need the  permission, neither of these options requires an account much less an approved bot. And in any case, you must be sure to specify a custom User-Agent string including information by which the sysadmins can contact you. Anomie⚔ 16:59, 10 September 2011 (UTC)

Technical
From Robot policy: Pulling this out for technical discussion. Will ask on IRC about this stuff. Tino Didriksen (talk)
 * Some api.php queries are very expensive
 * I make use of:
 * Loop through last 24 hours of api.php?action=query&format=json&list=recentchanges&rcdir=newer&rclimit=max&rcnamespace=0&rctype=edit which takes 3-4 minutes to complete.
 * Once per unique article title, do api.php?action=parse&format=json&page=Title&prop=text OR /Title?action=render&redirect=no depending on which is cheaper for you. Which is cheapest?
 * Downloading every new revision shortly after the edit occurs is acceptable if you have a strong need for this data.
 * I don't want this data. I have no need for every revision. That would be around 250000 revisions, for only 60000 changed articles, from what I saw. I just want the most recent revision of the articles changed in the last 24 hours.
 * Downloading every new revision on a batched daily basis is not acceptable.
 * Is that what I would be doing? Mincing words here, but I am not downloading every new revision - only one revision per article. Tino Didriksen (talk) 19:15, 10 September 2011 (UTC)


 * Re: queries, please let me butt in on this: the former sends you the HTML code encapsulated in JSON, and the latter sends you the rendered article, even including MediaWiki deletion notes and such. Neither will probably do what you want. —  Kudu ~I/O~ 00:38, 11 September 2011 (UTC)
 * They both seem to deliver the article as it is shown to viewers, in HTML. That is what I want. The API seems to give the same as action=render, but with added revision number, and we also need to store the revision number so we can link to the exact original. Tino Didriksen (talk) 07:50, 11 September 2011 (UTC)
 * Note that the full page HTML also contains the revision number. You can extract it from the setting of the Javascript variable  in the page header, or from the "Permanent link", "Cite this page", or "Download as PDF" links in the sidebar. Anomie⚔ 13:44, 11 September 2011 (UTC)


 * I guess I don't fully understand the point of the website in general, since anyone can already get machine-translated versions of any WP page from http://translate.google.com and that doesn't require maintaining a massive real-time mirror of the entirety of article space. But anyway, can't you set up your website so that when someone requests a translated version of an article, it immediately goes and fetches that one specific article, translates it, and then displays it to the user?  It might add an extra second or two to the whole process, but it sure seems to be a lot more bandwidth efficient and always guarantees that they're looking at the very latest version of the article.  &mdash;SW&mdash; confabulate 00:58, 29 September 2011 (UTC)
 * Mincing words here, but I am not downloading every new revision - only one revision per article. You realize there are nearly 4 million articles, right?  Also, you say you want to download about 60,000 new revisions on an average day, that translates to downloading one article every 1.44 seconds, continuously 24 hours a day.  To me, that seems like a relatively large, unnecessary drain on the servers.  &mdash;SW&mdash; soliloquize 01:05, 29 September 2011 (UTC)

If you want to mirror, then use a dump or make arrangements with the Foundation for a feed of some sort. Downloading 60000 pages per day via the API is really just excessive. Anomie⚔ 03:59, 14 October 2011 (UTC)
 * The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.