Wikipedia:Wikipedia Signpost/2011-08-15/Technology report

Making Wikimedia more forkable
The question of how easy it is to "fork" Wikimedia wikis, or, indeed, to merely mirror their content on another site, was posed this week on the wikitech-l mailing list by Wikimedian David Gerard. The concept is also related to that of backups, since a Wikipedia fork could provide a useful restore point if Wikimedia server areas were affected by simultaneous technical failure, such as that caused by a potent hacking attempt.

During the discussion, Lead Software Architect Brion Vibber suggested that the Wikimedia software setup could be easily recreated, as could page content. Instead, he said, the major challenge would lie in "being able to move data around between different sites (merging changes, distributing new articles)", potentially allowing users of other sites to feedback improvements to articles whilst also receiving updates from Wikimedia users. So far, at least one site (http://wikipedia.wp.pl/) has been successful in maintaining a live copy of Wikimedia wikis, lagging behind the parent wiki it tries to mirror by only minutes. No site has yet implemented an automated procedure for pushing edits made by its users upstream to its parent wiki, however. Other contributors suggested that few external sites would have the facility to host their own copy of images, and keeping in line with Wikimedia's strict policy on attribution.

In unrelated news, there were also discussions about making pageview statistics more accessible to operators of tools and apps (also wikitech-l). In particular, the current reliance on the external site http://stats.grok.se to collate data was noted. As MZMcBride wrote, "currently, if you want data on, for example, every article on the English Wikipedia, you'd have to make 3.7 million individual HTTP requests to [the site]".

Uploading was slower than it used to be, but that's fixed, says bugmeister
Although hampered by a lack of data points, anecdotal evidence collected over the past fortnight pointed to a slowdown in the speed of uploading files to Wikimedia wikis. The problem therefore made mass API uploading very difficult, and, as a result, a bug was opened. "An upload that should take minutes is taking hours", wrote one commenter. Another pinpointed Wikimedia servers as the bottleneck: during a test, uploads to the Internet Archive had been over ten times quicker. As it became clear that the problem was affecting a large number of users and data collected seemed to show a dramatic decrease in upload speeds earlier this year, significant resources were devoted to the issue. WMF technicians Chad Horohoe, Roan Kattouw, Sam Reed, Rob Lanphier and Asher Feldman have all worked on the problem.

Once the upload chain was determined as "User → Europe caching server → US caching server → Application server (Apache) → Network File System → Wikimedia server MS7", members of the operations team worked to profile where the bottleneck was occurring. Unfortunately, an error introduced by the profiling meant that uploads were in fact blocked for several minutes. Then, on 12/13 August, the problem was pinpointed and fixed: a module for helping optimise network connections, Generic Receive Offload (GRO), had in fact been slowing them down. According to WMF bugmeister Mark Hershberger, smaller data packets were being collated into much larger ones. The new packets were then too large to be handled effectively by other parts of the network infrastructure. Although there are still some reports of slowness, test performance has increased by a factor of at least three. In the future, more data on upload speed is likely to be collected to provide a benchmark against which efficiency can be tested.

In brief
Not all fixes may have gone live to WMF sites at the time of writing; some may not be scheduled to go live for many weeks.


 * There was a brief incident on Wednesday where users were being inappropriately identified as mobile users and redirected to the mobile version of Wikipedia following a software deployment (discussion). The deployment was aimed at improving levels of redirection ahead of the launch of an improved mobile browsing experience (set to be trialled later this month). Estimates for the amount of time the redirection was in place stand at around 6 minutes. In unrelated news, WMF Data Analyst Erik Zachte this week upgraded his figure for the percentage of Wikimedia page views originating on mobile devices to fifteen per cent.
 * On the English Wikipedia this week, bots were approved for a number of tasks including mass TfD tagging and tagging valid files as being eligible for transfer to Wikimedia Commons. BRFAs that are still open cover a number of other tasks, including the import of expert comment from an external site.
 * Mark Hershberger has suggested that efforts to get 1.18 released on time had significant "momentum" but needed to sustain that to achieve success. The bugmeister explained that while approximately 160 revisions had been reviewed in the last week, another 210 were still left to review (wikitech-l mailing list). The figures include certain core extensions, and are consequently higher than previously published figures which did not.
 * A MediaWiki hackathon has been announced for 14–16 October. Held in the American city of New Orleans, it will include discussion of Wikimedia Labs (a project that will integrate and extend the functionality available to tool developers) and a bugsmash (wikitech-l mailing list).
 * As is now becoming a regular event, developers reviewed the list of bugs currently marked as "blocking" the 1.18 release, or otherwise proving particularly problematic for users. Those attending noted their thoughts down in an Etherpad collaborative report.
 * A question raised at Wikimania – why the Chinese Wikipedia was getting so much more traffic than it used to – turned out to have a technical answer. The robots.txt file for the Chinese Wikipedia was written in both traditional and simplified Chinese, causing problems for bots from search engines and the like, a Chinese Wikimedian explained.