Wikipedia:Wikipedia Signpost/2021-01-31/Technology report


 * Legoktm is a site reliability engineer for the Wikimedia Foundation. He wrote this in his volunteer capacity.

Wikipedia has seen incredible success thanks to the efforts of thousands of dedicated contributors and the technical side is no exception. From backend and infrastructure developers to translators to bot and script writers, Wikipedia grew from a hobby project sitting on a single server to a robust platform operating out of datacenters on multiple continents.

Erik Möller, former Deputy Director of the Wikimedia Foundation and an early developer, noted in an email, "What always amazed me working in this environment is how many brilliant people found a niche in which they individually made incredibly impactful contributions, and that seems as true today as it ever was."

Starting with trust
At the very beginning, getting server access was "pretty loosey-goosey", Brion Vibber, an early developer and the first employee of the WMF, said in an email. "If you showed up and put in good work helping out, you might well hear 'yes' to getting some fairly direct access to things because that was the only way we were going to get anyone to do it!"

Assume good faith applied here as well. Möller said, "As I recall, in the very early years, these decisions were made based largely on trust by folks like [ Jimmy Wales] or the people he had already delegated access to (e.g., [Vibber]), with a high assumption of good faith that the people who showed up to help had no darker motives."

The source code was maintained and developed in a CVS repository on Sourceforge.net. Gabriel Wicke, a developer and later Principal Software Engineer at the WMF, said in an email, "Getting revision control access (CVS at the time) basically was about winning trust with whoever set up accounts, which I strongly suspect was [Vibber]."

Tim Starling, an early developer and current Principal Software Architect at the WMF, said in an email he got CVS access from Lee Daniel Crocker as soon as he said he was interested. "There was no pre-commit review, but the code on the server (just one or two servers back then) was not automatically updated, so commits were theoretically reviewed before they went live," he said.

Getting  (also known as superuser access) was a bit harder.

"I remember there being a ridiculously long and painful period between getting shell access and getting root, like six months," Starling said. "I had read/write access to the database, I could edit the code, I could view the server access logs, but for some reason root was a big deal."

Domas Mituzas, a former system administrator and WMF board member, said in an email, "It took me a bus trip to Berlin and sleeping on a German Wikipedian's couch and meeting everyone (we all were meeting each other for the first time!) that eased everyone into the idea of giving me root."

Scaling up
By 2003, significant changes needed to be made on the software side to accommodate the quickly increasing traffic. At the time, Wikipedia was running on only two servers and was rendering every page view from scratch, Wicke said.

"A then-fancy 64-bit Opteron DB server upgrade helped briefly, until it started crashing," he said. "The site was often down, and it was clear that any further growth would quickly consume whatever hardware had just been added."

After posting a proposal to add caching using Squid, and receiving feedback from Wales, Vibber and Jens Frank, Wicke "...went ahead and prototyped a basic Squid integration with active cache purging, initially on [his] own servers," which would be serving the main site by February 5, 2004.

"There were issues of course, like missing purges for images or transclusions," he said. "Those were fairly quickly resolved or worked around, and there was a lot of tolerance given the preceding phase of poor site availability."

A few weeks later on February 25, Wikipedia was featured in German news program Tagesthemen (watch on YouTube). Watching the live traffic stats, Wicke said they got "...all excited on IRC when the site briefly went from ~25 [requests per second] to around 1,500 without falling over."

More servers
Wikipedia originally ran on servers managed by Bomis, a dot-com startup. Starling said, "Before February 2004, Jason Richey, a Bomis employee, managed the hardware and would occasionally log in and restart things or otherwise try to fix downtime."

Sometimes this involved literally going the extra mile, as he lived in Los Angeles, while the servers were in San Diego. "I remember [Richey] having to drive 4 hours to San Diego to fix downtime caused by a really simple problem, like a broken hard disk," Starling said.

Some tasks required his intervention that today seem unthinkable. "My favorite memory from the very early Bomis days is that if you wanted to upload an image, you emailed a guy named Jason who would helpfully place it on the server for you," Möller said.

In 2004, Wikipedia moved to a datacenter in Tampa, Florida, to be closer to Wales's new home. "I believe Wales helped to rack the first batch of servers in Tampa," Starling said.

A year later, the Board named Mituzas the Hardware Officer, putting him in charge of, as he describes it, placing servers in a shopping cart and then asking Wales to pay for them.

"Instead of ordering servers one by one I tried a more exponential approach (buy 20, then 40, then 80, ...) – and each time we'd land those batches the site would get much snappier and within a few weeks we'd have more users to fill all the capacity," Mituzas said. "We bought cheap servers that needed hands in datacenters to do anything with them, but we had the capacity to survive the growth."

When others in the WMF's leadership wanted to use some funding to pay other bills, he said he pointed out that if the site wasn't up, there wouldn't be any other bills to pay.

"The challenging part of the role was being the first one to grab [Wales] when he got online – so paying attention to IRC notifications was key, otherwise we would not get our servers," Mituzas said.

Growing pains
Between rapid growth in traffic and not having enough technical resources while constantly implementing new features, Wikipedia was down rather frequently.

Mituzas recalled one outage from when nearly all of the developers met up for the first time in Berlin. "Kate, who wasn't at the meeting deployed Lucene-based search that nobody knew about and thus we were trying to understand why is Java running on our servers and why is it taking everything down."

Other times, developers created contingency plans in response to real world events; in one case it was because of Hurricane Charley.

"I didn't really know much about hurricanes or what to expect, but the local media was talking up the threat," Starling said (he lived in Australia at the time). "There was a risk that power and/or network access would be temporarily lost. We were making off-site backups of our data as if there was a chance of Tampa being flattened like a modern-day Galveston, which was maybe a bit of an overreaction, although I guess it's nice that we were making off-site backups of private data for the first time ever."

In 2005, The Signpost reported how tripped circuit breakers took the site down. It took developers a full day to restore editing after the primary database became corrupt.

"[That] outage caused Wales to quip that downtime is our most profitable product," Starling said.

Changes were often queued up for months at a time before being deployed to wikis in a single release. But eventually the breaking point was hit, Roan Kattouw, a current Principal Software Engineer at the WMF, explained in an email.

"In early 2011, we scheduled a 6-hour window to attempt to deploy the 1.17 upgrade and fix any resulting issues, and decided that if we couldn't fix things within those 6 hours, we would roll back to 1.16," Kattouw said. "This time window was from the morning to the early afternoon for me in the Netherlands, from the late afternoon into the evening for [Starling] in Australia, and from the late evening into the night for our US-based colleagues. The first two times we tried it, a lot went wrong, the site went down at times and we had major issues that we couldn't fix quickly enough, so we rolled back."

It took about three tries for them to get it working "successfully," he said.

"For some definition of success, that is: the site was up and was stable, without any critical issues. After the window ended, I spent the rest of the day fixing various outstanding issues while the others slept, and I passed the baton to Tim when he woke up," Kattouw said. "One of the issues I remember us encountering was all redirects being broken on French-language wikis. Today, that kind of issue would be considered a major problem and a train blocker, but that day it was so far down the priority list that we left it broken for about 12 hours."

Soon after developers began working on "heterogeneous deployment", allowing for progressive deployments.

"This way we can deploy a new version to only a few small wikis first, and work out the kinks before deploying it to larger wikis," Kattouw said. "We were able to accelerate this over time, and nowadays the deployment train runs every week, with major wikis getting new changes only two days after the first test wikis get them."

Expanding functionality
Wikipedia originally ran on the UseModWiki engine, which was written in Perl. Magnus Manske, a biochemistry student at the time, wrote a new wiki engine in PHP to allow for adding more Wikipedia-specific functionality. The "PHP script", as it was known, added features like namespaces, user preferences, and watchlists. It would be officially named "MediaWiki" when it was rewritten by Lee Daniel Crocker.

Other features taken for granted today like an autogenerated table of contents and section editing were controversial when initially introduced.

"As I recall the table of contents feature was a bit more contentious (no pun intended), mostly because of the automagic behavior (hence flags like  were created to control it)," Möller said. "With section editing, the first visual design was a bit cluttered (and of course there were still kinks to iron out, e.g., in the interaction with protected pages), but I think most people could fairly quickly see the appeal."

In other cases, editors forced developers to add features to the software. Carl Fürstenberg, a Wikipedia administrator, created, which allowed for conditional logic in templates.

"At one point I realized that the parameter expansion logic could be 'misused' to create a way to inject boolean logic into the at time limited template syntax, which could be helpful for creating more generic templates which didn't have to call chains of helper templates to create the same as it was before," he said in an email. "Thus I created Qif, Switch, and different boolean templates."

The developers weren't pleased. Starling wrote to the wikitech-l mailing list that he "...caved in and [had] written a few reasonably efficient parser functions...[that] should replace most uses of, and improve the efficiency of similar templates."

Fürstenberg said he never expected to ever be used so widely. "I think I first realized it had become widely used when it had to be protected as any edit to it halted Wikipedia for a while," he said.

In his 2006 mailing list post, Starling blamed the 2003 introduction of templates and the MediaWiki namespace and said he didn't understand "what a Pandora's box" he opened. But that functionality was key to enabling one of MediaWiki's greatest strengths: localization, allowing users to use the software in their preferred language.

Niklas Laxström, the founder of translatewiki.net and a WMF Staff Software Engineer, said in an email he originally submitted translations via Bugzilla, and then worked up his courage to ask Vibber to deploy them for him, sometimes breaking the Finnish Wikipedia because he forgot a semicolon.

"It was no wonder then, that many opted to do translations in the Wikipedias itself using Special:AllMessages. There was no risk of syntax errors and changes were live immediately, as opposed to potentially taking many months as deployments were few and far between," Laxström said. "By the way, this is a unique feature; I have not seen other websites which allows translation and customization of the whole interface by the users of the site using the site itself."

Scratching his own itch, Laxström started modifying Special:AllMessages to make translation easier, but didn't feel those changes were acceptable to go back into MediaWiki, so he hosted them on his own wiki. Today, nearly all localization of Wikipedia's interface is done via translatewiki.net, rather than on individual wikis.

He credits Raimond Spekking with having managed MediaWiki's localization process for over a decade now.

"[Spekking] checks the changes to mark the translations outdated where necessary, he renames messages and performs other maintenance activities. He exports translation updates multiple times per week." Laxström said. "He does this so well that it can feel like magic."

Divesting power
Early versions of Wikipedia's software gave immense power to developers. Only developers could block users, promote new administrators, rename users, and so on.

Seeing this as a problem, in 2004 Starling wrote an email to the Wikipedia-l mailing list titled "Developers should mind their own business", proposing that certain user rights be split into a separate group.

Today, Starling describes that shift in power as having been a big deal at the time. "I was very conscious of the fact that I was designing a social system," he said. "As you can guess from that email, I was uncomfortable with the fact that the power to do so had somehow fallen to me, but I wanted to get it right."

He credits Sunir Shah, the founder of MeatballWiki, for discussing "...that change with me at length, as well as other changes at the interface of social policy and technical design."

It's unclear how much lasting impact this change had, given the WMF's rise to the top of the Wikimedia power structure, in large part because it controls the majority of developers and servers. In 2014, Möller instituted "superprotect", which allowed the WMF to protect a page from even administrators editing it.

"[Möller's] idea was that it would be used in cases of conflict between the Foundation and the community, as a softer alternative to de-sysopping," Starling said. "When that conflict came, [Möller] asked me to make the necessary group rights changes. I said that I was uncomfortable putting my name to that on the wiki, so he found someone else to press the button."

Starling has a simple conclusion as to why the WMF has risen to the top of the power structure: Wikipedia lacks leadership.

"I would like to see an elected editorial board with the mandate and courage to make major policy changes," he said. "Without such a body, WMF necessarily fills in the power vacuum, although it is too timid to do so effectively, especially on any question relating to the content."

"The most controversial move in Wikipedia history"
Derek Ramsey originally wanted to create an article about a town he knew, but couldn't come up with anything more than a couple of sentences.

"So I came up with a solution: find a public domain mass data set that allowed a (somewhat) stub useful article to be created," he said in an email. "I've always had an interest in mass data processing, so this was something I knew I could do. I imported a number of census data tables into a MySQL database tables running on my own Linux computer. I correlated the data with other geographic data sources."

After cleaning up and performing other validation steps, Ramsey said he generated more than 3,000 text files for articles about United States counties and began adding them to Wikipedia by hand.

"This was extremely tedious and slow, but effective," he said. "However, there were 33,832 cities, and that would have taken an order of magnitude longer to complete."

He first wrote a Java program to read each article and make HTTP requests to post it to Wikipedia, later coding "... in features like error checking, error correction, throttling, pauses for human verifications, and other features."

Increasing the number of articles in Wikipedia by 40%, Andrew Lih later called it "the most controversial move in Wikipedia history" in his 2009 book The Wikipedia Revolution.

"I was bold and ignored all rules. You could still do that back then," he said. "After all, if I could edit articles manually, what difference did it make if I did the same thing automatically? It saved me time, but the end result was identical."

Out of all the controversy around mass article creation came two key things that Wikipedia still uses today.

First, Ramsey created, which most references on Wikipedia today use (and featured in xkcd).

"I wanted a generic way for Wikipedians to cite their sources easily, since prior to this the only citations made were manual and inconsistent," he said. "This was necessary because it was before the devs created native reference support in the Wikimedia software."

Second, Ramsey worked with other Wikipedians to develop an early bot policy. The initial version contained the contradictory statement, "In general bots are generally frowned [upon]."

"I thought the concern was overblown, but the consensus was demanding that something be done to address the perceived issues," Ramsey said. "It was my desire to get ahead of the issue before draconian measures shut it all down, so I created bot policy as a sort of compromise. I figured it was better to do that than to have all bots banned wholesale."

Soon after users started running bots and scripts under administrator accounts, dubbed "adminbots", to significant controversy.

"There was a lot of hysteria surrounding adminbots on the English Wikipedia but a few people quietly ran them, as far back as like 2005," Max McBride, a bot operator who previously ran adminbots, said in an email. "Some of these scripts were admittedly kind of terrifying and there weren't as many tools to mass undo unintentional damage."

McBride described people's attitudes on adminbots as "beyond reason" and suggested they were based on some sort of jealousy. "Like a random script gets admin rights and an admin gets two admin accounts, but not lots of regular users," he said. "I think that bred and fed some opposition."

Strong security
Unlike many other prominent websites, Wikipedia hasn't suffered from an embarrassing, public security incident losing its users' private data. Some of this is due to a lack of collecting private data in the first place, but since the beginning there has been a strong culture of focusing on security.

"In the early days, the worst case scenario was irreversible destruction of large amounts of user work, since we didn't have the resources to make frequent backups," Starling said. "I spent a lot of time doing security reviews, and informed by that work, I wrote policies and improved our APIs and conventions."

He also credited Vibber with making key policy decisions for other MediaWiki installations (disabling uploads by default and having a non-web writable source tree), that ensured MediaWiki didn't become a "...constant source of botnet nodes like some other PHP web applications."

But for readers and editors to visit the site securely required using a special secure.wikimedia.org gateway until native HTTPS support was rolled out in 2011 as an opt-in option.

Then in 2013, whistleblower Edward Snowden revealed that the NSA was targeting Wikipedia users visiting the site over the default, unencrypted HTTP protocol.

Ryan Lane, a former WMF Operations Engineer, said in an email the Snowden leaks prioritized switching to HTTPS by default. "We knew some governments were spying on their users (the great firewall of China was well known for this, and they were sharing this tech with other governments), but the Snowden leaks showed that the government was explicitly targeting Wikipedia users," he said.

Kattouw worked on the MediaWiki-side of the HTTPS change, allowing for protocol-relative URLs to be used. "I think the [Site Reliability Engineering] people who worked on the HTTPS migration deserve more credit," he said. "That was a much more difficult migration than many people thought."

The politics involved in making the switch weren't the regular WMF vs. community ones, it was actual global politics.

"For example, Russian Wikipedia asked us to implement HTTPS only (for all users, not just signed-in users) as soon as possible, as they wanted to head off Russian legislation that would have enabled per-page censorship, and it would have forced the government to choose between blocking all of Wikipedia, which was politically difficult, or dropping their aim at per-page censorship," Lane said. "This is why Russian Wikipedia got support before any other wiki (and more extensive support, at that). Chinese Wikipedia, on the other hand, asked us to delay rollout, as the Chinese government was already doing per-page censorship, and had previously blocked all of Wikipedia a number of times."

There's one large exception to this focus on security: the ability for users to create custom scripts and styles and share them with other users, on the wiki. In web development, this is typically known as a cross-site scripting vulnerability, but for Wikipedia it was a feature.

Fürstenberg created one of the most popular user scripts, Twinkle. He said it started as a helper for himself to "...conduct anti-vandalism and maintenance easier, from the point of reverting quickly, to the tedious task of filing reports to different sections. It pretty much boiled over from there."

Looking back, Vibber thinks the idea of user scripts is great, but implemented incorrectly. He said there are two primary problems:
 * 1) Running someone else's malicious code can lead to your account being taken over
 * 2) Script code access internal data and methods that aren't going to stay stable, potentially breaking over time.

"Both can be solved by using a sandboxed environment (probably a suitable iframe)," Vibber said. "I think there's a lot of cool stuff that can be built on top of this method, with full-on APIs for accessing an editor state as a plugin, or whatever."

Missed opportunities
At Wikimania 2012 and then in a Signpost op-ed, then-WMF Senior Designer Brandon Harris presented the "Athena Project", outlining a vision for what Wikipedia should look like in 2015.

Suffice to say, that vision was never fully implemented, and Harris said in an email he could write a book as to what went wrong. "I'd say the primary reason was the fact that Foundation had a stellar lack of focus and a muddled leadership direction which allowed for lower-level political infighting to thrive," he said.

The reaction to Harris's proposal was generally mixed to negative, but that's what Harris was hoping for. "A thing a lot of people – even professional designers – don't understand about the design process is that only 10% of it is actually 'designing,'" he said. "Most of it is marketing: You have to understand the market you're designing for to know what to design and you have to convince folk that your design solves the problem. You may have to sell the idea that the problem even exists!"

Part of the purpose of his proposal was an exercise in re-examining the entire interface, something he said neither the WMF nor community do enough of. "Look at what happens, every day! Nothing has changed since 2015," Harris said. "The Foundation still doesn't know how to sell its ideas and it keeps trying to fix the same problems with the same tepid changes to the toolchain. The community still doesn't know how to govern itself and still keeps using the same broken processes to inadequately deal with the same issues."

McBride wrote an op-ed response to Harris, titled Wikimedians are rightfully wary, expressing concerns about previous software deployments that didn't live up to their promise, like FlaggedRevs, which was supposed to solve the BLP problem.

"It wasn't a proposed solution as much as it was the only 'solution,'" he said. "And a lot of people had pinned their hopes on it being successful, but I was more interested in it failing fast so we could move on and try other solutions."

After various trials, and years of RfCs, Flagged Revisions (now rebranded on the English Wikipedia as "Pending Changes") is barely used on BLP pages, unmaintained and no longer enabled on new wikis. (It's worth noting that some communities like the German Wikipedia view it as a success.)

"The BLP problem is definitely not fixed," McBride said. "And there's an enormous gap between current tech and what could be implemented to alleviate the problem."

In his op-ed, McBride questioned whether upcoming projects like VisualEditor would end up with a similar fate as FlaggedRevs. As it turned out, the rollout of VisualEditor and Media Viewer a few months later were both extremely controversial among Wikipedians (not to mention the separate but related issue of superprotect), something that Möller acknowledges in hindsight.

"In both cases, a more gradual rollout (probably adding at least 1–2 years to the release timeline for VE, and 6–12 months for MV) could have prevented a lot of pain and frustration," Möller said. "I take on my share of responsibility for that."

Both Möller and McBride independently brought up the same saying: "change moves at the speed of trust" (Möller credited Lydia Pintscher for teaching it to him).

"For that principle to work in practice, an organization has to be prepared to let go of overly rigid timelines and commitments, because its commitments must always first and foremost be to the people whose trust it seeks to earn and keep," Möller said. "That doesn't mean it's impossible to make radical, transformative changes, but it can certainly feel that way."

McBride put it more bluntly. "Wikimedians don't like shitty software, they quickly embrace good software (think @pings or mass messages or...)," he said. "A lot of software is bad and is imposed on the communities without consultation or input. Of course people will dislike that and reject it."

Harris doesn't disagree. "...I think the primary reason is that editors are rightly concerned about impacts to their workflows, and the Foundation has been historically terrible about thinking about this and accounting for it," he said. "This is why I designed the New Pages Feed to work independently of the existing workflows and scripts that people had developed themselves."

Recognition
Early on, Wales recognized key development milestones by giving developers their own holidays: Magnus Manske Day (January 25), Tim Starling Day (October 31) and Brion Vibber Day (June 1).

"It isn't really clear who gets the credit now – whenever you step away, very few people remember what you did," Mituzas said. "Being recognized and rewarded by the community was definitely part of the motivation to keep on working."

Mituzas himself is remembered on the blame wheel, where he's responsible for 25% of Wikipedia's problems. "Sometimes it feels that the blame wheel is the only part that is left of any fame I had," he said.

Harris is likely the best known Wikimedia developer, having appeared on fundraising banners in 2011.

"We had three 'storytellers' who interviewed a lot of us about why we were working there and they liked what I had to say. They took photos," he said. "Later one of them ended up being used as a test and performed fairly well. This became popular and weird because the internet is weird."

Unsurprisingly, there's a direct parallel to how credit operates on Wikipedia itself.

"It's always fascinated me how much wiki editing has mirrored open source software contributions," McBride said. "In both, a lot of people making small suggestions and improvements are the ones who push the project forward."

Some of those names can be found on Special:Version or in the credits. Others might be in mailing list archives, forgotten bugs and long lost IRC logs, but their contributions nonetheless built Wikipedia into what it is today.