User:Matt Crypto/RandomArticles

While Wikipedia has a Random page feature, the pages are selected uniformly randomly from the database. As an alternative, I wrote a script to choose pages randomly based on their hit counts for a month; such a set might give a more representative example of how Wikipedia looks to visitors. The hit data for, say, September 2004 can be found here (warning: very large file). Below is an example from the hits so far this month (to 22nd September 2004). If you would like a set, just send me a message and tell me a Wikipedia page, and I'll run the script for you and paste in the output. &mdash; Matt 15:06, 21 Sep 2004 (UTC)

100 randomly-selected articles (weighted by popularity)

 * Embryophyte &mdash; (51 hits)
 * CRAF &mdash; (36 hits)
 * Congressional committee &mdash; (35 hits)
 * Snake &mdash; (870 hits)
 * Linux distribution &mdash; (687 hits)
 * Plate Carrée Projection &mdash; (20 hits)
 * Place Manner Time &mdash; (77 hits)
 * Sestertius &mdash; (124 hits)
 * Stargate SG-1 &mdash; (661 hits)
 * Moorhead, Minnesota &mdash; (24 hits)
 * MOHAA &mdash; (26 hits)
 * Ian Stuart &mdash; (6 hits)
 * Readelf &mdash; (42 hits)
 * Sidney James &mdash; (38 hits)
 * Jacques Derrida &mdash; (808 hits)
 * Edgar Degas &mdash; (1668 hits)
 * Strategic bombing &mdash; (270 hits)
 * The Kingston Trio &mdash; (77 hits)
 * Zoophilia &mdash; (11612 hits)
 * United States Senate &mdash; (2472 hits)
 * Women's Social and Political Union &mdash; (73 hits)
 * Prostaglandin &mdash; (596 hits)
 * Painters &mdash; (208 hits)
 * Archeology of Algeria &mdash; (16 hits)
 * Nyota &mdash; (10 hits)
 * Nikkei Index &mdash; (17 hits)
 * Norway &mdash; (2809 hits)
 * Coefficient &mdash; (86 hits)
 * Chinese mantis &mdash; (46 hits)
 * Triple &mdash; (107 hits)
 * Minor characters from The Hitchhiker's Guide to the Galaxy &mdash; (780 hits)
 * History of Seattle &mdash; (79 hits)
 * Dawes Rolls &mdash; (66 hits)
 * John Stewart &mdash; (40 hits)
 * Puberty &mdash; (1573 hits)
 * Electrical resistance &mdash; (806 hits)
 * Sophia &mdash; (225 hits)
 * Hydroponic (album) &mdash; (5 hits)
 * Biafran War &mdash; (218 hits)
 * Halloween documents &mdash; (170 hits)
 * Squad Automatic Weapon &mdash; (47 hits)
 * Carl Wayne &mdash; (210 hits)
 * British Forces Germany &mdash; (53 hits)
 * Beslan hostage crisis &mdash; (14675 hits)
 * Craigieburn &mdash; (12 hits)
 * Spot (Star Trek) &mdash; (131 hits)
 * Smart (automobile) &mdash; (941 hits)
 * Microscope &mdash; (3498 hits)
 * Time value of money &mdash; (117 hits)
 * George Jackson &mdash; (50 hits)
 * Clarence &mdash; (21 hits)
 * Communication with submarines &mdash; (789 hits)
 * Macaulay Culkin &mdash; (597 hits)
 * Jade Emperor &mdash; (194 hits)
 * Jimbo Wales &mdash; (514 hits)
 * Round Table &mdash; (146 hits)
 * Arizona State University &mdash; (606 hits)
 * List of regions of the United States &mdash; (1173 hits)
 * King's College, Cambridge &mdash; (165 hits)
 * Rhythmic gesture &mdash; (17 hits)
 * Longest word in English &mdash; (1405 hits)
 * Condorcet method &mdash; (806 hits)
 * Total Recall &mdash; (214 hits)
 * Shawn Michaels &mdash; (334 hits)
 * Conjunction fallacy &mdash; (142 hits)
 * 2004 Summer Olympics medal count &mdash; (1747 hits)
 * Pizza &mdash; (695 hits)
 * Ambisonics &mdash; (4 hits)
 * Paul Neil Milne Johnstone &mdash; (212 hits)
 * HMS Albion (1802) &mdash; (4 hits)
 * Contagious magic &mdash; (5 hits)
 * Phase velocity &mdash; (124 hits)
 * IWW &mdash; (120 hits)
 * Vegetarian &mdash; (355 hits)
 * Schlong &mdash; (26 hits)
 * Auschwitz Album &mdash; (3970 hits)
 * GameFAQs &mdash; (1317 hits)
 * Meteorology &mdash; (554 hits)
 * Connotation &mdash; (537 hits)
 * Oral sex &mdash; (7430 hits)
 * 1969 &mdash; (1749 hits)
 * Nucleic acid &mdash; (452 hits)
 * Alcohol &mdash; (1846 hits)
 * Uluru &mdash; (376 hits)
 * EMac &mdash; (136 hits)
 * Montagu Island &mdash; (30 hits)
 * Black Panther &mdash; (153 hits)
 * Orlando Letelier &mdash; (192 hits)
 * Godwin's law &mdash; (6776 hits)
 * Tybee Bomb &mdash; (2609 hits)
 * Spaced &mdash; (78 hits)
 * BAC 1-11 &mdash; (61 hits)
 * 1974 in film &mdash; (234 hits)
 * Relational model &mdash; (609 hits)
 * Property &mdash; (508 hits)
 * Glasgow &mdash; (704 hits)
 * Nicotine &mdash; (408 hits)
 * Rear Window &mdash; (177 hits)
 * Texas Air National Guard controversy &mdash; (166 hits)
 * Football World Cup 1974 &mdash; (85 hits)

Script
import re from random import * logFile = "/tmp/url_200409.html" maxEntries = None # 10000 numberOfArticles = 100 r1 = re.compile(r'^(\d*)\s*([0-9.]*)%\s*([0-9]*)\s*([0-9.]*)%\s*/wiki/(\S*)$') class ArticlePicker: def __init__(self, logFile, maxEntries = False): self.logFile = logFile self.hitList = [] self.count = 0 self.maxEntries = maxEntries def readLogFile(self): F = open(self.logFile) count = 0 self.hitSum = 0 for l in F:            if self.maxEntries and count > self.maxEntries: break try: hits, name = self.parseLine(l) except ValueError: continue count = count + 1 self.hitList.append((hits,name)) self.hitSum += hits self.count = count F.close self.hitList.sort self.hitList.reverse def parseLine(self, line): l = line.strip m = r1.match(l) if m == None: raise ValueError, "No matches found" (hits, t1, t2, t3, name) = r1.match(l).groups self.filterOut(hits, name) spaceName = re.sub('_', ' ', name) return int(hits), spaceName def filterOut(self, hits, name): if name == "": raise ValueError                   # Exclude blank if re.match(r'^\w*:', name): raise ValueError     # Exclude namespaces if re.match(r'Main_Page', name): raise ValueError # Exclude main page # Exclude popular oddities if re.match(r'_vti_bin/owssvr.dl|MSOffice/cltreq.asp', name): raise ValueError def selectRandomly(self, N = 1): rHits = [random * self.hitSum for i in range(N)] outputs = [None] * N        numberOfOutputs = 0 totalSoFar = 0 for hits, name in self.hitList: totalSoFar += hits for index in range(N): if not outputs[index] and totalSoFar >= rHits[index]: outputs[index] = hits, name numberOfOutputs += 1 if numberOfOutputs == N: return outputs return outputs H = ArticlePicker(logFile, maxEntries) H.readLogFile randomArticles = H.selectRandomly(numberOfArticles) print "==%d randomly-selected articles (weighted by popularity)==" % numberOfArticles for hits, name in randomArticles: print "* %s &mdash; (%d hits)" % (name, hits)
 * 1) Dump the articles