Wikipedia:Reference desk/Archives/Computing/2010 October 9

= October 9 =

Version control recommendations for large data sets
I am looking for recommendations for a version control system (both server and client) for a large and fairly unusual workload.

Workload description:
 * ~1 million files
 * >99% of files are ASCII text, but a small fraction are binary.
 * Mean file size of ~50 kB (so ~50 GB total under version control). Wide range of sizes with ~50% at less than 10 kB and a couple files > 1 GB.
 * Content is batch updated, so that ~10% of files are changed every two weeks in one big update. (Not necessarily the same files every update.)  Off schedule changes can be assumed to be negligible.
 * On average, only a few percent of each file's contents changes during each update. So the total diff size might be ~150 MB every two weeks.
 * Server must run on Linux, be web-accessible, support random file / revision access, etc.
 * Clients must be available for both Windows and Linux.

Obviously, any established version control system will support a wide range of features / platforms. Personally, I'm familiar with Subversion. However, I have little experience using any version control system as applied to very large data volumes and was wondering if some might be better at that than others. The little experience I do have with this suggests that some Subversion clients may perform quite poorly if you try to place very large numbers of files under version control.

Any suggestions / feedback would be appreciated. Dragons flight (talk) 03:38, 9 October 2010 (UTC)


 * On the web you can find plenty of discussions about the commercial version control system that used to be used for the Linux source code, and the open source one that Linus Torvalds started when they had to change due to licensing problems. This was mainly about performance. Hans Adler 12:19, 9 October 2010 (UTC)


 * Git and Mercurial seem problematic due to the lack of partial checkout functionality. When your managed set is 50 GB, most users are only going to want to see a small portion of that.  I'm not familiar enough with BitKeeper yet to have an opinion, though using a proprietary system would be an uphill sell.  Dragons flight (talk) 20:12, 9 October 2010 (UTC)


 * The intended git solution to this problem to have lots of small repositories, corresponding to the partial check-outs you're interested in. I find that this works better than I thought, but I'd still really like to have partial checkout.  Paul (Stansifer) 12:35, 10 October 2010 (UTC)


 * I am interested in the answer, too, and my only contribution is a partial workaround to the performance issue: TortoiseSVN is indeed pretty slow when you order an Update or Commit to a directory that has many subdirectories and files in them; the speedup is to drill down lower in the folder hierarchy where you do your "Update" or "Commit", so that TortoiseSVN has fewer ".svn" folders to dig through and do compares on.  Comet Tuttle (talk) 18:13, 9 October 2010 (UTC)


 * Yes. In addition to TortoiseSVN, I've also looked at a couple other SVN clients and they all had similar performance problems.  (One client outright crashed when asked to do an add operation on 30000 files.)  My suspicion is that this is a consequence of SVN being designed to manage status through lots of little files.  If I'm right, the required disk IO is going to make all SVN clients rather slow for many kinds of operations on large data sets (though some clients might use caching more effectively than others).  Dragons flight (talk) 20:12, 9 October 2010 (UTC)


 * Perforce is a commercial version control system that claims to be faster than the competition. ("Perforce can effortlessly handle millions of changes and terabytes of versioned data across multiple sites", says the web site.) I've used it, but not enough to form a useful opinion. It's worth considering at least. -- BenRG (talk) 04:22, 11 October 2010 (UTC)
 * I've used Perforce and I like it, but it isn't cheap. I think it was like $700 per seat license per year, about four years ago. --Trovatore (talk) 04:51, 11 October 2010 (UTC)
 * Maybe I got that wrong -- maybe it was a permanent license rather than a one-year license. Or maybe they've dropped their prices since then, don't know.  Anyway, see here for pricing. --Trovatore (talk) 06:07, 11 October 2010 (UTC)


 * You can always try it out; 2 clients are free, if I remember correctly. Perforce's reputation is high.  I've used it, but not for a million files under source control.  Comet Tuttle (talk) 18:38, 13 October 2010 (UTC)

Dragons flight, if I understand you correctly you are looking for something like the "sparse checkout" functionality that was added in git 1.7. So the functionality appears to be there now, except that you can't check out something like a/b/c/d/e/f/g/what_I_really_want/ as just what_I_really_want/ in your current working directory. Instead, git will create all the parent directories in the repository directory. See here. Hans Adler 06:59, 11 October 2010 (UTC)

please help
we have a system where we maintain a tracker and 5 users enter their respective date in the same page of excel sheetunfortunately i have a colleague who deletes the date in crucial time (can be a single cell or two) due to which the entire average is skewed, as she is doing this to defame me and hold me responsible from getting away with important task. a colleague have caught her red handed once doing it but could not do much. is there a way i can devise or implement a logging system where the login id would be logged into a file whoever deletes this so that i can prove and can reprimand her without which i am helpless.please help me. —Preceding unsigned comment added by 203.122.36.6 (talk) 10:01, 9 October 2010 (UTC)
 * Track Changes might be the simplest solution, depending on how "smart" your adversary is. Using Excel as a time tracker for multiple persons is generally a bad idea, though. I'll leave it to the rest of the RD/C volunteers to suggest better alternatives. :-) -- 78.43.71.155 (talk) 14:58, 9 October 2010 (UTC)
 * Of course, a low-tech approach would be a regular printout of your sheet, when you've verified it to be in a non-tampered condition, and have a trustworthy co-worker compare the printout with the file and sign it (with date and time), so it's not simply your word against that other person's word. -- 78.43.71.155 (talk) 15:01, 9 October 2010 (UTC)


 * A high tech approach would be to set up a source control server, like Subversion, and host the Excel sheet there. Every change is meticulously tracked with the user name, time, and date.  You can rewind to previous versions of the sheet to see what happened with each "commit" of changes.  The excuse you give for setting this up would be to reduce the number of conflicts that will occur if multiple people are trying to modify the sheet at the same time (because that's what source control systems were designed for).  Comet Tuttle (talk) 17:58, 9 October 2010 (UTC)
 * Excellent idea! Even better, keep the spreadsheet in comma-separated ASCII format (.csv).  Then you can see line-by-line diffs for each checkin. --Trovatore (talk) 06:48, 11 October 2010 (UTC)

Embedding live wikipedia page on an external website (perhaps in an i-frame?)
Hello all, thanks for reading. I'm working (veeeeery early stages) on a project to build a website something like a network of community-based blogs, articles, creative writing etc.

In any case, I am aware that some websites re-produce the content of wikipedia articles on their site (some credit it, some don't). This sort of thing might be useful for the project that I am working on, but I am also very aware that the articles on wikipedia are all 'living' things, insofar as they get updated, expanded etc.

My question (as per the title really): Is it possible to create an i-frame on a page of my site (Say a page about Barking & Dagenham for example) and have the Wikpedia Barking and Dagenham article appear there live? (Does that make sense, I'm still geting to grips with some of the terms and how some of these things work).

Cheers all, Darigan (talk) 12:58, 9 October 2010 (UTC)


 * It's possible, but it's probably not a good idea. You want to present the content, but you'll end up presenting the whole page, including the editing interface, sidebars, etc.  Much better to use the MediaWiki API to pull the article text, format it to HTML yourself (there's code around to do that, from MediaWiki and other places), and place that within the pages you're building. Better yet, a bit of smarts in using the MediaWiki API can limit the times you present vandalised info (you would, for example, not recover the latest version, but the last version that had stood for say 3 hours without being reverted - a "revert" being an edit with a summary that matches the general admin, twinkle etc. revert strings, or contains the words "rv" or "vandalism"). It's probably sensible for you to retrieve article contents only occasionally (say every day). -- Finlay McWalter ☻ Talk 13:36, 9 October 2010 (UTC)
 * Thanks Finlay McWalter - The MediaWiki suggestion you made sounds like a really good way to handle what I have in mind. I was worried that there might be an issue with the i-frame pulling in the entire interface rather than just the article content, and you confirmed that. Thanks as well for the tips about avoiding pulling in vandalised versions of articles. I will certainly follow-up your tips. Thanks again, Darigan (talk) 14:12, 9 October 2010 (UTC)

Another option would be to use the printable version of the page in your iframe, such as http://en.wikipedia.org/w/index.php?title=London_Borough_of_Barking_and_Dagenham&printable=yes This displays the live version of the page while removing the editing interface, and is a lot simpler than delving into mediawiki api. 82.44.55.25 (talk) 18:11, 9 October 2010 (UTC)
 * Thanks IP guy/girl - The project I'm working on involves me learning a lot from scratch, anything to ease that process is very much appreciated. Cheers Darigan (talk) 13:11, 10 October 2010 (UTC)

Java segment help request
i have a problem with the following java segment,please help me out.. in this student class i want to add another method void sum, which gives the total of subject1 from internal1 and internals2 , and similarly subject2.what should the parameter list in void sum contain? and also please help me define it.Avril6790 (talk) 13:05, 9 October 2010 (UTC)


 * You can define it to be whatever you want. You could define the student class to be stateful and define instance variables to store the values of subject1 and subject2; in that case, sum needs no parameters; it could store its result in another internal variable, or print the value of the sum of internal variables.  This is entirely a design choice on your part.  It is my opinion that this would be an incomprehensible design choice; while stateful programming is acceptable, in this trivial example it seems unnecessary and unintuitive.  (We don't know what "internal1" or "internal2" are supposed to do, let alone what you want the "sum" of, so how can we design its interface?).  I would also point out that your code snippet does not comply with the official recommended Java Code Conventions - class names should be capitalized (class Student implements Internal1, Internal2 {  ... ), and your interface names should be more meaningful than "internal" (this does not help you or anybody else know what the interface is or why you need it).  If you use more meaningful names in your program, it will help you and others evaluate the best design choices.  For example, in Java, if you want to set the value of an internal variable, you should use a get or set method so it is clear that you are modifying the internal state of the Student (i.e., setting his score in subject1 or subject2).  Then, you could have a method called "printSum" - it will be obvious what that method does and when it should be used.  I have also formatted your snippets with source tags for readability.  Consider:

Nimur (talk) 16:09, 9 October 2010 (UTC)

Finding a Mario game for DOS
I am finding a old Mario game for DOS. I often played it in 2000 and 2001.

I can check the reference desk regularly to answer questions about it. I reached the fourth stage and can remember details about the first three stages.

I got the game free, as a email attachment. One special thing I remember about the game is the phrase "back from the death, to rule Frisia again". —Preceding unsigned comment added by Kampong Longkang (talk • contribs) 18:33, 9 October 2010 (UTC)


 * There is a list of free Mario clones here: http://compactiongames.about.com/od/freegames/tp/supermario_clones_and_remakes.htm 92.15.17.139 (talk) 18:44, 9 October 2010 (UTC)


 * I have this game, Mario.exe .. I believe it was a fan-made clone of the original NES Super Mario Brothers, probably as a coding demonstration (it was only 64kb but looked amazing!). It runs in DOS, and as far as I recall it only had a limited amount of levels. Inside the .exe is the text string “Done by Utter Chaos [DFF]” which was probably some demo or cracking group. This mini-game actually may have started out as a trojan horse or a virus but I've never detected anything with any scanner on any of the copies I've seen. I can email you a copy if you want, I'm pretty sure it's freeware. -- &oelig; &trade; 21:38, 9 October 2010 (UTC)
 * This http://www.trendmicro.co.jp/vinfo/virusencyclo/default5.asp?VName=HLLP.YAI.A&VSect=T says its malware. 92.24.177.4 (talk) 23:00, 9 October 2010 (UTC)
 * The game was actually made by a developer called Mike Wiering. The copy you have is a hacked beta version, the full version (with six levels in total) can be downloaded as freeware from this link. --CalusReyma (talk) 00:09, 10 October 2010 (UTC)


 * Ahh, Interesting! How did you come about this information? Is the "...Frisia" or "Utter Chaos DFF" version both hacked betas? Also, if it was once balware I'm sure it is no longer, probably completely eradicated. -- &oelig; &trade; 00:23, 10 October 2010 (UTC)
 * Yes, I would think both are. I first found the game (the beta version) on a shareware compilation disc. The fourth stage in the beta is unwinnable; there's no exit pipe at the end, so you get stuck. I forget exactly how I came across the finished version (this was years ago); probably just through a search engine. --CalusReyma (talk) 09:15, 10 October 2010 (UTC)

I found a game with very similar levels at http://www.dosgamesarchive.com/download/mario/ except the fourth level is now the sixth level and levels four and five are new to me. The remaining levels are very similar, with a few tiny differences (I dont remember seeing a star in the previous game). But thanks for the help! q —Preceding unsigned comment added by Kampong Longkang (talk • contribs) 10:32, 10 October 2010 (UTC)

Does this MS game use a Messenger protocol?
I asked a similar question recently, but now I have more details. Does the game described here www.ehow.com/how_2331394_play-othello-online.html use Messenger as a protocol? When I run it in XP, the process it uses is described as "zclientm.exe".

I now often get an error message saying that the server has not responded - I wonder if this is due to me only having version 4 of Messenger, when the latest version is version fourteen. I dislike Messenger and will only update if that is likely to be the reason for the error message. Thanks 92.15.17.139 (talk) 19:07, 9 October 2010 (UTC)


 * If all else fails, run Wireshark and watch the traffic between the game and its server. This Microsoft article lists the ports used by messenger; for games it says it uses naked TCP on ports 80, 443, 1863 and UDP on just about any unprivileged port. Unfortunately 80 is also used for http and 443 for secure-http. -- Finlay McWalter ☻ Talk 19:15, 9 October 2010 (UTC)

Two related questions about device recognition: Linux and Windows
I have two related questions, about computers automatically recognising devices. One is about Linux, the other about Windows.


 * 1) My own computer runs Fedora 12. When I plug my DSLR camera (Olympus E-520) in to a USB port, the system doesn't do anything. I have to manually mount the camera into the file system, allowing me to access the memory card through a mount point. In contrast, when I plug my mobile phone (Nokia 6303i) in to a USB port, the system automatically recognises it, and offers to launch GThumb to download photographs. If it can do so for one device, why not another? How can I make it do so for the DSLR camera as well?
 * 2) My father's computer runs Windows Vista. When I insert a CF card into the memory card reader, the system automatically recognises it, and offers to launch Explorer or Windows image viewer. Then, after I select "safely remove device", the system stops recognising the CF card at all. If I take it out and put it back in, the memory card reader's light comes on, but Windows acts as if the card wasn't even there. Only rebooting makes it recognise it again. How can I fix this?

Can anyone help me with either of these problems? J I P &#124; Talk 19:55, 9 October 2010 (UTC)


 * I experienced a similar issue with card readers on Windows XP. Apparently windows sees the card reader as a drive whether there's a card in it or not (hence 3 or 4 drive letters being constantly taken up in 'My computer' by empty card reader slots), so when you "safely remove device" it disables the card reader altogether. 82.44.55.25 (talk) 20:26, 9 October 2010 (UTC)
 * When you finished using the card, make sure no programs are accessing it, then remove the card. Happened on my XP one as well. Sir Stupidity (talk) 22:12, 9 October 2010 (UTC)


 * I use an SD card in a USB/SD card adaptor. When I remove the card it behaves as you state. But if I remove the adaptor, I can plug it in again with a new card and all is recognised. -- SGBailey (talk) 22:13, 9 October 2010 (UTC)


 * On modern Linux installations, udev is responsible for detecting new devices, assigning them a /dev name, and (sometimes) for mounting them (sometimes that's done instead by Nautilus). You can watch this in progress by running udevadm monitor and tail -f /var/log/messages as you add (and remove) devices. In this case it sounds like there's a problem in the udev rules (which on my Ubuntu system are in /lib/udev/rules.d</tt>) which recognise devices. This article discusses how to write these rules - crucially the "USB Camera" section discusses an idiosyncrasy about how cameras report their "partitioning"; if yours is the same, then he gives a solution. -- Finlay McWalter ☻ Talk 23:40, 9 October 2010 (UTC)


 * To explain a bit more (mostly for the benefit of the next person who asks, which hopefully will show up in the RD/C search function). When a new USB device is detected, the USB stack signals the kernel, which sends a message into userspace on a netlink socket. udevd listens to that, examines the details of the device, and acts according to its rules. You can configure udev to directly mount a device (by having it run the mount command); that's the case for some of the example content of udev tutorials you'll find, but it's not how Ubuntu at least works. News of the new device is then propagated around using D-Bus (in some systems by HAL, in others by udevd itself). This is received by the GVFS daemon; you'll notice that if a usb disk is inserted before you login to GNOME, the disk appears in the "Places" menu, but hasn't been mounted (as reported by <tt>mount</tt>). It's also reported to Nautilus (I honestly don't know if GVFS does that, or if Nautilus is a D-Bus client of the relevant stream itself).  When Nautilus sees a rising-edge (a new insertion) for a disk, it may automount it (for the setting that controls that, run <tt>gconf-editor</tt> and navigate to <tt>/apps/nautilus/preferences/media_automount</tt>).  I don't know the procedure for a KDE based system, but I imagine it's generally much the same idea.  All of this is wonderfully flexible, but it's clearly complex and sometimes a little fragile.  If you just can't get it to work, here's a downright hack: run a cron job (say every 30 seconds) that runs <tt>lsusb</tt>, searches that output for the ID of your camera, and runs the <tt>mount</tt> command you've been running manually (and <tt>umount</tt> if it's removed). -- Finlay McWalter ☻ Talk 00:39, 10 October 2010 (UTC)


 * I haven't tested this, but I believe that cameras have more capabilities than merely mass storage, and therefore many of them have special Linux device drivers. There is a program called gtkam that is specially designed for interfacing with digital cameras. Looie496 (talk) 01:17, 10 October 2010 (UTC)


 * That's generally Picture Transfer Protocol. Some cameras are capable of being remote controlled (to take pictures) over USB, but there seems to be no standard protocol for that (for still cameras). gphoto (the software that underpins gtkam) has a list of those cameras that it knows how to do remote capture to here. -- Finlay McWalter ☻ Talk 02:40, 10 October 2010 (UTC)

Hard drive size
I bought a computer and it was advertised to have 250 gb of hdd space. However, Windows reports it to have like 232 gb. Why is that? —Preceding unsigned comment added by 71.224.49.81 (talk) 20:38, 9 October 2010 (UTC)


 * Packaging is often in gigabytes (109 bytes), but computers tend to think in gibibytes (230 bytes). It is recommended that the former be written GB and the latter GiB, but in many cases people and machines use GB without specifying what they really mean.  In any event, 250 GB = 232.8 GiB.  Dragons flight (talk) 21:44, 9 October 2010 (UTC)


 * Some people recommend the abomination "GiB", you mean. Comet Tuttle (talk) 07:03, 10 October 2010 (UTC)


 * Computers don't "tend to think in gibibytes". Microsoft chose to make Windows Explorer report disk sizes in units of 230 bytes. They could have chosen units of 109 bytes instead. Everybody would be better off if they had. -- BenRG (talk) 07:58, 10 October 2010 (UTC)


 * Its a sneaky way that disk sellers rip off consumers. 92.24.177.4 (talk) 23:02, 9 October 2010 (UTC)


 * Here's some more info 82.44.55.25 (talk) 23:46, 9 October 2010 (UTC)


 * Hard disk drive. ---— Gadget850 (Ed)  talk 02:45, 10 October 2010 (UTC)


 * (Your hard drive will actually store more than 250,000,000,000 bytes, but some of these are used during formatting to define sectors for quick reading. If you had a single file of exactly 250 GB of data (232.77 GiB), you could easily store it on your drive using a specially-written operating system, with some space to spare, but this would seldom be useful in practice, so real operating systems use a "wasteful" format of the drive that enables quick and easy access to each small file.)  The main reason for the apparent discrepancy is as explained by Dragons flight and others above.  It looks like a rip-off, but it is really just confusion over units.    D b f i r s   07:08, 10 October 2010 (UTC)


 * Yes, it's just confusion over units. Not a rip-off. -- BenRG (talk) 07:58, 10 October 2010 (UTC)
 * No, it's both, but it's a large-scale organised rip-off entered into by most (all?) hard drive manufacturers over the last 10 years or so. ;-) --Stephan Schulz (talk) 08:14, 10 October 2010 (UTC)
 * ... so how many bytes would you expect a 250 GB drive to hold? They are already being generous in giving you more than 250,000,000,000 bytes.  Would you expect a 250 GHz oscillator to run faster than 250,000,000,000 cycles per second? I have no shares or interest in hard drive manufacturers, just an interest in SI units.    D b f i r s   16:30, 10 October 2010 (UTC)
 * It's not just normal SI/traditional confusion. In this particular case, it's an SI unit given the name of an existing traditional unit. (which, of course was given prefixes from other SI units.) Imagine if the SI length unit was called a "yard". People would forever be complaining about 'extra inches' and 'missing centiyards'. APL (talk) 19:18, 12 October 2010 (UTC)
 * So how many extra bits are there in your byte? The mis-naming is the other way round.  Early computers had multiples of 1024 bytes that they called Kilobytes to save inventing a new prefix.    D b f i r s   01:21, 13 October 2010 (UTC)
 * If I had a penny for every time I've seen this question, I would be retired in the Bahamas by now... Sandman30s (talk) 11:31, 11 October 2010 (UTC)

Harmful computer monitor radiation?
Surprisingly, I could not find any article on computer monitor radiation. This, however, was helpful in answering my question of whether it was a myth that it is harmful, but I'm still wondering about older computer monitors, CRTs specifically, before standards such as MPR II and III (could not find any article on these either) or any other standards, could there in fact have been enough radiation emitted from these monitors to be not just harmful, but lethal given enough exposure? Is it even plausible, under extreme circumstances, for computer monitor radiation to be deadly? I mean just the fact that these 'Low Radiation Emission' standards do exist means that there was indeed enough harmful radiation emitted from these older monitors that it actually warranted putting these standards in place, so I'm wondering just how much radiation was reduced? by what percent? How much safer are we now? (i mean those that still use CRTs of course, which is rare these days ;) -- &oelig; &trade; 22:05, 9 October 2010 (UTC)


 * The suspected danger was radio frequency EM fields emitted by the CRT, with a particular worry about its effects on the unborn. This and this (both from the Health Physics Society) cover this. It seems they set a standard that they were confident was safe.  That doesn't mean it was a myth, or that it wasn't, but rather that it was easy enough to set a generally low level - it's often difficult, and ethically very problematic, to empirically demonstrate that such standards are unnecessarily low. -- Finlay McWalter ☻ Talk 23:28, 9 October 2010 (UTC)


 * See Cathode ray tube. -- Wavelength (talk) 23:41, 9 October 2010 (UTC)

example.com
How much traffic does http://example.com receive daily? 82.44.55.25 (talk) 23:53, 9 October 2010 (UTC)
 * It would probably be hard to come across such information without owning the website. What can be found is its Alexa rank. example.com's rank (which is calculated using both the number of visitors and the number of pages those visitors view, which for one-page example.com doesn't matter that much) over the past three months is 9937, and for today it was 11651. For comparison, the website of Kingston Technology was ranked 11653 for today, PC Pro's website was 11670, that of Radio France was one spot below PC Pro, and MTV's German website (mtv.de) was 11695. (Google, Facebook, and YouTube were the top 3 for today.) So while hard statistics are difficult to come by, at least it can be determined that example.com, on this particular day, gained more traffic than some websites of fairly large companies. Xenon54 (talk) 03:03, 10 October 2010 (UTC)
 * I disagree with the last conclusion. The Alexa article which you linked to makes it clear the Alexa rankings are far from perfect and given the way they are derived, this is hardly surprising. The IMO proper conclusion is "according to Alexa example.com, on this particular day, gained more traffic than some websites of fairly large companies". To use an example I liked to use, the main www wikipedia page lists some top languages in accordance to number of visitors to each language. For whatever strange reason, when this was first implemented Alexa rankings were used. However someone pointed out that these were different from the WMF statistics, not extremely so but enough to change the order of at least one or two languages IIRC. Another good example (as mentioned in the article) is the fact that Alexa themselves have changed their ranking system in the past, in attempt to improve the accuracy and these changes have had a clear effect on the rankings. Nil Einne (talk) 03:50, 10 October 2010 (UTC)