Talk:Web scraping/Archive 1

web scraping only for generation of new web pages?
Is it not web scraping if the data extracted from the web page does not end up in another web page? What if it's just stored in a file or database? —Fleminra 00:07, 19 April 2006 (UTC)

Yes, that definition strikes me as strange. It's certainly not how I use the phrase. I would have thought the definition was something like "Using custom-built software to access a site via HTTP, and parse the retrieved data in order to extract embedded information – not for rendering it like it like a normal user agent." Part of the definition would be, I think, that the software knows something specific about the data, and doesn't work against any random site. But—I have no references, just a general feeling. JöG 21:57, 5 February 2007 (UTC)

I also note that the Screen scraping article doesn't restrict the term "web scraping" like this article does. JöG 22:02, 5 February 2007 (UTC)

Self-reference
ILike2BeAnonymous seems to want to add the comment "Ironically, one of the most heavily 'scraped' sites is this one, Wikipedia.". I've reverted twice now; I won't revert again without something further happening; I hope ILike2BeAnonymous will come here for discussion. Per WP:SELF, "To ease reusability, never allow the text of an article to assume that the reader is viewing it at Wikipedia, and try to avoid even assuming that the reader is viewing the article at a website." So the statement being added clearly goes against that style guide. It's also not "ironic" that Wikipedia gets web scraped; it's by design. This might be considered original research -- I suspect this is ILike2BeAnonymous's personal opinion. I also don't see what it adds to the article from an encyclopediaic standpoint. In short, I don't think it belongs here. ILike2BeAnonymous: What's your rationale for adding it? Others: What do you think? --DragonHawk 22:42, 8 August 2006 (UTC)


 * Well, all I can say is that it is both ironic and by design, as you say (and, if you don't know it, a hot issue that's only going to get hotter as the practice increases, with moral implications that rebound on Wikipedia). So far as the "reusability" issue goes (basically a technical one), that could easily be remedied by making the statement refer explicitly to Wikipedia. +ILike2BeAnonymous 23:09, 8 August 2006 (UTC)


 * Wikipedia has always been free content. The intent, right from the inception, was that the content would be available to and contributed by the community at large.  See Wikipedia and History of Wikipedia.  The GFDL explictly permits commercial redistribution.  So anyone who "web scrapes" Wikipedia is doing exactly what the GFDL exists to enable: Reusing free content.  Given that irony is defined as "[meaning] the opposite of" or "being both coincidental and contradictory", I don't see how one can call that ironic.  Can you cite a source on the irony claim?  I do agree that suitable rewording may fix the self-reference issue, but that still leaves the question of "Why should this be in the article at all?".  Regarding your statement that this is "a hot issue"; are you referring to something about Wikipedia in particular, or the practice of web scraping of sites which are not free content?  --DragonHawk 00:10, 9 August 2006 (UTC)

The Wikipedia reference can be added back, as long as it's not a self reference, so it can say "one of the heaviest scraped websites is wikipedia.org" rather than "one of the heaviest scraped websites is this one, wikipedia". There are legitimate for talking about wikipedia, as it IS the most scraped website on the net, so people are scraping the wikipedia content, replacing the wiki links with their own ad links and have pages full of ads, while just scraping as many pages as they can, not structuring the content in any way or scraping particular pages, they just scrape anything to make money off their ads. And these scraped pages bump down some of the real results on google, meaning that sometimes you need to look at the 2nd page of results to get the answer you are looking for, as the first page is full of wikipedia results and wikipedia scraped results. Also people have commented that there could be a conflict of interest, as many of the ads on the scraped pages are actually generating money for google (google ads), so why would google want to show any independent pages when Wikipedia and the scrapers are willing to pay to get their pages shown instead of the independent ones? And no, wikipedia was NOT created to build easy advertising revenue for the spammers of the world. Wikipedia has been turned into spam and people in the field of technology are complaining about it in the media JayKeaton 23:01, 22 June 2007 (UTC)

Wikipedia
Wikipedia, in the world of scraping, is the most common source of information that spammers use, which makes it the most common source of scrape spam. The mentioning of wikipedia being used as scrape spam does NOT fail WP:SELF. Please READ WP:SELF before citing it as your reason for deleting. Someone above a while ago deleted "ironically wikipedia is one of the most common..." when he SHOULD have actually read WP:SELF and reworded it. So please do not remove the reference to wikipedia in this page any more JayKeaton 19:24, 2 September 2007 (UTC)
 * Sure. But it doesn't need to repeat the reason why someone would scrape a cite. Also, it needs a citation for 'most common' - that implies some clear research. In my experience you're probably right, but my guess doesn't belong in an encyclopedia. peterl 22:45, 2 September 2007 (UTC)
 * There is also the issue of scraped sites not showing the latest version of the real wikipedia page. Scaper sites, like answers.com, only scrape once, so until the next scrape all the information is old, and that scraped information could be missing important information that was added later, or it could contain spam or misleading information that was deleted after the article was scraped. You don't even need a source for that, just compare a scraped wikipedia article and the same wikipedia article and you will see things that have been taken out or added that the answers.com page does not show. I don't understand what is so bad about talking about wikipedia on this page. Are you all too sensitive to mention wikipedia on a controversial topic like web scraping? Even the answers.com article on webscraping that was scraped from wikipedia lacks updates made in the last few weeks! And wikipedia advertises itself as an advertisement free wikipedia, when other web sites are scraping out of date articles and plastering them with ads. How does this common web scraping practice not warrent a mention on the web scraping article!? JayKeaton 11:08, 3 September 2007 (UTC)


 * It seems pretty basic, no offense, but I agree with peterl. If you don't have a reference to support this "common practice" it's just not a very helpful contribution to the article, which could use better substantiation, copy-editing and reinforcement to begin with. It's not a matter of being "too sensitive" ... it's just too easy for "personal observations" to get out of hand in an article like this. Not to mention there are other "minor" details that I am sure you've already considered by yourself. dr.ef.tymac 16:11, 3 September 2007 (UTC)

Is this all about references then? Why didn't the person just put a fact tag up instead of deleting it? Ok, a quick google, hmmm, aha, here's one. Do you remember the John Seigenthaler, Sr. thing, where someone said that he was suspected of killing JFK. USA Today and Seigenthaler himself talked about it in an interview [Which can be found here], where he says that even after a moderator called "Jimmy Wales" fixed the offending text, it still remained on the scrapes at Answers.com and Reference.com "for three more weeks". Seigenthaler was probably one of the most famous wikipedia controversies of them all, I'm not sure that there is possibly a better example in existense for what I'm trying to say, but Google is full of them if you want me to post more. JayKeaton 17:11, 3 September 2007 (UTC)


 * Better yet, why not gather up those references, gather up some citation templates, fill in the blanks and edit the article, placing the references accordingly. If the edits represent a clear and well-substantiated improvement, you'll probably see little or no objection from anyone. If the bridge fits, cross it. dr.ef.tymac 17:41, 3 September 2007 (UTC)


 * If the bridge fits... cross it? Did you just make that up? In any case, I will do just that. I don't know why it keeps getting deleted whenever someone puts stuff like that on this article, it assumes bad faith when you just delete something with literally not even five minutes to think about it. JayKeaton 19:01, 3 September 2007 (UTC)


 * To answer the first question(s), yeah I made it up. Meaningless? Sure. Problem? Not really, you got the point. Mixed metaphors are the spice of life.


 * As far as imputing "bad faith" to anyone, you're free to do so if you want, but I don't see how that really helps anything here. Frankly this entire article needs work. Not to mention the fact that it's not entirely clear who or what you are referring to. If you've found fault with any of my edits, feel free to show me a diff. I'll be happy to discuss it with you. The same can probably be said of the other recent contributors to the article. dr.ef.tymac 21:01, 3 September 2007 (UTC)

I don't actually look at who edited what, I only look at what was edited. It doesn't actually help anything at all to know who changed what, unless they keep changing it back. As for the whole page, scraping has other uses besides simply recreating a page on another website on your own, but the only thing that gets mentioned by notable people and analysts is that web scraping has brought complaints in the past and now, I have a copy of american scientific that goes into a little detail about why scraping entire web sites is damaging to causes like open source and I even have another quote that predicts that, for wikipedia, a scraped wikipedia page will one day soon end up being the source for another wikipedia page. Web scraping has a lot of other applications and it seems to be a growing business, but I am concentrating on the observations and criticisms of scraping at the moment such as how a few web sites scraping something like wikipedia can cause havoc with Google search results. As for other applications of scraping, there are loads of web pages either selling scraping software or devoted to people trying to make their own. It can be used for simply gathering small pockets of information right up to copying entire commercial websites like eBay or Amazon on your own top level domain name. eBay and Amazon are the highest profile examples of scraped data offending the owners of that same data JayKeaton 07:55, 7 September 2007 (UTC)

General spin on article
I checked this article out as part of my third year computer science degree project/dissertation. I will be creating a web mashup which uses data from one site with an API, and another site with no API - but as far as I can tell it is completely legal, and all above-board.

The thing which I didn't like about this entry is that it talks about web scraping entirely in a negative sense and associates it mostly with spamming and "scraper sites". I would have loved to see some details of how web scraping is used in a legal, useful way, and perhaps even some information on how it can be done, appropriate programming languages, and the various techniques used. I know the thing to do here would be to add that information myself - however, the reason I was reading about web scraping is that I am indeed clueless about it at the moment! —Preceding unsigned comment added by 85.211.70.253 (talk) 18:29, 8 November 2007 (UTC)

Hear hear! I'm also doing a project in this field. Although it's presented in a very negative light on this article, isn't web/html scraping essentially what search engines do? They have to read html files (through http presumably) and turn that raw code into at the very least, a title, a brief sample of the main body text and read links to see what pages link to what etc. If that isn't "various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context" then I don't know what is. —Preceding unsigned comment added by 194.82.118.105 (talk) 13:24, 22 November 2007 (UTC)

List of libraries
A list of libraries for web scraping in different languages would be really useful, don't you think? —Bromskloss (talk) 13:25, 18 April 2008 (UTC)

Software
This section seems out-of-place. It's a list, for one thing, and those are usually better re-written in another format. However, I don't see a need to include information about specific programs in the article anyway. They're not notable. Anyone care to defend it before I remove it? Peppergrower (talk) 03:17, 20 July 2008 (UTC)

monetized
I can only guess at what this word means. It's not in the wictionary. Nick Levine 11:28, 3 March 2006 (UTC)
 * Look it up on dictionary.com. There's an entry. Phoenixrod 17:38, 3 April 2006 (UTC)
 * I've changed the wording to the substantially clearer 'generate revenue'. Monetization has more the concept of becoming legal tender, or becoming a substitue for money. Here we are more talking about just plain old profit. The use of the word 'monetization' made the entry less clear. peterl 01:15, 20 July 2007 (UTC)

Oxford: monetize or monetise (also monetarize) n	verb 1	convert into or express in the form of currency. 2	[usually as adjective monetized] adapt (a society) to the use of money. Bellthorpe (talk) 11:53, 29 October 2008 (UTC)

external link points to commercial site
The external link aliased "Know About Web Scraping" points to a commercial - not informational - site. I'm not a devoted wikipedant but I don't imagine Wikipedia is looking to become the internet yellow pages. —Preceding unsigned comment added by 74.221.242.122 (talk) 20:50, 23 July 2009 (UTC)

Privacy issues
It would be great for the Legal Issues section to include a discussion of pertinent privacy laws that generally require notice and consent to collect and use individual's personal information. Some privacy laws and regulations have a very narrow definition of what constitutes "publicly-available information," such as published telephone directories. These laws and regs may not include identifiable comments posted on websites within their narrow definitions of publicly-available information. Thus, if a web scraper harvests content from one site, including identifiable individuals' names, comments and other unstructured data, which is then mined, analyzed and sold, must the web scraper provide notice to the affected individuals and obtain their consent?

Consider this excerpt from the Office of the Privacy Commissioner of Canada's draft report from its consumer privacy consultations held in 2010:

''Moreover, in Canada, although personal information may appear in the public domain that does not necessarily mean it can be used for any purpose. For example, PIPEDA provides that some publicly available personal information (as defined in PIPEDA's Regulations) can be collected, used and disclosed without an individual's consent; however, the purposes for which that information may be collected, used or disclosed, are nonetheless limited. '' Source: http://www.priv.gc.ca/resource/consultations/report_2010_e.cfm

And here is the text from the PIPEDA Regulations regarding publicly-available information. Note that information posted on websites is absent:

Regulations Specifying Publicly Available Information

INFORMATION

1. The following information and classes of information are specified for the purposes of paragraphs 7(1)(d), (2)(c.1) and (3)(h.1) of the Personal Information Protection and Electronic Documents Act: (a) personal information consisting of the name, address and telephone number of a subscriber that appears in a telephone directory that is available to the public, where the subscriber can refuse to have the personal information appear in the directory; (b) personal information including the name, title, address and telephone number of an individual that appears in a professional or business directory, listing or notice, that is available to the public, where the collection, use and disclosure of the personal information relate directly to the purpose for which the information appears in the directory, listing or notice; (c) personal information that appears in a registry collected under a statutory authority and to which a right of public access is authorized by law, where the collection, use and disclosure of the personal information relate directly to the purpose for which the information appears in the registry; (d) personal information that appears in a record or document of a judicial or quasi-judicial body, that is available to the public, where the collection, use and disclosure of the personal information relate directly to the purpose for which the information appears in the record or document; and (e) personal information that appears in a publication, including a magazine, book or newspaper, in printed or electronic form, that is available to the public, where the individual has provided the information.

Source: http://laws.justice.gc.ca/eng/SOR-2001-7/page-1.html

64.208.23.76 (talk) 19:02, 27 January 2011 (UTC)

Legal Issues Expanded
The Legal Issues section of the entry seems to implicitly assume that screen scraping is generally harmless, and done primarily for personal consumption.

Perhaps this entry should point out that screen scraping is against the Terms of Use of many -- perhaps most -- commercial websites, which leads to legal liability for the scraper. Indeed, the Digital Millennium Copyright Act in the USA and European Union Copyright Directive specifically address "Circumvention of Copyright Protection Schemes", which would impact anyone scraping commercial sites -- whether for commercial gain or not -- especially when the scraped data is then redistributed.

Commercial sites will aggressively protect their intellectual property, and often have little tolerance for screen scraping, especially where it impacts their commerce. As most legal force is exerted out of the public eye (and also outside of any official lawsuit) it may not be readily apparent just how vigorously commercial websites can act to protect their IP. Those considering screen scraping a commercial site should study its Terms of Use, and also consider the consequences should the site become aware that the scraping is occurring.

It may make sense to update the Legal Issues with some of these sentiments. Thoughts?

Dracogen 16:55, 21 March 2007 (UTC)

No objections in over a week, so I proceeded in adding two paragraphs along these lines to the Legal Issues section.

Dracogen 16:12, 30 March 2007 (UTC)

RSS feeds are also something that is under attack by these scraping sites, becuase it is another form of largely unregulated form of media. these feeds can be stolen as well as imbedded into the code of scraping sites.

GravitationalAnomaly (talk) 10:51, 2 October 2011 (UTC) The last two 'sentences' regarding the Ryanair case are incomplete, and unfortunately as a result do not inform as to the conclusion or general direction of the court case. Neither are they sentences at the moment. Could someone with some knowledge, perhaps the original contributor, please correct. Thanks

Patronanejo (talk) 23:30, 19 April 2012 (UTC) I concur with the need to revise this woefully inarticulate and uninformative passage:
 * In February, 2010, the Irish High Court in the case of Ryanair Limited v Billigfluege.de GmbH. This case established a precedent by acknowledging that the Terms of Use on Ryanair’s website, including the prohibitions contained in those Terms of Use against screen scraping. The case is currently under appeal in the Supreme Court.

I have made the following corrections; certainly the original author sees the need for them: While Cúirt Uachtarach na hÉireann is useful in discriminating the Irish Supreme Court from its more familiar American counterpart, use of the Gaelic is greatly favoured by the Irish in referring to their independent offices of state--hence the added reference to An Ard-Chúirt, the High Court that rendered the decision.
 * Lack of predicate in the first sentence corrected by describing the actions of the An Ard-Chúirt;
 * Lack of predicate for Terms of Use in second sentence corrected by describing the court's view of the Terms' function;
 * Disambiguated location of appeal in third sentence to be Irish Cúirt Uachtarach na hÉireann

It looks awkward to American eyes, but I assure you Mr. Justice Michael Hanna is the proper way to address a sitting judge of the An Ard-Chúirt.

The remainder of my edits have been corrections to citations. Mistakes of my own were compounded by formatting behaviour that appears to cut off hyperlinks that contain a close bracket ]. — Preceding unsigned comment added by Patronanejo (talk • contribs) 23:30, April 19, 2012 (UTC)

Section "Techniques"
The section Techniques starts as as follows:
 * "Web scraping is the process of automatically collecting information from the world wide web. Web scraping, instead, favors practical solutions based on existing technologies that are often entirely ad hoc."

This is clearly an error. I investigated and found out that the original statement was:
 * "The process of automatically collecting Web information shares many common properties of the semantic Web envision, which is a more ambitious goal that still requires breakthroughs in text processing, semantic understanding, artificial intelligence, and human-computer interactions. Web scraping, instead, favors practical solutions based on existing technologies even though some solutions are entirely ad hoc."

In my opinion, the original statement was much better, and I don't understand why it has been removed. If there is a problem with the style, please improve the style. If there is a problem with the content, improve / correct the content. If you don't want to improve the text yourself, wikipedia provides the possibility to tag the text as "needs to be improved". But please don't remove something and leave the remaining text in an incorrect state that only confuses the reader. I will now re-insert the latest version of the removed text. -- GGShinobi (talk) 23:07, 20 February 2012 (UTC)

Patronanejo (talk) 23:50, 19 April 2012 (UTC)Thank you for taking the time to investigate and re-rationalise this passage. Simple deletion is not improvement--it is vandalism. Would that you had the time to identify and shame the offending party. No, I'm not the author.


 * The deletions have been made on 16 February 2012, 14:16 by Stylecustom. His comment was: "Removing peacock language"... :-/ --GGShinobi (talk) 11:11, 3 July 2012 (UTC)

Preserve "Web harvesting" article
Per wp:preserve, it should be noted that there is some content that was deleted at the article Web harvesting when it was redirected here. Diego (talk) 21:51, 1 October 2012 (UTC)

There's also this edit that removed internal links to non-free web scrapping tools. Diego Moya (talk) 15:17, 7 August 2013 (UTC)

Web scraping considered harmful
As an author of Python code so far to scrape ~20 various financial and other sites, I feel this article is entirely too neutral on the topic. My view on the web scraping data lifecycle:


 * 1) company holds data in structured database
 * 2) company converts data to crappy, rarely valid HTML soup, and arbitrarily tweaks its structure every six months
 * 3) user writes a ton of code to try to reconstruct the structure in step 1

I think this article should point out that web scraping is not more than a workaround for companies' unwillingness to publish structured data (RDF/XML/JSON/YAML/ASN.1/OFX/HDF5/protobuf/CSV/whatever). History will judge this whole industry as a make-work job. I cringe to think that anyone might read this article and think that this is an responsible and future-proof way for any organization to publish data. Thoughts? Fleminra (talk) 04:27, 2 March 2015 (UTC)

scrapesentry.com as a source
Anyone want to argue that it's something other than WP:REFSPAM, added for promotional purposes, and not a reliable source? If not, then it doesn't belong. It might be worth considering for blacklisting if such additions continue. --Ronz (talk) 15:17, 21 October 2014 (UTC)
 * Furthermore, the information is misleading. It seems to imply in general 23% of web traffic is scraping, when in fact that their claimed amount just for their own clients. Of course their clients are going to have high levels of scraping, that's why they're using anti-scraping services! Daniel Flint (talk) 00:25, 17 July 2015 (UTC)

'Notable Tools' are not directly related to web scraping
I don't understand the criteria used to list software tools not directly related to web scraping, e.g. 'Firebug' or 'Apache Camel'. I understand they may be used to perform web scraping but are only auxiliar tools, and not web scraping software per se. I suggest the deletion of such tools or its inclusion in another list, explaining why they are useful (e.g. firebug allows the user to inspect html tags(?)). — Preceding unsigned comment added by 146.155.14.186 (talk) 13:10, July 22, 2015 (UTC)


 * Agreed. Tools such as wireshark, firebug, and the like aren't really web-scrapers in any sense of the word. Someone who knows more about the other tools mentioned could clean it up. Suspender guy (talk) 20:22, 12 January 2016 (UTC)

Kantu
Kantu refers to a page about a Music Style. That can't be intentional, can it? Makan.tayebi (talk) 11:31, 25 July 2017 (UTC)

External links modified
Hello fellow Wikipedians,

I have just modified one external link on Web scraping. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
 * Added archive https://web.archive.org/web/20071012005033/http://www.bvhd.dk/uploads/tx_mocarticles/S_-_og_Handelsrettens_afg_relse_i_Ofir-sagen.pdf to http://www.bvhd.dk/uploads/tx_mocarticles/S_-_og_Handelsrettens_afg_relse_i_Ofir-sagen.pdf

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

Cheers.— InternetArchiveBot  (Report bug) 05:41, 9 January 2018 (UTC)

Split section
I think that the section "Web scraping § United States" should be split into a new article called Web scraping in the United States. This is clearly an independently notable subject, and splitting it out would alleviate the geographic bias by freeing up editors to expand on non-US jurisdictions without feeling intimidated by the huge US section, as well as freeing up editors to contribute US content (at its own article) since they are no longer constrained by section guidelines and thus can contribute comprehensive content.&thinsp;&mdash; Mr. Guye (talk) (contribs)&thinsp; 21:12, 23 July 2018 (UTC)

examples
maybe you should also mention services like yahoo pipes, dapper.net, and .... something similar to dapper.net. I know I used it before a few months ago but I can't remember what it's named. — Preceding unsigned comment added by 65.75.97.173 (talk) 18:41, December 23, 2007 (UTC)

merge to screen scraping
See Talk:Screen_scraping for discussion — Preceding unsigned comment added by DragonHawk (talk • contribs) 20:46, July 1, 2006 (UTC)

legal issues
I don't this article's discussion of the legalities of scraping is correct, and I'm disputing its neutrality. The DMCA prohibits technical measures to bypass an effective access control measure. A robot acting like a browser bypasses no effective measures in doing so, and thereby doesn't fall afoul of the DMCA. Also, redistributing copyrighted material is illegal regardless of whether the DMCA is invoked.

Furthermore, not all material gotten through screen-scraping is copyrighted. Consider the case of a site that displayed film showtimes. The showtimes themselves are not copyrighted any more than the numbers in a phone book are, and therefore can be used by whoever scrapes them without fear of copyright infringement. Wholesale copying of content is illegal, yes, but it's not an issue specific to "web scraping."

Also, performing an action that violates a site's terms of use is not illegal. It merely violates the terms of use, not any law. It's not even a breach of contract, since the user doesn't even have to read, much less agree to the terms to use the site.

Also, I demand a citation for the "courts have held" claim. I find it unlikely, though not entirely impossible. — Preceding unsigned comment added by Quotemstr (talk • contribs) 03:26, July 27, 2007 (UTC)

legal issues section reworked
The legal issues section made several bold and unsourced claims that could be interpreted as scare-mongering. Can someone check out the reworked section? —The preceding unsigned comment was added by Quotemstr (talk • contribs).

Legal issues again
I'm not starting an edit war, I swear. :-)

First of all, I cleaned up and normalized the references a bit, and made some minor phrasing changes that shouldn't be controversial.

I removed the section about legal action occurring out of the public eye. That information isn't only unsourced: it's unverifiable.

The court cases cited in the article hardly count as defeats. In the Ticketmaster case, the court held that the particular instance of scraping mentioned was not a trespass. In the other cases listed, the claim was for a preliminary injunction only. As I understand it, a preliminary injunction does not set case law, and should not be considered with the same weight as a final decision.

As for the aggregate damage section -- is there a specific source? Maybe I just missed it.

I don't see how the DMCA is relevant here either; the cases mentioned in the previous version seem to be covered by normal copyright law. A scraper doesn't necessarily have to circumvent any access restrictions in place on a site, considering that one can act like just a browser. Also, doesn't the DMCA specifically allow circumvention for interoperability? —The preceding unsigned comment was added by Quotemstr (talk) (contribs) 02:30, August 21, 2007 (UTC)