Talk:Apache Hadoop

Cleanup
I started taking a crack at cleaning this page up. Started with the description. Please let me know if this looks okay. If so, I'll add more citations and proceed to fixing the main body. Vinod (talk) 16:12, 28 October 2013 (UTC)

I still have no idea WHAT this thing does!
And I'm in IT! What is the (at least intended) reading audience for this article? If it's other Hadoop experts then it needs a total rewrite. — Preceding unsigned comment added by 98.23.29.8 (talk) 19:11, 25 January 2013 (UTC)
 * Haha my thoughts exactly! This article is a huge pile of buzzwords, none of which is understandable for the general public. A translation from Marketing to English would be much appreciated. 213.112.197.68 (talk) 10:27, 25 August 2013 (UTC)
 * This is an important part of IT infrastructure that underpins many emerging technologies. Sadly, the whole article is so badly written that many readers are struggling to understand Hadoop from this explanation. Vote +1rewrite Andmark (talk) 11:56, 1 September 2013 (UTC).


 * Uh... Yes... But... I can clarify that I've lots of little bits about Hadoop over the last few years, but my proximate cause for reading the article was some corporate presentations about so-called big data. Overall the article has failed to provide me with the kind of insight that I was hoping for, but I admit that part of my framing was that I expected to see more on the position of Amazon vis a vis Hadoop, insofar as the company that employs me is mentioned several places and I had somehow reached the conclusion that Amazon was the big barrier to overcome here... Should I conclude that Amazon is much less relevant to Hadoop than I thought, or that the article has a PoV against Amazon? I sort of agree with the comment about too many buzzwords, too, but mostly I'm just kind of disappointed in the lack of enlightment after spending the time to read the entire thing fairly carefully. I actually feel I do have a small idea of "WHAT this thing [Hadoop] does", but it wasn't helped or refined by the article Shanen (talk) 07:45, 9 December 2013 (UTC)

Visit their website; and I will too. -- Charles Edwin Shipp (talk) 15:42, 30 May 2014 (UTC)


 * I have a degree in Computing Science and I still don't know what Hadoop is after reading this article.  There needs to be an explanation of what it does, and what it is for, before any talk about its components.   If describing a car I would start with it being a vehicle that is driven by person and that cars typically transport between 1 and 7 people.   I would not begin by describing it as a framework consisting of an engine, gearbox and body.   The article needs to be written as an encyclopedia article.   FreeFlow99 (talk) 13:15, 6 June 2014 (UTC)


 * Another request for a plain explanation of what Hadoop is. The fog arrived for me when "data set" was used where I expected "data".  Seriously, this topic is important enough to deserve the attention of a subject matter expert to explain what this magical blend of project, framework, distributed file system, distributed data base, distributed operating system, etc. is. I echo the above comments. patsw (talk) 14:55, 19 August 2014 (UTC)

What Hadoop is Not
It is not a cluster. Distributed computing and cluster computing are two very different things.

A 100 computer Hadoop System with 100 cpu's can never have 100 cpu's working on the data on one system. Each system with one cpu can have precisely one cpu working on the data on that node. You own 100 cpu's but get the benefit of only the cpu's that happen to be local to the data being worked on. A Slurm - MPI - Ganglia etc,,, true cluster allows all the cpu's you own to work with whatever data you have to whatever extent they can access the data and share the computation. In Hadoop data needing more processor attention must be duplicates as many times as you need processors to work on it. In practice 10 or more copies of the same data may be needed. If the data is already massive then this duplication can be costly and prohibitive. It is possible that data aware clustering has obsoleted hadoop and similar poor mans cluster technologies. Rocks clusters, beowulf style clusters with data aware slurm implementation can our perform at a lower cost and with less duplication of data. In any case published Hadoop data from government users reveal that the cost of electricity is often so high no savings are realized over traditional data warehousing. Scottprovost (talk) 18:52, 1 September 2013 (UTC)

With the new versions of Hadoop applications out it may be possible to describe the ecosystem more clearly. The Berkeley Data Analytic Stack and a multitude of addons slash replacements have changed the landscape drastically that the term Hadoop has come to refer to everything and nothing. An article about this word that pretends to be about a computer software application is a falsity and should be removed. A disambiguation page with over 1,000 links to applications and systems once known as Hadoop ecosystem would be more appropriate. As fast as the word Hadoop's trending to popularity, it has now fallen with the Word Hadoop being synonymous massive software debacle or administrative failure. Wikipedia now needs a page for the word Hadooped. Scottprovost (talk) 22:22, 10 April 2015 (UTC)


 * The article already mentions the ecosystem in the intro, so the issue has not been ignored. But perhaps now a new separate article should be created just about the Hadoop Ecosystem? Michaelmalak (talk) 23:59, 10 April 2015 (UTC)

What Hadoop Is
Since over the years Hadoop has become many thinks and applications. Most of which can be run without HDFS or even any core Hadoop components. It would be a good addition to this article to provide a list and links to the 40 plus components that have become known as part of or in them selves "Hadoop". Sometimes referred to as the Alphabet soup of "Hadoop Ecosystem?" 2. Apache Pig 3. Apache Hive 4. Apache HCatalog 5. Apache HBase 6. Apache ZooKeeper 7. Apache Oozie 8. Apache Sqoop 9. Apache Flume 10. Apache Mahout ... Scottprovost (talk) 16:58, 15 March 2014 (UTC)

Hadoop is central to some vendors' Big Data solutions (Dell/EMC as an example). Big data implementations provide a way to move and aggregate large amounts of data and without redundant bits. This is one example that is far from Apache; it is a use of Hadoop but almost completely out of context of the original implementation. I do not work for EMC but I've had experience with their solutions. For reference (not to be included in the article) : https://www.dellemc.com/en-us/storage/unstructured-data-analytics/solutions-use-case.htm?CID=314887&VEN1=sP0BFNI8u%2C268143709895%2C901qz26673%2Cc%2C%2C&VEN2=b&LID=5957906&DGC=ST&ACD=1230921248720564&VEN3=823148740449067458 — Preceding unsigned comment added by 144.160.98.94 (talk) 16:09, 8 June 2018 (UTC)

Yahoo
The article says "On February 19, 2008, Yahoo! launched what it claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query."

I thought that Bing was powering Yahoo search??? Kitplane01 (talk) 18:54, 26 August 2010 (UTC)


 * Y! are switching/have switched to Bing for index and search; I don't know what they use those same clusters for now, but as of august they were running a 4000 machine cluster, as mentioned on the Hadoop general mailing list . That cluster is the largest #of machines in a single Hadoop cluster, though it is believed that Facebook have a bigger filestore in a cluster with less machines. (Newer servers have more higher-capacity disks in them. I have a photo of Arun and Owen from Y! running Terasort on one of Y!s clusters at Apachecon 2009; this includes a screen shot of the laptop as they set the then record for the petasort benchmark; this might make a good addition to the article.
 * Yahoo! runs more than 38,000 nodes across its various Hadoop clusters, the largest of which are 4,000 nodes. Even after the Bing switch-over, the clusters are used for analytics, machine-learning, ad targeting, content customization, etc.  Yahoo! is still by far the largest user of Hadoop.  —Preceding unsigned comment added by 99.23.190.196 (talk) 07:30, 28 September 2010 (UTC)

Untitled
This feels a little too much like promotional literature to me.


 * I don't think that's the case, but it is just fairly minimal right now. What we need is some information on the underlying architecture, some discussion of its strengths (scales) and weaknesses (Name node is a single point of failure, base performance not great, can be tricky to nurture if you don't know how to manage a cluster). Are you volunteering to add these? SteveLoughran (talk) 21:46, 23 June 2008 (UTC)
 * I agree that it looks more like a marketing brochure than a real wikipedia entry. 14:00, 30 October 2009 (UTC) — Preceding unsigned comment added by 193.109.175.80 (talk)


 * Added an architecture section, including coverage of limitations and specifics of the filesytems. Better?

I think this is a good overview of Hadoop ... concise ... relates the project and product well to the Who What Where and Why you'd be looking for in an Encyclopedia entry. The only thing I'd add is comparative discussion of other ways similar problems are solved to anchor context (FreddyMack (talk) 14:02, 14 April 2009 (UTC))


 * I would like to know what is involved in implementing it. What sort of limitations are imposed on developer making data processing code for this system? What sort of techniques can be used to make code more efficient for such a setup? Chillum  03:41, 21 May 2009 (UTC)

Google patents Hadoop?
Excerpt from http://www.theregister.co.uk/2010/02/22/google_mapreduce_patent/

In mid-January, Google won a patent for MapReduce, the distributed data crunching platform that underpins its globe-spanning online infrastructure. And that means there's at least a question mark hanging over Hadoop, the much-hyped open source platform that helps drive Yahoo!, Facebook, Microsoft's Bing, and an ever-expanding array of other web services and back-end business applications.

66.192.121.51 (talk) 17:35, 23 February 2010 (UTC)
 * Oh yeah? So they want to forbid that anyone else can slice an SQL query over several server within a cluster? Doesn't make any sense to me... --178.197.236.109 (talk) 12:25, 12 January 2014 (UTC)

Podcast with Hadoop
A recent Software Engineering Radio podcast was about Hadoop:

Episode 157: Hadoop with Philip Zeyliger. Released 2010-03-08. Direct download URL for MP3. Length: 51 minutes 04 seconds.

It could be included in the article, e.g. in External Links. E.g. as in arcticle "Aspect-oriented programming".

--Mortense (talk) 12:00, 9 March 2010 (UTC)

Belatedly done. Ross Fraser (talk)

Hadoop Podcast Focused On All Things Hadoop
http://allthingshadoop.com/podcast

perhaps can get put into this main hadoop page as a resource for use —Preceding unsigned comment added by Omniomega (talk • contribs) 04:42, 5 September 2010 (UTC)

What is the problem that Apache HAdoop is trying to solve
I read the article, but was unable to separate out the problem that the system seeks to solve from the implementation details. As far as I can tell, it seems to be useful wherever there is a large quantity of file-based data which can be processed independently from other data, but is expensive to transfer. This seems to read as if the problem is to create an index (hashmap?) that can direct you an appropriate node to compute on.. Is this right? Can someone splice in a section after the lede to aid understanding this? 129.67.86.189 (talk) 11:46, 19 April 2011 (UTC)
 * Actually, it's solving very different things. E.g. MapReduce is about accessing different clusters containing different data (where a cluster consists of several servers containing the exact same data). So it's basically distributing the SQL query and afterwards asking each server for the result of a different subset, and finally merging the data to create one data set. However, this can be easily done and probably any large scale DB developer already does it. Finally I think that Hadoop is great for distributed file server, but only, since distributed DB queries can easily be done without hadoop. Anyway, it's basically a Java query implementation, the question is, do we need it or shouldn't we just implement our own map reducing systems? --178.197.236.109 (talk) 12:31, 12 January 2014 (UTC)

Hadoop inspired by Google's GFS and MapReduce
The introduction erroneously says that Hadoop inspired Google's MapReduce and GFS. It is the other way around. Sanjay Ghemawat et al. published the GFS paper in 2003, and Jeffrey Dean and Sanjay Ghemawat published the MapReduce paper in 2004. Hadoop developers have clearly stated that they used these works as inspiration to solve their scalability problems. 96.250.77.130 (talk) 13:44, 1 June 2011 (UTC)


 * Well spotted! Someone edited the page page last week and flipped the credits. Reverted and added another warning to the IP address. SteveLoughran (talk) 20:38, 1 June 2011 (UTC)

Current Hadoop Versions are wrong
The current Hadoop versions rendered in the infobox are wrong. The 1.0.0 is the current beta version for the 1.0X branch and 0.20.203.X is the current stable version from the 0.20 branch. — Preceding unsigned comment added by Aalexand85 (talk • contribs) 15:07, 2 February 2012 (UTC)

HDFS Not Mountable?
The section on HDFS contains the following paragraph, "Another limitation of HDFS is that it cannot be directly mounted by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient. A Filesystem in Userspace (FUSE) virtual file system has been developed to address this problem, at least for Linux and some other Unix systems."

That's a pretty big contradiction, with a FUSE based filesystem for HDFS, it can be mounted by an existing operating system. Also, what's the deal with the phrase "existing operating system", is that opposed to an operating system that doesn't even exist. Onlynone (talk) 17:22, 20 April 2012 (UTC)
 * A version of Microsoft Windows which can mount FUSE-based filesystems would be an example of an operating system that doesn't even exist?
 * Of course, Linux, and other "UNIX-like" operating systems have been able to use FUSE for quite some time. Because of the involved metadata, direct mounting and accessing it like it was a directory of photos could have detrimental effects much like digging into your favorite relational database with a text/hex editor...

Copyvio?
This website has some of the same content as the article:

Do we think it's someone copying Wikipedia, or could it be a copyvio? Andrew327 07:57, 2 April 2013 (UTC)

Data nodes can talk to them selves?

"Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. " This is wrong!

Jargon and techno-babble
The intro claims that Hadoop provides "reliability and data motion to applications". Data motion? Is this like interpretive dance? A bit ballet? The term "data motion" is undefined elsewhere in WP (thank heaven). It is also the name of a company and product line (that has noting to do with Hadoop). As well, a previous entry in this talk page draws attention to text in the article where process nodes "talk to each other". Do they do this via Twitter? Or do they use couriers on cyber bikes like in the movie Tron? The intro also refers to "computation-independent computers". Nice to see computers finally moving away from being dependent on computation...

This whole article needs a re-write to avoid sloppy writing, breezy jargon, and dubious techno-babble. Ross Fraser (talk) 22:12, 15 July 2013 (UTC)

A glossary and advisory statement at the beginning of the article would go a long way toward demystifying it. 2601:2:8D00:1E3:E986:AB21:7172:AA44 (talk) 19:40, 22 July 2014 (UTC) John Beale

Stratosphere extends Hadoop
http://stratosphere.eu/

There are no mention of Stratosphere — Preceding unsigned comment added by 179.234.179.107 (talk) 09:28, 26 March 2014 (UTC)

Unbelievably bad bad bad article
The beginning of this article reads as follows:

"Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. "

This is fine, but the crucial word "processing" is one of the vaguest verbs in the English language and requires immediate elaboration.

Unfortunately, the word is nowhere elaborated. As a result, readers are left with the impression that Hadoop does nothing whatsoever.

Instead, all we get is innumerable paragraphs about its underlying architecture.

This is totally unacceptable. Hadoop is above all defined by what it does, not by how it is built. So: if the architecture paragraph are left in the article, they belong only after a good description of what Hadoop does.

As currently constituted, this article is exactly as if an article about Facebook mentioned in its first sentence that it was "social software" and then, with no elaboration on that description, proceeded to discuss for many, many paragraphs the software architecture of Facebook. That is how utterly ridiculous this article is.

I strongly urge that this article either be fixed immediately to explain what Hadoop does, or that it be removed, lest it give other unknowledgeable editors the wrong idea about what an encyclopedia article should be.Daqu (talk) 16:12, 30 October 2014 (UTC)


 * OK, I rewrote the introduction. I apologize, though, for the WP:SELFPUBLISH -- I couldn't find any other good source. Michaelmalak (talk) 17:17, 30 October 2014 (UTC)

The criticism that this article contains too many buzzwords is simply not true. It does contain a lot of technical terms (apparently mistaken for buzzwords) that make this a very useful article. I have read it not knowing what Hadoop was before now and I have a clear understanding now of what it is and where it fits into Big Data. — Preceding unsigned comment added by 94.193.190.1 (talk) 14:27, 13 May 2015 (UTC)


 * Probably written originally by an Apache Foundation documentation writer. Worst documentation anywhere, ever. — Preceding unsigned comment added by 208.81.212.222 (talk) 21:13, 24 June 2015 (UTC)


 * ASF documentations are all open source: contributions to improve it are welcome. One aspect of the Apache Docs is they generally assume some foundation of knowledge of/interest in the area. This article can't make so many assumptions on the audience. Even so, it's hard to do it without assuming some level of knowledge. SteveLoughran (talk) 17:00, 4 September 2015 (UTC)

Too Many Buzzwords Criticism is Wrong
I have just read this article (13/05/2015) and there are not buzzwords (as in marketing spin speak) but there are a lot of well known technical terms. I had heard of Hadoop before now but had no idea what it was until I read this article. Now I have a clear picture of what it is (i.e. how it is built, what it does and how it does it). (Processing is not a vague word, it is a specific Information Technology term.) And where it fits into the Big Data ecosystem.

94.193.190.1 (talk) 14:33, 13 May 2015 (UTC) 25 years experience in the IT industry at large enterprise level

Non-relevant reference in Other users under the Prominent users section.
Other users[edit] As of 2013, Hadoop adoption is widespread. For example, more than half of the Fortune 50 use Hadoop.[42] <- reference 42 does not provide any information about Hadoop use in Fortune 50 companies. — Preceding unsigned comment added by 205.178.86.245 (talk) 21:11, 18 June 2015 (UTC)

What and Why does it exist?
This page is like explaining how a car works without explaining what it is or why it exists. This leads to Anekantavada problem (the three blind men examining an elephant and each will tell you the elephant is something different). What is it? It has some pistons and is powered by gasoline. Yes, but what is it and why did they build it? The gasoline is vaporized and a spark introduced to cause it to explode. WHAT IS IT AND WHY DOES IT EXIST?! — Preceding unsigned comment added by 208.81.212.222 (talk) 21:10, 24 June 2015 (UTC)

hdfs thrift support questionable
After digging deep for hdfs thrift support, I came to this conclusion: there isn't any (anymore). You _can_ implement it yourself (which is what I started), but it is not part of hdfs. (There is thrift support for other parts of Hadoop, though). — Preceding unsigned comment added by 212.116.17.20 (talk) 12:24, 15 October 2015 (UTC)

Hadoop 1.x versus 2.x architectures
The article seems to only talk about Hadoop 1.x architecture. Hadoop 2.x has been available since 2013, and there is a big emphasis on Apache Yarn now which is not mentioned in this article. See "What is Apache Hadoop" section on http://hadoop.apache.org/ — Preceding unsigned comment added by 199.76.28.197 (talk) 20:45, 13 November 2015 (UTC)

Request to add Oracle as Hadoop cloud service provider.
Disclosure: I am an employee of Oracle and thus a with a WP:COI. Request permission to add Oracle to the comma-separated list of vendors at the end of the first paragraph of https://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_hosting_in_the_Cloud. Citation would be. I do not plan to add a separate subsection -- just to the comma-separated list. Michaelmalak (talk) 15:46, 8 April 2016 (UTC)
 * ✅ Seems reasonable, Oracle is a big player and this is simply an appropriate mention, not blatant advertising. Martin of Sheffield (talk) 16:01, 8 April 2016 (UTC)

Suggestions for improvements...
Hadoop is one of the open source framework for distributed processing. Hadoop is finding its application in many fields such Health Insurance, Banking, Medical, Artificial Intelligence, Machine Learning etc. Wikipedia users of Hadoop might be a mix of technical users and non-technical users, this article has to be improved in many ways to achieve user expectations and increase readability,


 * 1) Lead section is very important and need to be improved a lot, for naive users who are new to Hadoop will find difficult in understanding the basics of Hadoop.
 * 2) Structure of the article can be improved to achieve better readability
 * 3) New sections can be added to talk more about HDFS, which is the most important component of Hadoop
 * 4) There are different Hadoop supporting frameworks such as Sqoop, HIVE, Hbase, Oozie, Zoo Keeper etc. A small introduction to all these topics are highly appreciated.

Thanks, Lathivik (talk) 03:37, 17 October 2016 (UTC)

Lead
User:AtlasDuane: In what way do you think the lead is weak? In my opinion, it is good (and actually the only good thing about this article). Michaelmalak (talk) 15:38, 16 May 2017 (UTC)


 * In the absence of a timely response, I am removing the "lead rewrite" template. Michaelmalak (talk) 15:03, 19 May 2017 (UTC)

So what is it about?
Despite a lifetime in IT, I still don't get from the article the sort of thing Hadoop is intended to achieve.

Somewhere near the top of the article, could someone provide a simple, clear example of an issue that would be quite tricky to express or do in traditional computing (e.g. procedural language plus single processor plus traditional filesystem) that is easily expressible in Hadoop. Thanks. Feline Hymnic (talk) 16:42, 20 September 2019 (UTC)

Proposed removal of "Timeline" table
The placement of that huge "timeline" table at the start of the article is inappropriate. It is relatively unimportant to the reader seeking information about the topic, so should not be so prominent. If present at all, it should be much later. Is it needed at all? I propose removing it in about a week (late Sept 2019), but am persuadable to keep it in a place lower down the article. Feline Hymnic (talk) 21:48, 21 September 2019 (UTC)


 * There being no objection, I have removed it. If it is believed that this detailed timeline should be present, it ought to be near the end of the article, as an appendix-like piece. Feline Hymnic (talk) 13:42, 3 October 2019 (UTC)