Talk:MapReduce/Archive 1

Request clarification
gee wizz who wrote this article... the explaination of reduce in the 2nd paragraph escapes me, i don't think i quite understand it... clarification and a motion to edit would be much appreciated --Htmlism 18:13, 10 February 2006 (UTC)

Needs to define terms before using them.
Indeed this is poorly written. Look for the first use of the word "key" ... it appears in the basic definition without itself being defined, let alone how a key plays a role in the algorithm.

Also look at the use of "domain" for the key-value pair. What does it mean that a key-value pair is "from a different domain"? Why is it important that it cannot be from the same domain? How does a key-value pair relate to a search task as manifest in Google? What is meant by the "type" of the key? Sorry to be harsh, but the use of undefined terms is rendering this prose practically useless. 192.18.43.225 (talk) 17:52, 6 October 2009 (UTC)

I'd also add "frozen" in the "Dataflow" section. The section starts with "The frozen..." as if the word frozen was defined and had some kind of contextual meaning.173.195.59.162 (talk) 18:37, 14 November 2014 (UTC)

Whether or not MapReduce's map and reduce is really like map and reduce in Lisp ...
I would think that explaination of MapReduce could be improved by looking a bit deeper into the sort of recursion schemes that MapReduce's map and reduce are about. Even though it has been said many times (also in this Wikipedia article) that these functions were inspired by the same primitives in functional languages like Lisp, this comparison confuses matters quite a bit; see my paper on that subject. --Ralf Laemmel


 * Ralf, I would like to do that, but (speaking as the guy who largely wrote the article), what's there is already pushing the limits of my expertise- I can barely follow your paper, if at all. --maru (talk) contribs 17:42, 21 February 2006 (UTC)
 * I have barely parsed Ralf's paper up to the point where I can confirm his claim. I have added to the lead paragraph a paragraph to address this distinction. Diego (talk) 13:44, 5 June 2008 (UTC)

I modified the page to discuss Map/Reduce as a general framework, since there are two independent implementations (Hadoop's and Google's). Starfish seems to be a distributed work queue with no sorting or maps or reduces and it might make sense to remove the link. Owen O'Malley

Credit for putting the page up, but I find Ralf's article much clearer. Thanks Ralf. -- Tamsin

I agree with removing the Starfish link. Although Starfish claimes to be MapReduce, it in fact isn't when you look at the code.

I would like to comment on the sentence "Thus the MapReduce framework transform a list of (key, value) pairs into a list of values." I don't think this is correct. Even though the Reduce function emits only a list, it is clear that the framework knows for which key it called the function and adds the key to the function's result. So the total output of the framework is not a list but a map, and from the user's perspective the framework transforms one map into another map. Just take a look at the basic example: word counting. Obviously it doesn't make sense to produce a list of counts as a result: it must be a map from word to count. WhAt (talk) 14:06, 5 September 2008 (UTC)

I agree with the above comment. The result of MapReduce is usually a map. In my opinion, it makes more sense if the reduce step is expressed as: Reduce(k2, list (v2)) -> (k2, v3) On looking at the Google paper again, 5 of the 6 examples shows Reduce returning a key value pair. Also they talk about the output of a MapReduce step being used as the input to another step, so the output must be a map.75.35.114.124 (talk) 05:36, 14 November 2008 (UTC)

I made an exception over the criticism paragraph, referencing an article stating the visible comprehension flaws by the experts on the controversional critic. —Preceding unsigned comment added by LuzGamma (talk • contribs) 20:09, 9 December 2009 (UTC)

== 'receding unsigned comment added by 70.21.118.113 (talk) 14:28, 23 October 2007 (UTC)

How 'bout mentioning what language that example is in... python I think? ruby maybe? help a brotha out! —Preceding unsigned comment added by 24.89.158.2 (talk) 20:38, 20 September 2008 (UTC)


 * Python. SteveLoughran (talk) 13:17, 14 November 2008 (UTC)

Needs improvement
The description of MapReduce is bad. I found it difficult to read even being familiar with map and reduce functions in lisp/python/ruby. I suggest taking something from the google framework summary (http://labs.google.com/papers/mapreduce.html)(dead link) or related ACM paper (http://portal.acm.org/citation.cfm?doid=1327452.1327492) as starters, for anyone with time to improve. --JRavn talk —Preceding undated comment was added at 22:11, 24 November 2008 (UTC).
 * Well it looks like a significant part of the article was lifted verbatim from the ACM article. However it leaves out the basic explanation of what map and reduce do. The summaries at the links I provided do a good job, I think. --JRavn talk 22:18, 24 November 2008 (UTC)

Yes, this is WikiPedia, not a technical journal. Someone should walk thru an example application written in laymans terms - and very slowly!--173.69.135.105 (talk) 01:01, 3 November 2011 (UTC)
 * I agree! — Preceding unsigned comment added by 174.79.189.10 (talk) 18:18, 11 April 2012 (UTC)

This article does not described MapReduce for what it is: a programming pattern instead of a set of frameworks (frameworks implement the pattern, is more correct). Looking at the ACM article suggested above, and it's introduction this could be described perfectly without building the case for one specific (Google) implementation: "MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks." I'd rewrite it if I had the time... Angeloh (talk) 20:46, 4 May 2012 (UTC)

Specific improvements?
I don't think it's that bad, but a simple dataflow diagram would help immensely. Also, it would really helpful to say, for each of the "hot spots" discussed, whether it runs on the master or on workers. For example, it's not clear from the (current state of the) page, whether a worker doing a reduce polls other workers for items with a given key (though I suspect that's not it), or whether all the maps send their output back to the master node, which partitions them by key and dispatches to the appropriate worker node (or something more complicated). I'd be happy to improve this myself (and provide a diagram), but I don't yet know the right answers....

If the partition is happening on the master node, doesn't that lead to bad consequence if a reducing node goes down (unless somebody keeps all the data -- does somebody keep track of all map output until the corresponding reduce is done? Expensive...). — Preceding unsigned comment added by Sderose (talk • contribs) 17:02, 19 March 2012 (UTC)

Request to add some historical dates
Maybe someone could add some historical dates to the article. When did the need for the framework become apparent? When did it get productive? The reference "Dean, Jeffrey & Ghemawat, Sanjay (2004)" suggests that it was in or before 2004. Jacosi (talk) 21:26, 22 March 2009 (UTC)

the example blows
Some arguments are never used, no description for some of the functions used makes it hard to read. 12.8.224.6 (talk) 19:07, 8 January 2010 (UTC)

Agree. First of all, don't use function parameters like k1, k2 etc, because they don't make sens in math or computational notation. Instead I would use someting like this: map (in_key, in_value) -> list(out_key, intermediate_value) Also explanation of MapReduce is vague and completely incomprehensible for novices. I've forwarded few people to wiki to read basics, but all came back to me for explanations as they did not understand a bit from this article. Even mathematicians and experienced software engineers. Good job on writing a bad article. :) 98.210.248.73 (talk) 19:10, 28 February 2010 (UTC)

The example doesn't make any sense. In "reduce", the argument "word" is not used. So it cannot possibly be correct. 147.188.192.41 (talk) —Preceding undated comment added 11:16, 25 October 2010 (UTC).

Citations need cleanup
Citation 7 and 9 link to the same article and it seems there is no reference to the critique of Map/Reduce by DeWitt and Stonebraker anywhere -- it appears that citation 7 was ment to be just that. I have searched for, but have not been able to find, any copy of the critique -- only the blog post linked to by citation 7 and 9 appears to be easy to find. —Preceding unsigned comment added by 80.167.145.223 (talk) 02:35, 12 January 2010 (UTC)

Localization
Should mention that the patent is only valid in the US. —Preceding unsigned comment added by 91.47.187.226 (talk) 21:10, 25 January 2010 (UTC)

Implementations
This list is misleading as it mixes in RDBMS systems that have been extended to include some MapReduce features (Like Greenplum) with actual MapReduce engines (like Hadoop). It might be a good idea to break the list of implementations down into two or more categories. Rdryfoos (talk) 20:12, 14 March 2010 (UTC)


 * go for it. '''Some illustrations might help. 17:55, 15 March 2010 (UTC)

'''

Comparisons to Oracle Parallel Execution
the description of the process sounds precisely like Oracle's Parallel Execution (PX), which features a Query Coordinator breaking down, say, a large full table scan into multiple ranges of blocks to be read and sending those ranges to slave processes to perform. If you add RAC to the mix, you could have slaves running across multiple servers. Could someone add a compare and contrast? —Preceding unsigned comment added by 216.99.185.50 (talk) 22:27, 15 April 2010 (UTC)

Criticism
Someone just deleted a criticism in which someone says that alternatives are better if you can fit the data into memory. While I think the deletion was valid, I think it is a point to emphasise: MR is potentially overkill/inefficient if your data fits into RAM, where RAM can be 64-192GB in an off the shelf server. Indeed, Hadoop servers often come with that much RAM to keep their 6-12 cores busy.

I propose that we add some criticisms, we just need valid references

1. If the data fits into RAM you can do other algorithms, don't need to go through data linearly.

2. If the data fits into a database you can afford and its rate of change means the indexes work, you can get away with SELECT statements. That is, DB lookups are efficient on stable datasets. Stonebraker is probably the citation here.

Nobody who works in MR code is going to dispute these, the key point being once your data is measured in hundreds of GB and are things where you can scan through rather than index, MapReduce works well. We just need the citations. — Preceding unsigned comment added by SteveLoughran (talk • contribs) 11:53, 27 February 2011‎

Patented?
The first sentence of an article should be intended to introduce the reader to the concept as quickly as possible. The very first descriptive terms are key. In this article, the very first descriptive term is, "patented."

This is not helpful to the vast majority of readers. Please consider an introduction that describes what MapReduce is to the widest possible spectrum of readers and then gets more specific, and finally, if the patenting of the algorithms involved is important for some reason, then introduce the fact that it's patented with information on why that's so important that it needs to be in the lead section. -Miskaton (talk) 16:50, 5 April 2011 (UTC)

Dataflow unclear
The "Dataflow" section first refers to "a map function", implying one, and later to "each map function", implying more than one. Which is it? This is unclear. --beefyt (talk) 03:43, 25 March 2012 (UTC)

Multiple Issues
I noticed a few problems with this article, as compared to the MapReduce paper titled "MapReduce Simplified Data Processing on Large Clusters".

For e.g. "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

"Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

In the paper, the map worker does not pass an answer to its master node. It only notifies the master of the location of its processed data while the data is left on the worker's disk.

Secondly, the master does not "collect the subproblems and combines them". That's the task of the reduce worker nodes. The master only allocates nodes.

Also, the example in the article, as mentioned by those above me, is no good. I propose to re-write a few of the sections of this article to better reflect the paper I pointed to above. Is this ok?

Shierro (talk) 03:17, 25 May 2012 (UTC)

Alternative example
The code sample seems illogical and may be incorrect. Here is an alternative (which needs verification because I am not expert).

The implied `submit` function dispatches the document as work-elements to processing-nodes and the corresponding `emit` function transmits the work-result back. Thus the expensive string comparisons are done in a distributed fashion and the cheaper counting functions are centralized (although they *could* also be distributed). — Preceding unsigned comment added by 198.152.70.2 (talk) 14:01, 2 November 2012 (UTC)


 * I for one would love to see a more practical, even layman's, explanation to how map and reduce fits in to the framework/system. I'd love some boxes and arrows. --Popoi (talk) 05:06, 29 November 2012 (UTC)

Huaijun Wu (talk) 07:18, 13 December 2012 (UTC) Personally I think the `submit` function is also confusing. The `map` function takes an input of `key`， while `submit` function only takes one input `document`. So somewhere within `submit`, a list must be built to contain all distinctive words in the document. And then `map` function can be invoked with each word in the list and the document itself as a pair. But the building of this key list already go through all tokens of the document. Why bother to invoked the `map` function? We are in a paradox.Huaijun Wu (talk) 07:18, 13 December 2012 (UTC)

Considering that the example in the main article is basically copied line-for-line from the example slide of Google presentation on MapReduce, I'm thinking it's probably right.Dtanders (talk) 20:08, 20 February 2013 (UTC)

When I read this article, I was wondering, what on earth does the emit function do. It should be explained, although it might be obvious for most of the guys participating this discussion. 2014-09-07 02:26 (EET) — Preceding unsigned comment added by 82.128.191.80 (talk)

The example shown here is absolutely wrong!. The proposed solution fails in distributing locally, does not do the work expected in the first example, and is bad defined since 'key' in Map is not known where it comes from (-must be provided in main?- all wrong). The only thing this does is -with some major fixes- count one expected word in a document, potencially several different in different documents but only one on each document. This is doing all wrong... — Preceding unsigned comment added by 94.126.240.2 (talk) 11:54, 13 March 2013 (UTC)

Agree with the above user. The proposed example misrepresents the MapReduce paradigm. Do not switch to the above unless it is revised. key is not a word to be counted. In general in MR, the key given to map is the filename that contains the current value. The example used on the MR page produces a count of *all words* in input files to the MR job. That said, the OP is correct that the example actually used on the MR page is unclear... Plikarish (talk) 18:56, 18 April 2013 (UTC)

An actual implementation of the example pseudocode is in http://static.usenix.org/events/osdi04/tech/full_papers/dean/dean.pdf — Preceding unsigned comment added by 132.183.4.2 (talk) 17:32, 24 May 2013 (UTC)

There is a much clearer example, that's far easier to follow, here: http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/ 165.193.168.6 (talk) 10:40, 14 August 2013 (UTC)

IBM version example is very clear and easy to understand. — Preceding unsigned comment added by 12.34.36.250 (talk) 21:16, 22 November 2013 (UTC)