Talk:Data mining/Archive 1

=2006= Why does the article start by giving the abbreviation DMM for "Data mining"? Why the extra M?

Data Mining vs Data Farming
Data Mining and Data Farming are Different & Distinct Activities

I think the topics "data mining" and "data farming" should be kept fully separate from each other. There should (at most) be a link from one to the other suggesting a possible relationship in that data farming may use data mining after a simulation process.

Data mining on databases in most cases does not require simulations at all. In fact the data that is being analyzed often comes from the real world, e.g. sales transactions, patient records, manufacturing data, etc. And data mining is a far more widespread activity than data farming. Although data farming may use some data mining as one component, it is a different activity. One might say that:
 * Datafarming = Model design + Simulation + Datamining

In the same sense, it does not make sense to merge the topic "simulation" with data farming, but one can mention a possible relationship.

Enabling data mining
Before you can mine data, you have to transform it into a data-minable form. At work, we've many years and considerable money trying to put structured and  semi-structured XML data into a minable form using an ETL process.

Are there people who agree that a section on "Enabling data mining" might be a reasonable section to add to this Data mining article?

Organization of this entire CS field?
This portion of Computer Science is a field almost all by itself. As a CS student that also works in the business department, data mining is covered as part of the business school coursework as well. I think this specific page should do better at introducing data mining as more than just a random topic in CS. DM is one of the hottest topics in computing today and there is very little organized info on it on wikipedia. For example where is info on apriori and RIPPER for rule based classifiers? I will try to help where I can, but I am no expert. I can help propose a structure however. -- blainegarrett (garr0154@umn.edu)


 * I agree that more details need ot be written. Bit how do you mean "random topic in CD"?  Being a computational field, it is certainly closely related to CS, and is usually taight by CS departments.  As mentioned in the article, it is also related to other ways of data analysis, like statistics.  Again as mentioned, it is also related to artificial intelligence, because extracting knowledge from data is an AI problem.  Of course, it has applications in business, and other areas, as the article says.  So what more should be said about that?  RIPPER is a rule-learning algorithm.  It is not covered on WP at all right now, and probably should be. -Pgan002 01:25, 10 August 2006 (UTC)

General Comments
Very weird --- long introduction, then a table of contents, followed by brief history, some links and some references.

I didn't find this entry academic enough, there aren't enough examples of theory, applications and methods, it mostly talks about the dangers of data mining as if it is a dangerous thing. Needs content. --Exa 15:53, 20 Apr 2004 (UTC)

The Diapers and Beer example is somewhat apocryphal and has little basis in fact. See http://www.dba-oracle.com/oracle_tips_beer_diapers_data_warehouse.htm for more detail.

I've removed the Motley Fool Foolish Four reference as an example of retrospective data mining. Funnily enough, I worked at the Motley Fool at the time -- the UK site -- and the statement as it stood was factually incorrect.

Without wanting to digress too wildly: the Foolish Four was abandoned because at a time of rather huge gains with practically every other investment strategy (this was when the FTSE 100 had recently hit 7,000 and the Dow 11,000) it was showing a small loss and this was considered embarrassing. Had it in fact been continued it would likely have outperformed most other strategies (and indeed the market) over the next year or so at least.

It's true that data mining is a particular danger with such "mechanical" investing strategies and I've tried to reflect that in the paragraph I've inserted. Mswake 22:43, 26 Feb 2004 (UTC)

Data mining is becoming more than just conventional processing; apparently it is now expanding into other fields including multimedia data mining. However, I'm not sure as whether extended data mining styles/methods can be covered in the data mining article or should be covered in another article?

I added a simple example of what data mining offers (8:13 EST 21 Nov. 2004)

Except for maybe some very specific subfields, data mining seems to be managementspeak-ish. I mean in most cases where the work is used, looks like it is just a synonym for scientific method in its full generality. Discovering human-useful strategies from oracles in combinatorial games (e.g. from endgame tablebases for chess) [See the addition I've made in the page] is called data-mining, when it is called anything at all. (I published a paper once, in the early days of this thing - it was at first rejected, but when I resubmitted it with the only change being the title - I inserted the magic term "data mining" in the title, and the same journal promptly published it.)

The article reads well as an introduction to Data Mining, until the line mentioning the A Priori algorithm, at which point the technical level of the article shoots up and the reader is bombarded with terminology unfamiliar to anyone without a mathematical background. For example, no explanation of an oracle is given (the linked article deals with ancient mythological beings) and the writer makes many other assumptions about the reader's level of knowledge. Perhaps this paragraph could be put as a separate "technical details" section, or at least be re-written in a less dense (no pun) more patient manner.

As a newcomer to the field, I found the article very lightweight. Some of the better introductions to the topic are available in the links that are cited.


 * I, too, am a newbie to the field but have an electrical engineering background. I agree with the above post re the introduction of jargon and the spike in technical complexity of the article.  The style and format of IEEE Spectrum magazine -- where subjects are discussed and expounded upon in three increasingly complex stages -- might be a useful model here.  Easier said than done, I know. -- Joe Lombardi 16:45, 17 March 2006 (UTC)

I found this section confusing/inadequate:

"As is the case for economic models which successfully predict 10 of the last 3 recessions, one must of course know which other names came up on the "possible members" list before being confident this was not an exercise in data dredging."

What is this about? 10 of the last 3 recessions? Clarify please


 * This refers to a criticicism of data mining that if it is not performed correctly it actually becomes "data dredging" which has a high Type 1 error rate. A humorous example is a model which correctly predicted the last three recessions, but which also predicted seven other recessions that didn't occur (type-1 errors).  Depending on your field, you may also hear this referred to as predicting 10 out of the last 3 bear markets, predicting 10 out of the last 3 California earthquakes, predicting 10 out of the last 3 rainstorms, etc.  capitalist 03:25, 2 August 2006 (UTC)

The section "Combinatorial game data mining" needs cleaning up.

I found several things unclear or missing in the article.

First, there is no reference to the important field of pre-processing, which almost allways needs to be done on the data before one can start the mining process. How often are your data sets/data stores/data collections in the form of matrices or ready-to-mine graphs?

Second, the term data mining usually refers to four techniques:

1. Prediction. This one is mentioned.

2. Clustering.

3. Finding outliers.

4. Finding local patterns.

I don't think this article sheds light on these teqniques.

Oct 2007, Fred Flint. (Ref. Skillicorn: Understanding complex Datasets) —Preceding unsigned comment added by 193.216.65.76 (talk) 09:23, 13 October 2007 (UTC)

Merge - Tech mining
There is nothing in Tech mining. Ripper234 17:40, 4 April 2006 (UTC)

See also TOC entry should be categorized
The see also entries should be categorized. At the moment this is a wild mixture of algorithms, general categories, programming APIs and so on. It's a mess. JKW 12:33, 22 April 2006 (UTC)

Reference SPAM policy
I have removed the following articles:
 * Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining Association Rules between Sets of Items in Large Databases (1993). Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, months 26–28, pp.207–216.
 * Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules (1994). Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), month 12–15, pp.487–499.

Why? I appreciate of course the scientific value, but not here! Those articles are too specific, and should be part of Database mining or better the subcategory Relational data mining.

Any other comments on this? JKW 20:15, 23 April 2006 (UTC)

Software categorization and avoiding SPAM
Since there exists tons of potential SPAM publishers, we should also categorize the software part to identify earlier link SPAM.

JKW 20:18, 23 April 2006 (UTC)


 * Maybe we need something analogue to List of PDF software. But is Wikipedia now a link collection or not? JKW 22:45, 23 April 2006 (UTC)
 * I think this article should only list notable software - that is software with its own viable Wikipedia article. That would remove the external links->software section completely. -- zzuuzz (talk) 20:30, 2 May 2006 (UTC)
 * Agree with zzuuzz - lists are among the many things WP is NOT - and I'm not sure the PDF software list fits either.... -- MarcoTolo 20:39, 2 May 2006 (UTC)

(Merge - Tech mining) I don't think merge is needed
Tech mining is not related to Data mining, as a CS student i studied that topic in university, i think Data mining is related to data warehousing and Customer Relationship Management (CRM) used in business databases, also an example of Data mining is text mining.
 * Text mining and tech mining are not equivalent, are they? And, please post some more comment based on review articles about tech mining. I think we should ensure that Wikipedia is not misused to invent new terms. So, even if some people have tried to create this term, we need references for it. JKW 20:11, 3 May 2006 (UTC)
 * Tech mining is an application of data (and text) mining which is used to reveal new and emerging technologies out of an existing body of literature. It's an established concept, but I think this term is only used by a few authors (ISBN 047147567X ). I would consider it a sub-field of this article - an application of data mining, but even though the article is currently small, I see no reason why it shouldn't have an article of its own - it is different. Then again, I have no objections to the merge. -- zzuuzz (talk) 21:22, 3 May 2006 (UTC)

Maybe the author meant "Text Mining" not "Tech Mining". Text mining is an offshoot of data mining that focuses on insights culled from unstructured content. I'm an IT analyst in the field, covering all the players, and have never heard of "tech mining." —The preceding unsigned comment was added by 206.41.49.27 (talk • contribs).

Data farming merge
This merge makes no sense... data farming is used for simulations across large clusters, data mining is a learning technique for large data sets. What do these two have in common? --Beefyt 04:12, 7 July 2006 (UTC)

Agree that a merge is a bad idea - though data farming may utilize data mining techniques when analyzing the generated data, it is a very different process. -Lawton

Data farming should be possibly merged with grid computing, but merging with data mining is a strange proposition.

Database Mining merge
As far as I can tell, "Database Mining" is just a different way of saying "data mining". If there is an difference, it should be explained by a single sentence in the main article. --DavidLeeLambert 18:15, 18 July 2006 (UTC)

Theres a big difference: If you've a database for the mining, you have some common format, hence you can use that structure to make mining easier and/or to accelerate the mining. There's some really good video at google, where a nonlinear data-mining is explained... However, in that video, you see some example for a general-purpose data-mining, where the different datasets don't share any common format, which means, that the algorithm in-use has to be much more sophisticated. That video does also show, that a naive approach leads to an n4-algorithm, while the then shown one only needs n1.4

Online Poker Datamining
The term "datamining" is also used by online poker players to refer to the practice of using a program to watch and record hands on many different poker tables at once in order to get detailed information on how a variety of players play in order to devise strategies to counteract them. This usage is pretty common and should be mentioned somewhere in this article IMO. AlexMc 22:24, 29 July 2006 (UTC)

Data mining not data analysis?
The introduction says: "'Although the term is usually used in relation to analysis of data, like artificial intelligence, it is an umbrella term and is used with varied meaning in a wide range of contexts.'"

The sentence is at best uninformative, and I think incorrect. What else can data mining be, other than analysis of data?? -Pgan002 01:37, 10 August 2006 (UTC)


 * One possibility is that Data Mining is a form of exploration of the data as opposed to analysis. Analysis implies that we're breaking down the data into its constituent parts somehow and examining each separately.  Exploration would imply a study of the the qualities of the entire dataset as a whole...more of a synthesis than an analysis.  I'm not really making that argument at this point, but I present it as a possible example to answer your question.  capitalist 02:53, 10 August 2006 (UTC)


 * I think it's more of a question of specialistic jargon: "data analysis" assumes very specific meanings, although somewhat different in different fields, and I understand that the same happens for "data mining"; but it's true that they both are ways to "analyse the data". I would say that you're welcome to try to reformulate that phrase... -- Sergio Ballestrero 06:07, 10 August 2006 (UTC)


 * I encourage debate with regards to these questions. In my experience, after several years of data warehousing, my experience is that it is indeed along the lines of specialistic jargon, as cited by Sergio above.  I agree that the introduction sentence is far too inclusive to apply any meaning fit for a wiki topic.  We are trying to define a phrase (jargon?), not debate the literal English meaning of the words "analysis" and "mining" with their relationship to "data".  Instead of trying to avoid being wrong, I assert that we define these topics in a narrow fashion in order to avoid confusion.  This is a wiki - and represents a source of definition!  With that said, I believe my experience can add to those definitions.


 * "Data Mining" - is usually reserved as a way to describe doing pattern analysis on extremely large volumes of data. Taking from the "Data Mining" definition, this involves "non-trivial" means.  It is something that must be automated, and usually has at least one algorithm asserted, and implemented in code.   In other words: it is the effective declaration of a pattern, which when applied will produce results that will either confirm or deny the assertion.  It is in this definition that "data mining" is taken seriously as a discipline, so that the idea of "data dredging" is avoided - where people simply look for any kind of pattern, and then try to explain it.  The act of "data dredging" can indeed be helpful, but the validity must be more rigorously questioned, as in the case of the Schwed's roullette wheel: "Sadly enough, they have usually found it.", where  "it" refers to the explaination of a relationship, rather than the confirmation of an assertion which is supported by the data.


 * "Data Analysis" - is of course, as a literal statement, similiar to "data mining". However, as a way to convey meaning that is exclusive to that of "data mining", I have found that "data analysis", usually entails what Capitalist cites above.  It is a way of breaking down relationships and values in order to generate meaning.  This is usually a far more encompassing phrase, and is far more often found to be used in the literal meaning of the English: "data analysis", rather than referring to a specific discipline.


 * "Data Discovery" - I add to the mix, in order to account for an unmentioned area. "Data Discovery" is a "trivial" form of "data mining", that usually involves "data analysis".  It is done to become familiar with data.  One way to become familiar with data is to query it in various ways in order to analyze the relationships.  It is rare that this form of analysis involves programmatic algorithms which must be automated and may iterate over several volumes of data.  This term is asserted in order to avoid confusion with "Data Mining".


 * I would, of my own opinion, consider the following. I have worked at a DM company for about 8 years, my comments are my own and do not represent my company. Data mining is a more hip slang term to keep money coming into the statistical software market (or business intelligence market or decision support systems market, whatever). We just pulled the wool over the eyes of a lot of biz people (read: algorithmically illiterate) to make it all sound more high tech and promising. Given the failures of AI and statistical expert systems over the previous decades (the false promises of neural nets and expert rule systems around the 1970s), data mining in this regard is an attempt to reinvent and reintroduce the same stuff that has been pushed for 50 years: decision support software. I think it is a negative viewpoint, and I don't even like to say it, but it rings true time and again. Data mining is a marketing term with a marketing purpose. It differentiates itself from data analysis because data mining software vendors want people to believe they are doing something that is not plain old data analysis. At an extreme level, one could say that data mining is solely a marketing term layered on top of data analysis and is not even encyclopaedic (that is too extreme for me though). So that is my marketing rant, but don't hold me to it please. My boss would disagree with me anyways. My second comment would be data mining's primary emphasis on unsupervised learning or applied AI. Yes, it is data analysis, but it is relatively more central to the concept of data mining. Really this is all just a spectrum, or continuous scale, of how many inputs are introduced into the question (or query) being asked/answered/modeled. In other words, the level of human bias and background knowledge. Data analysis covers the full spectrum but resides primarily on the side with a higher level of human bias, where the question is already known, and data analysis is performed to confirm curiosity (T tests are fun!). Data mining falls into the other side of the scale. The trend is to reduce the inputs or parameters to the algorithm, down to the point where the interface between the analyst and the computer centers around asking a question and getting results, without caring about what machine learning tool found it (black box stuff). For that matter, you could say that data mining does not center around having preconceived beliefs about the nature or patterns of a dataset. A data mining query is more abstract than a data analysis query in that the question becomes "is there anything in there that suggested a common, underlying pattern or trend or explanation of events" as opposed to dealing with "confirm what the weather channel said". My experience has lead me to be a big fan of Gregory Piatetski Shapiro and his definition of data mining and his website. Mr. Shapiro says: "Data Mining is the process of finding new and potentially useful knowledge from data". He then references this article, whichi is a wonderful resource for those who want to learn a little bit more. Greg also points to this article by Susan Imberman, which those who find the ambiguity interesting may find as an OK read. I could go on, but hopefully I made some kind of point, hope it helps. Josh Froelich 18:59, 8 January 2007 (UTC)
 * I liked this table from http://www.dmreview.com/article_sub.cfm?articleId=2367. First question is OLAP, second is DM, as an example of one component of data analysis in comparison to DM's component of predictive modeling (forecasting/regression analysis). Josh Froelich 02:10, 9 January 2007 (UTC)


 * What was the response rate to our mailing? What is the profile of people who are likely to respond to future mailings?
 * How many units of our new product did we sell to our existing customers? Which existing customers are likely to buy our next new product?
 * Who were my 10 best customers last year? Which 10 customers offer me the greatest profit potential?
 * Which customers didn't renew their policies last month? Which customers are likely to switch to the competition in the next six months?
 * Which customers defaulted on their loans? Is this customer likely to be a good credit risk?
 * What were sales by region last quarter? What are expected sales by region next year?
 * What percentage of the parts we produced yesterday are defective? What can I do to improve throughput and reduce scrap?

Data dredging and large data sets
The article says "'...large data sets invariably happen to have some exciting relationships peculiar to that data'"

This problem is not due to large data sets. It is more pronounced in small data sets. For example, given one smoker wearing red and one non-smoker not wearing red, we have the relationship "smokers wear red". This spurious pattern is much less likely to occur in a randomly selected data set of 100 people. -Pgan002 02:50, 10 August 2006 (UTC)


 * Technically what you say is true, but the problem is perception. In my experience in physics data analysys, students without much experience and training in statistics can easily identify and discard as such the random coincidences of few events, but tend not to consider them as random (and compare them to the probabilities of random correlations) when the numbers become large. So it does sound very believable to me that the problem becomes more relevant with large datasets, especially since many data mining users will probably have less statistics background than a physics student. -- Sergio Ballestrero 06:19, 10 August 2006 (UTC)


 * It depends on what you mean by "large." If by "large," you mean "a large number of observations," then (ceteris paribus) increasing the number of observations decreases the probability of a spurious relationship existing between the outcome variable and a given explanatory variable. However, if by "large," you mean "a large number of explanatory variables," then (ceteris paribus) increasing the number of (potential) explanatory variables *increases* the probability of a spurious relationship existing between the outcome variable and at least one of the explanatory variables. Wikiant 18:32, 14 August 2006 (UTC)


 * Large requires more nuanced definition(s), perhaps along the lines of my final three bullet points. (This is an issue more widely in the data mining literature.)


 * Note first "large" as measured by number of bytes required for storage. This is an issue for database managers, affecting also the ease with which users can manipulate the data. If the interests of most users are best served by replacing minute by minute temperature readings by hourly or daily or monthly averages, then perhaps the more summarized data should be provided as the default.


 * For making generalizations from the data (patterns that persist), size in bytes is not per se important. Three aspects are important
 * Increasing the number of variables increases the opportunity for spurious association, as has been mentioned.
 * Groupings within the data are important, if the aim is to generalize to other groups. Suppose for example that data, although in terabytes, are from 5 branches out of 50 within an organization, and that the search is for patterns that may have implications across all 50 branches. For such purposes there are, at best (assuming that the 5 branches can be treated as a random sample) 5 observations only at the level that matters.  A bytewise large data set may, having in mind an intended use, be hopelessly small.
 * In general patterns, if they are to be of interest, must persist in time (what is the relevance of the 2006 to 2007?) as well as in space (other branches of the same organization).
 * Jhmaindonald 08:02, 17 June 2007 (UTC)

Web Directory or Encyclopedia (revisited)
Are all the external links on this article really necessary? It seems like one gets added every day?--Timdew 19:56, 25 August 2006 (UTC)
 * 1. Feel free to reduce the list, I have done the same a while ago. Especially external links from people without Wikipedia account are suspicious and people like to abuse Wikipedia to push their own Web-sites.
 * 2. We should define what is a notable software and web link according to Notability (web). If this does not fit, an extension of the rules should be discussed there.
 * JKW 20:06, 25 August 2006 (UTC)
 * Personally I don't believe any external software links are necessary as notable software should have a wikipedia page and therefore be in the "see also" category on this page ( but i may be being overly harsh).--Timdew 20:40, 25 August 2006 (UTC)

So what's the consensus? external links - software (external) do they stay or go??? --Timdew 09:37, 29 August 2006 (UTC)

I say if the software isn't award-winning or distinguished in the field, then it should definitely go. Otherwise, the entries are just advertisements whose informational value, while non-zero, is trivial with respect to the standards of an encyclopedia. -ERI employee 21:25, 20 September 2006 (UTC)

Beer and diapers story
Hoping this comes out formatted correctly (I'm not a wikipedian, only trying to bring this to the attention of editors who can act on it). The "data mining" entry features an unattributed "fact" about the well-known beer-and-diapers example of data mining. The wording of the text states this as fact, whereas in BI circles it regarded as a parable (albeit an interesting one). Please see the following reference for a brief background on this issue: http://www.dssresources.com/newsletters/66.php
 * Hello there. First of all, I moved this comment down to the bottom of the discussion page because these pages move chronologically from top to bottom.  I know that's not generally the convention on the web, but for some odd reason Wikipedia has it set up that way.  :0)


 * Regarding the diapers and beer example, the original wording in the article stated that this was a hypothetical example that was widely used. The article did not present the correlation as a fact.  What happened was that the word "hypothetical" was part of a dead link that was removed by a user.  At that point, since the word "hypothetical" disappeared, it made it look like the example was being presented as a fact.  Apparently the same user then added a citation tag to the sentence.  I have re-inserted the caveat that the example is only hypothetical, and have removed the citation tag.   The diaper/beer example can be found by using Google Scholar and searching for something like "diaper beer data mining" or whatever.  There are dozens of papers that use this example, but all the ones I looked at refer to it as a hypothetical example.  capitalist 03:11, 30 August 2006 (UTC)

Sorry, I removed the dead link as there was a 404 error when I clicked it. I forgot to reinstate the word "hypothetical". It was my intention to state in talk that we should have a better example with a verifiable source and not a "widely known" as this adds weazel words to an otherwise good article.--Timdew 10:37, 30 August 2006 (UTC)


 * What about if we just replaced the "widely known example" phrase with something like "a hypothetical example used in many papers on data mining"? That would at least be a little more specific and would not require one of us to go searching for a different citeable example. capitalist 05:51, 31 August 2006 (UTC)

Minor change note
• I added the category "Unstructured data mining" for an emerging subfield of data mining, removed "text mining" from "Structured data mining," and added it to "Unstructured data mining." Text is not a structured form. I also added "image mining" to the "Unstructured data mining." -ERI employee 16:16, 20 September 2006 (UTC)
 * i agree, with a note that text is not considered traditional as structured, but one of the underyling drivers of natural language processing is to excise structure from text, based on the assumption that there is inherent structure within textual data, just not to the same standard of RDBMS and 2 dimensional datasets. In that regard text is multi-dimensional depending on ones approach to its facets or features. I mean, look at part of speech tagging, what else is it doing other than extracting the syntactical structure inherent within? Just opinion, I agree with the separation and use it myself, but it is somewhat of a simplification and can mislead less informed readers. Josh Froelich 02:03, 9 January 2007 (UTC)

Software section = Spam
The "Software (external)" section seems like little more than spam and contrary to external links policy. I suggest its removal, as it is only a magnet for further spamming. Thoughts? --ZimZalaBim (talk) 11:08, 16 October 2006 (UTC)
 * No feedback; thus, I'm removing the section per WP:EL. --ZimZalaBim (talk) 09:35, 31 October 2006 (UTC)
 * I agree since I have already cleaned it up a while ago. And still people are attracted to add rather products than information.