Talk:Data mining/Archive 2

=2007=

Misuse of the term
The comment "as the application performs no analysis itself" makes what might be a distinction without a difference. Surely a macro is, logically, a computer program. Is the comment "[A template or pre-defined macro] performs no analysis" saying in effect that most macros are so simple that they cannot be thought of as performing analysis?

Perhaps "data analysis not data mining" should be "data summary not data mining"? The present wording needs changing. We learn that "A key defining factor ... is that the application itself is performing some real analysis". However the final sentence implies that data analysis is not data mining. Hence either data analysis is not real analysis or else data analysis is [a part of] data mining.

The discussion here needs substantial tightening and clarification. Jhmaindonald 09:02, 17 June 2007 (UTC)

Statistical methods
I've removed the following from the page as I don't think it's true. Often? What do you do with the 2^40 models once you have fit them all? What does "smartly" mean? What about interactions and non-linearities? This may be appropriate to included as one form of data mining but it should not be at the top of the page.

In statistical analysis where there is no underlying theoretical model, data mining is often achieved via stepwise regression methods wherein the space of 2k possible relationships between a single outcome variable and k potential explanatory variables is smartly searched. With the advent of parallel computing, it became possible (when k is less than approximately 40) to examine all 2k models. This procedure is called all subsets or exhaustive regression. Some of the first applications of exhaustive regression involved the study of clinical data.


 * The "smartly" part is well beyond the current scope of this article as the explanation requires a subtantial background in statistics. The interested reader would click on the link to stepwise regression for more information. As to what is done with the 2^40 models, again the answer is beyond the scope of this article. As to the use of "often," stepwise regression procedures are included in all major statistical packages for the purpose of automating data mining (e.g., SAS, SPSS, eViews). I have returned the deleted text to the article.Wikiant 03:44, 15 November 2007 (UTC)

Step wise regression is an outdated and poor performing technique. I would hope that no one is using it for data mining, but instead use more modern and more statistically well-founded shrinkage based techniques like the lasso or elastic net. I have removed the text from the article. Hadleywickham (talk) 23:32, 20 November 2007 (UTC)


 * I encourage you to include the critique in the article along with supporting reference(s). As stepwise is a standard procedure included in major statistical packages, its inclusion in this article (regardless of its performance) is appropriate. Once more, I'm putting the deleted text back in. Wikiant (talk) 00:27, 21 November 2007 (UTC)

I don't think it is appropriate to include a technique that is well known to be suboptimal. Just because it is included in major statistical packages does not mean it should be included here. The burden of proof is on you to show that it is useful, not on me to prove that it is not. Hadleywickham (talk) 17:14, 21 November 2007 (UTC)


 * I disagree that the burden of proof is to show that the procedure is useful. The burden of proof (which has been met) is to show that the procedure is relevant to the topic of data mining. For example, someone who is interested in data mining is going to run across stepwise techniques -- if nowhere else, then in the stat package he's using. What do we then gain by not discussing the procedure here? If the technique is well known to be suboptimal, then let's include that discussion also. Wikiant (talk) 18:06, 21 November 2007 (UTC)

Terminology Should Be Brief
I had removing content, I really do.... Fortunately the old content is saved and can be placed elsewhere. I deleted the extra details regarding knowledge and statistics from the terminology because they went into descriptive detail and detracted from the focus on data mining. A simple link would suffice and that information should be weaved into the respective wiki pages. Regards... —Preceding unsigned comment added by 67.177.186.96 (talk) 03:30, 4 December 2007 (UTC)