Talk:Statistical classification

Focus
It seems that this page deals primarily with two related topics:

1 - statistical methods for performing classification (e.g. linear regression)

2 - statistical methods for training machine learning algorithms to perform classification.

This is not nearly the entirety of the involvement of statistics in classification. For example, neither of these concerns is relevant to a pregnancy test, and yet a pregnancy test is developed with statistical methods (e.g. sensitivity and specificity).

So it seems to me a better title and focus for the page would be "Algorithmic classification" or "Computer classification".

Now, there is a nod to the more general role of statistics in the evaluation section. But this generality is not reflected in the rest of the page, except in the title and the very first line (which I therefore find misleading). So I would suggest moving the evaluation section to the Classification page, or starting a new page that links all the stuff scattered around Wikipedia on this topic. The Classification page is in very poor shape at the moment but it does address the generality of classification and is the appropriate place for the basic concept.

Willbown (talk) 10:50, 11 June 2024 (UTC)

Link to peer reviewed paper
Hi, I recently added some new information regarding the coparison of various classification techniques with a reference to a peer reviewed article. There seems to be some controversy on this subject, the link has been removed several times. I am currently doing my PhD on this topic an I know the information is very relevant.

Is the reference and the external link http://www.pattenrecognition.co.za suitable for this site? If not, what can I do so that this information is not repeatedly removed?

cvdwalt —The preceding unsigned comment was added by 155.232.128.10 (talk) 07:03, 9 March 2007 (UTC).

Wikipedia is not a platform for shameless self-promotion. Adding your own paper on this topic is clearly a conflict of interest (see this: http://en.wikipedia.org/wiki/Wikipedia:Conflict_of_interest).

Your paper has been removed from the Pattern recognition article (again, because of self-promotion). See here: http://en.wikipedia.org/wiki/Talk:Pattern_recognition#Link_to_peer_reviewed_paper. So you just added it to this article.

Furthermore, the journal that the paper was published in is fairly obscure and unnotable. If you really want to cite a paper on this subject, cite one from the topline journals (such as JMLR). —Preceding unsigned comment added by 97.107.142.93 (talk) 13:07, 2 October 2010 (UTC)

Merging/reorganizing Pattern Recognition and Statistical classification
Regarding Statistical classification/temp, I'm puzzled how to integrate it into existing articles. Check out Classification: there are two types of classification, Taxonomic classification and Statistical classification. I think you may be talking about taxonomic classification, but I'm not sure. We've made a distinction: taxonomic classification is based on human decision-making, while statistical classification is based on algorithmic decision-making.

-- hike395 July 1, 2005 17:55 (UTC)

I checked out the existing pages again. I'm talking about algorithmic decision making in the temp article--the classification of items into groups based on numerical/statistical analysis using some algorithm. My issue is that under the category of algorithmic decision making, the topic is discussed only in terms of pattern recognition/machine learning and there is no general explanation of what statistical classification is and does. Algorithmic, (or computational, or numerical or statistical--I use them synonymously) classification can be, and is, applied to all kinds of things. So I think an overall summary of what stat. classif. is and does--how it works, it's underlying ideas, types of approaches, specific algorithms etc., is needed. Particular applications can then be discussed after that. To jump to a pattern recognition application immediately is just getting too specific too fast on just one of many many applications.

Also, the applications listed under taxonomic classification are not necessarily based on human decision making. Some of them can be algorithmic as well, such as phenetics-based classifications of organisms.

I don't disagree with your idea to break the topic into human vs algorithmic-based procedures, but I think some work needs to be done to make everything clearer. What I wrote needs to be expanded on for sure, but is a basic intro which can be built on I hope.

Jeeb 2 July 2005 00:18 (UTC)


 * What a conundrum. I've thought about it, and I agree with you: Statistical classification should be the main article about classification in statistics. The problem is: what to do about Pattern recognition? Here are three issues: 1) lots of pages link to pattern recognition. If it turns into a redirect, it would surprise a lot of people; 2) if it is too similar to statistical classification, they will slowly evolve to have different/conflicting information (given that Wikipedians are not thorough about checking for redundant articles, and 3) ...


 * Issue 3 is a doozy, and it goes back to the sociology of AI research. AI research goes through boom/bust cycles that seem to last 10-20 years. Each cycle generates a new name. In the 1950s and 1960s, the statistical AI approach was called pattern recognition (especially applied to computer vision tasks). In the 1980s, it was called neural networks (and it was vaguely neuromorphic). In the 1990s, it was called machine learning. In each cycle (except for machine learning?), the researchers overpromised and their area fell into disrepute. The name fell out of favor, except for those die-hard people who stayed with the same techniques. Thus, we still have pattern recognition conferences (ICPR), neural network conferences (IJCNN), and machine learning conferences (ICML) that all co-exist.


 * So, I think that we should rewrite pattern recognition to be a more historical/sociological article about statistical AI, rather than a listing of techniques.


 * The problem is, it's an enormous undertaking, and people may not fully agree. I can take a stab at making a stubby start of the article. The problem is that, without a lot of meat in the article, it may drift into replicating statistical classification. Also, we would need to find sources for the histoy of pattern recognition, which is somewhat tricky.


 * -- hike395 July 7, 2005 06:05 (UTC)


 * More data! Check out the FOLDOC definition of Pattern Recognition. They distinguish PR from statistical classification by 1) claiming that PR is a subfield, 2) PR systems solve the whole problem (including pre-processing), and 3) there are non-statistical classification approaches to PR (including syntactic classification, which I had forgotten about). -- hike395 July 7, 2005 15:38 (UTC)


 * ...and I realize, on re-reading pattern recognition (PR) that I had been thinking of it as synonymous with image analysis when I made my initial comments and wrote the temp article, but the article makes it clear that PR is broader than just image analysis, which I agree with. Nevertheless, I think PR and statistical classification (SC) are different because of (1) your comment that PR can involve non-statistical (e.g. syntactical) approaches, and (2) SC (and PR) can be unsupervised (the PR article as written focuses on training sets and mapping a set of items onto an appropriate classification label using such sets--which means it is talking only of supervised classification procedures. But classification can also be unsupervised, with the labeling of classification groups coming later via some independent, non mathematical  procedure).  So in some respects PR seems to me broader than SC, and in other ways narrower, so I'm not so sure that PR is a subfield of SC; I'm prone now to think it's actually broader, but at any rate, I think they're certainly different enough to warrant separate articles.


 * Including the historical evolution of PR sounds like a good idea, but I think some info on methods and techniques should be included as well, because PR seems to me to have important and distinguishing elements (like the incorporation of syntactic or contextual information that you mention). (It is in that respect especially that I think PR is broader than stat. classif., which never, to my knowledge, deals with syntactical information or the whole concept of topological relationships among items or groups).


 * How about two separate articles without any redirects, justified by clear distinctions between the two in the articles--simply remove the redirect from SC to PR that now exists, put the existing SC-temp article where SC now is, and then continue to edit the two articles using this (and future) discussion as a basis for it? No links to PR would be affected that way, and any existing links to SC would not redirect a reader to PR.  As for the enormous undertaking, I think this minimizes it because we can just slowly continue to revise the two existing articles as we discuss the relationship between the two topics...
 * Jeeb


 * Re-reading the temp page material, I realize that it uses terminology that is not standard in either machine learning or pattern recognition. The temp page is fundamentally about clustering, not statistical classification. It assumes that there are no fixed, pre-defined classes. Instead, the data guides the creation of the classes. The temp page describes both partitional clustering and agglomerative clustering.


 * Remember, the essence of classification algorithms is that they take a training set that has both input data and labels. A clustering algorithm only takes input data (no labels). What we have here is a good introduction to clustering. I'll look over there and see if it fits in. -- hike395 06:46, July 26, 2005 (UTC)


 * Later: went to . This makes sense: Jeeb is a community ecologist, so he was using the terminology from that field. I made a note in the main article. We've been using opposite definitions of classification for a month now :-(. -- hike395 07:10, July 26, 2005 (UTC)


 * Ah, good work! I agree that what I wrote in "temp" falls into what many would call "cluster analysis".  However I don't think it's a clear cut distinction, because for example, I believe it is common for satellite imagery analysts to use the terms "supervised" and "unsupervised" classification in their work, the former correpsonding to your definition of classification, the latter to your definition of clustering.  The statsoft online textbook, which I use as a main reference, follows your line of argument in that they have cluster analysis as a chapter separate from classification, and the description therein supports your idea. On the other hand their definition of classification in their glossary (http://www.statsoft.com/textbook/glosfra.html) mentions nothing about putting things into a priori labeled classes, and their def of cluster analysis says it's a "classification algorithm".  I would like to see this definitional fuzziness cleared up, but I'm willing to go with your ideas in the meantime.


 * I think the statistical classification article still needs a bit of elaboration on the intro before the math and details that follow, although I see you worked on it some. Also, I think "numerical classification" is used as a synonym, particularly in taxonomic applications, and I think that term is actually more accurate in some ways.


 * As for the temp article, I don't know if some of that can be useful in the clustering article or not, but I think so.Jeeb 04:39, 3 August 2005 (UTC)

"Machine learning" ???
The opening sentence is:

"Statistical classification is a supervised machine learning procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as traits, variables, characters, etc) and based on a training set of previously labeled items."

To my way of thinking, this fails to distinguish between two kinds of properties of statistical classification. First and most important is what the 'goal of statistical classification is. Second, and less important, is by what means this goal is achieved.

By mixing together a) the goal and b) how it is achieved, this definition of the article's subject is more confusing than helpful to anyone trying to learn what statistical classification is.Daqu (talk) 09:06, 16 February 2010 (UTC)

Serious problems
This article has serious problems. The biggest issue is that it's confusing statistical classification with classification in general. There are plenty of non-statistical classification techniques, such as decision trees and support vector machines, some of which are described or referenced in this article. I'm getting this article renamed to classification (machine learning) and when this is done I will rewrite the article to fix these problems. Benwing (talk) 03:24, 5 October 2010 (UTC)
 * OK, the intro is now rewritten. More to come.Benwing (talk) 04:01, 5 October 2010 (UTC)
 * You should get agreement before renaming the article. You seem to think that "statistical classification" covers only cases where a formal statistical model is involved. But classification has been treated by a wide variety of methods within statistics for a long time, and many techniques treated have been empiriocal rather than being model-based. "Machine learning" may be a term familiar to some, but possibly only those dedicated to "machine learning", and I doubt if anyone in a real applied field would use it. There certainly needs to be an article on "statistical classification" in order for it to be a pair with something on "statistical clustering" (possibly under other names). If there is something to be said that pertains to "machine learning" specifically, then there can be another article. Melcombe (talk) 09:12, 5 October 2010 (UTC)
 * I doubt that your statement about no one in any "real applied field" knowing what "machine learning" is, but whether that's the case or not, it's no excuse for having an article that's misnamed. There is nothing statistical about many of the most commonly used classification algorithms, e.g. support vector machines (SVMs), decision trees, k-nearest neighbor classifiers.  Things like perceptrons, neural networks (multi-layer perceptrons),, and radial basis function classifiers can potentially be statistical, although in their most basic incarnation they're not.  "Statistical" means "generates and works with probabilities", nothing more or less.  If some people who don't have a good hold on statistics abuse "statistical" to mean "empirical" or "making soft decisions using weights" then that's not our problem -- we can make a note of this misuse of terminology but we don't have to and shouldn't continue it ourselves.  And classifiers like decision trees aren't even remotely similar to anything statistical.  The former version of this article was a mish-mash of description of the general classification problem and specifically statistical classification, and formerly there was no article at all on classification as a general machine learning problem.  What is your alternative?  Do you seriously think the status quo is OK?  If you refuse to allow the renaming, I'll do some cutting and pasting to put this article (which I have almost completely rewritten, anyway) where it belongs and put something or other in "statistical classification", but I'd rather preserve the history.  We certainly can't have the current status quo remain. Benwing (talk) 10:45, 5 October 2010 (UTC)
 * In any case, there isn't an article on "statistical clustering" so I have zero idea what you're talking about when you say there "needs to be an article on 'statistical classification'". Have you actually read through the article as it stands?  It has a description of exactly what statistical classification is and how it differs from non-statistical classification.  Do you object to this?  Do you consider it insufficient?  Why is it necessary to have a separate page on statistical classification, given that there's no such page for clustering? Benwing (talk) 10:55, 5 October 2010 (UTC)
 * It is certainly interesting that you pretend to have a definition of "Statistical" that does not include "data analysis" in some form. And of course "machine learning" implies that it is someting that can only be done with computers, whereas statistical classification was being done before the invention of computers. Melcombe (talk) 14:43, 5 October 2010 (UTC)
 * I could reverse this comment and observe that your personal definition of "computer" is strictly linked to the physical object and not the algorithmic/computational/mathematical research behind it, which has unarguably predated statistical science by many-many centuries. The main point however is that "statistical classification" refers to "statistical learning" (already existing article) and cannot be generalised to include all data classification, which includes techniques that don't use any statistical concepts at all. Delafé (talk) 10:56, 13 January 2013 (UTC)


 * I looked through the literature and the usage of "statistical classification" and "statistical classifier" is inconsistent, ranging anywhere from meaning "probabilistic" to "uses weights to make a soft decision" to "anything algorithmic". I have chosen to use "probabilistic classification" instead in the article, which hopefully should be clear. (Note, this is also the usage in Bishop "Pattern Recognition and Machine Learning"; Bishop does not use the term "statistical classification" at all, probably due to the confusion about this term.) Given that "statistical classification" has such an inconsistent usage, it's not suitable as the canonical title.  I moved the page back to "classification (machine learning)", with redirects from "statistical classification" and a discussion in the article of the inconsistent usage.  Hopefully that should satisfy your concerns. Benwing (talk) 23:46, 5 October 2010 (UTC)
 * Obviously it does not. Not only is there question of what an appropriate title should be, there is the matter of Wikipedia convention that consensus must be sought on major changes such as radical renaming of articles. Melcombe (talk) 08:41, 6 October 2010 (UTC)
 * Look, I'm happy to try and find a consensus, but you haven't actually responded to any of the concerns I've brought up. Benwing (talk) 03:37, 9 October 2010 (UTC)

In view of the attempted hijacking of the previously existing topic of this article, I have replaced the content with something more appropriate, with the intention that it should be the "main article" for the existing category Category:Statistical classification. The major part of the version created by Benwing is at classification in machine learning. This restoress the situation similar to that existing before the undiscussed renaming by Benwing. The opinions added by Benwing,  and now in classification in machine learning, still have no sources backing them up. The slightly changed name is to allow for the working of automatic procedures for replacing multiple redirects. Melcombe (talk) 12:48, 23 December 2010 (UTC)

Fisher a frequentist?
How can Fischer's linear discriminant be a frequentist method if he assumes that the data has a multivariate normal distribution? Doesnt assuming a distribution make the method Bayesian? Mrdthree (talk) 00:11, 24 May 2011 (UTC)

Requested move back to Statistical classification - Done

 * The following discussion is an archived discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. No further edits should be made to this section. 

The result of the move request was: page moved per request. - GTBacchus(talk) 14:39, 23 June 2011 (UTC)

Statistical classification (machine learning) → Statistical classification – There is an existing article Classification in machine learning that covers the topic of application in machine learning. The name statistical classification is used as the main article for Category:Statistical classification and this is the role this article had before its un-discussed name change. Meanwhile the wikilink statistical classification redirects to a poor and inappropriate article. The proposal is that this articlw should replace the redirect, restoring the previous status. Melcombe (talk) 10:58, 15 June 2011 (UTC)


 * Support: It's counterintuitive for an undisambiguated page to redirect while the disambiguated page is a separate article. –CWenger ( ^ •  @ ) 15:19, 15 June 2011 (UTC)
 * The above discussion is preserved as an archive of a requested move. Please do not modify it. Subsequent comments should be made in a new section on this talk page. No further edits should be made to this section.

Robotics attention needed
Chaosdruid (talk) 12:23, 28 February 2012 (UTC)
 * Check after merge-move
 * Reassess if necessary

Rename/move (again)
As I see this topic has been already brought up in the past but no action was ever taken. The fact is that the title of this article is problematic as well as inconsistent with mainstream usage in literature. Statistical classification can only refer to statistical learning, which is only a subfield of classification theory in general as there are many classification techniques (e.g. KNN) that have little or nothing to do with statistics. In addition, there is already an article on Statistical learning theory so that topic is fully cover. When the vote against the move was made, there used to be an article called Classification in machine learning but it appears that the article was later deleted and redirected here. I strongly believe that the current article should be renamed to something like Data classification or Classification (data analysis) or Classification (machine learning) or something similar. Delafé (talk) 10:36, 13 January 2013 (UTC)

Terminology: "classification" is supervised, "clustering" is unsupervised -- Really?
A search on google books demonstrates there is no consensus on this terminology. Therefore, I intend to rename the two articles to Supervised classification and clustering and Unsupervised classification and clustering. Fgnievinski (talk) 23:37, 3 May 2014 (UTC)


 * I do think this is by far the most common terminology, and I do not think renaming the articles to almost exactly the same title is beneficial. IMHO your search terms are unfair, and do not represent the common terminology. But indeed, there are books called “unsupervised classification” that then talk about clustering. So you have convinced me that a disambiguation from classification to clustering is sensible. I'm not so sure about the other way around: almost every hit on “supervised clustering” is for semi-supervised methods, and one of the first book matches (2005) claims novelty on this term. So at least “supervised clustering” does not seem to exist. 188.98.199.93 (talk) 05:58, 4 May 2014 (UTC)


 * Google books for "supervised clustering" -"semi-supervised clustering" yields 75 results; half of which still seem to be "partially supervised". Clustering is quite different from classification; and many researchers doing classification have a hard time understanding the mindset of clustering. "When you have a hammer, everything starts to look like a nail." - I know that many people in classification consider clustering only as the "ugly duckling" of classification. But when you talk to researchers focusing on clustering, they will consider it as a fundamentally different challenge (finding something new, as opposed to modeling a predefined structure).
 * I therefore oppose this rename. Let's stick with "Classification" and "Clustering", they are two different (but related) things; with the most obvious difference is the availability of labels (=supervision); but the underlying difference is much more fundamental: when doing clustering, you don't exactly know what you are looking for (which is why most clustering methods are published in the Knowledge Discovery in Databases community) whereas in classification you have a clearly defined objective function: minimize the mis-predicted labels, making it a Machine learning task. There is some overlap, but these two communities are not the same (e.g. Microsoft academic: ML vs. DM are indeed two different subdomains). Closely related is the discussion Data Mining != Machine Learning (but data mining uses machine learning).
 * No need to open this can of worms - there is plently of literature supporting the classification is (semi-) supervised, clustering is unsupervised distinction. --Chire (talk) 08:30, 5 May 2014 (UTC)


 * OK, I'm withdrawing the renaming proposal. But outsiders from either of ML or DM fields do not immediately recognize the distinction, reason why I maintain the hatnotes at the top of the articles should be retained. Which brings me to another contention, this one about the "problems" section of the ML banner, please see Template_talk:Machine_learning_bar. Thanks. Fgnievinski (talk) 12:31, 5 May 2014 (UTC)

The hatnote should be maintained, because "(unsupervised) classification" is actually an old synonym among statisticians for what is now more often called clustering (at least in the ML community). Q VVERTYVS (hm?) 14:54, 5 May 2014 (UTC)

Also, I have trouble with "statistical classification", as it implies a distinction with "non-statistical classification", which is nonsensical; I think it should be called "automatic classification". Fgnievinski (talk) 15:17, 5 May 2014 (UTC)


 * I don't like the title much either, but automatic classification can also mean applying using hand-crafted rules and patterns. The present article really discusses the statistical/machine learning approach, where an algorithm is used to generate the rules. Q VVERTYVS (hm?) 16:47, 5 May 2014 (UTC)


 * I don't think "automatic classification" is better, and there is non-statistical classification: manual taxonomies, as common in biology. See also the disambiguation page: classification. For us it is only "Classification", but that term is too generic. We could however rename it to Classification (statistics) or Classification (machine learning); there are many articles following this naming scheme, for example Active learning (machine learning) and Feature (machine learning). But we should check history, this has been considered before (see above). --Chire (talk) 08:25, 7 May 2014 (UTC)

And the following section needs to be brought in alignment with the agreements above: Pattern_recognition; e.g., it starts with "This article contains an extensive list of statistical classifiers for supervised and unsupervised classification tasks, clustering and general regression prediction." (shivers) Fgnievinski (talk) 15:17, 5 May 2014 (UTC)


 * I've removed that entire section. While it contained some truth, it was a blatant violation of WP:NOTHOWTO and completely unreferenced. Q VVERTYVS (hm?) 16:47, 5 May 2014 (UTC)

Evaluation
A plethora of classification performance indicators have been proposed in the scientific literature. Nobody can say that one is better than another; that depends on what people do with the indicators. In this article, there is only a discussion on the uncertainty coefficient and its advantage over simple accuracy. Why is this particular indicator put in evidence ?

Also, the advantage that is given is clearly a mistake, even if it is written in the cited paper. The uncertainty coefficient cannot be insensitive to the relative sizes of the different classes. If we follow the link to the page devoted to the uncertainty coefficient, we see that it depends on H(x) which depends on P(x). But P(x) is the prior probabilities, or in other words, the relative sizes of the different classes. — Preceding unsigned comment added by 149.154.235.63 (talk) 01:30, 24 December 2017 (UTC)

Is there a distinction from machine learning
Is there a distinction from machine learning in that SC is about classification into a small or at least finite number of categories/classes whereas ML might output a set/vector? of reals ? - Rod57 (talk) 19:08, 21 March 2018 (UTC)

"Decision Stream" Editing Campaign
This article has been targeted by an (apparent) campaign to insert "Decision Stream" into various Wikipedia pages about machine learning. "Decision Stream" refers to a recently published paper that currently has zero academic citations. The number of articles that have been specifically edited to include "Decision Stream" within the last couple of months suggests conflict-of-interest editing by someone who wants to advertise this paper. They are monitoring these pages and quickly reverting any edits to remove this content.

Known articles targeted:
 * Artificial intelligence
 * Statistical classification
 * Deep learning
 * Random forest
 * Decision tree learning
 * Decision tree
 * Pruning (decision trees)
 * Predictive analytics
 * Chi-square automatic interaction detection
 * MNIST database  — Preceding unsigned comment added by ForgotMyPW (talk • contribs) 17:37, 2 September 2018 (UTC)

BustYourMyth (talk) 19:16, 26 July 2018 (UTC)

Dear BustYourMyth,

Your activity is quite suspiciase: registration of the user just to delete the mention of the one popular article. Peaple from different contries with the positive hystory of Wikipedia improvement are taking part in removing of your commits as well as in providing information about "Decision Stream".

Kind regards, Dave — Preceding unsigned comment added by 62.119.167.36 (talk) 13:35, 27 July 2018 (UTC)

I asked for partial protection at WP:ANI North8000 (talk) 17:06, 27 July 2018 (UTC)

The short description
The short description suffers from "infinite recursion"; perhaps we can choose a better description. Thatsme314 (talk) 18:08, 7 June 2023 (UTC)


 * I rephrased it, but I didn't think about it very much; feel free to adjust. Suriname0 (talk) 00:48, 8 June 2023 (UTC)

merge
I propose merging into Pattern recognition. There is nothing distinct about the two topics, they are just synonyms DMH43 (talk) 01:38, 12 December 2023 (UTC)