Talk:Perceptron

Article about Perceptron can't even decide what a perceptron is
From the introduction, it is an algorithm but also a neuron:
 * the perceptron (or McCulloch–Pitts neuron) is an algorithm

But it is also an abstract version of neurons using directed graphs and temporal logic:
 * The perceptron was invented in 1943 by Warren McCulloch and Walter Pitts.[5]

(There is no learning algorithm in the paper.) It is also a machine implementing the algorithm:
 * Mark I Perceptron machine, the first implementation of the perceptron algorithm.

But the machine is also an artificial neural network:
 * Rosenblatt called this three-layered perceptron network the alpha-perceptron

It is also any of the individual nodes in such a network:
 * In the context of neural networks, a perceptron is an artificial neuron using the Heaviside step function as the activation function.

And just to be specific, it is actually an algorithm about linear algebra in R^n:
 * a function that maps its input 𝑥 (a real-valued vector) to an output value 𝑓(𝑥) (a single binary value)

Look, I understand if these ambiguities and contradictions already exist in the machine-learning literature. But it is our job as editors to present things clearly. Even just acknowledging the issue in the introduction would be helpful. And I don't think the article is complete without explaining the link between the neural nets on the one hand, and the linear-algebra optimization problem on the other. 2406:3003:2001:2ADD:330D:ABF8:1865:DE5B (talk) 10:20, 18 July 2024 (UTC)

Not understandable for normal people
Is there a prize to make this completely impossible to understand for normal people or students in school? Why is it immediately written with matrices and not first explained on a simple easy to visualize example? 2003:E5:1F18:54DB:99:C38D:A80B:311D (talk) 11:52, 31 May 2024 (UTC)

Example issues
As suggested by Igogo3000, I have removed the learning rate. As mentioned by others earlier, there is no need for a learning rate in Perceptron as scaling the learning rate simply scales the weights but does not change any predictions. Having a learning rate in the definition is misleading and confusing -- in fact one of Perceptron's greatest advantages compared to similar methods is that it has no learning rate that has to be optimized on a held-out set. By having a learning rate in the article, people implementing it will waste hours tuning the learning rate only to find out that it doesn't matter. Unfortunately, this rendered the table that was there incorrect so I removed it as suggested by Igogo3000


 * This seems wrong. The "steps" section below refers to the learning rate, and I have no idea how you would execute those steps without a learning rate.  — Preceding unsigned comment added by 2620:15C:170:110:B5F8:3AC9:8F0F:F878 (talk) 22:05, 21 March 2019 (UTC)

Is Linearly Separable Equation Correct?
In the part where it says the following (see below), if $$d_j=0$$ how can $$( \mathbf{w}\cdot\mathbf{x}_j + b ) d_j > \gamma $$ be greater than $$\gamma$$?

The training set $$D$$ is said to be linearly separable if the positive examples can be separated from the negative examples by a hyperplane; that is, if there exists a positive constant $$\gamma $$ and a weight vector $$\mathbf{w}$$ such that $$( \mathbf{w}\cdot\mathbf{x}_j + b ) d_j > \gamma $$ for all $$1 < j < m$$.

24.177.56.139 (talk) 19:57, 12 October 2013 (UTC)


 * The output values are set to either 1 or -1 to make this equation work out. Q VVERTYVS (hm?) 22:20, 12 October 2013 (UTC)
 * This section is inaccurate in some more ways:
 * It is said that linear separability implies the existence of positive $$\gamma$$, while in fact it's not. The existence of $$\gamma$$ is an additional requirement for perceptron to converge. It is nontrivial as in the case of on-line learning, the training set may be considered infinite.
 * $$\gamma$$ is then used in an estimate for the number of updates, but that formula is only valid if $$||w|| = 1$$. This requirement haven't been mentioned.
 * Igogo3000 (talk) 16:46, 6 January 2014 (UTC)

as discussed
As discussed on Talk:Artificial neuron, there is a lot of overlap between articles on this and various related topics. See that page for some suggestions on how they might best be seperated. - IMSoP 18:29, 11 Dec 2003 (UTC)

illustrating images for gaussian data don't differ
Hello, the three pictures (gaussian data in 2d, gaussian data with linear classifier, gaussian data in higher space) do not differ. Is it intended? —The preceding unsigned comment was added by 87.160.196.168 (talk) 19:45, 26 February 2007 (UTC).

Still not changed. This is confusing and seems like an error Benjamin Good (talk)  —Preceding undated comment added 23:56, 29 March 2013 (UTC)

I agree, I'll probably remove them and try to rewrite it in the near future... --Falcorian (talk) 21:03, 11 April 2016 (UTC)

I don't think there was anything to save, so I've removed the images and paragraph. The technique described is useful, but without the images it doesn't really fit and they are (as far as I can tell) completely wrong. Hopefully someone can make better images in the future! --Falcorian (talk) 03:00, 12 April 2016 (UTC)

Is my Pic ok
Please see the discussion page for the XOR perceptron net

pocket ratchet algorithm
Having never heard of this algorithm, I'm disapointed it doesn't have more discussion, either here or on its own page.

McCulloch-Pitts redirect
I'll try and tidy up this article when I get the chance, but McCulloch-Pitts neuron should NOT redirect here. MP neurons are threshold units, whereas neurons in a perceptron model are linear functions of their summed inputs. An MP neuron's activation is essentially u(x) + b, where u(x) is the heaviside step function, x is the sum of inputs (inputs can either be +1 excitatory or -1 inhibitory), and b is a bias or threshold. DaveWF 07:33, 4 April 2006 (UTC)


 * According to chapter 2 of Rojas' book the only difference between the classical Rosenblatt perceptron and McCulloch-Pitts neuron is that the perceptron has weighted inputs. This concurs with Russell & Norvigs AI A modern approach, which says that perceptrons also use the heaviside step function. Also McCulloch-Pitts activation isn't a straight sum, because the inhibitory inputs are absolute - a single active inhibitory connection will force the whole output to 0. There's no such thing as a -1 input for MP neurons. 129.215.37.38 17:25, 15 August 2007 (UTC)


 * There should be a separate page for McCulloch-Pitts neuron, as this is an important first attempt on making an artificial neuron, yet it is fundamentally different from later attempts. Note that several text mixes up McCulloch-Pitts neuron and perceptrons, for example Introduction to the theory of Neural Computation by Hertz, Krogh, and Palmer. At page 3 an MP neuron is given weights. Jeblad (talk) 15:05, 14 January 2020 (UTC)
 * It is also strictly speaking incorrect that McCulloch/Pitts invented the Perceptron. This sentence should be removed or altered. 194.230.144.205 (talk) 12:38, 19 March 2024 (UTC)

What do b and R represent?
My lay understanding of scientific, computer, and math topics is good but my understanding of jargon and formulae is poor...

In the definition section we have f(x) =  + b - all terms are defined except b &mdash; what am I missing? &mdash; Hippietrail 16:53, 10 April 2006 (UTC)
 * I've tried to clear this up. Hope that helpsl. DaveWF 06:22, 5 September 2006 (UTC)

We still have: "if there exists a positive constant γ and a weight vector w such that ...", with the following formula containing a b. Since this is nowhere mentioned to be trained, shouldn't it be "if there exists ... and an offset b such that"? Even more confusing: what's the R in Novikoff's formula? -- dejj

Learning a linear MLP that does XOR
Multiple layers don't help unless they have non-linear activation functions, since any number of linear layers will still give you a linear decision boundary. I'll fix this up later tonight. DaveWF 22:26, 4 September 2006 (UTC)
 * Okay, I've fixed some bias-related stuff but I'm going to do a major rewrite over the next couple of days. The superficial discussion is fine but some of the other stuff is inaccurate or misleading. DaveWF 06:22, 5 September 2006 (UTC)
 * where α < 1 limitation comes from? I bet any positive α will work for training. 91.124.83.32
 * If α >= 1, the values oscillate (or so I've read; I haven't worked it out for myself). -AlanUS (talk) 20:03, 20 April 2011 (UTC)

Learning rule & bias?
The article suggests that the bias is not adjusted by the perceptron learning rule, but AFAIK this is usually not the case. Can someone who knows more about NNs confirm this? Neilc 23:48, 15 October 2006 (UTC)
 * The bias can be learned just like any other weight, and often is by means of introduction of another input which is always '1'. In this way it's quite similar to linear regression. DaveWF 04:03, 18 October 2006 (UTC)
 * But what is confusing in this article is that there still is an _extra_ bias variable b. To quote the article: "b is the bias term, which in the example below we take to be 0." This is only half-true, because what we set to zero is a _fixed_, non-learnable bias term b. Whereas the learnable bias, introduced through the extra input which always values 1 (explanation missing in the text), is very much needed to solve the problem! If nobody objects, I will change this during the next week to be more understandable. 140.105.239.162 (talk) 11:05, 12 November 2011 (UTC)

Content difficult to understand
I think the article should be updated a little. even the second paragraph is a little confusing, when it calls the perceptron a 'binary classifier'. I will try to draw some pictures, and I think some of the mathematical text should either be cleaned up, or replaced with picturesPaskari 17:37, 29 November 2006 (UTC) I updated the 'learning algorithm' section, as I found it to be a little difficult to follow. My version is better in that it is tabular, but I still think it's too dificult too follow. Could someone look it over and update as need be. Paskari 19:28, 29 November 2006 (UTC)

Multi Layer Perceptron
Could someone who is knowledgable enough create a page for multi-layer perceptrons? there is a good section on it under Artificial Neural Networks, but it isn't detailed enough. Paskari 19:34, 29 November 2006 (UTC)


 * ✅ We now have a multilayer perceptron article.

Running Time
I am going to create a section outlining the running time and tractability of the algorithm, I hope this is OK with everyone. Paskari 16:53, 1 December 2006 (UTC)


 * First we consider a network with only one neuron. In this neuron, if we assume the length of the largest element in x or in w to be of size n then the running time is simply the running time of the dot product which is $$O(n^3)$$.  The reason for this is that the dot product is the rate determining step.  If we extend the example further, we find that, in a network with k neurons, each with a running time of $$O(n^3)$$, we have an overall running time of $$O(kn^3)$$


 * I've removed this section as it seems completely wrong to me and is unsourced. The number of inputs to a single perceptron defines the O time. Since all the inputs can be inputs to a single perceptron, then the runtime is $$O(n)$$. There are n multiplications, I'm not sure where the power of 3 comes from. 172.188.190.67 17:33, 20 August 2007 (UTC)


 * While a software *simulator* of a single-neuron with n inputs may require $$O(n)$$ runtime (since it does n multiplies, one at a time), my undestanding is that the original perceptron (as well as some neural nets) are built with dedicated hardware for each input, so they run much faster, closer to $$O(1)$$ runtime, since all multiplies are done simultaneously. --DavidCary (talk) 20:08, 19 July 2019 (UTC)

Initial weight vector
What is the weight vector set to in the first run through the training set? Is it the equal to your x vector?
 * It doesn't matter, random would work just as well 91.124.83.32

Training set definition
In the training set defined as ,...,, it should be clarified that the vector y1...ym is the desired output set, with values being either 1 or 0. (This is my assumption, and I'm completely new to this field, so I'm not editing the page myself.) —Preceding unsigned comment added by 123.236.188.135 (talk) 11:57, 15 August 2009 (UTC)

Suspected Excessive Promotion of Herve Abdi
Another reference to Herve Abdi, inserted by an anonymous user with ip address 129.110.8.39 which seems to belong to the University of Texas at Dallas. Apparently the only editing activity so far has been to insert excessive references to publications by Herve Abdi, of the University of Texas at Dallas. The effect is that many Wikipedia articles on serious scientific topics currently are citing numerous rather obscure publications by Abdi et al, while ignoring much more influential original publications by others. I think this constitutes an abuse of Wikipedia. A while ago, as a matter of decency, I suggested to 129.110.8.39 to remove all the inappropriate references in the numerous articles edited by 129.110.8.39, before others do it. For several months nothing has happened. I think the time has come to delete the obscure reference. Truecobb 21:37, 15 July 2007 (UTC)

Minsky and Grossberg
I am looking throughout Wikipedia for inaccurate claims regarding Minsky and Papert's Perceptrons book, and I intend to correct all of them. One interesting fact regarding the contents of this page here is that one of the exaggerated claims were inserted together with a reference to this Grossberg article, in an anonymous edit of an IP that never did anything else. Here is the edit in question:

http://en.wikipedia.org/w/index.php?title=Perceptron&oldid=61149477

Does anybody know who inserted this text?...

I don't know what this article has to do with that book. First of all, not only Perceptrons brings proofs of how to implement the XOR (partity) function for any number of inputs, but many other books before it already said that (for example, Rosenblatt's book). This article seems to deal with more complex networks, with feedbacks. I believe that this reference to this article is just something that people hear and go on repeating without questioning.

My doubt is: should we just remove anyway any reference to article claiming to have “countered” the Perceptrons book, or should we keep them to show that the research in ANNs never really stop for real? -- NIC1138 (talk) 22:32, 24 March 2008 (UTC)

Very important confusion of names
It seems to me that this article is confusing what a perceptron is a (three-layered network) for a single neuron. The “associative” neurons, as called by rosenblatt, reside in an intermediary layer (what has today the strange name of “hidden” layer). The output is a unique summation followed by a limitation in the case of what Rosenblatt called a “simple perceptron”, and more outputs in more complex perceptrons. We will need some intense rewrites to reflect this in the article...

I believe the training Rosenblatt did in the beginning was just using randomly assigned weights in the first layer, and then adjusting only the output weights. Perhaps this is what leads people to the confusion, believing that the early perceptron was just a simple linear classifier, when it was already a second-order structure. -- NIC1138 (talk) 17:16, 25 March 2008 (UTC)


 * The concept 'perceptron' may be a synonym of 'neuron', so in this regard the article is fine. See http://encyclopedia2.thefreedictionary.com/Perceptron -- 90.156.88.124 (talk) 19:38, 22 November 2009 (UTC)

NAND learning perceptron with all zero inputs
Note that in the given example the inputs 0,0,0 are not given. With a desired output of 1 and all zero inputs, the sum of the products would not yield 1 as output on any epoch. Is there an example of this in practical use? Should this be mentioned in the article? Shiggity (talk) 07:51, 13 August 2008 (UTC)
 * Basically, the example only shows the process used for learning, it is not a practical example. A single "neuron" cannot even learn some functions on two inputs (XOR and identity), though NAND on three inputs should be possible (with a negative threshold and all possible inputs being used to "teach" the perceptron, the example just doesn't reflect that. 82.231.41.7 (talk) 19:07, 14 August 2008 (UTC)
 * The first input is the constant 1 input whose weight is in effect the bias. The example is a fine practical example.  130.60.5.218 (talk) 17:59, 24 November 2009 (UTC)

Kernel trick and structured inputs and outputs
I removed this statement


 * The kernel-perceptron not only can handle nonlinearly separable data but can also go beyond vectors and classify instances having a relational representation (e.g. trees, graphs or sequences).

because I believe that the kernel trick is not necessary for going beyond vectors. Perceptrive (talk) 20:35, 16 March 2009 (UTC)

Why binary classifier?
Why does the "definition" section define a perceptron as a binary classifier? It can also be used for e.g. linear regression. It seems perhaps more appropriate to define it as a function that linearly maps inputs to an output, and according to the application (classification, regression), a different transfer function is used on the output node. Msnel (talk) 08:44, 16 January 2011 (UTC)

Doesn't correspond to the known facts
I don't know English, therefore excuse for autotransfer. But article contains obvious errors. The section History completely doesn't correspond to true. And the main thing перцептрон, invented by Rosenblatt isn't the linear qualifier. It is possible to look the Russian version of article (Перцептрон) where all is verified according to primary sources in more details.

Therefore has established a template Hypothesis and I ask to result article according to the facts. Though I know that described is frequent traditional error. --SergeyJ (talk) 08:51, 1 June 2011 (UTC)

The initial weights of the next iteration
In the example "the final weight of one iteration become the initial weights of the next itaration". When I look at the final weights w0, w1 and w2, of the first itertaion, they are {(0.1, 0, 0),(0.2, 0, 0.1),(0.3, 0.1, 0.1),(0.3, 0.1, 0.1)} whereas the initial weights of the second iteration seem to be different: {(0.3, 0.1, 0.1),(0.4, 0.1, 0.1),(0.5, 0.1, 0.2),(0.5, 0.1, 0.2)}. Am I missing something here? — Preceding unsigned comment added by 193.66.174.253 (talk) 20:05, 26 March 2013 (UTC) I noticed that the next iteration cycle is not for the whole sample set but for the next sample of the sample set. — Preceding unsigned comment added by Christe56 (talk • contribs) 04:50, 27 March 2013 (UTC)

computational geometry?!
"In computational geometry, the perceptron is an algorithm..." - the perceptron is not related to geometry... it is a general machine learning algorithm --Erel Segal (talk) 10:17, 27 May 2013 (UTC)

Error Conversation
For step 3 in "Steps", why would we only sum from j to s? Why not sum from 1 to s? A citation would be nice.

Additionally for step 3, why would the error not be squared as it usually is ? It seems that errors resulting from a false negative (+1) and a false positive (-1) would balance out.

Clarkatron (talk) 17:00, 26 February 2014 (UTC)Clarkatron


 * The intended sum ranges over j=1 to s. The error is wrong, it should be absolute (fixed that). Q VVERTYVS (hm?) 17:52, 26 February 2014 (UTC)


 * Thanks. I think that's a huge improvement. Although I understand that what you outlined would accomplish the same thing, why not use mean squared error? It seems to be standard in cases like these.Clarkatron (talk) 19:22, 27 February 2014 (UTC)


 * Mean squared error is a regression loss. In classification, it doesn't matter how far the decision function is from -1 or +1, just whether it's bigger than zero or not, i.e. whether the model makes the right prediction. (E.g. when the prediction is 10 for a positive example, the squared error is still (10 - 1)² = 49 even though the prediction is correct.) Q VVERTYVS (hm?) 22:45, 27 February 2014 (UTC)

thimkquest has been discontinued
So we should remove that reference. (Not the link that wikipedia chose to place after this comment :))

Who solved the XOR-problem first?
Please compare the history section with the article on artificial neural networks and other sources. In the before-mentioned article it is not said that Minsky and Papert solved the XOR-problem with respect to artificial neural networks. Rather it sounds like Werbos 1974 was the first to solve this problem. Here the contrary is said. Garrafao (talk) 14:25, 11 June 2015 (UTC)

Python code incorrect?
I think there's a bug in the Python code. It seems to be updating the weights with each training example for use with the next training example, but shouldn't the weights be updated based on all training examples before the new weights are used? Otherwise you seem to just enter into a limit cycle. I'm saying the code should be:

27.33.60.133 (talk) 16:24, 30 August 2015 (UTC)


 * Both are valid options. Updating after each training example is the "classical" perceptron, which works in a true online setting (each example is shown exactly once to the algorithm and discarded thereafter). The convergence proof by Novikoff applies to the online algorithm. Q VVERTYVS (hm?) 18:10, 30 August 2015 (UTC)

No permission to use collectively
Who gave permission to use perceptrons collectively on the same vector? Whose idea was it? https://discourse.processing.org/t/flaw-in-current-neural-networks/11512 Also there is no mention of the linear associative memory aspect of perceptrons. Which is rather important as there are under capacity, capacity, and over capacity cases with very different consequences. — Preceding unsigned comment added by 14.177.144.110 (talk) 23:35, 2 June 2019 (UTC)

Learning rate does not change predictions?
The article says "Unlike other linear classification algorithms such as logistic regression, there is no need for a learning rate in the perceptron algorithm. This is because multiplying the update by any constant simply rescales the weights but never changes the sign of the prediction.[9]".

Multiplying the update by a constant does not rescale the weights per se, but only the update. If the learning rate is infinitely small the training process will not change the weights at all and the perceptron will not converge. It might not change the sign of the previous prediction (obviously) but it will affect the sign of the next one by affecting how much the weights are changed by training. This can be easily proved empirically by implementing a perceptron with weights that are very small in relation to the inputs and random weight initialisations.

I will be removing those lines from the article, especially because the source used is a small comment in a Professor's lecture notes that is even smaller than the text on the article itself without any (empirical or theoretical) proof of its own. Yalesrios (talk) 20:17, 28 March 2020 (UTC)

The answer is yes and no. Note that, as long as nonzero updates occur, the weights vector is growing at a rate proportionally to $$r$$. The longer the training goes on, the larger will be the weights vector compared to the update. The changes in the direction of the weights vector become smaller and smaller. If you start with zero weights, the rate $$r$$ has indeed no influence on the results whatsoever (it just rescales the weights vector). If you start with an initial guess for $$\mathbf{w}$$, a small update rate makes the changes smaller. 77.11.67.138 (talk) 05:51, 10 October 2022 (UTC)

Core definition issue
I think the core definition requires revision. It currently states: "In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers." The small correction is that the algorithm works for both supervised and unsupervised learning. Musides (talk) 01:47, 24 May 2023 (UTC)


 * I feel like the entire article has to be redone. Some other language versions of it define the Perceptron as a network, not just a model of one neuron (and don't conflate it with the McCulloch-Pitts neuron). Vkardp (talk) 14:57, 6 July 2023 (UTC)

Step 2.a - Isn't f only defined for vectors?
I think the function f was defined only for vectors in the section "Definition". Here it is acting like the Heaviside step function. Sr cricri (talk) 01:05, 23 September 2023 (UTC)