Talk:Multilayer perceptron

MLP solve problems stochastically?...not true
From Applications section, "They are useful in research in terms of their ability to solve problems stochastically, which often allows one to get approximate solutions for extremely complex problems like fitness approximation." MLPs do not have any stochastic processes. In other words, there isn't a random element to a MLP. This is either an error or someone intended something else. -- Joseagonzalez (talk) 04:21, 9 August 2010 (UTC)

Only difference is non-linear activation... what?!
XOR can be easily represented by a linear activation function multilayer perceptron.

It is just (X1 OR X2) AND NOT (X1 AND X2). All of these can easily be represented by perceptrons, and putting them together simply requires more layers. What is this nonsense about non-linear activation functions being required to make them any different? The difference is that you can compose multiple functions by having each input go to more than one middle node. —Preceding unsigned comment added by 96.32.175.231 (talk • contribs)


 * What exactly do you mean by "linear activation function"? A function f(x) = x? With such a function the network's output becomes a linear function of inputs, which is what a single-layer perceptron also represents. -- X7q (talk) 16:20, 13 June 2010 (UTC)


 * XOR can be easily represented by a linear activation function multilayer perceptron. - So represent it! And post the resulting network here. -- X7q (talk) 16:23, 13 June 2010 (UTC)

Agree with you. Matlab nntool Network manager confirms that little through possibility to train multilayer network with purelin (linear activation function). Dzid (talk) 15:56, 13 June 2010 (UTC)


 * You must have made a mistake somewhere. -- X7q (talk) 16:20, 13 June 2010 (UTC)


 * The confusion here is in the difference between representation and training. One cannot start with some given linear perceptron and train it to represent XOR. One can, however, create a linear perceptron that represents XOR, but the functional space is disjoint. When training a network, that kind of disjoint result does not occur (unless you specifically separate the topology such that there are two separately-trained pieces).
 * A linear perceptron is always reducible to a hyperplane - that is, $$0 = a_0 x_0 + a_1 x_1 +a_2 x_2 + ...$$ but it can also be a disjoint set of hyperplanes, though once you allow the network to fluctuate (as in training) the disjointness is unstable and XOR is not possible. So OP is correct - you can represent XOR with a linear perceptron, but you cannot train XOR from a given starting point (unless the network is specifically separated beforehand). SamuelRiv (talk) 18:09, 13 June 2010 (UTC)


 * One can, however, create a linear perceptron that represents XOR. - so please create it, if you can. Then we can discuss what's wrong with it. -- X7q (talk) 18:38, 13 June 2010 (UTC)


 * You mean like this? I'm fairly certain you guys are confusing the hammered-in "hyperplane rule" with the first day of class where they show that an infinite linear neural net is a Turing Machine. Anyway, as I note above, this net is unstable when perturbed and trained. SamuelRiv (talk) 18:41, 13 June 2010 (UTC)


 * The example there uses a step activation function, not a linear activation function. That's why it was capable of representing nonlinear functions like XOR. -- X7q (talk) 18:47, 13 June 2010 (UTC)


 * That's what a linear activation function is. But if you want a continuous line for the second layer function, you can have one - the inputs will still be 1, 0, or -1, so that if the second layer function is y=x, the output is the same. If you want to say that the inputs have to be continuous, then it is simply a matter of turning that into binary (which is another 1st-week class exercise). SamuelRiv (talk) 18:57, 13 June 2010 (UTC)


 * No, it's not. Let me appeal to a reliable source: C. Bishop, Neural Networks for Pattern Recognition. On page 121 it says "Note that, if the activation function of all hidden units in a network are taken to be linear, then for any such network we can always find an equivalent network without hidden units.". Later on the same page Bishop calls it a "Heaviside, or step, activation function", and gives it as an example of a non-linear activation function. -- X7q (talk) 19:23, 13 June 2010 (UTC)


 * Okay then, I see your point, for which that definition of "nonlinear" would be what I meant as disjoint planes. This is then collapsible but requires a certain single-variable preprocessing or post-processing to do so (in the case of collapsing the XOR example, one can take the absolute value of a final output value to get the desired result). That's the whole point of collapsing - if all a neural network is is just linear vectors, then of course it collapses as any vector spanning space. The point of the collapse theorem is to take it a step further such that a simple integrate-and-fire model can collapse completely provided one can take a unit-step or something of the final output, which doesn't change the computational process within the system. By the way, of the major operators (and, or, not, xor, imp), the only one that can be created by the above definition of "linear" is not. So again, I fail to see the usefulness of that definition. SamuelRiv (talk) 22:36, 13 June 2010 (UTC)


 * To answer OP, the XOR can be reduced to a 2-layer linear perceptron, but the activation function has to tweak to get rid of the hidden layer, though it remains linear - basically the step function changes, which is easy to do as preprocessing the input - though there's probably a clearer trick that I'm not aware of. The straight collapse requires, IIRC, all the hidden layer weight matrices to be invertible, which in the XOR example they are not, so this tweak is required, but I am not sure whether or not that is an actual theorem. I should look that up. SamuelRiv (talk) 19:16, 13 June 2010 (UTC)

Another application
see this ref 1 where they describe using MLP in marine energy conversion work.--Billymac00 (talk) 03:13, 2 July 2011 (UTC)


 * MLP's and, more generally, other supervised learning methods are used in many areas of science for data analysis and prediction. Specific application which you mention looks to me like a very narrow topic. I don't think we should mention it unless it has attracted sufficient attention from other researchers. -- X7q (talk) 05:34, 3 July 2011 (UTC)

Correction needed in activation function section
There is a discussion on this talk page (Only difference is non-linear activation... what?!) which demonstrates that an activation function is a linear combination of the input nodes, **not an on-off mechanism**. The on-off mechanism is only used on the final output node, not the hidden layer nodes.

As described in the other discussion, an on-off mechanism (in the hidden layer) is an example of a nonlinear activation. However, since the binary function is not differentiable, a neural net using the binary function for activation can't be trained with back-propagation.

These are very important distinctions, and obviously a source of confusion (as demonstrated by the referenced discussion). Unfortunately, I don't have time to fix it at the moment. — Preceding unsigned comment added by Qiemem (talk • contribs) 02:47, 19 February 2013 (UTC)

Error in citation
MLP utilizes a supervised learning technique called backpropagation for training the network.[1][2]

The source [1] was written in the 60's whereas backpropagation was published in context of AI by Rumelhart et al (PDP group): Rumelhart, D., and J. McClelland (1986), Parallel Distributed Processing, MIT Press, Cambridge, MA.

Suggested solution: Remove reference [1]. — Preceding unsigned comment added 15:04, 6 May 2014 (UTC)

Picture (figure) needed
The current picture shows MLP behavior. But there is no MLP image.--Bojan PLOJ (talk) 09:04, 17 April 2020 (UTC)

Why is the citation for the transformer architecture not "Attention Is All You Need?"
I believe it is the standard citation; see, for example, the second sentence of the transformer article. Pogogreg101 (talk) 04:44, 18 August 2023 (UTC)

27 October 2023: I added the citation. — Preceding unsigned comment added by Pogogreg101 (talk • contribs) 03:51, 28 October 2023 (UTC)

Heaviside Step Function is Nonlinear
Hi,

There is a mistake in the first paragraph. It says that modern neural networks use a nonlinear activation function while the original perceptron uses the Heaviside step function. The Heaviside step function is definitely nonlinear. The difference is that there is now a hidden layer, not that the original perceptron didn't have nonlinear outputs. It did. I would also say it is not a misnomer to call it a multi-layer perceptron as a result. CireNeikual (talk) 19:09, 4 January 2024 (UTC)


 * Agree with this - was very confused by the description of the Heaviside function as not a non-linear function. 67.161.2.182 (talk) 21:35, 8 May 2024 (UTC)