Talk:Transformer (deep learning architecture)

Wiki Education Foundation-supported course assignment
This article was the subject of a Wiki Education Foundation-supported course assignment, between 5 September 2019 and 10 December 2019. Further details are available on the course page. Student editor(s): Iliao2345.

Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 04:23, 18 January 2022 (UTC)

Suggestions for the "Background" section
The first sentence mentions "attention mechanism" without explaining what they are. Unfortunately, no article by that name exists, and a reader looking at the RNN, LSTM, and GRU pages will find no mention of them. I think this paragraph needs to be explicit about *which* specific models introduced attention mechanisms with adequate citation. --Ninepoints (talk) 19:25, 21 July 2020 (UTC)


 * For what it's worth, there's this now:
 * Attention (machine learning)
 * – AndyFielding (talk) 11:14, 18 April 2024 (UTC)

Feedback from Logan Paterson on Isaac Liao's article
Logkailp (talk) 14:41, 22 October 2019 (UTC) Praise: - Article does a very good job of laying a groundwork of what Transformers are and giving details on the inner workings of it. - doesn't repeat things too often - links to other articles for applications of transformers instead of unnecessarily writing them out all over again.

Changes suggested: - I would put a little more background information in the background portion, as I came into the essay knowing nothing about transformers or the way that RNN's or CNN's work, and therefore couldn't grasp the information as well as I could have had I known some background information in the beginning. - Might want to separate the training section from the Architecture section, as they seem to be slightly different topics that could be more distinguished from one another. - Add a little more information in the section on CNN's

Most Important improvement: - More background information like I put above. This may just be a problem with my background knowledge but since the article is meant to be written for "everyone", you may want to add more to give the reader a groundwork of the topic.

Applicable to mine: - I really like your layout of the article and how the article builds from background information to explaining the workings of the topic and how each individual part of a transformer functions to the overall uses and applications of transformers - Smoothly transitioned from topic to topic within each subsection. Logkailp (talk) 14:41, 22 October 2019 (UTC)Logan Paterson

"Autoregressive" link points to wrong page
Someone linked the "Autoregressive" part of "Autoregressive Convolutional Neural Network" to "Autoencoder". Yes, they both start with "Auto", but this is clearly wrong. I'd fix it, but Wiki has rules these days where you can't fix a mistake unless you log in and then specify why you made a change, sign it, and have some understanding of how the "rules for editing" work? — Preceding unsigned comment added by 65.158.32.123 (talk) 14:05, 13 January 2020 (UTC)

I've made that change now, thanks. --aricooperdavis (talk) 22:14, 20 January 2020 (UTC)

Diagrams and simple explanations
Perhaps this is a stupid question, but what do people think of adding diagrams to the article? Also what do people think of adding dummies are us explanations? Daniel.Cardenas (talk) 18:32, 18 October 2020 (UTC)


 * Yes, diagrams are a good idea. However, one must ensure that they aren't misleading because then they do more harm than good. I don't know what "dummies are us explanations" mean. Im The IP  (talk) 19:00, 18 October 2020 (UTC)

AlphaFold, transformers, and attention mechanisms
Given the recent "milestone scientific breakthrough" being hailed for AlphaFold for its results in the protein structure prediction problem at CASP 14, and also their use in computer vision (, ; also Image GPT), I think it would be useful if we could try to present what they are trying to do in a more general framing perspective, wider and more general than their use in NLP.

(AlphaFold 2 is believed to use two transformer networks as the key core of its design).

In AlphaFold I've written that the transformers "'effect a mathematical transformation of [the elements of two feature-vs-feature matrices]. These transformations have the effect of bringing relevant data together and filtering out irrelevant data for these two relationships, in a context-dependent way (the 'attention mechanism'), that can itself be learnt from training data.'" I'd be grateful for input as to whether I've got this more or less right?

Transformers therefore seem to be maybe doing a similar job to bottleneck networks, autoencoders, latent variable extractors, and other forms of nonlinear input transformation and dimensional reduction techniques -- but there's obvously more to it than that. It might be useful to identify if there are similarities and differences.
 * (added): cf Transformers as Variational Autoencoders, found on github

Finally, it's clear that we could use an article on attention (machine learning), aka attention networks, aka attention mechanisms. Some of the following, found by Google, look like they may be relevant, but it would be good to get at least a stub created by someone who knows a bit about it.
 * Attention and Memory in Deep Learning
 * Lilian Weng, Attention? Attention!
 * Attention mechanism, FloydHub
 * Buomsoo Kim, Attention mechanism
 * Prodip Hore, Sayan Chatterjee A Comprehensive Guide to Attention Mechanism in Deep Learning for Everyone
 * also Giuliano Giacaglia, How Transformers Work, which puts attention etc in context.

Pinging as recent editors here, in case you can help. Jheald (talk) 15:06, 2 December 2020 (UTC)


 * I agree with everything you say. Please incorporate this into the article. And yes, we should have an article on attention (machine learning), aka attention networks, aka attention mechanisms. I'll create a stub for it now. -- The Anome (talk) 09:11, 3 December 2020 (UTC)


 * Any idea on how to find reliable sources in this area? Most of my knowledge in the area comes from github, random blog posts, and YouTube and those sources don't count. Would ArXiv do? Im The IP  (talk) 09:25, 3 December 2020 (UTC)
 * Well, we're not under WP:MEDRS, or Israel/West Bank restrictions, so sourcing can a little more permissive. Obviously, the usual hierarchy applies, with major textbooks, and reviews and survey articles and tour-de-horizon commentary pieces from the leading journals in the field near the top of tree, and other sources falling somewhere below that.  A key criterion is always: does the source have a reputation for knowing what they're talking about.  (Also: how mainstream, or introductory, is what they're saying? They maybe get more latitude reviewing the foundations of the field, vs playing up their latest project)  My understanding is the ML is a field that very much talks to itself through preprints and conference papers, so arXiv papers should certainly have their place.  I also think there is a place for more informal pieces like blogs or videos, which can give more accessible treatments that can be useful to readers.  Videos from authoritative sources can certainly be worth adding as External links.  With luck, most of this area shouldn't be controversial, so IMO it's a question of finding the balance of references that are most useful to readers.  And of course, we're a wiki: so there's always a lot to be said for going with what we've got, establishing a framework or a structure for the topic, then ever-incrementally finding what we can add to the topic.  People can always retire old references and ELs, if they have sources that are better.


 * Incidentally, the paper from Google Research on transformers in computer vision that I linked above (An image is worth 16X16 words: transformers for image recognition at scale) looks very helpful, (and also the tutorial based on it).  One nice thing about vision examples is that they can be so visual -- I love the pictures showing the examples of attention.
 * I've also seen a reference to this paper as being of interest, in applying the transformer model to molecular-biological domains with 3d symmetries.
 * Nice quote too, from the start of that Google paper, on Transformers vs CNNs: "Transformers lack some of the inductive biases inherent to CNNs, such as translation. equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data. However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias."
 * -- if I'm reading that right, it's saying that with enough data, transformers can learn the symmetries and adjacencies of 1D, 2D, and 3D spaces, even when they have not been hard-coded in.
 * I don't want to be editing before I feel I've got a proper grasp and perspective of the subject, so I'd really appreciate if the shape of it could be laid down by those who do. But it does look very interesting!  Jheald (talk) 16:28, 3 December 2020 (UTC)

The name "Transformer"
It would be great to have an explanation for the name "Transformer" included into the article, if there exists one, or a clarification that the name is arbitrary, otherwise. — Preceding unsigned comment added by AVM2019 (talk • contribs) 20:57, 5 December 2020 (UTC)

Vanilla Transformer Code: Incomplete
The "Pseudocode" section may be doing more to confuse than help because many of the terms are undefined(copy it to Python to see what I mean). So here is what I suggest:


 * 1) Temporarily remove it.
 * 2) Update the code to include relevant imports in Pytorch or Tensorflow or make custom definitions so that all terms are well defined in the code.
 * 3) Post it again. — Preceding unsigned comment added by 103.118.46.204 (talk • contribs)

This was PSEUDO Code not CODE. Why not just leave it. If one is able to program, one will find the right layers in pytorch or tensorflow.... — Preceding unsigned comment added by Nico Hambauer (talk • contribs)


 * I would argue that it is not pseudo-code, but rather an incomplete Python implementation. Pseudocode, by its very definition, should not be as language-specific as this code snippet. Python operations such as "embedding", "multi_head_attention", etc., should not appear in pseudocode; rather, the pseudocode should be readable by programmers in any language, whether or not they are familiar with the operation of these specific Python operations. WikiDan61 ChatMe!ReadMe!! 13:53, 27 July 2021 (UTC)
 * I agree with WikiDan61. Stuff like "multi_head_attention(x, x, x, None)" is completely unreadable for those not already familiar with Python and the framework this is written in. intforce (talk) 14:06, 27 July 2021 (UTC)

Ok maybe then I am wrong and the one that just stupidly used all the frameworks and now is used to it without noting differences anymore, but I was kind of sad to see it go as it was kind of helpful even if one had to look up all the implementations if not used to the libs. Thanks for the note! Will revert my change then :) — Preceding unsigned comment added by Nico Hambauer (talk • contribs)
 * The pseudocode can be made useful if the functioning of the framework functions can be explained, rather than just assuming that the reader knows what they do. The best way to do this would be to include in the pseudocode a declaration of the function with a pseudocode description of its operation. Then the function can be invoked within the pseudocode, since the reader will now have the knowledge required to understand it. WikiDan61 ChatMe!ReadMe!! 14:55, 27 July 2021 (UTC)

There is a readable XLNet publication ...
... at arxiv: https://arxiv.org/abs/1906.08237

Is there compelling reason to cite it at OCLC, rather than in the place where people will be able to read it? 222.154.128.36 (talk) 09:14, 2 April 2022 (UTC)

Suggestion to increase the "Importance" to "Mid" or "High"
With recent progress within AI, transformers are entering more conversations with non-experts. Also, this topic is relevant to a growing number of fields outside of linguistics. Cscangarella (talk) 04:34, 10 April 2023 (UTC)Cscangarella


 * This is already oversimplified. It should never devolve even further into an article for people who can't even understand the current form. That would make it useless to the only people who knowledge of Transformers could possibly serve. It needs to become more technical, not less. Someone lacking WP:COMPETENCE might be similarly offended by the articles on specific topics in Pure Mathematics. "Linguistics"... 76.188.120.7 (talk) 18:27, 12 April 2023 (UTC)

NPOV history
Please don't be like Schmidhuber.

Especially nefarious is retroactively naming "linear Transformer" to the 1993 model without explaining it is a retroactive naming, or just quoting old passages where "attention" is used metaphorically as if it is a direct originator of attention mechanism.

I think the fast weight controller is not a hushed-up origin of modern Transformers, but rather an attempt to apply high-order neural networks, or pi-sigma networks (1991), to the problem of processing sequential data. It failed to gain traction and plain LSTM dominated until 2014 when seq2seq introduced attention mechanism to LSTM, and 2017 purified attention mechanism into the Transformer. pony in a strange land (talk) 01:06, 24 April 2023 (UTC)

relies too much primary ref?
I think the notice on relying too much on primary references is not correct. The article has nearly 90 references. The primary reference here would be the 2017 paper (all you need is attention) ans possibly some work leading up to that paper. However, most papers are after that, by different authors. Those are academic references, but not primary to the transformer architecture. Bquast (talk) 15:12, 13 May 2023 (UTC)


 * I suggest to remove the notice. Maybe an inline notice of having more non-academic sources good be added lower down. Bquast (talk) 15:13, 13 May 2023 (UTC)

Did Jürgen Schmidhuber invent Transformers?
Conflicting edits have added/removed statements such as In 1992, the first kind of Transformer was published by Jürgen Schmidhuber under the name "fast weight controller."

Schmidhuber has been involved in multiple controversies over what he terms credit assignment. He holds a minority but not fringe view, regarding the proper attribution of ideas in the field of AI.

The paper "Attention is All You Need" by Vaswani et al describes the Transformer as follows: "In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention."

The paper Learning to Control Fast-Weight Memories by Jürgen Schmidhuber describes the Fast Weight Controller as: "This paper describes an alternative class of gradient-based systems consisting of two feedforward nets that learn to deal with temporal sequences using fast weights: the first net learns to produce context dependent weight changes for the second net whose weights may vary very quickly."

There is not an immediate resemblance between the two methods: Transformers are a sequence-to-sequence model using self-attention, and Fast-Weight Controllers sound more like a predecessor to Hypernetworks ("an approach of using one network...to generate the weights for another network") or Memory Networks.

But, in the years after the Transformer gained popularity, several modified and altered systems based on the Transformer were proposed. One such system was the Linear Transformer by Katharopoulos et al. which "[expresses] the self-attention as a linear dot-product of kernel feature maps... We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks."

The Linear Transformer is not the same as the Transformer, but in the paper Linear Transformers Are Secretly Fast Weight Programmers Schmidhuber proves that it is mathematically equivalent to the Fast-Weight Controller, apart from its normalization scheme.

To cover Jürgen Schmidhuber's contributions without violating either WP:NPOV or WP:UNDUE, I propose that the article should make clear the following:

Lwneal (talk) 18:06, 12 August 2023 (UTC)
 * Schmidhuber invented the Fast Weight Controller
 * The FWC was mathematically almost identical to Katharopoulos' Linear Transformer, but not to Vaswani's Transformer
 * The FWC did not have the language-processing capabilities of a modern Transformer
 * The FWC is a notable historical contribution to the line of research that produced the Transformer (along with other forms of recurrent neural networks in the 80s, 90s, and 2000s.)

"Decoder only" is ill defined
Description of decoder block lists the original three sub-layers (a self-attention mechanism, an attention mechanism over the encodings, and a feed-forward neural network) but later in the Terminology section "decoder only" is defined as autoregressive encoder, autoregressive decoder.

The words Decoder Only implies the lack of an encoder yet nothing in the article addresses how this 'autoregressive encoding' is happening sans encoder or the shape of the decoder block since 'an attention mechanism over the encodings' is confusing when the source of encodings is not given in this case 101.178.0.181 (talk) 01:15, 4 September 2023 (UTC)

references 33 and 35 seem unhelpful

 * Why is there a need to cite a paper from the arXiv, published a year after the paper which made a scientific leap?
 * What is the point of the Ithaca example?Ladypine (talk) 09:02, 31 October 2023 (UTC)

Loss function
Seemingly not covered in the article: when creating a transformer, what is the loss function to be minimized? I see in the article that once trained, a transformer can be used with a post-processing layer (or layers) to be trained, which enable a specific task such as classification. I understand a loss function for the transformer-plus-classification task, but what is the loss function used on the raw transformer before a specific task is chosen to be appended?

Or putting it another way, I can't be the only person who is looking for mention of a loss function. I would very much appreciate a sentence along the lines of one of these: Thanks — Q uantling (talk &#124; contribs) 20:44, 1 November 2023 (UTC)
 * 1) The loss function is, in effect, ....
 * 2) In lieu of a loss function, ....


 * You have to attach a task head later and the task head uses some loss function suitable to solve your task Biggerj1 (talk) 21:58, 6 March 2024 (UTC)

Wiki Education assignment: Research Process and Methodology - FA23 - Sect 202 - Thu
— Assignment last updated by HELLOEXTRACREDIT (talk) 20:51, 11 November 2023 (UTC)

Wiki Education assignment: Linguistics in the Digital Age
— Assignment last updated by Fedfed2 (talk) 00:54, 9 December 2023 (UTC)

Transformers transform what?
I came to this article to learn what a "Transformer" is or does. After reading it twice, I still haven't determined much of anything of about why it would be called a "transformer" or what place in an A.I. system it fits. According to Wikipedia tradition, and probably the MOS, the answer should have been in the first few sentences. Instead, I have dug through a word salad of gobblydagoop and have only faint impressions of the underlying technology involved but no clear, top-level understanding of what it does. —EncMstr (talk) 22:08, 29 March 2024 (UTC)


 * The name isn't of much importance to be honest. Researchers like naming things any which way. 80.2.247.44 (talk) 20:19, 6 July 2024 (UTC)