Talk:Long short-term memory

Clarifying dimensions of vectors
First, I am fairly confident that the outputs of the sigmoid functions in the LSTM gates are actually vectors, since this is how TensorFlow implements the sigmoid function (as a componentwise sigmoid), but this is not clear from the content currently on this page. A note should be added that sigma is a vectorized or componentwise function that produces a vector result rather than a scalar. Currently, sigma is linked to the Wikipedia page for the sigmoid function, which is described as producing only a scalar. This is misleading in the context of the LSTM.

Secondly, the statement "contains {\displaystyle h}h LSTM cell's units" is misleading because it suggests the dimensionality of the LSTM's memory units is 1 and that h is instead the number of LSTM units. I believe that in fact h is the size of the memory within each LSTM unit, not the number of LSTM cells. — Preceding unsigned comment added by 2603:6080:6600:A65E:C71:5F89:961:240D (talk) 19:54, 30 November 2022 (UTC)

Convolutional LSTM
In this section new variables $$ V_{f}, V_{i}, V_{o} $$ are used. These variables should be introduced and described before.

194.39.218.10 (talk) 09:19, 29 November 2016 (UTC)

Checking of equations needed
In the equations for the peephole LSTM, the last non linearity for $$h_t$$ is applied before multiplying by the gate. In the 2001 paper from Gers and Schmidhuber (LSTM Recurrent Networks Learn Simple Context-Free and Context-Sensitive Languages) it is applied after. I think someone who knows how it is implemented in practice should double check this.

Just changed it to reflect the paper. Also, there is no U_{c} term in the paper. Atilaromero (talk) 16:12, 7 June 2019 (UTC)

In a similar subject, the paper says that "To keep the shield intact, during learning no error signals are propagated back from gates via peephole connections to CEC". Does it means that automatic backpropagation frameworks such as TensorFlow would derive different equations from the paper? If this is true, then peephole LSTM would require a customized gradient function, and this would deserve a warning on the page. A first look at the backpropagation algorithm in the paper suggests that this is indeed true, since there's no reference to the previous cell's state error on the equations, all the deltas come only from the present output error.Atilaromero (talk) 16:12, 7 June 2019 (UTC)

Introduction for non-experts
The article is not very helpful for the average Wikipedia reader with limited expertise in recurrent ANNs. It should have an introductory section giving examples of typical time-series data and explaining roughly why standard recurrent networks run into difficulty. In particular, in what sort of situation do input vectors at a given time have strong relationships to vectors at much earlier times? And why to conventional recurrent ANNs fail in these circumstances? (it's not sufficient to merely state that error signals vanish). I would improve accessibility for the interested general reader and sacrifice descriptions of more recent developments. Paulhummerman (talk) 14:26, 5 December 2016 (UTC)
 * Agree. Read this, understood nothing.  Search Google for tutorial and understood. Daniel.Cardenas (talk) 18:05, 30 July 2017 (UTC)

External links modified
Hello fellow Wikipedians,

I have just modified one external link on Long short-term memory. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
 * Added archive https://web.archive.org/web/20120522234026/http://etd.uwc.ac.za/usrfiles/modules/etd/docs/etd_init_3937_1174040706.pdf to http://etd.uwc.ac.za/usrfiles/modules/etd/docs/etd_init_3937_1174040706.pdf

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

Cheers.— InternetArchiveBot  (Report bug) 21:35, 5 January 2018 (UTC)

Proposed change
Hello friends,

After reading this page, I elect that the following lines be removed:

"A problem with using gradient descent for standard RNNs is that error gradients vanish exponentially quickly with the size of the time lag between important events."

"This is due to {\displaystyle \lim _{n\to \infty }W^{n}=0} {\displaystyle \lim _{n\to \infty }W^{n}=0} if the spectral radius of {\displaystyle W} W is smaller than 1."

I make such a proposition for the following reasons:

- Regarding the first line: that certainly is a problem if your gradient descents aren't proportional to the time lag between important events. In other words, sure, that's a problem, but it's not difficult to fix, and for that reason merits no mention.

- Regarding the second line: this is overly-complicated to an extent no less than silly, and is absolutely superfluous in this context. This one has to go if the first one goes anyway, but seriously friends, I'm calling you out on this one. To quote Albert Einstein's reply to Franz Kafka's draft of The Castle, "Life is not this hard."

TheLoneDeranger (talk) 05:38, 26 August 2018 (UTC)


 * The vanishing gradient problem should be mentioned because it is the whole motivation behind LSTMs. But probably the note about the spectral radius could be left to the vanishing gradient problem page to explain. Themumblingprophet (talk) 01:31, 1 May 2020 (UTC)

Section "Future" is difficult to read
The section "Future" should partially be rewritten imo., as it contains lots of repetitive words such as "system", "most" and "more":

"more and more complex and sophisticated, and most of the most advanced neural network frameworks" ... "mixing and matching" ... "Most will be the most advanced system LSTMs into the system, in order to make the system"... — Preceding unsigned comment added by MakeTheWorldALittleBetter (talk • contribs) 16:49, 26 January 2019 (UTC)

Finite precision
Since my edit (removal of the irrelevant and thus confusing "which use finite-precision numbers") was reverted by Nbro without comment, would Nbro care to explain what finite precision has to do with the motivation for LSTMs and why it has to be mentioned? As my edit summary already explained, gradients vanish/explode in deep networks irrespective of the amount of precision in your calculations if you don't counteract it. Finite precision makes it slightly worse, but increasing precision (say from single precision floats to double precision floats) does not solve the problem - otherwise that is exactly what would have been done to solve the problem (instead of inventing LSTMs and countless other more complicated ideas). It might be instructive to read the original paper by Hochreiter and Schmidhuber. It mentions vanishing/exploding gradients as motivation, but does not mention higher numerical precision as a possible solution. The only mention of higher precision explains that it would prevent a gradient-independent optimization strategy, namely randomly guessing the parameters, from working very well. Catskineater (talk) 23:08, 11 February 2019 (UTC)


 * If you used real numbers (infinite precision numbers) instead of floating-points (either single or double precision, or any arbitrary finite precision), every calculation performed during back-propagation to train a RNN would be mathematically correct. It would not matter if numbers became very big or small: mathematically, that would be correct.


 * "Vanishing and/or exploding gradient problems" is just a fancy name to denote the general and abstract problem which is basically due to the use of finite-precision numbers, even though nobody really mentions it (because, I suppose, there aren't many well educated computer scientists in the ML community). To be more concrete, the exploding gradient problem is when the gradients explode (i.e. become very large). This explosion of the gradients is often due to inexact computations (e.g., using finite-precision numbers). Even if this is not the case, very big (or small) numbers will likely cause problems anyway: the more inexact computations you perform, the likely you will be performing mathematically wrong computations (i.e., in simple terms, you will not be performing the computations that you think you are performing), or, at some point, you will not be able to represent such big (small) numbers.


 * Is it clearer now? If not, have a look at how back-propagation works, and read this article: https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html.


 * Whoops, I completely forgot about this discussion for a few years. The fact that with infinite precision the calculations would be mathematically correct is also irrelevant, as the vanishing gradient problem is about convergence problems that are present in the original, infinite precision formulation of the problem (the problematic training result IS the mathematically correct result). It would be correct to assume that the vanishing gradient problem can be solved by increased precision if the gradient vanished/exploded in every network layer by the same magnitude (one can then solve it by increased precision and changing the learning rate to an appropriate value - which happens automatically for adaptive optimizers like Adam), BUT THAT IS NOT THE VANISHING GRADIENT PROBLEM. The vanishing gradient problem is only a problem because the gradient vanishes with a layer-dependent magnitude that grows exponentially with layer number. If the practitioner ignores that, then there is no useful choice of learning rate. Every choice of learning rate will lead to either divergence in layers with a large gradient magnitude or extremely slow convergence in layers with a small gradient magnitude. Since infinite precision does not resolve that, finite precision is not the cause and increasing precision contributes only insignficantly (lack of precision makes the vanishing gradient problem a very slightly worse problem) to solving the problem. Catskineater (talk) 05:53, 16 March 2024 (UTC)

Introducing reams of mathematical equations without defining the terms employed should be considered a capital offense
This article suffers from a common problem in many Wikipedia articles in the computer-science domain: the author/s have no idea of the readership the article (or, for that matter, Wikipedia itself) is targeting, and mechanically reproduce equations from other sources without bothering with the effort of introducing the variables employed. Here, anyone able to understand (more accurately, recognize) the equations would already have a deep understanding of the topic, and wouldn't be wasting time trying to educate oneself with Wikipedia. I'm reminded of Niklaus Wirth's Pascal User Manual and Report (1967), described as "The only introductory book on Pascal intended for expert Pascal programmers"

Prakash Nadkarni (talk) 18:05, 17 February 2020 (UTC)

Reworked History Section
I made substantial changes to the history section. Let me know if you have any concerns about it. The big things were
 * group by year
 * add some more/better sources

I tried not to remove anything that was already there. It did mention Microsoft using LSTMs before and I have taken that out because the source (a Wired article) did not mention them. It would be nice if someone would track some of their big usages down.

If I were writing it from scratch I probably would not have included so much information on the usages of the LSTM and focused more in its academic history. E.g. when where bidirectional LSTMs invented? But there is some of that. Would like to see more if someone has the time.

Checking the dates for when each company actually started using LSTMs for their various products took some time. Apple was the hardest and I eventually just opted for noting when they said they would use them. Both sourced articles saying Apple started using them in 2016 were tech news blogs. And both of them (three including the Wired article) said that apple would soon deploy LSTMs in Quicktype. It had not yet. They were all reporting on a single talk by Apple's Vice President of software engineering, Craig Federighi, in a 2016 Apple developers conference. I couldn't find the talk but apparently he said that Apple would use LSTMs to improve in Siri and Quicktype. The blog post announcing the Apple's use of LSTMs in Quicktype came out in September of 2018. And the blog post about LSTMs in Siri came out in August of 2017. The paper about the model they used was published earlier in 2017. Also, Apple uses the LSTM for language identification (2019).

Finally, there was a claim about an LSTM getting "record results" in natural language text compression. I had a look at that. The model cited in the reference was marked "CM" in the benchmark which stands for "(Context Mixing): bits, modeled by combining predictions of independent models." Looking at the author's website I could not find a mention of LSTM. But there was an LSTM from late 2019 which is now third on the benchmark so I changed the wording and replaced the citation. The big question in my mind is whether this is really a significant event in the history of LSTMs. — Preceding unsigned comment added by Themumblingprophet Themumblingprophet (talk) 01:38, 1 May 2020 (UTC)

Inconsistency in activation functions
Under the "LSTM with a forget gate" category, $$ \sigma_{c} $$ is defined as an activation function but it is never used. From the peephole LSTM equations, it seems $$ \sigma_{h} $$ should be replaced with $$ \sigma_{c} $$ in the equation for $$ \tilde{c_{t}} $$. However, I am not sure that this is correct. Can someone more knowledgeable on the topic confirm this? --ZaneDurante (talk) 20:19, 2 June 2020 (UTC)


 * I do not understand the peephole LSTM well enough to answer your question (yet). But looking at the original lstm paper and the following two which introduced the forget gate and the peephole connections, then looking through some blog posts around the internet and the section on LSTMs in the Deep Learning book, I'm not sure where this notation is coming from... $$\sigma_g$$ makes sense as the activation function of the gates. I can maybe make sense of $$\sigma_c$$ as the activation function of the cell. And possibly $$\sigma_h$$ as hyperobolic tangent activation function. Or maybe the activation function of the hidden layer. But something must be wrong because, as you point out, $$\sigma_c$$ is not used. Do you think this notation is being copied from somewhere? Themumblingprophet (talk) 22:40, 3 June 2020 (UTC)


 * I found the original edit which introduced the equations for the 'traditional' and peephole LSTM. There $$\sigma_h$$ is for the hidden layer activation function, $$\sigma_c$$ is for cell's activation function, and $$\sigma_g$$ is for the gates' activation function. There was a note about what the original functions were for each of them (e.g. $$tanh$$). But nothing indicating that these symbols stood for any particular activation function themselves. And, yes, $$\sigma_c$$ was being used. Themumblingprophet (talk) 00:11, 4 June 2020 (UTC)


 * I would suggest something like this for the activation function section instead:


 * $$\sigma_g$$: activation function for the gated units, often (originally?) the logistic sigmoid function.
 * $$\sigma_c$$: activation function for the cell units, often (originally?) the hyperbolic tangent function.
 * $$\sigma_h$$: activation function for the hidden units, often the hyperbolic tangent function or, as the peephole LSTM paper suggests, $$\sigma_h(x) = x$$.


 * And replacing $$\sigma_h$$ with $$\sigma_c$$ where you suggested. Themumblingprophet (talk) 00:25, 4 June 2020 (UTC)

Question
Is LSTM a term which can be described as "long-term and short-term memory" or a "short-term memory" which has been last for a long time? Parsehos (talk) 08:42, 21 May 2023 (UTC)


 * I think the use of "memory" is more of an analogy to the psychological concept. It's better to say that it models our intuitive knowledge of linguistic context by learning, propagating, and then "forgetting", information through a sentence. For example, in the sentences "Marcia like to berries. She eats lots of them.", "Marcia" is typically a woman's name, so "remembers" grammatical gender (in the context vector) until it encounters "she", then it forgets its. Same with "berries" and plurality. At least that's a theoretical motivation. See https://aclanthology.org/N19-1002/ (esp. page 5) for more. too_muchcuriosity (talk) 14:38, 26 October 2023 (UTC)
 * To answer the original question, it is a blob of "short-term" (localized) data that persists for a "long time". An LSTM has two channels to pass data: the "regular one" $$h_t$$ from the base concept of an RNN, which passes data from cell to cell, and a second channel, $$c_t$$ a kind of pass-through channel. The second channel passes these "short" memories over "long" distances. Why "short"? Because the creation of the "short" memory was localized in time & place: it is created/built in only one cell. After that, it gets passed down this long pipe, this pass-through channel. Most cells let the data on the pass-through pipe ... (you guessed it!) pass-thru, without changes. Anywhere down the pipe, that blob of data might get used, to generate output. Sometimes (rarely? .. depends...) a portion of it might be cut-out, discarded & forgotten, replaced by a new chunklet of data. This forget&replace happens in just one cell, so it's again "short", and this new "memory" gets passed onwards on the "long" pass-thru pipe. Hope that's clear. The article sort-of says this, but maybe not very clearly?


 * The regular channel is $$h_t$$ and has the formal name of "the hidden state vector" aka "the output vector." The pass-thru $$c_t$$ is called a "cell state vector". My desription applies only to the regular LTSM, not the peephole LTSM; the peephole thing drops the "regular" path and uses only on the pass-thru pipe. 67.198.37.16 (talk) 03:31, 1 June 2024 (UTC)

What is the t / time subscript in the equations?
What is the t subscript in the equations supposed to mean? It's ambiguous and should be clarified. A diagram showing how many units connect together into a network is required.