User:JPxG/Bi-LSTM

Bidirectional long short-term memory is an artificial neural network architecture used for deep learning.

Background
Since the origin of computing, artificial intelligence has been an object of study, but during the second half of the 20th century, processing power became more easily accessible and computer-based research became more commonplace. The term "machine learning", used as early as 1959 by IBM researcher Arthur Samuel, currently encompasses a broad variety of statistical learning, data science and neural network approaches to computational problems (often falling under the aegis of artificial intelligence). The first neural network, the perceptron, was introduced in 1957 by Frank Rosenblatt. This machine attempted to recognize pictures and classify them into categories. It consisted of a network of "input neurons" and "output neurons"; each input neuron was connected to every single output neuron, with "weights" (set with potentiometers) determining the strength of each connection's influence on output. The architecture of Rosenblatt's perceptron is what would now be referred to as a fully-connected single-layer feed-forward neural network (FFNN). Since then, many different innovations have occurred, the most significant being the development of deep learning models in which one or more "layers" of neurons exists between the input and output.

Neural networks are typically initialized with random weights, and "trained" to give consistently correct output for a known dataset (the "training set") using backpropagation to perform gradient descent, in which a system of equations is used to determine the optimal adjustment of all weights in the entire network for a given input/output example. In traditional feed-forward neural networks (like Rosenblatt's perceptron), each layer processes output from the previous layer only. Information does not flow backwards, which means that its structure contains no "cycles". In contrast, a recurrent neural network (RNN) has at least one "cycle" of activation flow, where neurons can be activated by neurons in subsequent layers.

RNNs, unlike FFNNs, are suited to processing sequential data, since they are capable of encoding different weights (and producing different output) for the same input based on previous activation states. That is to say, a text-prediction model using recurrence could process the string "The dog ran out of the house, down the street, loudly" and produce "barking", while producing "meowing" for the same input sequence featuring "cat" in the place of "dog". Achieving the same output from a purely feed-forward neural network, on the other hand, would require separate activation pathways to be trained for both sentences in their entirety.

However, RNNs and FFNNs are both vulnerable to the "vanishing gradient problem"; since gradients (stored as numbers of finite precision) must be backpropagated over every layer of a model to train it, a model with a large number of layers tends to see gradients "vanish" to zero or "explode" to infinity before getting all the way across. To resolve this problem, long short-term memory (LSTM) models were introduced by Sepp Hochreiter and Jürgen Schmidhuber in 1995—1997, featuring a novel architecture of multiple distinct "cells" with "input", "output" and "forget" gates. . LSTMs would find use in a variety of tasks that RNNs performed poorly at, like learning fine distinctions between rhythmic pattern sequences.

While LSTMs proved useful for a variety of applications, like handwriting recognition, they remained limited in their ability to process context; a unidirectional RNN or LSTM's output can only be influenced by previous sequence items. Similar to how the history of the Roman Empire is contextualized by its decline, earlier items in a sequence of images or words tend to take on different meanings based on later items. One example is the following sentence: "He loved his bird more than anything, and cared for it well, and was very distraught to find it had a broken propeller." Here, the "bird" is being used as a slang term for an airplane, but this only becomes apparent upon parsing the last word ("propeller"). While a human reading this sentence can update their interpretation of the first part after reading the second, a unidirectional neural network (whether feedforward, recurrent, or LSTM) cannot. To provide this capability, bidirectional LSTMs were created. Bidirectional RNNs were first described in 1997 by Schuster and Paliwal as an extension of RNNs.

NLP crap
Bidirectional algorithms have long been used in domains outside of deep learning; in 2011, the state of the art in part-of-speech (POS) tagging classifiers consisted of classifiers trained on windows of text which then fed into bidirectional decoding algorithms during inference; Collobert et al. cited examples of high-performance POS tagging systems whose decoding systems' bidirectionality was instantiated in dependency networks and Viterbi decoders.

2015, Wang et al, unified tagging solution using bi-lstm rnn with word embedding

2015, Ling et al, compositional character models for open vocabulary word representation

2015, Kawakami et al, representing words in context with multilingual supervision

2016, Li et al, sentence relation modeling with auxiliary character-level embedding

Speech / handwriting
In a 2005 paper, Graves et al. used bidirectional LSTMs for improved phoneme classification and recognition.

In a 2013 paper, Graves et al. used deep bidirectional LSTM for hybrid speech recognition.

2014, Zhang et al, distant speech recognition using Highway LSTM RNNs

2016, Zayats et al, disfluency detection with bi-LSTM

In a 2007 paper, Liwicki et al. did a heckin novel approach to on-line handwriting recognition based on bi-LSTM.

Sequence tagging
2015, Huang et al, bi-LSTM-CRFs for sequence tagging

Else
2016, Kiperwasser et al., dependency parsing using bi-LSTM feature representations

2016, Zhang et al., driving behavior recognition model with multi-scale cnn and bi-lstm

2021, Deligiannidis et al. analyzed performance versus complexity of bi-RNNs versus volterra nonlinear equalizers in digital coherent systems.

2021, Oluwalade et al, human activity recognition using smartphone and smartwatch sensor data

2021, Dang et al, lstm models for malware classification (are they bidirectional?)

2016, Lample et al, neural architectures for named entity recognition

2016, Wang et al, image captioning with bi-LSTM