When I read words, I don’t think from scratch but understand each word based on your understanding of previous words. Traditional neural networks can’t do this; they cannot use its reasoning about previous events in the film to inform later ones.

LSTM

LSTMs are a special type of RNN, designed for learning long term dependencies. All RNNs have the form of a chain of repeating modules of neural network. In standard neural networks, this repeating module will have a simple structure such as a single layer:

LSTMs have this same chain structure but the repeating module has a different structure. Instead of a single neural network layer, there are four that interact with each other.

In the above diagram

  • Each line carries an entire vector, from the output of one node to the input of others.
  • Pink circles are pointwise operations, like vector addition
  • Yellow boxes are learned NN layers
  • Lines merging denote concatenation
  • Line forking denotes copying a vector to different locations.

Core Idea: Cell State

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram. It’s like a conveyor belt running straight down the whole chain, with only some minor linear interactions. Information can easily just flow along unchanged.

The LSTM does have the ability to add/remove information to the cell state, regulated by structures called gates. Gates optionally let information through, composed out of a sigmoid neural net layer and a pointwise multiplication system. The LSTM has 3 of these gates to protect and control the cell state.

  • Sigmoid layer: outputs numbers between and , describing how much of each component should be let through. means let nothing through, means let everything through.

LSTM Walkthrough

Forget gate

First, we decide what information to throw about from the cell state. This decision is made by a sigmoid layer called the “forget gate layer”. It looks at and , and outputs a number between and for each number in the cell state .

For a language model trying to predict the next word based on all the previous ones, the cell state might include the gender of the present subject, so that the correct pronouns are used. When we see a new subject, we want to forget the gender of the old subject.

Input to Cell State

The next step is to decide what information we’re going to store in the cell state.

  1. A sigmoid layer called the input gate layer decides which values we’ll update.
  2. A layer then creates a vector of new candidates, , that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

In the example of a language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.

Cell state update

Here, the old cell state, , is updated into the new cell state, . The previous steps already decided what to do, we just need to actually do it.

  • We multiply the old state by , forgetting the things we decided to forget earlier.
  • We add . This is the new candidate value, scaled by how much we decided to update each state value.

In the case of the language model, this is where we would actually drop the information about the old subject’s gender and add the new information, as we decided in previous steps.

Output

Finally, the outputs are decided – basically a filtered version of our cell state.

  • First, a sigmoid layer decides which parts of the cell state we’re going to output.
  • Then, the cell state is put through to push the values to be between and and multiply it by the output of the sigmoid gate, so that we can only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what’s coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

Variants on LSTM