An RNN (Recurrent Neural Network) processes sequential data by maintaining a hidden state — a summary of everything seen so far — passed forward through the sequence. This gives RNNs memory. LSTM and GRU variants solved the vanishing gradient problem that limited vanilla RNNs. Transformers eventually replaced RNNs for most tasks by processing entire sequences in parallel rather than one step at a time.

Category: Deep Learning · Difficulty: Intermediate · Last updated: 15 May 2026 · 5 min read


RNN — What It Is, How Sequential Memory Works & Why Transformers Eventually Replaced It

What is RNN?

A standard neural network processes each input independently — the output for input 5 does not depend on what inputs 1-4 were. For images, this is fine. For sequences — sentences, time series, audio — order matters enormously. “Dog bites man” and “Man bites dog” have identical words but completely different meanings.

RNNs were designed for this. At each step of the sequence, an RNN processes the current input and combines it with its hidden state — a vector summarising everything the network has seen so far. This hidden state is passed to the next step, updated there, and passed again — all the way through the sequence. The final hidden state (or all of them) is then used for the task: classify this sentence, predict the next word, translate this sequence.

The hidden state is the RNN’s memory. It is how the model remembers that “bank” appeared after “river” six words ago — and uses that context to interpret what comes next.

How RNN works ?

  1. Input the first element of the sequence (e.g. the first word token).
  2. Combine it with the initial hidden state (usually zeros) using a learned weight matrix.
  3. Apply an activation function — producing a new hidden state.
  4. Pass this hidden state to the next step alongside the next input element.
  5. Repeat for every element in the sequence.
  6. Use the final hidden state (or all hidden states) as input to the output layer.

LSTM AND GRU

Vanilla RNNs suffer from the vanishing gradient problem — gradients shrink exponentially as backpropagation travels through many time steps, preventing learning of long-range dependencies.

LSTM (1997, Hochreiter & Schmidhuber) solves this with gates: a forget gate (how much of the previous cell state to discard), an input gate (how much new information to add), and an output gate (what to output). These learned gates allow gradients to flow freely across long sequences.

GRU (2014, Cho et al.) is a simplified LSTM with only two gates (reset and update). Fewer parameters, similar performance on many tasks, faster to train.

Real-world examples

Not theory — what real teams actually shipped using this technique.

  • Google Translate used LSTMs from 2016 to 2017 before switching to transformers — LSTMs produced dramatically better translations than previous phrase-based approaches but required sequence-to-sequence architectures that were complex to train.
  • Time series forecasting — RNNs and LSTMs remain widely used for financial time series, demand forecasting, and IoT sensor data where sequence length is manageable and transformers would be overkill.
  • Music generation — early music generation models used LSTMs to predict note sequences, maintaining a hidden state that encoded the musical context developed over many previous notes.

Common pitfalls

  • Sequential processing bottleneck — RNNs cannot be parallelised across the sequence length. Processing a 1000-word document requires 1000 sequential steps. Transformers process all 1000 in parallel.
  • Limited long-range memory — even LSTMs struggle with very long-range dependencies (hundreds of tokens apart). Transformers directly attend to any position with equal computational cost.
  • Largely replaced for NLP — for text tasks, transformers consistently outperform RNNs and are faster to train at scale. RNNs persist for specific applications: real-time streaming (where processing one step at a time is advantageous), edge deployment (smaller memory footprint), and some time series tasks.

Frequently asked questions

QUESTION 1 What is an RNN in simple terms?

ANSWER 1 A neural network with memory — it processes sequences one step at a time and passes a hidden state summary forward, allowing it to use context from earlier in the sequence.

QUESTION 2 What is the vanishing gradient problem?

ANSWER 2 Gradients shrink exponentially through backpropagation over many time steps — making it impossible to learn dependencies between positions far apart in a sequence.

QUESTION 3 What are LSTM and GRU?

ANSWER 3 RNN variants with gate mechanisms that allow gradients to flow freely across long sequences — solving vanishing gradients. Dominated NLP before transformers.

QUESTION 4 Why did transformers replace RNNs?

ANSWER 4 Parallel processing (no sequential bottleneck), direct long-range attention at any distance, and better scaling with data and parameters.


Sources & further reading

  • Hochreiter & Schmidhuber (1997). Long Short-Term Memory. Neural Computation — the original LSTM paper.
  • Cho et al. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078 — introduced GRU.
  • Karpathy (2015). The Unreasonable Effectiveness of Recurrent Neural Networks — famous blog post demonstrating RNN capabilities. karpathy.github.io/2015/05/21/rnn-effectiveness/
  • Goodfellow, Bengio & Courville (2016). Deep Learning. Chapter 10: Sequence Modelling. deeplearningbook.org

📬 Get one concept + one use case every Tuesday. Join the newsletter →