What is the vanishing gradient problem in RNNs?

The vanishing gradient problem is why vanilla RNNs struggle with long sequences. Backpropagation through an RNN involves multiplying gradients through every time step. If these multiplications produce numbers less than 1, the gradients shrink exponentially as they propagate backwards — by the time they reach early time steps, they are effectively zero. The model cannot learn dependencies between words far apart in a sequence.

RNN (Recurrent Neural Network)

Q: What is an RNN in simple terms?

An RNN is a neural network with memory. When processing a sequence — a sentence, a time series, an audio clip — it processes one element at a time and passes a summary of everything seen so far (the hidden state) to the next step. It remembers context. Reading the word 'bank' in 'river bank', the hidden state carries 'we have been talking about rivers' — helping interpret the ambiguous word correctly.

Q: What are LSTM and GRU?

LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) are RNN variants designed to solve the vanishing gradient problem. They use gate mechanisms — learned functions that control how much of the previous hidden state to keep, forget, and update. This allows gradients to flow more freely across many time steps, enabling models to learn long-range dependencies. LSTM and GRU dominated NLP from 2015 to 2018 before transformers replaced them.

Q: Why did transformers replace RNNs?

Three reasons. Parallelism: RNNs process sequences step by step — you cannot process step 5 until you have finished step 4. Transformers process all positions simultaneously, enabling training on much longer sequences with the same time. Long-range dependencies: transformer attention directly connects any position to any other, regardless of distance. Scale: transformers scale better with data and parameters than RNNs, producing better models at frontier scale.

⚡ An RNN (Recurrent Neural Network) processes sequential data by maintaining a hidden state — a summary of everything seen so far — passed forward through the sequence. This gives RNNs memory. LSTM and GRU variants solved the vanishing gradient problem that limited vanilla RNNs. Transformers eventually replaced RNNs for most tasks by processing entire sequences in parallel rather than one step at a time.

Category: Deep Learning · Difficulty: Intermediate · Last updated: 15 May 2026 · 5 min read

RNN — What It Is, How Sequential Memory Works & Why Transformers Eventually Replaced It

What is RNN?

A standard neural network processes each input independently — the output for input 5 does not depend on what inputs 1-4 were. For images, this is fine. For sequences — sentences, time series, audio — order matters enormously. “Dog bites man” and “Man bites dog” have identical words but completely different meanings.

RNNs were designed for this. At each step of the sequence, an RNN processes the current input and combines it with its hidden state — a vector summarising everything the network has seen so far. This hidden state is passed to the next step, updated there, and passed again — all the way through the sequence. The final hidden state (or all of them) is then used for the task: classify this sentence, predict the next word, translate this sequence.

The hidden state is the RNN’s memory. It is how the model remembers that “bank” appeared after “river” six words ago — and uses that context to interpret what comes next.

How RNN works ?

Input the first element of the sequence (e.g. the first word token).
Combine it with the initial hidden state (usually zeros) using a learned weight matrix.
Apply an activation function — producing a new hidden state.
Pass this hidden state to the next step alongside the next input element.
Repeat for every element in the sequence.
Use the final hidden state (or all hidden states) as input to the output layer.

LSTM AND GRU

Vanilla RNNs suffer from the vanishing gradient problem — gradients shrink exponentially as backpropagation travels through many time steps, preventing learning of long-range dependencies.

LSTM (1997, Hochreiter & Schmidhuber) solves this with gates: a forget gate (how much of the previous cell state to discard), an input gate (how much new information to add), and an output gate (what to output). These learned gates allow gradients to flow freely across long sequences.

GRU (2014, Cho et al.) is a simplified LSTM with only two gates (reset and update). Fewer parameters, similar performance on many tasks, faster to train.

Real-world examples

Not theory — what real teams actually shipped using this technique.

Google Translate used LSTMs from 2016 to 2017 before switching to transformers — LSTMs produced dramatically better translations than previous phrase-based approaches but required sequence-to-sequence architectures that were complex to train.
Time series forecasting — RNNs and LSTMs remain widely used for financial time series, demand forecasting, and IoT sensor data where sequence length is manageable and transformers would be overkill.
Music generation — early music generation models used LSTMs to predict note sequences, maintaining a hidden state that encoded the musical context developed over many previous notes.

Common pitfalls

Sequential processing bottleneck — RNNs cannot be parallelised across the sequence length. Processing a 1000-word document requires 1000 sequential steps. Transformers process all 1000 in parallel.
Limited long-range memory — even LSTMs struggle with very long-range dependencies (hundreds of tokens apart). Transformers directly attend to any position with equal computational cost.
Largely replaced for NLP — for text tasks, transformers consistently outperform RNNs and are faster to train at scale. RNNs persist for specific applications: real-time streaming (where processing one step at a time is advantageous), edge deployment (smaller memory footprint), and some time series tasks.

Frequently asked questions

QUESTION 1 What is an RNN in simple terms?

ANSWER 1 A neural network with memory — it processes sequences one step at a time and passes a hidden state summary forward, allowing it to use context from earlier in the sequence.

QUESTION 2 What is the vanishing gradient problem?

ANSWER 2 Gradients shrink exponentially through backpropagation over many time steps — making it impossible to learn dependencies between positions far apart in a sequence.

QUESTION 3 What are LSTM and GRU?

ANSWER 3 RNN variants with gate mechanisms that allow gradients to flow freely across long sequences — solving vanishing gradients. Dominated NLP before transformers.

QUESTION 4 Why did transformers replace RNNs?

ANSWER 4 Parallel processing (no sequential bottleneck), direct long-range attention at any distance, and better scaling with data and parameters.

Sources & further reading

Hochreiter & Schmidhuber (1997). Long Short-Term Memory. Neural Computation — the original LSTM paper.
Cho et al. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078 — introduced GRU.
Karpathy (2015). The Unreasonable Effectiveness of Recurrent Neural Networks — famous blog post demonstrating RNN capabilities. karpathy.github.io/2015/05/21/rnn-effectiveness/
Goodfellow, Bengio & Courville (2016). Deep Learning. Chapter 10: Sequence Modelling. deeplearningbook.org

📬 Get one concept + one use case every Tuesday. Join the newsletter →