Have you ever wondered how AI models make sense of sequential data like text, time series, or medical images as time series? Beside the “traditional” recurrent neural network, two powerful machine learning architectures recently emerged as frontrunners in this field: transformers and State Space Models (SSM). But what sets them apart, and how do they compare? Let’s delve into sequence modeling and examine the nuanced differences between these two “newcomers”.
Rise of Transformers: Attention is All You Need
In 2017, Transformers burst onto the scene1 by revolutionizing natural language processing. Their main innovation is the attention mechanism, which enables each element in a sequence to interact with every other element, thereby mastering the contextual connections within the sequence. In the transformer architecture, each element follows simple rules based on its relationship with all other elements. Language models such as GPT are built on transformers, showcasing the transformer architecture’s strength in handling highly contextual language.
In addition to excelling at tasks requiring global context understanding, Transformers have shown remarkable adaptability across various domains with surprising emergent capabilities such as reasoning and even math. However, their power and robustness come with a cost: computational complexity that scales quadratically with sequence length, limiting their efficiency on very long sequences.
State Space Models: The Efficient Challengers
While not new, state space models (SSMs) have recently surged in popularity for deep learning, particularly with innovations like the Mamba architecture2. SSMs operate on a different principle: they maintain an internal state that evolves over time as they process a sequence.
In an SSM, each time step follows simple rules to update its state based on the previous state and new input. Collectively, these state transitions create a dynamic system capable of modeling complex temporal patterns. There’s no global view of the entire sequence simultaneously, but rather a continuously evolving representation.
SSMs efficiently handle very long sequences and their computational complexity scales linearly with sequence length. This makes them particularly attractive for tasks involving extensive temporal dependencies or limited computational resources. Recent studies3 have shown that while SSMs excel at many tasks, they struggle with certain operations like exact copying that require precise information retrieval.
Emerging Applications and Future Directions
The strengths and limitations of each approach have led to exciting developments in various domains. For instance, MedMamba4, an SSM-based architecture, has shown promising results in medical image classification, offering competitive performance with fewer parameters and lower computational requirements.
Meanwhile, Transformers continue to dominate in natural language processing tasks, leveraging their strong ability to handle context and perform information retrieval.
Looking ahead, the most exciting prospects may lie in hybrid approaches that combine the strengths of both Transformers and SSMs. Researchers are exploring ways to improve SSMs’ performance on copying tasks and investigating hybrid architectures that could offer the best of both worlds.
If you’re interested in how models handle information, see Modeling and Compression on why all models are essentially sophisticated compressors.
References
Footnotes
-
Vaswani, A., et al. (2017). Attention is All You Need. Advances in Neural Information Processing Systems, 30, 5998-6008. ↩
-
Gu, A., et al. (2022). Efficiently Modeling Long Sequences with Structured State Spaces. International Conference on Learning Representations. ↩
-
Jelassi, S., et al. (2024). Repeat After Me: Transformers are Better than State Space Models at Copying. arXiv preprint arXiv:2402.01032v2. ↩
-
Yubian, Y., et al. (2024). MedMamba: Vision Mamba for Medical Image Classification. arXiv preprint arXiv:2403.03849. ↩