Transformer architectures have risen as the preferred choice among deep-learning models, finding applications in diverse fields such as natural language processing, computer vision, and time-series forecasting. Notably, it forms the core of large language models like ChatGPT. What sets Transformers apart from conventional models like fully connected networks (FCNs), convolutional neural networks (CNNs), and residual neural networks (ResNets) is the incorporation of the Attention mechanism. While Attention has driven significant progress, our understanding of its fundamentals remains in its infancy. Questions surrounding its ability to memorize lengthy sequences, model long-range dependencies, optimize effectively, and generalize well are still being explored.
Yet, recent developments have opened exciting avenues for future research, aiming to establish fundamental limits, enhance architecture, and improve algorithms and techniques such as prompt-tuning, thus fostering a sustainable and reliable future for large language models (LLMs). Motivated by these recent developments in understanding and exploiting phenomena in learning with the Transformer architecture, this tutorial aims to offer a friendly and guided tour of Transformer architectures while emphasizing its core principles. We will review recent breakthroughs that demystify Transformers and highlight promising future research directions. By doing so, we intend to inspire the research community to take a closer systematic look at Transformers.
Tutorial outline:
- Motivation: The Transformer Revolution
- Part I: Premiere on Transformers
- Part II: Fundamentals
- Feature selection
- Optimization fundamental
- Approximation and memorization
- Part III: Modern Applications
- In-context-learning
- Efficient fine-tuning
- Scaling and emergent abilities
- Part IV: Future Directions