██████ ██████ ██████ ██ ███████ ██████ ████████
██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
██████ ██████ ██ ██ ██ █████ ██ ██
██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
██ ██ ██ ██████ █████ ███████ ██████ ██
███████ ███ ███ █████ ██ ██ ██ ██ ███ ███
██ ████ ████ ██ ██ ██ ██ ██ ██ ████ ████
███████ ██ ████ ██ ███████ ██ ██ ██ ██ ██ ████ ██
██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
███████ ██ ██ ██ ██ ███████ ███████ ████████ ████████ ██ ██
This project is a small Large Language Model (LLM) that I built from scratch by implementing the Transformer architecture described in the "Attention is All You Need" research paper. The goal was to deeply understand every part of a modern sequence-to-sequence model: multi-head self-attention, positional encodings, layer normalization, residual connections, and the full encoder–decoder training loop. I trained this small model on a custom dataset built from the complete works of William Shakespeare so it could learn to generate text in an Elizabethan style.
I implemented the model in code starting from a blank file: tokenization, embeddings, stacked attention blocks, feed-forward layers, and autoregressive decoding. I also wrote the full training pipeline: dataset preprocessing, batching, optimizer and learning rate scheduling, as well as evaluation and sampling utilities to generate text.
Beyond reproducing the architecture, I spent time profiling and optimizing performance, paying attention to tensor shapes, masking, and memory usage to make the model trainable on limited hardware. This project significantly strengthened my understanding of how modern LLMs work under the hood, from the math of attention to the practical engineering needed to train and serve them.
→ Check the repo here.