Research Project

Building an LLM
From Scratch

A quest to understand Transformers by building one. Trained on poetry, powered by PyTorch, and built from the ground up.

Kaggle

The Dad Question

Why torture myself with backpropagation?

One evening, my dad casually asked: "So, what exactly does this LLM thing do?"

I panicked. I threw out buzzwords—"transformers", "attention", "tokenization". I realized if I couldn't explain it simply, I didn't understand it.

So I decided to build one. Not fine-tune GPT-4. Not call an API. But build a Transformer from empty Python files. To peek under the hood, sweat through the math, and finally say: "Dad, I got this."

Why Scratch?

Research

Reinvent the wheel to understand the road.

Sovereignty

Own the pipeline, control the data.

"You don’t just want wheels — you want monster truck wheels."

The Curriculum

Raising a baby poet on a diet of 223k characters.

Poems Dataset

A small, expressive corpus for quick experimentation.

Total Characters223,504

Unique Tokens (Vocab)79

We used Character-Level Tokenization. Every letter, space, and punctuation mark gets a unique ID. Simple, effective, and perfect for learning the "atoms" of language.

Tokenization Logic

Input Text"love this"

Encoder (stoi)[62, 65, 72, 55, ...]

Model ProcessingTensor Operations

Dataset Size

0.2MB

Tiny but Mighty

Tokenizer.py

chars = sorted(list(set(text)))
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

The Brain

Evolution from Bigram to Transformer.

V1

Bigram Model

Looks at one character to guess the next. No context, just probability tables. Result: Total gibberish.

V2

Self-Attention

Tokens start "talking" to each other. Queries, Keys, and Values allow the model to focus on relevant past information.

V3

Multi-Head Attention

Multiple attention heads run in parallel, capturing different nuances of the text simultaneously.

The Output

From gibberish to semi-coherent poetry.

Loss Curve

Started at ~4.5 (random guessing). After 2000 iterations, dropped to ~2.0, showing significant learning of structure and syntax.

Structure Emergence

The model learned to form words, use punctuation, and mimic the stanza structure of poems, even if the meaning is abstract.

Output Generation

# Step 0 (Untrained)
Eum[boRFXHR)!Nk'Gwqw;ei(3sVtCLU...

# Step 2000 (Trained)
To the bitnessabs, I witcap in syecrietss to siow,
there were.
A whake I pack in chicad Shateing...

It's not Shakespeare, but it's ours.

Building an LLM From Scratch