Building the First ML Model for Grade 2 Braille — And Why It Took Three Tries

🏷 braille
🏷 ocr
🏷 byt5
🏷 transformer
🏷 accessibility
🏷 machine-learning
🏷 python

Here’s a scenario that plays out in classrooms every day. A visually impaired student finishes a written exam — on paper, in Braille, the way they’ve been taught. The teacher collects it. And then… stares at it. Because the teacher can’t read Braille.

This isn’t some edge case. Most teachers of visually impaired students cannot read Braille. They need a way to convert that embossed page into readable English text. So they reach for one of the Braille OCR apps on their phone. They snap a photo. And they get garbage.

“Eight of us scanned 28 documents multiple times… not one word was translated.”

That’s an actual app store review. And it’s not an outlier. I started digging into why these apps fail so consistently — and found a gap so big I decided to build a solution myself.

Teacher puzzling over a Braille exam while the student waits

What Is Grade 2 Braille?

If you’ve ever seen Braille, you probably think of it as a one-to-one code. Each cell (a 2x3 grid of raised dots) maps to one letter. “A” is dot 1. “B” is dots 1-2. Simple lookup.

That’s Grade 1 Braille. And almost nobody uses it.

Within the first few months of learning Braille, students move to Grade 2 — contracted Braille. It’s like the shorthand version. Common words and letter combinations get compressed into fewer cells. “The” becomes a single cell. “And” becomes a single cell. “ing” at the end of a word? One cell.

There are over 180 contraction rules, and here’s what makes it genuinely hard: the same cell can mean completely different things depending on context.

Take the cell with dots 1-2 (⠃). It could mean:

The letter “b” (when it’s part of a word)
The word “but” (when it stands alone between spaces)
Part of the contraction “bl” (in specific positions)

And it gets trickier. You can’t use the “can” contraction inside the word “Duncan.” The “th” contraction applies in “think” but not in “pothole.” These are position-dependent, pronunciation-dependent rules that require understanding the surrounding text.

90-95% of all real Braille documents use Grade 2. Every book, every magazine, every exam paper from a student past their first semester. Grade 1 is training wheels.

Grade 1 is a simple one-to-one mapping, Grade 2 is context-dependent and complex

Why Every Existing App Fails

Here’s the punchline: every existing Braille OCR app only supports Grade 1.

That means they only work on 5-10% of real-world Braille. When a Grade 2 document hits a Grade 1 reader, the result isn’t just inaccurate — it’s incomprehensible.

A student writes: “The children will go shopping”

A Grade 1 reader sees something like: “t ch w g shop”

The contractions for “the,” “children,” “will,” and “go” get interpreted as individual letters, producing nonsense. The teacher is no closer to reading the exam.

And it’s not like the technology for generating Braille doesn’t exist. Libraries like liblouis can translate English text into Grade 2 Braille perfectly well. Screen readers use it every day. But going the other direction — from a Braille image back to English text — is where everything breaks down.

The tools exist to write Braille. Nothing exists to read it back from a photograph.

The Problem Is Interpretation, Not Detection

Two-stage architecture: detection is solved, interpretation isn’t

When I started researching this, the first insight that changed my approach was realizing that reading Braille from images is really two separate problems — and one of them is already solved.

Stage 1: Cell Detection — Look at the image, find each Braille cell, determine which dots are raised. This is computer vision. And it’s a solved problem. Models like YOLOv8 achieve 98-99% accuracy on cell detection. Each Braille cell is a binary structure — six positions, each either raised or flat. That’s 2^6 = 64 possible patterns. Compared to handwriting recognition (infinite variation in how people write the letter “a”), detecting Braille dots is straightforward.

Stage 2: Interpretation — Take the sequence of detected cell patterns and figure out what English text they represent. For Grade 1, this is a simple lookup table. For Grade 2, this is an unsolved research problem.

The two-stage Braille pipeline: image to cell detection to interpretation to text

This separation matters for a practical reason: don’t re-solve what’s already solved. I didn’t need to build a better dot detector. I needed to build a better interpreter — one that understands Grade 2 contractions in context.

The architecture decision was to use existing YOLOv8 models for Stage 1, and focus all research effort on Stage 2: building a seq2seq model that takes a sequence of Unicode Braille characters and outputs English text.

No Data, No Models, No Baselines

Here’s what makes this a genuine research problem rather than an engineering exercise: there is no training data.

Every existing Braille dataset is Grade 1 only. Nobody has published a parallel corpus of Grade 2 Braille paired with English text. No model has been trained on Grade 2 interpretation. There are no benchmarks, no baselines, no published results to compare against.

The closest thing to a “baseline” is using liblouis in reverse — its back-translation mode. But liblouis was designed for forward translation (English to Braille), and its reverse mode produces garbled output:

Input:    "Way down south where the jungle grows,"
Liblouis: "WAY D\2467/N S\12567/th where the JUNGLE GR\2467/S,"

Those backslash-number sequences are escape codes for contractions that liblouis can’t resolve in reverse. On a test set, liblouis back-translation achieves about 10% exact match with a character error rate of 0.26. Not even close to usable.

Creating Training Data From Nothing

No one has published a parallel corpus of Grade 2 Braille paired with English text. I needed to create one from scratch.

The key insight: while no Braille-to-English dataset exists, the reverse direction is well-supported. Liblouis — the open-source Braille translation library used by every major screen reader — can translate English text into Grade 2 Braille with near-perfect accuracy.

So I flipped the pipeline. Take English text from Project Gutenberg books, run it through Liblouis to generate Grade 2 Braille, and use the pairs as training data — with the Braille as input and the English as the target.

# Simplified version of the data generation
english = "The children will go shopping"
braille = liblouis.translate("en-ueb-g2.ctb", english)
# braille → "⠠⠮⠀⠡⠝⠀⠺⠀⠛⠀⠩⠕⠏⠏⠬"
# Training pair: (braille, english)

Five Gutenberg books produced 25,138 sentence pairs. I also hand-collected 42 real-world samples from a children’s Braille book (Jellybean Jungle) — manually transcribing the Braille and its English equivalent. These 42 samples were held out entirely from training. They’d be the real test: can a model trained on synthetic Liblouis data generalize to actual human-written Braille?

v1: Custom Tokens Meet Random Embeddings (2.9%)

My first attempt used T5-small — Google’s 60-million parameter seq2seq model. Solid choice for translation tasks. The question was how to represent Braille input.

Braille has 64 possible cell patterns (2^6 dots). I created 64 custom tokens — c0 through c63 — and added them to T5’s vocabulary. Each token maps to one Braille cell pattern. Clean, explicit encoding.

The problem? Those 64 new tokens got randomly initialized embeddings. T5’s existing 32,000 tokens have embeddings refined through massive pre-training. My 64 Braille tokens were random noise dropped into a carefully tuned embedding space.

The result: the model memorized fragments of the training corpus. It would output the same few phrases regardless of input. Every Braille sequence produced something like “the project gutenberg ebook of” — a phrase it had seen repeatedly in training headers.

2.9% accuracy. And that 2.9% was just lucky overlaps, not real understanding.

Lesson: Randomly initialized embeddings in a pre-trained model don’t learn well during a small fine-tune. The gradient signal from 25K examples isn’t enough to drag 64 tokens from random noise into meaningful positions in embedding space.

OK, so custom tokens don’t work. What if I skip them entirely and feed Unicode Braille characters directly into T5? Braille characters live at Unicode code points U+2800 through U+283F. They’re real characters. T5 should be able to handle them.

I reformatted the training data to use raw Unicode Braille and kicked off training. The loss curves looked plausible. Surely this would work better.

0% accuracy. Worse than random.

I dug into the predictions and found the model repeating a single phrase for every input — same symptom as v1 but even more extreme. Then I checked what T5’s tokenizer actually does with Braille characters:

from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small")

braille_text = "⠠⠮⠀⠡⠝"
tokens = tokenizer.tokenize(braille_text)
# tokens = ['▁', '<unk>', '<unk>', '<unk>', '<unk>', '<unk>']

Every single Braille character mapped to <unk>. T5’s SentencePiece tokenizer was trained on Common Crawl text — it has never seen Braille Unicode characters. They’re all unknown tokens. The model received the same input for every training example: a sequence of identical <unk> tokens. It was literally blind to its input.

Lesson: Always check what your tokenizer actually does with your input before training. A quick tokenizer.tokenize() call would have saved me hours of GPU time. This is especially important when working with non-Latin scripts, specialized Unicode ranges, or any text outside the tokenizer’s training distribution.

Three attempts: random pieces, blindfolded, clear vision

v3: ByT5 — When Bytes Beat Tokens

Two failures, same root cause: the model couldn’t properly see Braille input. Custom tokens had random embeddings. Standard tokenization collapsed everything to <unk>. I needed a model that could handle arbitrary Unicode characters natively.

Enter ByT5.

ByT5 is a variant of T5 that operates on raw UTF-8 bytes instead of subword tokens. Its vocabulary is just 259 entries: 256 byte values plus 3 special tokens. Every UTF-8 byte has a pre-trained embedding. There is no tokenizer to collapse unknown characters — because there are no unknown characters. Everything is just bytes.

Each Unicode Braille character encodes as 3 UTF-8 bytes. All three bytes exist in ByT5’s vocabulary with pre-trained embeddings. No random initialization. No <unk>. The model can see every part of the input.

T5 tokenizer:  "⠠⠮" → [<unk>, <unk>]           # blind
ByT5:          "⠠⠮" → [0xE2, 0xA0, 0xA0,       # sees everything
                         0xE2, 0xAE, 0xAE]

This was the key insight: for non-standard scripts and symbols, byte-level models sidestep the entire tokenizer problem.

The Infrastructure Gauntlet

Switching to ByT5 solved the representation problem. But byte-level processing means sequences are roughly 3x longer than subword sequences — and that creates GPU headaches.

Attempt on T4 (16GB), fp16:

Out-of-memory at batch size 8. Byte sequences are long.
Reduced batch to 4, got it running… with a training loss of 10^17 and validation loss of NaN.

The NaN wasn’t a bug in my code. It’s a fundamental limitation of fp16 precision. fp16’s maximum value is ~65,504. ByT5’s attention matrices over long byte sequences produce intermediate values that exceed this during softmax computation. The numbers overflow. Training literally cannot work in fp16 with ByT5.

This is a concrete, practical takeaway: ByT5 requires bf16 or fp32. bf16 has the same exponent range as fp32 (max ~3.4 × 10^38) but with reduced mantissa precision — enough for training, and it avoids the overflow. But bf16 requires an A100 or newer GPU.

Final setup — A100 80GB with bf16:

Parameter	Value
Model	google/byt5-small (300M params)
Hardware	A100 80GB (Colab Pro)
Precision	bf16
Batch size	8 (effective 32 with grad accumulation)
Learning rate	1e-4, cosine decay with 10% warmup
Epochs	10
GPU memory used	~10GB of 80GB

The Moment It Clicked

Training started. Epoch 1: training loss 10.6, val loss 2.25. Epoch 2: training loss 6.4, val loss 1.40. Standard early descent.

Then epoch 3 hit.

Epoch	Training Loss	Validation Loss
1	10.637	2.254
2	6.404	1.402
3	0.691	0.072
4	0.220	0.024
5	0.116	0.015
…	…	…
10	0.049	0.008

The dramatic loss drop between epoch 2 and 3

Validation loss went from 1.40 to 0.07 in a single epoch. A 20x drop. The model went from confused to competent between epoch 2 and 3 — it “got” the mapping from Braille byte sequences to English text. The remaining 7 epochs were refinement.

No overfitting either. Validation loss kept decreasing through all 10 epochs, ending at 0.008. On a seq2seq task, that’s remarkably low.

Results: The First Working Grade 2 Model

Time for the real test. Remember those 42 hand-transcribed Jellybean Jungle samples the model had never seen?

Dataset	Samples	Exact Match	Character Error Rate
Jellybean (real-world)	42	92.9%	0.004
Synthetic test set	1,396	89.8%	0.019

A character error rate of 0.004 means the average prediction is 99.6% character-accurate on real-world Braille. And this model was trained entirely on synthetic Liblouis-generated data — it had never seen human-transcribed Braille during training.

The synthetic data generalized. That was the gamble, and it paid off.

Versus the Baseline

How does liblouis back-translation compare?

	ByT5-small	Liblouis
Exact match (real-world)	92.9%	9.5%
Exact match (synthetic)	89.8%	10.3%
CER (real-world)	0.004	0.236
CER (synthetic)	0.019	0.260

Roughly 9x better on exact match. Here’s what the difference looks like in practice:

Input:    "Sherlock Holmes sat moodily at one side of the fireplace"
Liblouis: "Shall\124567/LOCK HOLMES SAT MOODILY AT \5/O SIDE of the FIREPLACE"
ByT5:     "Sherlock Holmes sat moodily at one side of the fireplace"

Input:    "Way down south where the jungle grows,"
Liblouis: "WAY D\2467/N S\12567/th where the JUNGLE GR\2467/S,"
ByT5:     "Way down south where the jungle grows,"

Those \124567/ sequences in the liblouis output are escape codes — contractions it can’t resolve in reverse. The ByT5 model handles them cleanly.

And critically: zero hallucinations. Every prediction is input-dependent and grounded in the actual Braille sequence. v1 would hallucinate memorized corpus text regardless of input. v3 produces correct, contextual translations.

Across All Three Versions

	v1	v2	v3
Model	T5-small (60M)	T5-small (60M)	ByT5-small (300M)
Braille encoding	Custom tokens	Unicode direct	Unicode as bytes
Tokenizer result	Random embeddings	All → `<unk>`	Native byte processing
Real-world match	0%	0%	92.9%
Prediction quality	Hallucinated corpus text	Single repeated phrase	Correct, input-dependent

What the Errors Reveal

Of the 3 misses on the real-world jellybean set (after normalizing smart quotes to ASCII), all are minor:

Expected	Predicted	Issue
`sides-- covering`	`sides--covering`	Missing space after dash
`hanging on the trees, my mouth`	`hanging on the tree jes, so, my mouth`	Contraction decoding error
`if I hadn't`	`ifI hadn't`	Missing space

On the synthetic test set, the dominant error source is output truncation — the model’s max output length was set to 256 bytes, and long sentences simply got cut off. The predicted text is correct up to the truncation point. This is a configuration fix, not a model problem.

The other error patterns:

Number encoding: The model knows number indicators exist but sometimes maps to wrong digits. Likely cause: too few number examples in training data (mostly prose, few chapter headings).
Special characters: Ligatures like “æ” and currency symbols like “£” — rare in training data, no reliable mapping learned.

None of these are fundamental limitations. More training data and a higher max length would address most of them.

What I Learned Building This

Tokenizer choice can make or break your model. This is the single biggest takeaway. Two failed attempts, both caused by how the model’s tokenizer handled Braille input. If I’d checked tokenizer.tokenize() on Braille characters before v2, I would have caught the <unk> problem immediately.

ByT5 is the right tool for non-standard scripts. If your input contains characters outside a model’s tokenizer vocabulary — specialized Unicode ranges, mathematical symbols, emoji, rare scripts — ByT5 sidesteps the problem entirely. It’s under-discussed in the ML community for these use cases.

bf16 is not fp16. They are not interchangeable. fp16 has a max value of ~65,504 and will overflow on ByT5’s long attention sequences. bf16 has the same range as fp32. This distinction cost me a failed training run and several hours of debugging.

Synthetic data can generalize to real-world data. The model was trained entirely on Liblouis-generated pairs and achieved 92.9% exact match on human-transcribed Braille it had never seen. The quality of the synthetic generation pipeline matters — Liblouis is a well-maintained, standards-compliant library — but the result validates the approach.

Decouple solved problems from unsolved ones. The two-stage architecture meant I never needed to touch cell detection. Stage 1 (YOLOv8, 98-99% accurate) just works. All effort went into Stage 2 — the part that actually needed solving.

What’s Next

The model is published on Hugging Face — the first publicly available ML model for Grade 2 contracted Braille.

Next steps:

More training data — expanding from 5 books to 20+ to improve coverage of rare contractions and numbers
ByT5-base (580M params) — more model capacity for edge cases
End-to-end pipeline — connecting Stage 1 (YOLOv8 image detection) directly to Stage 2 (this model) for a photo-in, text-out workflow
Nemeth Code — mathematical Braille, which adds 2D spatial layout parsing on top of everything. That’s a whole other problem.

The goal: a teacher photographs a student’s Braille exam, and gets readable English text back in seconds. We’re not there yet on the full pipeline — but the hardest piece, Grade 2 interpretation, now works.

The model and code are open source. If you work with Braille, accessibility tools, or non-standard OCR — I’d love to hear from you.

← Why Semantic Search Alone Fails on Legal Text (And How Hybrid Search Fixed It)