The Gift Of Gab

About

The Gift Of Gab is a small language model — about 100 million parameters — that runs entirely inside your browser. Nothing you type leaves your machine.

Choose either the full F32 weights, about 380 MiB, or the smaller Q8 weights, about 100 MiB. F32 is higher quality; Q8 is faster to download and usually runs better on phones. The selected model and tokenizer are stored locally in IndexedDB, so subsequent visits load from your device.

Links

Model: Gab100M on Hugging Face

Training data: Gab100MPretrain

Fine-tuning data: Gab100MFinetune

Blog/site: gabormakesgames.com

Notes

This is a research project running a small local model. Expect quirks. Refresh to start over; there's no chat history kept between sessions.

Architecture

Decoder-only transformer with pre-normalization, rotary position embeddings, exact GeLU feed-forward blocks, and tied input/output embeddings.

Specs

parameters99,711,744

layers12

hidden size768

MLP size3456

attention heads12

head dim64

context length4096 active, rolling window

vocab size10,000

RoPE theta100,000

norm epsilon1e-5

model filesQ8 rowwise or full F32

Forward Pass

For token ids x, look up embeddings h = E[x], where E has shape [10000, 768]. Each of the 12 blocks applies:

h = h + Attention(RMSNorm(h))

h = h + MLP(RMSNorm(h))

After the final block:

logits = RMSNorm(h) @ E.T

There is no separate output head; logits use the tied embedding matrix.

RMSNorm

Each norm has one learned scale vector g of length 768 and no bias:

RMSNorm(x) = g * x / sqrt(mean(x^2) + 1e-5)

Attention

For each block, normalized hidden states are projected with bias-free matrices:

q = x @ Wq.T, k = x @ Wk.T, v = x @ Wv.T

Each projection has shape [768, 768]. The result is reshaped to [12 heads, seq, 64]. RoPE is applied to q and k before the causal attention scores:

scores = (q @ k.T) / sqrt(64)

attn = softmax(causal_mask(scores)) @ v

Attention(x) = concat(attn_heads) @ Wo.T

Wo is also [768, 768] and bias-free.

RoPE

Rotary position embeddings use split-half rotation, matching Llama-style RoPE. For each pair index i in a 64-wide head:

freq_i = 1 / theta^(2i / 64), with theta = 100000

angle = position * freq_i

The first 32 channels and last 32 channels form rotation pairs:

[a, b] -> [a*cos(angle) - b*sin(angle), b*cos(angle) + a*sin(angle)]

MLP

The feed-forward path is a two-matrix GeLU MLP with no gate and no biases:

u = x @ Wup.T, where Wup is [3456, 768]

m = GeLU(u)

MLP(x) = m @ Wdown.T, where Wdown is [768, 3456]

GeLU is the exact form:

GeLU(x) = 0.5 * x * (1 + erf(x / sqrt(2)))

Tokenizer

The tokenizer is byte-level BPE with 10,000 ids. Special tokens occupy ids 0-16: <|end|>, <|user|>, <|assistant|>, <think>, </think>, and related reserved tokens. Normal text is split by the GPT-style regex pre-tokenizer, converted through the GPT-2 byte-to-unicode map, then BPE merges are applied.

Runtime

Inference uses a local JavaScript/WebGPU runtime with a float32 KV cache. Weights live in IndexedDB; Q8 keeps weights quantized for smaller/faster mobile runs, while F32 keeps the original fine-tuned precision. Q8 uses symmetric rowwise int8 weights with float32 scales:

weight[row, col] = int8[row, col] * scale[row]

Chat template

Turns are concatenated with no whitespace or separators between special tokens:

<|user|>…<|end|><|assistant|><think>…</think>…<|end|>

Thinking is optional and lives inside the assistant turn. Turn off Chat view to see the actual token stream as the model sees it, sub-word BPE tokens included.

Reset

This clears the locally cached model files from IndexedDB. Use it if the download was interrupted, the model files changed, or you want to switch between Q8 and F32.

Your current chat will be lost and the page will reload. After reload, the app will show the download screen again.