Architecture
Decoder-only transformer with pre-normalization, rotary position embeddings,
exact GeLU feed-forward blocks, and tied input/output embeddings.
Specs
parameters99,711,744
layers12
hidden size768
MLP size3456
attention heads12
head dim64
context length4096 active, rolling window
vocab size10,000
RoPE theta100,000
norm epsilon1e-5
model filesQ8 rowwise or full F32
Forward Pass
For token ids x, look up embeddings h = E[x], where E has shape [10000, 768]. Each of the 12 blocks applies:
h = h + Attention(RMSNorm(h))
h = h + MLP(RMSNorm(h))
After the final block:
logits = RMSNorm(h) @ E.T
There is no separate output head; logits use the tied embedding matrix.
RMSNorm
Each norm has one learned scale vector g of length 768 and no bias:
RMSNorm(x) = g * x / sqrt(mean(x^2) + 1e-5)
Attention
For each block, normalized hidden states are projected with bias-free matrices:
q = x @ Wq.T, k = x @ Wk.T, v = x @ Wv.T
Each projection has shape [768, 768]. The result is reshaped to [12 heads, seq, 64]. RoPE is applied to q and k before the causal attention scores:
scores = (q @ k.T) / sqrt(64)
attn = softmax(causal_mask(scores)) @ v
Attention(x) = concat(attn_heads) @ Wo.T
Wo is also [768, 768] and bias-free.
RoPE
Rotary position embeddings use split-half rotation, matching Llama-style RoPE. For each pair index i in a 64-wide head:
freq_i = 1 / theta^(2i / 64), with theta = 100000
angle = position * freq_i
The first 32 channels and last 32 channels form rotation pairs:
[a, b] -> [a*cos(angle) - b*sin(angle), b*cos(angle) + a*sin(angle)]
MLP
The feed-forward path is a two-matrix GeLU MLP with no gate and no biases:
u = x @ Wup.T, where Wup is [3456, 768]
m = GeLU(u)
MLP(x) = m @ Wdown.T, where Wdown is [768, 3456]
GeLU is the exact form:
GeLU(x) = 0.5 * x * (1 + erf(x / sqrt(2)))
Tokenizer
The tokenizer is byte-level BPE with 10,000 ids. Special tokens occupy ids 0-16:
<|end|>, <|user|>, <|assistant|>,
<think>, </think>, and related reserved tokens.
Normal text is split by the GPT-style regex pre-tokenizer, converted through the GPT-2
byte-to-unicode map, then BPE merges are applied.
Runtime
Inference uses a local JavaScript/WebGPU runtime with a float32 KV cache.
Weights live in IndexedDB; Q8 keeps weights quantized for smaller/faster
mobile runs, while F32 keeps the original fine-tuned precision. Q8 uses symmetric
rowwise int8 weights with float32 scales:
weight[row, col] = int8[row, col] * scale[row]
Chat template
Turns are concatenated with no whitespace or separators between special tokens:
<|user|>…<|end|><|assistant|><think>…</think>…<|end|>
Thinking is optional and lives inside the assistant turn. Turn off Chat view to see the actual token stream as the model sees it, sub-word BPE tokens included.