UNAVision
Neural Image Codec & Visual Tokenizer
UNAVision is a compact neural vision codec and visual tokenizer. It compresses arbitrary RGB imagery into a dense latent at a fixed 16:1 spatial ratio and reconstructs at 1–4% fidelity loss — and the loss shrinks as resolution grows (inverse of typical codecs). I can batch 6x 40MP images on a single RTX 4090. Under 150K trainable parameters. 100% codebook utilization (zero dead codes). Dual continuous/discrete bottleneck on same weights with <0.10% gap.
- Client
- Independent · Eval repo public
- Role
- Sole author · Architecture, training, evals
- Duration
- Ongoing
- Team
- Solo
- •16:1 spatial compression ratio
- •97.69% average reconstruction fidelity
- •Under 150K trainable parameters
- •Batches 6x 40MP images on single RTX 4090
- •Loss decreases with resolution (inverse of typical codecs)
- •Dual continuous/discrete bottleneck
- •100% codebook utilization (zero dead codes)
- •UNA Audio prototype also developed


What it is
A compact neural vision codec and visual tokenizer. Compresses arbitrary RGB imagery into a dense, well-structured latent at a fixed 16:1 spatial ratio and reconstructs at 1–4% fidelity loss on natural imagery.
Loss shrinks as input grows: 4–6K photos land in the 1–2% band; 40 MP cases hold there comfortably. 100% active visual vocabulary utilization — zero dead codes.
Memory envelope
A batch of half a dozen 40 MP images fits in a single forward pass on one RTX 4090 — no tiling, no sharding, no gradient checkpointing acrobatics, no OOM.
Possible because activation memory is dominated by the 16:1 bottleneck and the network sits under 150K trainable parameters.