2024-PresentResearch

UNAVision

Neural Image Codec & Visual Tokenizer

★150K Params, 40MP Batching, 97.69% Fidelity

UNAVision is a compact neural vision codec and visual tokenizer. It compresses arbitrary RGB imagery into a dense latent at a fixed 16:1 spatial ratio and reconstructs at 1–4% fidelity loss — and the loss shrinks as resolution grows (inverse of typical codecs). I can batch 6x 40MP images on a single RTX 4090. Under 150K trainable parameters. 100% codebook utilization (zero dead codes). Dual continuous/discrete bottleneck on same weights with <0.10% gap.

Evaluation Repository

Client: Independent · Eval repo public
Role: Sole author · Architecture, training, evals
Duration: Ongoing
Team: Solo

Outcomes

16:1

Spatial Compression

97.69%

Avg Fidelity

99.42%

Peak Fidelity

<150K

Parameters

Highlights

•16:1 spatial compression ratio
•97.69% average reconstruction fidelity
•Under 150K trainable parameters
•Batches 6x 40MP images on single RTX 4090
•Loss decreases with resolution (inverse of typical codecs)
•Dual continuous/discrete bottleneck
•100% codebook utilization (zero dead codes)
•UNA Audio prototype also developed

Reconstruction Fidelity · Drag to Compare

Wildlife · 2560px

⇄

Original

Reconstructed

What it is

A compact neural vision codec and visual tokenizer. Compresses arbitrary RGB imagery into a dense, well-structured latent at a fixed 16:1 spatial ratio and reconstructs at 1–4% fidelity loss on natural imagery.

Loss shrinks as input grows: 4–6K photos land in the 1–2% band; 40 MP cases hold there comfortably. 100% active visual vocabulary utilization — zero dead codes.

Memory envelope

A batch of half a dozen 40 MP images fits in a single forward pass on one RTX 4090 — no tiling, no sharding, no gradient checkpointing acrobatics, no OOM.

Possible because activation memory is dominated by the 16:1 bottleneck and the network sits under 150K trainable parameters.

VisionImage CodecVisual TokenizerVAE AlternativeCompression