Benchmarks

Linux, NVMe SSD, median of 5 runs after 2 warmup, cold reads via posix_fadvise(POSIX_FADV_DONTNEED). Both zero-copy and copy modes are shown where applicable.

Cross-format reading

zTensor reads .safetensors, .pt, .gguf, .npz, .onnx, .h5, and .zt through a single mmap-backed API. The results below measure throughput when loading a Llama 3.2 1B-shaped model (~2.8 GB) from each format, compared against each format's native library.

ztensorztensor (zc off)ref. zero-copyref. copy

Source format	zTensor	zTensor (zc off)	Reference impl.
.zt	2.19 GB/s	1.37 GB/s	n/a
.safetensors	2.19 GB/s	1.46 GB/s	1.33 GB/s / 1.35 GB/s† (`safetensors`)
.pt	2.04 GB/s	1.33 GB/s	0.89 GB/s (`torch`)
.npz	2.11 GB/s	1.41 GB/s	1.04 GB/s (`numpy`)
.gguf	2.11 GB/s	1.38 GB/s	1.39 GB/s / 2.15 GB/s† (`gguf`)
.onnx	2.07 GB/s	1.29 GB/s	0.76 GB/s (`onnx`)
.h5	1.96 GB/s	1.30 GB/s	1.35 GB/s (`h5py`)

ONNX measured at 1 GB (protobuf 2 GB limit). †Native zero-copy where available (GGUF mmap views, SafeTensors safe_open).

Zero-copy vs. copy. By default (copy=False), zTensor returns mmap-backed arrays with no memory copy. Setting copy=True reads into owned arrays. Some reference implementations also support zero-copy (GGUF mmap, SafeTensors safe_open); their numbers are shown with a dagger (†). Formats with serialization overhead (pickle for .pt, zip for .npz, protobuf for .onnx) are slower in both modes. For formats that also use mmap internally, copy-mode throughput converges because both implementations perform the same mmap-then-copy sequence.

Safety. For .pt files, zTensor uses a restricted pickle VM in Rust that only recognizes tensor reconstruction opcodes and extracts metadata without executing arbitrary code, unlike torch.load(), which invokes pickle.load().

Format comparison

The benchmarks below compare .zt against other formats, where each format uses its own reference implementation.

Read throughput. Three workloads at 512 MB: Large (few big matrices), Mixed (realistic model shapes), Small (many ~10 KB parameters).

LargeMixedSmall

Format	Large	Mixed	Small
ztensor	2.08 GB/s	2.02 GB/s	1.76 GB/s
ztensor (zc off)	1.25 GB/s	1.31 GB/s	1.46 GB/s
safetensors	1.23 GB/s	1.32 GB/s	1.35 GB/s
pickle	1.25 GB/s	1.36 GB/s	1.40 GB/s
npz	1.05 GB/s	1.06 GB/s	0.22 GB/s
gguf	2.32 GB/s	2.31 GB/s	0.21 GB/s
gguf (zc off)	1.40 GB/s	1.40 GB/s	0.20 GB/s
onnx	0.73 GB/s	0.75 GB/s	0.65 GB/s
hdf5	1.28 GB/s	1.33 GB/s	0.16 GB/s

With copy enabled, all mmap-based formats converge to similar throughput since the bottleneck is the memory copy itself. In zero-copy mode, ztensor maintains ~2 GB/s across all workloads. GGUF's native mmap is fast on large tensors (2.32 GB/s) but has high per-tensor overhead on small tensors (0.21 GB/s); ztensor avoids this overhead, sustaining 1.76 GB/s even with many small parameters.

Write throughput. For large and mixed workloads, ztensor, GGUF, pickle, and HDF5 all write at near-memcpy speed (3.6-3.9 GB/s). SafeTensors is notably slower (~1.7 GB/s). With many small tensors, per-tensor overhead reduces throughput across all formats.

LargeMixedSmall

Format	Large	Mixed	Small
ztensor	3.62 GB/s	3.65 GB/s	1.42 GB/s
safetensors	1.72 GB/s	1.77 GB/s	1.48 GB/s
pickle	3.62 GB/s	3.68 GB/s	2.00 GB/s
npz	2.40 GB/s	2.40 GB/s	0.51 GB/s
gguf	3.85 GB/s	3.86 GB/s	1.06 GB/s
onnx	0.28 GB/s	0.29 GB/s	0.32 GB/s
hdf5	3.67 GB/s	3.69 GB/s	0.27 GB/s

Compression. .zt supports optional per-component zstd compression. Effectiveness varies by workload: random float32 weights are nearly incompressible (8% reduction), but structured data compresses dramatically. Pruned weights (73%) and ternary quantization (75%) compress well because their byte patterns are highly redundant.

ztensorztensor + zstd-3

Workload	Description	Compressed size	Reduction
Dense fp32	Random float32 weights	92%	8%
Quantized int8	4-bit values in int8 storage	52%	48%
Pruned 80%	Float32 with 80% zero weights	27%	73%
Ternary	1 quantized weights	25%	75%

Zstd level 3, the recommended default.

Compression throughput. Compression trades throughput for disk savings. More compressible data reads faster because less I/O is needed.

ztensorztensor + zstd-3

Read

Write

Workload	Read	Read zstd-3	Write	Write zstd-3
Dense fp32	1.31 GB/s	0.45 GB/s	3.65 GB/s	0.72 GB/s
Quantized int8	1.31 GB/s	0.73 GB/s	3.65 GB/s	0.24 GB/s
Pruned 80%	1.31 GB/s	0.59 GB/s	3.65 GB/s	0.39 GB/s
Ternary	1.31 GB/s	0.90 GB/s	3.65 GB/s	0.45 GB/s

Reproducing

All benchmarks can be reproduced using the scripts in benchmark/:

pip install ztensor safetensors torch numpy gguf onnx h5py

# Cross-format reading (Llama 3.2 1B shapes)
python benchmark/bench.py run --dist llama-1b --runs 5 --warmup 2

# Format comparison (512 MB, three workloads)
python benchmark/bench.py run --size 512 --dist large --runs 5 --warmup 2
python benchmark/bench.py run --size 512 --dist mixed --runs 5 --warmup 2
python benchmark/bench.py run --size 512 --dist small --runs 5 --warmup 2

# Full sweep (all sizes, distributions, scenarios)
python benchmark/bench.py sweep --runs 5 --warmup 2

Cross-format reading​

Format comparison​

Reproducing​

Cross-format reading

Format comparison

Reproducing