Skip to main content

Benchmarks

Linux, NVMe SSD, median of 5 runs after 2 warmup, cold reads via posix_fadvise(POSIX_FADV_DONTNEED). Both zero-copy and copy modes are shown where applicable.

Cross-format reading

zTensor reads .safetensors, .pt, .gguf, .npz, .onnx, .h5, and .zt through a single mmap-backed API. The results below measure throughput when loading a Llama 3.2 1B-shaped model (~2.8 GB) from each format, compared against each format's native library.

ztensorztensor (zc off)ref. zero-copyref. copy
Source formatzTensorzTensor (zc off)Reference impl.
.zt2.19 GB/s1.37 GB/sn/a
.safetensors2.19 GB/s1.46 GB/s1.33 GB/s / 1.35 GB/s† (safetensors)
.pt2.04 GB/s1.33 GB/s0.89 GB/s (torch)
.npz2.11 GB/s1.41 GB/s1.04 GB/s (numpy)
.gguf2.11 GB/s1.38 GB/s1.39 GB/s / 2.15 GB/s† (gguf)
.onnx2.07 GB/s1.29 GB/s0.76 GB/s (onnx)
.h51.96 GB/s1.30 GB/s1.35 GB/s (h5py)

ONNX measured at 1 GB (protobuf 2 GB limit). †Native zero-copy where available (GGUF mmap views, SafeTensors safe_open).

Zero-copy vs. copy. By default (copy=False), zTensor returns mmap-backed arrays with no memory copy. Setting copy=True reads into owned arrays. Some reference implementations also support zero-copy (GGUF mmap, SafeTensors safe_open); their numbers are shown with a dagger (†). Formats with serialization overhead (pickle for .pt, zip for .npz, protobuf for .onnx) are slower in both modes. For formats that also use mmap internally, copy-mode throughput converges because both implementations perform the same mmap-then-copy sequence.

Safety. For .pt files, zTensor uses a restricted pickle VM in Rust that only recognizes tensor reconstruction opcodes and extracts metadata without executing arbitrary code, unlike torch.load(), which invokes pickle.load().


Format comparison

The benchmarks below compare .zt against other formats, where each format uses its own reference implementation.

Read throughput. Three workloads at 512 MB: Large (few big matrices), Mixed (realistic model shapes), Small (many ~10 KB parameters).

LargeMixedSmall
FormatLargeMixedSmall
ztensor2.08 GB/s2.02 GB/s1.76 GB/s
ztensor (zc off)1.25 GB/s1.31 GB/s1.46 GB/s
safetensors1.23 GB/s1.32 GB/s1.35 GB/s
pickle1.25 GB/s1.36 GB/s1.40 GB/s
npz1.05 GB/s1.06 GB/s0.22 GB/s
gguf2.32 GB/s2.31 GB/s0.21 GB/s
gguf (zc off)1.40 GB/s1.40 GB/s0.20 GB/s
onnx0.73 GB/s0.75 GB/s0.65 GB/s
hdf51.28 GB/s1.33 GB/s0.16 GB/s

With copy enabled, all mmap-based formats converge to similar throughput since the bottleneck is the memory copy itself. In zero-copy mode, ztensor maintains ~2 GB/s across all workloads. GGUF's native mmap is fast on large tensors (2.32 GB/s) but has high per-tensor overhead on small tensors (0.21 GB/s); ztensor avoids this overhead, sustaining 1.76 GB/s even with many small parameters.

Write throughput. For large and mixed workloads, ztensor, GGUF, pickle, and HDF5 all write at near-memcpy speed (3.6-3.9 GB/s). SafeTensors is notably slower (~1.7 GB/s). With many small tensors, per-tensor overhead reduces throughput across all formats.

LargeMixedSmall
FormatLargeMixedSmall
ztensor3.62 GB/s3.65 GB/s1.42 GB/s
safetensors1.72 GB/s1.77 GB/s1.48 GB/s
pickle3.62 GB/s3.68 GB/s2.00 GB/s
npz2.40 GB/s2.40 GB/s0.51 GB/s
gguf3.85 GB/s3.86 GB/s1.06 GB/s
onnx0.28 GB/s0.29 GB/s0.32 GB/s
hdf53.67 GB/s3.69 GB/s0.27 GB/s

Compression. .zt supports optional per-component zstd compression. Effectiveness varies by workload: random float32 weights are nearly incompressible (8% reduction), but structured data compresses dramatically. Pruned weights (73%) and ternary quantization (75%) compress well because their byte patterns are highly redundant.

ztensorztensor + zstd-3
WorkloadDescriptionCompressed sizeReduction
Dense fp32Random float32 weights92%8%
Quantized int84-bit values in int8 storage52%48%
Pruned 80%Float32 with 80% zero weights27%73%
Ternary1 quantized weights25%75%

Zstd level 3, the recommended default.

Compression throughput. Compression trades throughput for disk savings. More compressible data reads faster because less I/O is needed.

ztensorztensor + zstd-3

Read

Write

WorkloadReadRead zstd-3WriteWrite zstd-3
Dense fp321.31 GB/s0.45 GB/s3.65 GB/s0.72 GB/s
Quantized int81.31 GB/s0.73 GB/s3.65 GB/s0.24 GB/s
Pruned 80%1.31 GB/s0.59 GB/s3.65 GB/s0.39 GB/s
Ternary1.31 GB/s0.90 GB/s3.65 GB/s0.45 GB/s

Reproducing

All benchmarks can be reproduced using the scripts in benchmark/:

pip install ztensor safetensors torch numpy gguf onnx h5py

# Cross-format reading (Llama 3.2 1B shapes)
python benchmark/bench.py run --dist llama-1b --runs 5 --warmup 2

# Format comparison (512 MB, three workloads)
python benchmark/bench.py run --size 512 --dist large --runs 5 --warmup 2
python benchmark/bench.py run --size 512 --dist mixed --runs 5 --warmup 2
python benchmark/bench.py run --size 512 --dist small --runs 5 --warmup 2

# Full sweep (all sizes, distributions, scenarios)
python benchmark/bench.py sweep --runs 5 --warmup 2