Benchmarks
Linux, NVMe SSD, median of 5 runs after 2 warmup, cold reads via posix_fadvise(POSIX_FADV_DONTNEED). Both zero-copy and copy modes are shown where applicable.
Cross-format reading
zTensor reads .safetensors, .pt, .gguf, .npz, .onnx, .h5, and .zt through a single mmap-backed API. The results below measure throughput when loading a Llama 3.2 1B-shaped model (~2.8 GB) from each format, compared against each format's native library.
| Source format | zTensor | zTensor (zc off) | Reference impl. |
|---|---|---|---|
| .zt | 2.19 GB/s | 1.37 GB/s | n/a |
| .safetensors | 2.19 GB/s | 1.46 GB/s | 1.33 GB/s / 1.35 GB/s† (safetensors) |
| .pt | 2.04 GB/s | 1.33 GB/s | 0.89 GB/s (torch) |
| .npz | 2.11 GB/s | 1.41 GB/s | 1.04 GB/s (numpy) |
| .gguf | 2.11 GB/s | 1.38 GB/s | 1.39 GB/s / 2.15 GB/s† (gguf) |
| .onnx | 2.07 GB/s | 1.29 GB/s | 0.76 GB/s (onnx) |
| .h5 | 1.96 GB/s | 1.30 GB/s | 1.35 GB/s (h5py) |
ONNX measured at 1 GB (protobuf 2 GB limit). †Native zero-copy where available (GGUF mmap views, SafeTensors safe_open).
Zero-copy vs. copy. By default (copy=False), zTensor returns mmap-backed arrays with no memory copy. Setting copy=True reads into owned arrays. Some reference implementations also support zero-copy (GGUF mmap, SafeTensors safe_open); their numbers are shown with a dagger (†). Formats with serialization overhead (pickle for .pt, zip for .npz, protobuf for .onnx) are slower in both modes. For formats that also use mmap internally, copy-mode throughput converges because both implementations perform the same mmap-then-copy sequence.
Safety. For .pt files, zTensor uses a restricted pickle VM in Rust that only recognizes tensor reconstruction opcodes and extracts metadata without executing arbitrary code, unlike torch.load(), which invokes pickle.load().
Format comparison
The benchmarks below compare .zt against other formats, where each format uses its own reference implementation.
Read throughput. Three workloads at 512 MB: Large (few big matrices), Mixed (realistic model shapes), Small (many ~10 KB parameters).
| Format | Large | Mixed | Small |
|---|---|---|---|
| ztensor | 2.08 GB/s | 2.02 GB/s | 1.76 GB/s |
| ztensor (zc off) | 1.25 GB/s | 1.31 GB/s | 1.46 GB/s |
| safetensors | 1.23 GB/s | 1.32 GB/s | 1.35 GB/s |
| pickle | 1.25 GB/s | 1.36 GB/s | 1.40 GB/s |
| npz | 1.05 GB/s | 1.06 GB/s | 0.22 GB/s |
| gguf | 2.32 GB/s | 2.31 GB/s | 0.21 GB/s |
| gguf (zc off) | 1.40 GB/s | 1.40 GB/s | 0.20 GB/s |
| onnx | 0.73 GB/s | 0.75 GB/s | 0.65 GB/s |
| hdf5 | 1.28 GB/s | 1.33 GB/s | 0.16 GB/s |
With copy enabled, all mmap-based formats converge to similar throughput since the bottleneck is the memory copy itself. In zero-copy mode, ztensor maintains ~2 GB/s across all workloads. GGUF's native mmap is fast on large tensors (2.32 GB/s) but has high per-tensor overhead on small tensors (0.21 GB/s); ztensor avoids this overhead, sustaining 1.76 GB/s even with many small parameters.
Write throughput. For large and mixed workloads, ztensor, GGUF, pickle, and HDF5 all write at near-memcpy speed (3.6-3.9 GB/s). SafeTensors is notably slower (~1.7 GB/s). With many small tensors, per-tensor overhead reduces throughput across all formats.
| Format | Large | Mixed | Small |
|---|---|---|---|
| ztensor | 3.62 GB/s | 3.65 GB/s | 1.42 GB/s |
| safetensors | 1.72 GB/s | 1.77 GB/s | 1.48 GB/s |
| pickle | 3.62 GB/s | 3.68 GB/s | 2.00 GB/s |
| npz | 2.40 GB/s | 2.40 GB/s | 0.51 GB/s |
| gguf | 3.85 GB/s | 3.86 GB/s | 1.06 GB/s |
| onnx | 0.28 GB/s | 0.29 GB/s | 0.32 GB/s |
| hdf5 | 3.67 GB/s | 3.69 GB/s | 0.27 GB/s |
Compression. .zt supports optional per-component zstd compression. Effectiveness varies by workload: random float32 weights are nearly incompressible (8% reduction), but structured data compresses dramatically. Pruned weights (73%) and ternary quantization (75%) compress well because their byte patterns are highly redundant.
| Workload | Description | Compressed size | Reduction |
|---|---|---|---|
| Dense fp32 | Random float32 weights | 92% | 8% |
| Quantized int8 | 4-bit values in int8 storage | 52% | 48% |
| Pruned 80% | Float32 with 80% zero weights | 27% | 73% |
| Ternary | 1 quantized weights | 25% | 75% |
Zstd level 3, the recommended default.
Compression throughput. Compression trades throughput for disk savings. More compressible data reads faster because less I/O is needed.
Read
Write
| Workload | Read | Read zstd-3 | Write | Write zstd-3 |
|---|---|---|---|---|
| Dense fp32 | 1.31 GB/s | 0.45 GB/s | 3.65 GB/s | 0.72 GB/s |
| Quantized int8 | 1.31 GB/s | 0.73 GB/s | 3.65 GB/s | 0.24 GB/s |
| Pruned 80% | 1.31 GB/s | 0.59 GB/s | 3.65 GB/s | 0.39 GB/s |
| Ternary | 1.31 GB/s | 0.90 GB/s | 3.65 GB/s | 0.45 GB/s |
Reproducing
All benchmarks can be reproduced using the scripts in benchmark/:
pip install ztensor safetensors torch numpy gguf onnx h5py
# Cross-format reading (Llama 3.2 1B shapes)
python benchmark/bench.py run --dist llama-1b --runs 5 --warmup 2
# Format comparison (512 MB, three workloads)
python benchmark/bench.py run --size 512 --dist large --runs 5 --warmup 2
python benchmark/bench.py run --size 512 --dist mixed --runs 5 --warmup 2
python benchmark/bench.py run --size 512 --dist small --runs 5 --warmup 2
# Full sweep (all sizes, distributions, scenarios)
python benchmark/bench.py sweep --runs 5 --warmup 2