Documentation Index
Fetch the complete documentation index at: https://mintlify.com/risc0/risc0/llms.txt
Use this file to discover all available pages before exploring further.
GPU Acceleration
RISC Zero supports hardware acceleration for the proving process using NVIDIA CUDA and Apple Metal. GPU acceleration can provide 5-20x speedups for proof generation compared to CPU-only proving.Overview
The zkVM proving process is computationally intensive, involving:- Polynomial evaluations
- Fast Fourier Transforms (FFTs)
- Merkle tree construction
- Cryptographic hashing
GPU acceleration affects only the proving process, not guest execution. Guest code runs identically on CPU and GPU-accelerated systems.
CUDA Support (NVIDIA GPUs)
Prerequisites
Building with CUDA Support
Using Pre-built Binaries
The RISC Zero toolchain includes CUDA support by default when CUDA is detected:Building from Source
To build the zkVM with CUDA support:Cargo.toml
Running with CUDA
CUDA acceleration is enabled automatically when available:CUDA Environment Variables
| Variable | Purpose | Default |
|---|---|---|
RISC0_FORCE_CPU | Disable GPU acceleration | 0 (use GPU) |
CUDA_VISIBLE_DEVICES | Select GPU devices | All devices |
RISC0_CUDA_MAX_THREADS | Max CUDA threads per block | Auto-detect |
Multi-GPU Support
To use specific GPUs:Metal Support (Apple Silicon)
Prerequisites
Building with Metal Support
Using Pre-built Binaries
Metal support is automatically enabled on macOS:Building from Source
Cargo.toml
Running with Metal
Metal acceleration is enabled automatically on supported systems:Performance Benchmarks
Typical Speedups
Speedups vary based on program complexity and hardware:| Hardware | Speedup vs CPU | Notes |
|---|---|---|
| NVIDIA RTX 3090 | 10-15x | High-end consumer GPU |
| NVIDIA RTX 4090 | 15-20x | Latest consumer GPU |
| NVIDIA A100 | 12-18x | Data center GPU |
| Apple M1 Max | 8-12x | Integrated GPU |
| Apple M2 Ultra | 12-16x | High-end integrated GPU |
Example: Fibonacci Proof
Speedups are most significant for larger programs with more cycles. Small programs may see less improvement due to GPU initialization overhead.
Dual Mode (Experimental)
RISC Zero supports a “dual” mode that uses both CPU and GPU simultaneously for certain operations:Cargo.toml
Optimizing GPU Performance
Memory Considerations
GPU memory is often the limiting factor:use risc0_zkvm::ProverOpts;
let opts = ProverOpts::default()
.with_segment_limit_po2(20); // Reduce from default 22
CPU/GPU Work Distribution
For optimal performance:- Use GPU for: Large polynomial operations, FFTs, Merkle trees
- Use CPU for: Small operations, sequential tasks, I/O
- Balance: Avoid CPU waiting on GPU (queue work ahead)
Troubleshooting
CUDA Out of Memory
Problem:CUDA error: out of memory
Solutions:
- Close other GPU-intensive applications
- Reduce segment size with
with_segment_limit_po2() - Use a GPU with more memory
- Process segments in smaller batches
CUDA Not Detected
Problem: GPU not being used despite CUDA installation Solutions:- Verify CUDA installation:
nvcc --version - Check GPU is visible:
nvidia-smi - Ensure feature flag is enabled:
cargo build --features cuda - Check
RISC0_FORCE_CPUis not set
Metal Build Failures
Problem: Build fails with Metal-related errors on macOS Solutions:- Update Xcode:
xcode-select --install - Check macOS version:
sw_vers(need 12.0+) - Verify Metal availability:
system_profiler SPDisplaysDataType | grep Metal
Slower with GPU
Problem: GPU proving is slower than CPU Possible Causes:- Small programs: GPU overhead exceeds benefit
- Old GPU: Compute capability too low
- Memory transfer: Too much CPU ↔ GPU data transfer
- Thermal throttling: GPU overheating
- Use CPU for small programs
- Upgrade GPU hardware
- Batch operations to reduce transfers
- Improve GPU cooling
Production Deployment
Cloud Providers with GPU Support
| Provider | GPU Options | Notes |
|---|---|---|
| AWS | P3, P4, G5 instances | T4, V100, A100 GPUs |
| Google Cloud | A2 instances | A100 GPUs |
| Azure | NC, ND series | V100, A100 GPUs |
| Lambda Labs | GPU instances | Cost-effective option |
Docker with CUDA
Example Dockerfile for CUDA support:Monitoring GPU Usage
Monitor GPU utilization in production:Cost-Benefit Analysis
When to Use GPU Acceleration
✅ Use GPU when:- Proving large programs (>1M cycles)
- High throughput requirements
- Cost of GPU < cost of CPU time
- Latency is critical
- Programs are very small (under 100K cycles)
- Occasional proving (GPU setup overhead)
- GPU not available or too expensive
- Development/testing only
Cost Example (AWS)
| Instance Type | GPU | Price/hour | Proofs/hour | Cost/proof |
|---|---|---|---|---|
| c6i.8xlarge | None | $1.36 | ~80 | $0.017 |
| g5.xlarge | T4 | $1.006 | ~500 | $0.002 |
| p4d.24xlarge | A100 | $32.77 | ~4000 | $0.008 |
Next Steps
- Learn about Performance Optimization for guest code
- Explore Recursive Proving for proof aggregation
- Use Profiling to identify bottlenecks
- Check out Precompiles for cryptographic acceleration