GPU Acceleration

RISC Zero supports hardware acceleration for the proving process using NVIDIA CUDA and Apple Metal. GPU acceleration can provide 5-20x speedups for proof generation compared to CPU-only proving.

Overview

The zkVM proving process is computationally intensive, involving:

Polynomial evaluations
Fast Fourier Transforms (FFTs)
Merkle tree construction
Cryptographic hashing

These operations are highly parallelizable and benefit significantly from GPU acceleration.

GPU acceleration affects only the proving process, not guest execution. Guest code runs identically on CPU and GPU-accelerated systems.

CUDA Support (NVIDIA GPUs)

Prerequisites

Supported Hardware

NVIDIA GPU with compute capability 7.0 or higher

Recommended: RTX 3000 series or newer, or Tesla/A100 for production

Minimum 8GB GPU memory (16GB+ recommended for large programs)

CUDA Toolkit

Install CUDA Toolkit 12.0 or later:

Linux: CUDA Toolkit Downloads

Windows: CUDA Toolkit for Windows

Verify Installation

nvcc --version
nvidia-smi

Building with CUDA Support

Using Pre-built Binaries

The RISC Zero toolchain includes CUDA support by default when CUDA is detected:

cargo risczero install

Building from Source

To build the zkVM with CUDA support:

Cargo.toml

[dependencies]
risc0-zkvm = { version = "5.0", features = ["cuda"] }

Then build:

cargo build --release --features cuda

Building with CUDA requires the CUDA Toolkit to be installed on the build machine. The build process will fail if CUDA is not found.

Running with CUDA

CUDA acceleration is enabled automatically when available:

cargo run --release

To force CPU-only execution even with CUDA available:

RISC0_FORCE_CPU=1 cargo run --release

CUDA Environment Variables

Variable	Purpose	Default
`RISC0_FORCE_CPU`	Disable GPU acceleration	`0` (use GPU)
`CUDA_VISIBLE_DEVICES`	Select GPU devices	All devices
`RISC0_CUDA_MAX_THREADS`	Max CUDA threads per block	Auto-detect

Multi-GPU Support

To use specific GPUs:

# Use GPUs 0 and 1
CUDA_VISIBLE_DEVICES=0,1 cargo run --release

# Use only GPU 2
CUDA_VISIBLE_DEVICES=2 cargo run --release

Metal Support (Apple Silicon)

Prerequisites

Supported Hardware

Apple Silicon: M1, M1 Pro, M1 Max, M1 Ultra, M2, M2 Pro, M2 Max, M3 series

macOS 12.0 (Monterey) or later

Xcode Command Line Tools

Install Xcode Command Line Tools

xcode-select --install

Building with Metal Support

Using Pre-built Binaries

Metal support is automatically enabled on macOS:

cargo risczero install

Building from Source

Cargo.toml

[dependencies]
risc0-zkvm = { version = "5.0", features = ["metal"] }

Build:

cargo build --release --features metal

Running with Metal

Metal acceleration is enabled automatically on supported systems:

cargo run --release

Metal provides excellent performance on Apple Silicon, often matching or exceeding CUDA performance on comparable hardware.

Performance Benchmarks

Typical Speedups

Speedups vary based on program complexity and hardware:

Hardware	Speedup vs CPU	Notes
NVIDIA RTX 3090	10-15x	High-end consumer GPU
NVIDIA RTX 4090	15-20x	Latest consumer GPU
NVIDIA A100	12-18x	Data center GPU
Apple M1 Max	8-12x	Integrated GPU
Apple M2 Ultra	12-16x	High-end integrated GPU

Example: Fibonacci Proof

# CPU-only (AMD Ryzen 9 5950X)
time cargo run --release
# Proving time: ~45 seconds

# With NVIDIA RTX 4090
time cargo run --release
# Proving time: ~3 seconds

Speedups are most significant for larger programs with more cycles. Small programs may see less improvement due to GPU initialization overhead.

Dual Mode (Experimental)

RISC Zero supports a “dual” mode that uses both CPU and GPU simultaneously for certain operations:

Cargo.toml

[dependencies]
risc0-zkvm = { version = "5.0", features = ["dual"] }

Dual mode is experimental and currently only supported for recursion circuits. Use with caution in production.

Optimizing GPU Performance

Memory Considerations

GPU memory is often the limiting factor:

Check GPU Memory

nvidia-smi
# Look for "Memory-Usage" line

Reduce Segment Size

For large programs that exceed GPU memory:

use risc0_zkvm::ProverOpts;

let opts = ProverOpts::default()
    .with_segment_limit_po2(20); // Reduce from default 22

Batch Proving

Process multiple segments in batches to fit in GPU memory:

let segments = executor.run()?;
for chunk in segments.chunks(10) {
    // Prove chunks separately
}

CPU/GPU Work Distribution

For optimal performance:

Use GPU for: Large polynomial operations, FFTs, Merkle trees
Use CPU for: Small operations, sequential tasks, I/O
Balance: Avoid CPU waiting on GPU (queue work ahead)

Troubleshooting

CUDA Out of Memory

Problem: CUDA error: out of memory Solutions:

Close other GPU-intensive applications
Reduce segment size with with_segment_limit_po2()
Use a GPU with more memory
Process segments in smaller batches

CUDA Not Detected

Problem: GPU not being used despite CUDA installation Solutions:

Verify CUDA installation: nvcc --version
Check GPU is visible: nvidia-smi
Ensure feature flag is enabled: cargo build --features cuda
Check RISC0_FORCE_CPU is not set

Metal Build Failures

Problem: Build fails with Metal-related errors on macOS Solutions:

Update Xcode: xcode-select --install
Check macOS version: sw_vers (need 12.0+)
Verify Metal availability: system_profiler SPDisplaysDataType | grep Metal

Slower with GPU

Problem: GPU proving is slower than CPU Possible Causes:

Small programs: GPU overhead exceeds benefit
Old GPU: Compute capability too low
Memory transfer: Too much CPU ↔ GPU data transfer
Thermal throttling: GPU overheating

Solutions:

Use CPU for small programs
Upgrade GPU hardware
Batch operations to reduce transfers
Improve GPU cooling

Production Deployment

Cloud Providers with GPU Support

Provider	GPU Options	Notes
AWS	P3, P4, G5 instances	T4, V100, A100 GPUs
Google Cloud	A2 instances	A100 GPUs
Azure	NC, ND series	V100, A100 GPUs
Lambda Labs	GPU instances	Cost-effective option

Docker with CUDA

Example Dockerfile for CUDA support:

FROM nvidia/cuda:12.2.0-devel-ubuntu22.04

# Install Rust
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"

# Install RISC Zero
RUN cargo install cargo-risczero
RUN cargo risczero install

# Build with CUDA support
WORKDIR /app
COPY . .
RUN cargo build --release --features cuda

CMD ["cargo", "run", "--release"]

Run with:

docker build -t risc0-cuda .
docker run --gpus all risc0-cuda

Monitoring GPU Usage

Monitor GPU utilization in production:

# Real-time monitoring
watch -n 1 nvidia-smi

# Log GPU metrics
nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory,memory.used --format=csv -l 5 >> gpu_metrics.log

Cost-Benefit Analysis

When to Use GPU Acceleration

✅ Use GPU when:

Proving large programs (>1M cycles)
High throughput requirements
Cost of GPU < cost of CPU time
Latency is critical

❌ Skip GPU when:

Programs are very small (under 100K cycles)
Occasional proving (GPU setup overhead)
GPU not available or too expensive
Development/testing only

Cost Example (AWS)

Instance Type	GPU	Price/hour	Proofs/hour	Cost/proof
c6i.8xlarge	None	$1.36	~80	$0.017
g5.xlarge	T4	$1.006	~500	$0.002
p4d.24xlarge	A100	$32.77	~4000	$0.008

For high-volume production workloads, GPU acceleration typically provides 5-10x cost savings despite higher per-hour instance costs.

Next Steps

Learn about Performance Optimization for guest code
Explore Recursive Proving for proof aggregation
Use Profiling to identify bottlenecks
Check out Precompiles for cryptographic acceleration

Documentation Index

​GPU Acceleration

​Overview

​CUDA Support (NVIDIA GPUs)

​Prerequisites

​Building with CUDA Support

​Using Pre-built Binaries

​Building from Source

​Running with CUDA

​CUDA Environment Variables

​Multi-GPU Support

​Metal Support (Apple Silicon)

​Prerequisites

​Building with Metal Support

​Using Pre-built Binaries

​Building from Source

​Running with Metal

​Performance Benchmarks

​Typical Speedups

​Example: Fibonacci Proof

​Dual Mode (Experimental)

​Optimizing GPU Performance

​Memory Considerations

​CPU/GPU Work Distribution

​Troubleshooting

​CUDA Out of Memory

​CUDA Not Detected

​Metal Build Failures

​Slower with GPU

​Production Deployment

​Cloud Providers with GPU Support

​Docker with CUDA

​Monitoring GPU Usage

​Cost-Benefit Analysis

​When to Use GPU Acceleration

​Cost Example (AWS)

​Next Steps

GPU Acceleration

Overview

CUDA Support (NVIDIA GPUs)

Prerequisites

Building with CUDA Support

Using Pre-built Binaries

Building from Source

Running with CUDA

CUDA Environment Variables

Multi-GPU Support

Metal Support (Apple Silicon)

Prerequisites

Building with Metal Support

Using Pre-built Binaries

Building from Source

Running with Metal

Performance Benchmarks

Typical Speedups

Example: Fibonacci Proof

Dual Mode (Experimental)

Optimizing GPU Performance

Memory Considerations

CPU/GPU Work Distribution

Troubleshooting

CUDA Out of Memory

CUDA Not Detected

Metal Build Failures

Slower with GPU

Production Deployment

Cloud Providers with GPU Support

Docker with CUDA

Monitoring GPU Usage

Cost-Benefit Analysis

When to Use GPU Acceleration

Cost Example (AWS)

Next Steps