Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/risc0/risc0/llms.txt

Use this file to discover all available pages before exploring further.

GPU Acceleration

RISC Zero supports hardware acceleration for the proving process using NVIDIA CUDA and Apple Metal. GPU acceleration can provide 5-20x speedups for proof generation compared to CPU-only proving.

Overview

The zkVM proving process is computationally intensive, involving:
  • Polynomial evaluations
  • Fast Fourier Transforms (FFTs)
  • Merkle tree construction
  • Cryptographic hashing
These operations are highly parallelizable and benefit significantly from GPU acceleration.
GPU acceleration affects only the proving process, not guest execution. Guest code runs identically on CPU and GPU-accelerated systems.

CUDA Support (NVIDIA GPUs)

Prerequisites

1
Supported Hardware
2
  • NVIDIA GPU with compute capability 7.0 or higher
  • Recommended: RTX 3000 series or newer, or Tesla/A100 for production
  • Minimum 8GB GPU memory (16GB+ recommended for large programs)
  • 3
    CUDA Toolkit
    4
    Install CUDA Toolkit 12.0 or later:
    6
    Verify Installation
    7
    nvcc --version
    nvidia-smi
    

    Building with CUDA Support

    Using Pre-built Binaries

    The RISC Zero toolchain includes CUDA support by default when CUDA is detected:
    cargo risczero install
    

    Building from Source

    To build the zkVM with CUDA support:
    Cargo.toml
    [dependencies]
    risc0-zkvm = { version = "5.0", features = ["cuda"] }
    
    Then build:
    cargo build --release --features cuda
    
    Building with CUDA requires the CUDA Toolkit to be installed on the build machine. The build process will fail if CUDA is not found.

    Running with CUDA

    CUDA acceleration is enabled automatically when available:
    cargo run --release
    
    To force CPU-only execution even with CUDA available:
    RISC0_FORCE_CPU=1 cargo run --release
    

    CUDA Environment Variables

    VariablePurposeDefault
    RISC0_FORCE_CPUDisable GPU acceleration0 (use GPU)
    CUDA_VISIBLE_DEVICESSelect GPU devicesAll devices
    RISC0_CUDA_MAX_THREADSMax CUDA threads per blockAuto-detect

    Multi-GPU Support

    To use specific GPUs:
    # Use GPUs 0 and 1
    CUDA_VISIBLE_DEVICES=0,1 cargo run --release
    
    # Use only GPU 2
    CUDA_VISIBLE_DEVICES=2 cargo run --release
    

    Metal Support (Apple Silicon)

    Prerequisites

    1
    Supported Hardware
    2
  • Apple Silicon: M1, M1 Pro, M1 Max, M1 Ultra, M2, M2 Pro, M2 Max, M3 series
  • macOS 12.0 (Monterey) or later
  • Xcode Command Line Tools
  • 3
    Install Xcode Command Line Tools
    4
    xcode-select --install
    

    Building with Metal Support

    Using Pre-built Binaries

    Metal support is automatically enabled on macOS:
    cargo risczero install
    

    Building from Source

    Cargo.toml
    [dependencies]
    risc0-zkvm = { version = "5.0", features = ["metal"] }
    
    Build:
    cargo build --release --features metal
    

    Running with Metal

    Metal acceleration is enabled automatically on supported systems:
    cargo run --release
    
    Metal provides excellent performance on Apple Silicon, often matching or exceeding CUDA performance on comparable hardware.

    Performance Benchmarks

    Typical Speedups

    Speedups vary based on program complexity and hardware:
    HardwareSpeedup vs CPUNotes
    NVIDIA RTX 309010-15xHigh-end consumer GPU
    NVIDIA RTX 409015-20xLatest consumer GPU
    NVIDIA A10012-18xData center GPU
    Apple M1 Max8-12xIntegrated GPU
    Apple M2 Ultra12-16xHigh-end integrated GPU

    Example: Fibonacci Proof

    # CPU-only (AMD Ryzen 9 5950X)
    time cargo run --release
    # Proving time: ~45 seconds
    
    # With NVIDIA RTX 4090
    time cargo run --release
    # Proving time: ~3 seconds
    
    Speedups are most significant for larger programs with more cycles. Small programs may see less improvement due to GPU initialization overhead.

    Dual Mode (Experimental)

    RISC Zero supports a “dual” mode that uses both CPU and GPU simultaneously for certain operations:
    Cargo.toml
    [dependencies]
    risc0-zkvm = { version = "5.0", features = ["dual"] }
    
    Dual mode is experimental and currently only supported for recursion circuits. Use with caution in production.

    Optimizing GPU Performance

    Memory Considerations

    GPU memory is often the limiting factor:
    1
    Check GPU Memory
    2
    nvidia-smi
    # Look for "Memory-Usage" line
    
    3
    Reduce Segment Size
    4
    For large programs that exceed GPU memory:
    5
    use risc0_zkvm::ProverOpts;
    
    let opts = ProverOpts::default()
        .with_segment_limit_po2(20); // Reduce from default 22
    
    6
    Batch Proving
    7
    Process multiple segments in batches to fit in GPU memory:
    8
    let segments = executor.run()?;
    for chunk in segments.chunks(10) {
        // Prove chunks separately
    }
    

    CPU/GPU Work Distribution

    For optimal performance:
    1. Use GPU for: Large polynomial operations, FFTs, Merkle trees
    2. Use CPU for: Small operations, sequential tasks, I/O
    3. Balance: Avoid CPU waiting on GPU (queue work ahead)

    Troubleshooting

    CUDA Out of Memory

    Problem: CUDA error: out of memory Solutions:
    1. Close other GPU-intensive applications
    2. Reduce segment size with with_segment_limit_po2()
    3. Use a GPU with more memory
    4. Process segments in smaller batches

    CUDA Not Detected

    Problem: GPU not being used despite CUDA installation Solutions:
    1. Verify CUDA installation: nvcc --version
    2. Check GPU is visible: nvidia-smi
    3. Ensure feature flag is enabled: cargo build --features cuda
    4. Check RISC0_FORCE_CPU is not set

    Metal Build Failures

    Problem: Build fails with Metal-related errors on macOS Solutions:
    1. Update Xcode: xcode-select --install
    2. Check macOS version: sw_vers (need 12.0+)
    3. Verify Metal availability: system_profiler SPDisplaysDataType | grep Metal

    Slower with GPU

    Problem: GPU proving is slower than CPU Possible Causes:
    1. Small programs: GPU overhead exceeds benefit
    2. Old GPU: Compute capability too low
    3. Memory transfer: Too much CPU ↔ GPU data transfer
    4. Thermal throttling: GPU overheating
    Solutions:
    • Use CPU for small programs
    • Upgrade GPU hardware
    • Batch operations to reduce transfers
    • Improve GPU cooling

    Production Deployment

    Cloud Providers with GPU Support

    ProviderGPU OptionsNotes
    AWSP3, P4, G5 instancesT4, V100, A100 GPUs
    Google CloudA2 instancesA100 GPUs
    AzureNC, ND seriesV100, A100 GPUs
    Lambda LabsGPU instancesCost-effective option

    Docker with CUDA

    Example Dockerfile for CUDA support:
    FROM nvidia/cuda:12.2.0-devel-ubuntu22.04
    
    # Install Rust
    RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
    ENV PATH="/root/.cargo/bin:${PATH}"
    
    # Install RISC Zero
    RUN cargo install cargo-risczero
    RUN cargo risczero install
    
    # Build with CUDA support
    WORKDIR /app
    COPY . .
    RUN cargo build --release --features cuda
    
    CMD ["cargo", "run", "--release"]
    
    Run with:
    docker build -t risc0-cuda .
    docker run --gpus all risc0-cuda
    

    Monitoring GPU Usage

    Monitor GPU utilization in production:
    # Real-time monitoring
    watch -n 1 nvidia-smi
    
    # Log GPU metrics
    nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory,memory.used --format=csv -l 5 >> gpu_metrics.log
    

    Cost-Benefit Analysis

    When to Use GPU Acceleration

    Use GPU when:
    • Proving large programs (>1M cycles)
    • High throughput requirements
    • Cost of GPU < cost of CPU time
    • Latency is critical
    Skip GPU when:
    • Programs are very small (under 100K cycles)
    • Occasional proving (GPU setup overhead)
    • GPU not available or too expensive
    • Development/testing only

    Cost Example (AWS)

    Instance TypeGPUPrice/hourProofs/hourCost/proof
    c6i.8xlargeNone$1.36~80$0.017
    g5.xlargeT4$1.006~500$0.002
    p4d.24xlargeA100$32.77~4000$0.008
    For high-volume production workloads, GPU acceleration typically provides 5-10x cost savings despite higher per-hour instance costs.

    Next Steps