Local LLM Execution: llama.cpp Internals and Metal/CUDA GPU Offloading

Analyzing CPU/GPU memory bandwidth bottlenecks when running quantized open-weights models.

Written by Shyank
Shyank
Banner

SHARE

The landscape of artificial intelligence has undergone a fundamental shift. While the early days of LLM deployment were dominated by closed-source API endpoints, developers now routinely run multi-billion parameter models directly on local hardware. The engine behind this local revolution is llama.cpp, a highly optimized C/C++ inference runtime originally created by Georgi Gerganov and now maintained under the ggml-org GitHub repository. The significance of this framework was further highlighted by the landmark partnership between Hugging Face and GGML in February 2026, which consolidated local AI deployment infrastructure.

However, executing massive LLMs on consumer-grade workstations or server clusters is not a simple task. It requires navigating physical hardware bottlenecks, specifically the memory bandwidth limitations of the host system. This guide explores the deep internals of llama.cpp, analyzes the architectural differences between CUDA and Metal GPU acceleration, and provides a mathematical framework for diagnosing and mitigating memory bottlenecks when running quantized open-weights models.


What Is It?

llama.cpp is a lightweight, dependency-free inference engine written in pure C and C++. It is designed to run Large Language Models with high efficiency across heterogeneous computing architectures. Unlike server-native inference engines like vLLM, which are built primarily for data-center GPUs using Python wrappers, llama.cpp focuses on running models with minimal system overhead.

At the core of the llama.cpp ecosystem is the GGUF (GGML Universal File) format. GGUF is a binary container format designed to store both the model's weight tensors and its configuration metadata in a single file. GGUF replaced the older GGML format to solve problems related to backward compatibility, metadata extensibility, and ease of deployment.

+-------------------------------------------------------------+
|                        GGUF File Structure                  |
+-------------------------------------------------------------+
|  Header (Magic Number, Version, Tensor Count, KV Count)     |
+-------------------------------------------------------------+
|  Metadata Key-Value Pairs (Model Name, Tokenizer Config)    |
+-------------------------------------------------------------+
|  Tensor Information (Offsets, Shapes, Quantization Types)   |
+-------------------------------------------------------------+
|  Tensor Binary Data (Quantized Weight Arrays)               |
+-------------------------------------------------------------+

GGUF allows the engine to parse all model parameters and run-time parameters (such as the context window size, token vocabulary, and special token IDs) without needing external JSON or YAML files. Furthermore, it supports memory-mapping (via the mmap system call), meaning that the operating system can load weights directly from disk into virtual memory on-demand, reducing memory consumption and initial load times.


Why It Matters

Local LLM execution is not just a preference for hobbyists; it is a critical requirement for modern software engineering. The commercial and practical benefits include:

  1. Data Privacy and Regulatory Compliance: Many enterprises are prohibited by law (e.g., GDPR, HIPAA) from transmitting proprietary code, financial logs, or customer conversations to third-party APIs. Local execution keeps the data within the organization's firewall.
  2. Deterministic Cost Control: API calls scale linearly with usage. Running local models shifts the cost structure from variable operational expenditures (OPEX) to fixed capital expenditures (CAPEX), making high-throughput tasks like batch processing or document scanning highly cost-efficient.
  3. Zero Network Latency: Running models locally eliminates the network round-trip time (RTT) associated with cloud endpoints, resulting in faster time-to-first-token (TTFT) and more responsive interactive user interfaces.
  4. Offline Resilience: Applications built on local models can run in disconnected environments, such as marine vessels, secure offline facilities, or during network outages.

How It Works

The secret to running large models on consumer hardware lies in quantization. Quantization is the process of converting a model's weights from high-precision floating-point formats (like FP16 or BF16) to lower-bit representations (like 8-bit, 4-bit, or even 2-bit integers).

When a model is quantized to a 4-bit GGUF representation, it is not simply rounded. It uses block-based quantization, where weights are grouped into blocks (typically 32 or 256 weights per block). Each block has its own scaling factor (scale) and offset, which are represented as high-precision floats. The individual weights within the block are stored as low-bit offsets relative to the block scale.

In llama.cpp, several quantization schemes are implemented, such as:

  • Q4_K_M: A hybrid 4-bit quantization format that uses 4-bit representation for some layers and 5-bit for attention layers to preserve accuracy.
  • IQ4_XS: An "importance-quantized" 4-bit format that dynamically scales weights based on their impact on overall model output, minimizing perplexity loss.
  • Q8_0: An 8-bit quantization format that offers negligible accuracy degradation but requires double the memory of 4-bit quants.

For a deeper dive into the underlying algebra and statistics of quantization, read our article on Quantization Mathematics.

During execution, llama.cpp does not dequantize the entire model back into FP16 in memory. Instead, it reads the quantized integer weights and dequantizes them just-in-time inside the processor registers or CPU cache line during the matrix-vector multiplication. This approach minimizes memory bandwidth usage by keeping the data in its compressed 4-bit form until the moment of computation.


Architecture

The execution pipeline of an LLM in llama.cpp is divided into two distinct phases: Prefill (Prompt Processing) and Decode (Token Generation). These phases place opposite demands on computer hardware.

+---------------------------------------------------------------------------------+
|                                LLM Inference Phases                             |
+---------------------------------------------------------------------------------+
|  Phase: PREFILL (Prompt Processing)     |  Phase: DECODE (Token Generation)     |
+---------------------------------------------------------------------------------+
|  Workload: Parallel batch processing    |  Workload: Sequential token generation|
|  Bottleneck: Compute-Bound (TFLOPS)     |  Bottleneck: Memory-Bandwidth-Bound   |
|  Goal: Saturate GPU/CPU cores           |  Goal: Maximize memory transfer speed |
+---------------------------------------------------------------------------------+

1. Prefill (Prompt Processing)

During this phase, the entire user prompt is ingested at once. The engine computes the Key-Value (KV) cache for all input tokens in parallel. This phase is characterized by dense GEMM (General Matrix Multiply) operations. Because the matrix sizes are large, the processor's compute units (such as Tensor Cores on NVIDIA GPUs or Apple's Neural Engine/GPU) can be fully saturated. Prefill performance is compute-bound; it is limited by the raw mathematical operations per second (FLOPS) the hardware can execute.

2. Decode (Token Generation)

Once the prompt is processed, the model generates output tokens one by one. For each generated token, the model must execute a matrix-vector multiplication. This means that the system must load every single weight parameter of the model from memory (VRAM or system RAM) into the processor's registers, perform a single calculation with the previous token, and output the result.

Because the amount of calculation per weight is tiny, the processor spends most of its time waiting for the weights to be fetched from memory. The decode phase is memory-bandwidth bound. Performance is limited not by how fast the GPU can compute, but by how fast the memory bus can stream the model's weights.

The Memory Bottleneck Hierarchy

The following table compares the physical memory bandwidth of various hardware platforms:

Hardware PlatformMemory TypeBus Width (Bits)Memory BandwidthRaw Compute Power
NVIDIA RTX 4090GDDR6X3841,008 GB/s83 TFLOPS (FP32)
Apple M3 Ultra (Est.)Unified LPDDR58192800 GB/s38 TFLOPS (FP32)
Apple M3 MaxUnified LPDDR5512400 GB/s18 TFLOPS (FP32)
Dual Channel DDR5 (CPU)DDR5-560012889.6 GB/sVariable (CPU Cores)
PCIe Gen 4 x16 BusSystem LinkN/A32 GB/sN/A

GPU Offloading and the "Bandwidth Cliff"

When running a model that is too large to fit entirely in the VRAM of a discrete GPU, llama.cpp allows you to offload only a subset of the layers to the GPU using the --n-gpu-layers (-ngl) flag. The remaining layers are processed by the CPU.

       +---------------------------------------------+
       |             llama.cpp Split Execution       |
       +---------------------------------------------+
       |                                             |
       |  +-------------------+                      |
       |  |     GPU VRAM      | [Fast Bandwidth]      |
       |  |  Layers 1 to 24   |                      |
       |  +---------+---------+                      |
       |            |                                |
       |    [PCIe Bus Transfer] (Overhead Bottleneck) |
       |            |                                |
       |  +---------v---------+                      |
       |  |    System RAM     | [Slow Bandwidth]      |
       |  |  Layers 25 to 32  |                      |
       |  +-------------------+                      |
       +---------------------------------------------+

While partial offloading is incredibly flexible, it introduces a severe performance bottleneck known as the bandwidth cliff:

  • For each token generation step, the active activation tensors must be copied back and forth between GPU VRAM and System RAM across the PCIe bus.
  • The overall token generation speed collapses to the speed of the slowest memory link. In a partial offload configuration, your inference speed is throttled by your CPU's system RAM bandwidth (89.6 GB/s for dual-channel DDR5) and the PCIe transfer overhead, rather than utilizing the 1,008 GB/s of your GPU.
  • To minimize attention overheads, llama.cpp implements optimizations like Flash Attention. This helps reduce the memory transfer overhead by computing attention on-chip. Read more about attention bottlenecks in our analysis of Mitigating Attention Bottlenecks.

Production Deployment Considerations

Deploying llama.cpp in production requires moving beyond basic command-line execution. It involves optimizing memory access patterns and hardware utilization.

1. Memory Configuration (--mmap and --mlock)

By default, llama.cpp uses memory mapping (--mmap) to map the GGUF file into the process's virtual address space. This allows the OS to load parts of the model lazily. However, in production environments with high concurrency, the OS may page out parts of the model weights if system memory is tight, leading to catastrophic disk-read latencies.

To prevent this, always enable the --mlock flag. This locks the model's memory pages in RAM, preventing the kernel from swapping them to disk. Note that using --mlock requires setting the user's memlock resource limits (ulimit -l) in Linux.

2. VRAM Allocation Math

Before deploying, you must calculate the exact VRAM footprint of your configuration. The total VRAM required is the sum of:

  1. Model Weights: Model Size (GB) = Parameters (B) * Quantization Bits / 8
  2. KV Cache: The KV cache grows with the context window size and the number of active sequences.
  3. Execution Buffers: Temporary workspace buffers required for intermediate matrix multiplication kernels.

The mathematical model for calculating the KV Cache size is:

KV Cache Size (Bytes) = 2 * B * L * H_kv * D * P

Where:

  • B = Batch size (number of concurrent sequences).
  • L = Number of layers in the model.
  • H_kv = Number of Key-Value attention heads (using GQA or MQA).
  • D = Head dimension (typically 128).
  • P = Precision in bytes (e.g., 2 for FP16, 1 for Q8_0, 0.5 for Q4_0).

If this total exceeds the physical VRAM capacity, the system will experience Out-Of-Memory (OOM) crashes or fall back to system memory, causing performance to drop significantly. For a deeper study of batch scheduling and KV memory optimization, see our detailed guide on Continuous Batching vs PagedAttention.


Common Mistakes

Here are the most frequent operational errors developers make when setting up llama.cpp:

  1. Incorrect Thread Allocation: Running llama.cpp on CPU with thread counts matching the total logical processors (including hyperthreads/E-cores) rather than the physical performance cores (P-cores). Hyperthreading adds pipeline switching overhead, slowing down matrix execution.
  2. Spillover to CPU via Partial Offloading: Assuming that offloading 80% of layers to a GPU yields 80% of the GPU's speed. In reality, the speed drops to that of the slowest link (the CPU system RAM).
  3. Leaving Flash Attention Disabled: Forgetting to pass the --flash-attn flag, which leads to quadratic VRAM scaling and high attention latency as the context window grows.
  4. Context Window Oversubscription: Setting a context window size (-c) of 32k or 128k without allocating sufficient VRAM for the KV cache, leading to runtime crashes.
  5. Ignoring NUMA Architecture: On multi-socket CPU servers (e.g., dual AMD EPYC), failing to pass the --numa flag causes memory allocations to cross NUMA nodes, doubling memory latency.

Lessons From Production Deployments

Operating local LLMs at scale exposes hardware and OS behaviors that do not appear in short-term tests.

Case Study 1: The High-Concurrency Latency Spike

In an enterprise customer support bot deployed using llama.cpp server, latency spiked by over 400% under high concurrent loads. The cause was identified as head-of-line blocking in the single-threaded inference queue. Since llama.cpp processes requests sequentially or in static batches, a single user requesting a long token generation blocked the prefill step of other incoming requests. The solution was migrating to dynamic batching and allocating multiple model instances across separate GPU cards.

Case Study 2: Thermal Throttling on Workstations

During an overnight document classification batch job on an RTX 4090 workstation, performance dropped from 35 tps to 14 tps after 2 hours of continuous execution. Diagnostic logs showed that the GPU hit its thermal limit (83°C) and throttled core clocks to prevent damage. This highlights the importance of adequate cooling and power limit adjustments (e.g., using nvidia-smi -pl) in sustained execution environments.

Case Study 3: Unified Memory Contention on Apple Studio

On a Mac Studio with an M2 Ultra running Llama-3-70B-Q8, token generation degraded whenever the system was under high CPU load. Because Apple Silicon uses unified memory, CPU tasks (such as log compression and disk writes) competed directly with the GPU for the same physical memory bus bandwidth. This shows that unified memory is a shared resource; background system activity can degrade inference throughput.


What Most Articles Miss

Most high-level tutorials focus on command-line flags while ignoring the fundamental physics of computer memory.

The Quantized Generation Equation

To predict the exact token generation throughput of any system, you can use the Memory Bandwidth Bottleneck Equation:

Max Tokens per Second = Memory Bandwidth (GB/s) / Model Size (GB)

Let's calculate the theoretical limits for running Llama-3-70B Quantized models:

  • Model configuration: Llama-3-70B-Q4_K_M (approximately 42.5 GB).
  • Hardware 1: NVIDIA RTX 4090 (24GB VRAM). The model does not fit in VRAM. It must be offloaded to DDR5 System RAM. With dual-channel DDR5 running at 89.6 GB/s, the theoretical speed limit is:
    89.6 GB/s / 42.5 GB = 2.11 tokens/sec
    
  • Hardware 2: Mac Studio M3 Max (400 GB/s Unified Memory). The model fits entirely in unified memory. The theoretical speed limit is:
    400 GB/s / 42.5 GB = 9.41 tokens/sec
    
  • Hardware 3: Mac Studio M3 Ultra (800 GB/s Unified Memory). The model fits entirely in unified memory. The theoretical speed limit is:
    800 GB/s / 42.5 GB = 18.82 tokens/sec
    

In practice, real-world speeds are slightly lower (typically 75% to 85% of this theoretical limit) due to the overhead of processing the KV cache, activation transfers, and thread scheduling latency.

GGUF vs. GPU-Native Formats (AWQ & GPTQ)

Developers often ask why they should use GGUF instead of GPTQ or AWQ. The answer lies in execution architecture:

Feature / FormatGGUFAWQGPTQ
Primary Execution TargetHybrid CPU & GPUPure GPUPure GPU
Memory ArchitectureSplit Memory AllocationUnified VRAM AllocationUnified VRAM Allocation
Inference Enginesllama.cpp, OllamavLLM, TensorRT-LLMvLLM, AutoGPTQ
Quantization MethodGrouped Block-ScaleActivation-AwareLayer-Wise Calibration
Loading LatencyLow (via mmap)High (requires full VRAM copy)High (requires full VRAM copy)

Best Practices

To extract the maximum performance from your hardware, follow this system configuration checklist:

For NVIDIA (CUDA) Platforms

  1. Enable CUDA Graphs: Ensure your runtime compiles with CUDA Graph support to combine GPU kernels and bypass CPU scheduling bottlenecks.
  2. Pin Memory: Use --mlock to keep weights pinned in physical RAM.
  3. Maximize Layer Offloading: Set -ngl to a value higher than the model layers (e.g., -ngl 99) to ensure all layers, including the embeddings and final normalization layers, reside in VRAM.
  4. Limit VRAM Headroom: Leave at least 1.5 GB of VRAM free for operating system display processes and the dynamic expansion of the KV cache.

For Apple Silicon (Metal) Platforms

  1. Thread Control: Match thread count (-t) exactly to the physical performance cores (P-cores) of your system. For example, use -t 8 on an M3 Max with 8 P-cores.
  2. Unified Memory Limit: By default, macOS reserves a portion of unified memory for system display. Use the sysctl setting to increase the GPU allocation limit if running massive models:
    sudo sysctl iogpu.wired_mem_limit=100000000000
    
  3. Use FP32 for KV Cache: While Q8_0 KV cache saves memory, Apple Silicon GPUs compute matrix multiplication more efficiently when the KV cache is stored in high-precision floats.

FAQ

1. Why does llama.cpp utilization look low (20-30%) in task manager during generation?

This is normal. During the decoding phase, the GPU cores spend most of their cycles idle, waiting for the memory bus to deliver the model weights. The bottleneck is the memory bus bandwidth, not the processing core capacity.

2. Can I use multiple different GPUs to run llama.cpp?

Yes. llama.cpp supports splitting models across multiple GPUs. You can specify the split ratio using the --tensor-split (-ts) flag. However, note that if the GPUs have different architectures or memory speeds, the execution will be limited by the slowest GPU.

3. How does context length affect VRAM consumption?

Context length increases the size of the KV cache. While model weight size remains constant, the KV cache grows linearly with context size. For long-context models (e.g., 32k or 128k), the KV cache can easily exceed the size of the model weights themselves.

4. What is the benefit of speculative decoding?

Speculative decoding uses a smaller, faster "draft" model to predict tokens, which are then verified in a single parallel step by the larger "target" model. This converts memory-bound generation steps into compute-bound verification steps, increasing generation speeds by up to 2x.

5. Why is my prompt processing speed (prefill) so slow compared to token generation (decode)?

If your prefill speed is slow, you are likely compute-bound. Ensure your compiler has SIMD optimizations enabled (AVX2/AVX-512 on x86, Neon on ARM) or that you are compiling with GPU acceleration (Metal/CUDA) rather than running in pure CPU mode.

6. Should I choose Q4_K_M or Q5_K_M for general use?

For most tasks, Q4_K_M offers the optimal balance between performance and memory usage. It reduces model size by over 70% while keeping perplexity loss minimal. Upgrade to Q5_K_M only if you observe reasoning degradation in complex programming or mathematical tasks.

7. What does the error "Could not allocate workspace buffer" mean?

This error occurs when the GPU does not have enough contiguous VRAM to allocate temporary execution buffers. To fix this, reduce your context size (-c), decrease the batch size (-b), or offload fewer layers to the GPU.

8. Does llama.cpp support Hugging Face models directly?

Yes, llama.cpp can load GGUF models directly from Hugging Face. For raw safetensors or PyTorch models, you must first convert them to GGUF format using the convert_hf_to_gguf.py script provided in the llama.cpp repository.

9. How do I enable GPU acceleration on Linux?

Ensure you compile llama.cpp with the CUDA compiler (NVCC) using the CMake command:

cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release

Then verify that your model runs with active GPU layers by checking the console output logs for CUDA initialization message.

10. Can I run llama.cpp on a server without a GPU?

Yes. llama.cpp was built originally for CPU inference. It runs efficiently on server processors using multi-threaded execution and SIMD instructions. However, expect token generation speeds to be limited by your system RAM bandwidth (typically < 5-10 tokens/sec for large models).


Key Takeaways

  • Memory Bandwidth is King: LLM token generation (decode phase) is entirely memory-bandwidth bound. Performance is dictated by memory transfer speed, not processing core counts.
  • Prefill vs. Decode: The prefill stage is compute-bound and benefits from high GPU core counts, while the decode stage is memory-bound and benefits from wider memory buses.
  • Beware the Bandwidth Cliff: Partial GPU offloading introduces latency overheads. Crossing the PCIe bus to process remaining layers on CPU throttles overall performance to System RAM speeds.
  • Optimize Thread Mapping: When running on CPU, bind thread count to the physical performance cores (P-cores) only. Avoid hyperthreading and efficiency cores.
  • Verify Memory Pinning: Always use the --mlock flag in production environments to lock model weights in RAM and prevent the operating system from swapping them to disk.

About & Technical Stack

Shyank Akshar

Shyank Akshar

Hi! I'm Shyank, a full-stack Software Developer and a Call of Duty enthusiast. I help businesses scale by engineering robust technology solutions that automate complex tasks, save hundreds of hours, and delight users. Over the years, I've partnered with leading global startups and government organizations to deliver high-performance, secure applications at scale.

Technical Stack

Languages, platforms, and architectures I build on.

iOS
Swift
GCP
AWS
Java
backend
Golang
Javascript
Typescript
Mongo DB
MySQL
Redis
Kotlin
Kafka
Kubernetes
Docker
Microservices
System Design
Distributed Systems
More Blogs
Recent Blogs