CPU-Only LLM Inference on Threadripper: Sergiu Nagailic Benchmarks llama.cpp Performance
How much AI can you run on just a CPU? That’s the question Sergiu Nagailic set out to answer using a powerful—but GPU-less—desktop setup. In his latest blog, he shows how careful tuning of CPU threads, memory, and runtime parameters can push large language model performance into impressive territory—all without a graphics card.
His testbed is a second-hand AMD Threadripper 1950X workstation he’s nicknamed “QuietBee,” and the goal was to benchmark large, open-weight AI models using only CPUs. By modifying llama.cpp and experimenting with a fork called ik_llama.cpp, Sergiu explored different build strategies, memory layouts, and batch sizes to extract maximum inference throughput.
Using a custom benchmarking wrapper, he ran both exploratory and deep test sweeps across configurations like BLAS/BLIS optimization, NUMA-aware thread pinning, and variations in kernel flags. His highest-performing setup—an "ik_vanilla" build with tuned flags—achieved over 92 tokens per second in mixed prompt-and-response workloads, outperforming the standard build by more than 60%.
Beyond performance gains, the real contribution is reproducibility. Sergiu’s wrapper script and benchmarking framework are available via HumanFace Tech’s GitHub, enabling others to run similar tests, even on modest home lab hardware. His work especially supports developers fine-tuning LLMs for niche domains like Drupal or Laravel—where GPU access may be limited, but domain knowledge is deep.


