CPU-Only LLM Inference on Threadripper: Sergiu Nagailic Benchmarks llama.cpp Performance

Staff Reporter

A MacBook with lines of code on its screen on a busy desk — Christopher Gower / Unsplash

How much AI can you run on just a CPU? That’s the question Sergiu Nagailic set out to answer using a powerful—but GPU-less—desktop setup. In his latest blog, he shows how careful tuning of CPU threads, memory, and runtime parameters can push large language model performance into impressive territory—all without a graphics card.

His testbed is a second-hand AMD Threadripper 1950X workstation he’s nicknamed “QuietBee,” and the goal was to benchmark large, open-weight AI models using only CPUs. By modifying llama.cpp and experimenting with a fork called ik_llama.cpp, Sergiu explored different build strategies, memory layouts, and batch sizes to extract maximum inference throughput.

Using a custom benchmarking wrapper, he ran both exploratory and deep test sweeps across configurations like BLAS/BLIS optimization, NUMA-aware thread pinning, and variations in kernel flags. His highest-performing setup—an "ik_vanilla" build with tuned flags—achieved over 92 tokens per second in mixed prompt-and-response workloads, outperforming the standard build by more than 60%.

Beyond performance gains, the real contribution is reproducibility. Sergiu’s wrapper script and benchmarking framework are available via HumanFace Tech’s GitHub, enabling others to run similar tests, even on modest home lab hardware. His work especially supports developers fine-tuning LLMs for niche domains like Drupal or Laravel—where GPU access may be limited, but domain knowledge is deep.

Reference: CPU-only LLM Inference by Sergiu Nagailic (nikro) (17 October 2025)

LLM

Drupal

WordPress

Open Source

Disclosure: This content is produced with the assistance of AI.

Disclaimer: The opinions expressed in this story do not necessarily represent that of TheDropTimes. We regularly share third-party blog posts that feature Drupal in good faith. TDT recommends Reader's discretion while consuming such content, as the veracity/authenticity of the story depends on the blogger and their motives.

Note: The vision of this web portal is to help promote news and stories around the Drupal community and promote and celebrate the people and organizations in the community. We strive to create and distribute our content based on these content policy. If you see any omission/variation on this please reach out to us at #thedroptimes channel on Drupal Slack and we will try to address the issue as best we can.