Nichebench Released to Benchmark AI Models on Drupal 10/11 Code and Knowledge
Sergiu Nagailic, Co-founder and CTO of HumanFace Tech has released Nichebench, a benchmarking framework designed to evaluate how well large language models (LLMs) understand and generate Drupal 10/11 code. The tool assesses both factual knowledge and practical coding ability—key metrics for fine-tuning AI models tailored to Drupal’s evolving ecosystem.
Unlike generic benchmarks, Nichebench focuses on niche, domain-specific tasks. It runs two evaluation tracks: quiz-based knowledge tests using multiple-choice questions and code generation tasks that require models to produce Drupal implementation code. Each result is assessed using a GPT-5-based LLM-as-a-Judge setup via DeepEval, allowing for scalable and consistent evaluation.
Initial findings show promising accuracy from some open models on the quiz track, including GPT-OSS-120B (90%), Phi-4 (14B) (88%), and Qwen3-Coder (86%). However, the code generation tests revealed a much larger performance gap. GPT-5 achieved 75% accuracy, while top open models like GPT-OSS-120B peaked at 40%. Most open models struggled with structured data output in JSON or YAML—a known weakness in smaller architectures.
Based on the results, Nagailic is now preparing to fine-tune a dedicated Drupal-specific LLM starting with high-performing open models like GPT-OSS-20B and Qwen3-Coder-30B-A3B. The goal is to create an open-weight LLM optimized for modern Drupal practices, capable of supporting real-world development tasks such as module creation, architecture advice, and implementation support.
The benchmark's test cases and datasets are not public to prevent contamination in future training datasets, but contributors may request access via the project’s GitHub repository. A companion explainer video is also available on YouTube.


