Add comprehensive benchmark report: Allmos vs nano-vLLM performance analysis

  • Documented complete GCP GPU setup process (L4, CUDA 12.7, PyTorch)
  • Benchmarked Allmos: 22.81 tokens/sec on Qwen3-0.6B
  • Compared against nano-vLLM baseline: 1434 tokens/sec (62.8x gap)
  • Identified bottlenecks: no KV cache reuse, Python loop overhead, lack of batching
  • Provided detailed recommendations for optimization (KV cache, CUDA graphs, batching)
  • Includes full technical specifications, methodology, and reproducibility details

This establishes baseline metrics for the research project on AI coding assistant effectiveness in systems software development.

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

Merge request reports

Loading