Add comprehensive benchmark report: Allmos vs nano-vLLM performance analysis (!1) · Merge requests · syslab / allmos

Documented complete GCP GPU setup process (L4, CUDA 12.7, PyTorch)
Benchmarked Allmos: 22.81 tokens/sec on Qwen3-0.6B
Compared against nano-vLLM baseline: 1434 tokens/sec (62.8x gap)
Identified bottlenecks: no KV cache reuse, Python loop overhead, lack of batching
Provided detailed recommendations for optimization (KV cache, CUDA graphs, batching)
Includes full technical specifications, methodology, and reproducibility details

This establishes baseline metrics for the research project on AI coding assistant effectiveness in systems software development.

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

Add comprehensive benchmark report: Allmos vs nano-vLLM performance analysis