The reported tables also don't match the screenshots. And their baselines and tests are too close to tell (judging by the screenshots not tables). 29/33 baseline, 31/33 skills, 32/33 skills + use skill prompt, 33/33 agent.md
Good catch on the numbers. 29/33 vs 33/33 is the kind of gap that could easily be noise with that sample size. You'd need hundreds of runs to draw any meaningful conclusion about a 4-point difference, especially given how non-deterministic these models are.
This is a recurring problem with LLM benchmarking — small sample sizes presented with high confidence. The underlying finding (always-in-context > lazy-loaded) is probably directionally correct, but the specific numbers don't really support the strength of the claims in the article.