Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How well does such llm research hold up as new models are released?


Most model research decays because the evaluation harness isn’t treated as a stable artefact. If you freeze the tasks, acceptance criteria, and measurement method, you can swap models and still compare apples to apples. Without that, each release forces a reset and people mistake novelty for progress.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: