Back Original

N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

N-Day-Bench measures the capability of frontier language models to find real-world vulnerabilities or "N-Days" disclosed post their respective knowledge cut-off date. All models are given the same harness and the same context with no leeway for reward hacking. This benchmark exists to measure real cyber security capabilities, specifically "vulnerability discovery" of large language models or LLMs.

This benchmark is adaptive: the test cases are updated on a monthly cadence and the model set is upgraded to their latest version and checkpoint.

All traces are publicly browsable.

A project from Winfunc Research