Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Author here.

I am an independent researcher (originally med background, moved to CS/Physics). I spent the last few weeks manually grading GPQA-Diamond and Humanity's Last Exam (HLE) because my experimental models (DeepSeek-Overclock) were deriving "wrong" answers that looked logically sound.

I conducted a forensic audit of the datasets. I suspect these benchmarks are currently "gaslighting" foundation models.

*Findings:*

* GPQA-Diamond: Inherent error lower bound *26.8%*. * HLE (Sampled): Inherent error lower bound *~58%*.

Visual Summary of Error Rates: https://i.postimg.cc/nV5hskX2/image1.png

The most shocking finding is in *HLE*, which appears to be riddled with OCR errors from hand-written content, rather than actual "hard" problems. I reverse-engineered these errors by treating the standard answers as "cryptographic hashes" to find the original intended questions.

*Exhibit A: The "Phantom Parameter" (Physics)* In a lattice adsorption problem (`66fecb...`), the text is broken. I successfully reverse-engineered the "Gold Answer" (4.61) and found it corresponds to a specific physical setup where the text digit `4` was misread as `k`, and a strikethrough was interpreted as a deletion. *See the forensic reconstruction:* https://i.postimg.cc/nhfV2hY9/image2.png

*Exhibit B: The Visual Counterfeit (Math)* In a complex projective space problem, the benchmark penalizes the correct formula because the transcriber likely misread `(n+1)(n+1)` (Rank × Dimension) as `(n+1)^(n+1)` due to slanted handwriting. *See the visual comparison:* https://i.postimg.cc/6TJKMMZR/image3.png

*Conclusion:* Because of these errors, valid reasoning from models is being assigned a zero score. We are seemingly optimizing for typo-compatibility, not intelligence.

Full PDF is on Zenodo (linked above). Verification code (~139 scripts) will be open-sourced once I sanitize the repo (having some git access issues atm). Happy to answer questions.





this feels a bit like a bombshell given the other recent works on emergent misalignment. how long have we been lying to models?

This is a deeply unsettling thought. I hope everyone can see this work. We truly have no idea how much resources have been wasted here.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: