Author here. I am an independent researcher (originally med background, moved to... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		jopsammy 13 hours ago \| parent \| context \| favorite \| on: Scientific Insolvency in GPQA and HLE: A forensic ... Author here. I am an independent researcher (originally med background, moved to CS/Physics). I spent the last few weeks manually grading GPQA-Diamond and Humanity's Last Exam (HLE) because my experimental models (DeepSeek-Overclock) were deriving "wrong" answers that looked logically sound. I conducted a forensic audit of the datasets. I suspect these benchmarks are currently "gaslighting" foundation models. Findings: * GPQA-Diamond: Inherent error lower bound 26.8%. * HLE (Sampled): Inherent error lower bound ~58%. Visual Summary of Error Rates: https://i.postimg.cc/nV5hskX2/image1.png The most shocking finding is in HLE, which appears to be riddled with OCR errors from hand-written content, rather than actual "hard" problems. I reverse-engineered these errors by treating the standard answers as "cryptographic hashes" to find the original intended questions. Exhibit A: The "Phantom Parameter" (Physics) In a lattice adsorption problem (`66fecb...`), the text is broken. I successfully reverse-engineered the "Gold Answer" (4.61) and found it corresponds to a specific physical setup where the text digit `4` was misread as `k`, and a strikethrough was interpreted as a deletion. See the forensic reconstruction: https://i.postimg.cc/nhfV2hY9/image2.png Exhibit B: The Visual Counterfeit (Math) In a complex projective space problem, the benchmark penalizes the correct formula because the transcriber likely misread `(n+1)(n+1)` (Rank × Dimension) as `(n+1)^(n+1)` due to slanted handwriting. See the visual comparison: https://i.postimg.cc/6TJKMMZR/image3.png Conclusion: Because of these errors, valid reasoning from models is being assigned a zero score. We are seemingly optimizing for typo-compatibility, not intelligence. Full PDF is on Zenodo (linked above). Verification code (~139 scripts) will be open-sourced once I sanitize the repo (having some git access issues atm). Happy to answer questions.

cmrx64 12 hours ago [–]

this feels a bit like a bombshell given the other recent works on emergent misalignment. how long have we been lying to models?

jopsammy 12 hours ago | [–]

This is a deeply unsettling thought. I hope everyone can see this work. We truly have no idea how much resources have been wasted here.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact