Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The mentioned Xerox copier incident was not an OCR failure, but the copier actively changed the numbers in the original image due to its image compression algorithm.


Here's some of the context: www.dkriesel.com/blog/2013/0810_xerox_investigating_latest_mangling_test_findings

Learn More: https://www.dkriesel.com/start?do=search&id=en%3Aperson&q=Xe...

Brief: Xerox machines used template matching to recycle the scanned images of individual digits that recur in the document. In 2013, Kriesel discovered this procedure was faulty.

Rationale: This method can create smaller PDFs, advantageous for customers that scan and archive numerical documents.

Prior art: https://link.springer.com/chapter/10.1007/3-540-19036-8_22

Tech Problem: Xerox's template matching procedure was not reliable, sometimes "papering over" a digit with the wrong digit!

PR Problem: Xerox press releases initially claimed this issue did not happen in the factory default mode. Kriesel demonstrated this was not true, by replicating the issue in all of the factory default compression modes including the "normal" mode. He gave a 2015 FrOSCon talk, "Lies, damned lies and scans".

Interesting work!


Any lossy compressor changes the original image for better compression at expense of the perfect accuracy.


Exactly, in practice the alternatives are either blocky artifacts (JPEG and most other traditional codecs), blurring everything (learned codecs optimised for MSE) or "hallucinating" patterns when using models like GANs. However, in practice even the generative side of compression models is evaluated against the original image rather than only output quality, so the outputs tend to be passable.

To see what a lossy generator hallucinating patterns means in practice, I recommend viewing HiFiC vs original here: https://hific.github.io/


Tradtional lossy compressors have well-understood artifacts. In particular they provide guarantees such that you can confidently say that an object in the image could not be an artifact.


The word perfect is misplaced, the trade off is size vs fidelity (aka accuracy)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: