The complexity of many file formats is staggering and I try (usually with no success) to convince people to rely only on simpler formats that can be audited easily.
Another example: I seem to recall a case awhile back where Word change-tracking contained sensitive information even though “the document” didn’t seem to.
Sadly it used to be easy to rely on plain text implicitly but then we made Unicode so complicated that you have to be careful there too. Fingerprinting through zero-width sequences, etc. is essentially unnoticeable if you do not actively look for it.
Word docs used to contain a UUID of the variety that included the MAC address of the machine that created, allow a doc to be traced back to the machine on which it was created.
Another example: I seem to recall a case awhile back where Word change-tracking contained sensitive information even though “the document” didn’t seem to.
Sadly it used to be easy to rely on plain text implicitly but then we made Unicode so complicated that you have to be careful there too. Fingerprinting through zero-width sequences, etc. is essentially unnoticeable if you do not actively look for it.