Video streams are generally compressed in blocks called group of pictures (GOP) as well. That's how seeking in a video stream works - skip to the beginning of the GOP the desired timestamp is in, start decoding from there and show images once the timestamp is reached.
Well that was kind of my point, why assume that the above poster would go for the most naive possible implementation when thinking about it for more than a minute yields several obvious solutions, and there are known implementations that have already solved that problem?
ZSTD also has some fun "dictionary" operations so that even if you're chunking your data you can still take advantage of cross-chunk redundancy by "training" across all your chunks before the compression stage.