Brotli-G: A GPU compression/decompression standard for digital assets

peter_d_sherman · on Nov 22, 2022

There's an interesting side point here: There exist (and there will be more of these in the future) data compression algorithms which are in general, too slow for specific software use-cases. (related: https://superuser.com/questions/263335/compression-software-...).

The thing is -- they typically run too slow for their intended applications on current, consumer-grade CPU's...

But, could some of them be optimized to take advantage of GPUs (as Brotli is here) -- and would that then increase their performance to a level such that applications which previously could not use them because the algorithm previously took so long -- can now make use of them IF the software end-user has the proper GPU?

?

I think there's a huge amount of possibilities here...

Especially when you get to compression algorithms that include somewhat esoteric stuff, like Fourier Transforms, Wavelet Transforms -- and other weird and esoteric math algorithms both known and yet-to-be-discovered...

In other words, we've went far beyond Huffman and Lempel-Ziv for compression when we're in this territory...

(In fact, there should be a field of study... the confluence/intersection of all GPUs and all known compression algorithms... yes, I know... something like that probably already exists(!)... but I'm just thinking aloud here! <g>)

In conclusion, I think there's a huge amount of interesting future possibilities in this area...

miohtama · on Nov 22, 2022

See also Blosc, with their tag line ”faster than memcpy”

https://www.blosc.org/pages/blosc-in-depth/

Blosch it optimised to have the work buffers that fit into L1 cache, so it can outperform memcpy for certain workflows, e.g. numeric arrays. Because the bottleneck is not CPU, but RAM and slower caches.

peter_d_sherman · on Nov 23, 2022

Excellent link!

I had not heard of blosc before -- but that looks pretty cool!

the-alchemist · on Nov 22, 2022

Actually, there is a huge amount of work in this area. Take a look at https://encode.su/forums/2-Data-Compression

MuffinFlavored · on Nov 22, 2022

How do you get around the performance hit of having to buffer from disk/network -> CPU -> RAM -> GPU and back or whatever?

jessermeyer · on Nov 22, 2022

There is always a minimum cost of moving data from one place to another. If you're computing on the GPU, the data must arrive there. The problem is that PCIE bandwidth is often a bottleneck, and so if you can upload compressed data then you essentially get a free multiplier of bandwidth based on the compression ratio. If the decompression time is faster than having sent the full uncompressed dataset, then you win.

But yeah, direct IO to the GPU would be great but that's not feasible right now.

peter_d_sherman · on Nov 23, 2022

>The problem is that PCIE bandwidth is often a bottleneck, and so if you can upload compressed data then you essentially get a free multiplier of bandwidth based on the compression ratio.

Agreed! The history of computers is sort of like, at any given point of historical time, there's always a bottleneck somewhere...

It's either with the speed of a historical CPU running a specific algorithm, with a historical type of RAM, with a historical storage subsystem, or with a historical type of bus or I/O device... once one is fixed by whatever novel method or upgrade -- then we always invariably run into another bottleneck! <g>

>But yeah, direct IO to the GPU would be great but that's not feasible right now.

Agreed! For consumers, a "direct direct" (for lack of better terminology!) CPU-to-GPU completely dedicated I/O path (as opposed to the use of PCIe as an intermediary) isn't (to the best of my knowledge) generally available at this point in time...

If we are looking towards the future, and/or the super high end business/workstation market, then we might wish to consider checking out Nvidia's Grace (Hopper) CPU architecture: https://www.nvidia.com/en-us/data-center/grace-cpu/

>"The fourth-generation NVIDIA NVLink-C2C delivers 900 gigabytes per second (GB/s) of bidirectional bandwidth between the NVIDIA Grace CPU and NVIDIA GPUs."

Or, we could check out the Cerebras WSE-2:

https://www.cerebras.net/product-chip/

>"Unlike traditional devices, in which the working cache memory is tiny, the WSE-2 takes 40GB of super-fast on-chip SRAM and spreads it evenly across the entire surface of the chip. This gives every core single-clock-cycle access to fast memory at extremely high bandwidth – 20 PB/s. This is 1,000x more capacity and 9,800x greater bandwidth than the leading GPU."

Unfortunately it's (again, to the best of my limited knowledge!) not available for the consumer market at this point in time! (Boy, that would be great as a $200 plug-in card for consumer PC's, wouldn't it? -- but I'm guessing it might take 10 years (or more!) for that to happen!)

I'm guessing in 20+ years we'll have unlimited bandwidth, infinitely low latency optical fiber interconnects everywhere... we can only dream, right? <g>

peter_d_sherman · on Nov 23, 2022

I didn't know about those forums! Awesome link!

eklitzke · on Nov 22, 2022

All the things you're talking about are widely used in audio and video codecs, which typically do have hardware acceleration support.

peter_d_sherman · on Nov 24, 2022

True!

But, there's a huge difference in say, the hardware available to a well-funded Silicon Valley company in the richest place in the richest country in the world, and hardware that a poor person in Sub-Saharan Africa or Rural India might be able to scrap together (if they can even acquire or afford electricity and an Internet connection...)

Around the world, globally -- some people might not have access to hardware which is accelerated enough (much less a GPU!) -- to run the latest and greatest algorithms or the applications that depend on those algorithms...

Conversely, in places like Silicon Valley -- you might have all sorts of new-fangled software running on (and requiring!) super high-end hardware -- that just isn't available or affordable for even the middle class in America!

The democratization of information (on the one hand!) requires that computers, no matter how old, no matter how incapable or unaccelerated (i.e., no GPU) -- run the 'best in class' algorithm -- for whatever their hardware is capable of.

But progressivism, at least in the technological sphere, that is, advancing technology, making progress, "pushing the envelope", "working at the cutting edge", progressivism -- requires the opposite! It requires that we run the absolute newest, latest, most experimental, most "bleeding edge" algorithms on the newest, latest, most experimental hardware!

What emerges from these two countervailing goals -- is a matrix.

A matrix of "hardware capability" X and "best in class algorithm for" Y.

For any given algorithm/application.

If we were to go through the process (er, pain, er process! <g>) to create such a matrix, such a table... then we'd see (much like with incomplete historical snapshots of the Periodic Table of elements) that there are missing / "not well-defined" entries!

We'd also see that there are yet additional combinations, er, elements, er, combinations <g> -- to be discovered in the future!

So, there you have it... (technological!) progressivism vs. (technological) democratization of information (AKA, "equity" as applied to information).

If both countervailing goals are to be pursued (and I claim both are equally virtuous!) -- then what we need is matrices of what is currently "best in class" for any given hardware/algorithmic combination!

And... we might have a whole lot of great hardware and algorithms -- but I strongly believe there is still a lot to be discovered (and improved upon!)...

But yes, in general, I agree that many audio and video codecs do have hardware acceleration support!

corysama · on Nov 22, 2022

There is a new interest in GPU-side general-purpose decompression due to Microsoft pushing DirectStorage.

https://devblogs.microsoft.com/directx/directstorage-1-1-com...

nevi-me · on Nov 22, 2022

To digress and talk about DirectStorage, reading their documentation and announcements still leaves me with an unanswered question. Maybe someone knows the answer.

Does DirectStorage only work for games, or can one use it for compute workloads?

Context: I've been learning some basic GPU programming (via rust-gpu though, not CUDA), and one of the things that sound easy to implement are offloading compute kernels to the GPU (e.g. arrow-rs).

Being able to load datasets via DirectStorage could be great, but as I'm still really learning the basics, I can't figure out whether I could leverage this for my work/learning.

mmozeiko · on Nov 22, 2022

You can use it for any workload that can use D3D12 buffers or textures as input. All the API does for you is transfer data from disk to ID3D12Resource object. After that it is up to you to do whatever you want - use it for fragment shader or compute shader input, etc.. If you use other api's like cuda or vulkan, then you'll need to use interop to create their resources using D3D12 backed resource (or do a copy, whatever is possible there).

telendram · on Nov 22, 2022

Interesting, nVidia selected to go with zstd instead: https://developer.nvidia.com/nvcomp

rektide · on Nov 22, 2022

Im interested to see what image format specific compressors go gpu. JpegXL, AVIF, WebP... who wants to show up & throw down? Or even just fastpng?

Meanwhile we dont really hear or regard many of the gpu-oriented compression techs. TIL Basis/KTX2 is itself zstd compressed (formerly LZ apparently?). https://github.com/BinomialLLC/basis_universal

darrinm · on Nov 22, 2022

Benchmarks?

pstuart · on Nov 22, 2022

It requires the source asset to be compressed using brotli-g as well, so it's dependent on the hosting servers to be of any value.

the-alchemist · on Nov 22, 2022

I don't think so, if I read the article correctly.

> Existing optimized Brotli decompression functions (CPU implementations) should be able to decompress the Brotli-G bitstream, while more optimal data-parallel implementations on hosts or accelerators can further improve performance.

pstuart · on Nov 22, 2022

Pre-coffee non-clarity on my part. I was referring to this part:

> One thing for developers to note is that assets that have already been compressed with Brotli cannot be decompressed with Brotli-G decompressor implementations

efortis · on Nov 23, 2022

Static content can be uploaded already compressed. For example, using `brotli_static` in Nginx you can upload `foo.js` and `foo.js.br`

ttoinou · on Nov 22, 2022

Anyone knows if there is a way to tune this algorithm to be lossy ?

miohtama · on Nov 23, 2022

Brotli is based on Huffman and LZ99 generic data compression algorithms.

Any lossy compression step needs to come before Brotli.

Brotli alone makes little sense for image compression, but is ideal for binary files.