This demo extends a Parquet file by embedding a Tantivy full-text search index inside it. A custom DataFusion TableProvider implementation uses the embedded full-text index to optimize wildcard LIKE predicates.
> Note that the readers of Parquet need to be aware of any metadata to exploit it. But if not, nothing changes
The one downside of this approach, which is likely obvious, but I haven't seen mentioned is that the resulting parquet files are larger than they would be otherwise, and the increased size only benefits engines that know how to interpret the new index
> That is, start with Wild West and define specs as needed
Yes this is my personal hope as well -- if there are new index types that are widespread, they can be incorporated formally into the spec
However, changing the spec is a non trivial process and requires significant consensus and engineering
Thus the methods used in the blog can be used to use indexes prior to any spec change and potentially as a way to prototype / prove out new potential indexes
I think you would pick DataFusion over DuckDB if you want to customize it substantially. Not just with user defined functions (which are quite easy to write in DataFusion and are very fast), but things like
* custom file formats (e.g. Spiral or Lance)
* custom query languages / sql dialects
* custom catalogs (e.g. other than a local file or prebuilt duckdb connectors)
* custom indexes (read only parts of parquet files based on extra information you store)
* etc.
If you are looking for the nicest "run SQL on local files" experience, DuckDB is pretty hard to beat
BTW here is a fun exercise that takes this idea to the extreme. Who can build a custom file format that gets the best ClickHouse performance (on DataFusion):
In general, if you can partition your datasets on your predicate column, sorting is likely the best option
For example when you have a predicate like, `where id = 'fdhah-4311-ddsdd-222aa'` sorting on the `id` column will help
However, if you have predicates on multiple different sets of columns, such as another query on `state = 'MA'`, you can't pick an ideal sort order for all of them.
People often partition (sort) on the low cardinality columns first as that tends to improve compression signficantly