Hacker Newsnew | past | comments | ask | show | jobs | submit | alamb's commentslogin

This demo extends a Parquet file by embedding a Tantivy full-text search index inside it. A custom DataFusion TableProvider implementation uses the embedded full-text index to optimize wildcard LIKE predicates.


> Note that the readers of Parquet need to be aware of any metadata to exploit it. But if not, nothing changes

The one downside of this approach, which is likely obvious, but I haven't seen mentioned is that the resulting parquet files are larger than they would be otherwise, and the increased size only benefits engines that know how to interpret the new index

(I am an author)


> That is, start with Wild West and define specs as needed

Yes this is my personal hope as well -- if there are new index types that are widespread, they can be incorporated formally into the spec

However, changing the spec is a non trivial process and requires significant consensus and engineering

Thus the methods used in the blog can be used to use indexes prior to any spec change and potentially as a way to prototype / prove out new potential indexes

(note I am an author)


We are actively working on supporting extension types. The mechanism is likely to be using the Arrow extension type mechanism (a logical annotation on top of existing Arrow types https://arrow.apache.org/docs/format/Columnar.html#format-me...)

I expect this to be used to support Variant https://github.com/apache/datafusion/issues/16116 and geometry types

(note I am an author)



Specifically, DataFusion is faster when querying parquet directly.

Most of the leaderboard of ClickBench is for database specific file formats (that you first have to load the data into)


I think you would pick DataFusion over DuckDB if you want to customize it substantially. Not just with user defined functions (which are quite easy to write in DataFusion and are very fast), but things like * custom file formats (e.g. Spiral or Lance) * custom query languages / sql dialects * custom catalogs (e.g. other than a local file or prebuilt duckdb connectors) * custom indexes (read only parts of parquet files based on extra information you store) * etc.

If you are looking for the nicest "run SQL on local files" experience, DuckDB is pretty hard to beat

Disclaimer: I am the PMC chair of DataFusion

There are some other interesting FAQs here too: https://datafusion.apache.org/user-guide/faq.html


BTW here is a fun exercise that takes this idea to the extreme. Who can build a custom file format that gets the best ClickHouse performance (on DataFusion):

https://github.com/apache/datafusion/issues/13448

Disclaimer I am on the PMC of Apache DataFusion, so am totally a fan boy.


In general, if you can partition your datasets on your predicate column, sorting is likely the best option

For example when you have a predicate like, `where id = 'fdhah-4311-ddsdd-222aa'` sorting on the `id` column will help

However, if you have predicates on multiple different sets of columns, such as another query on `state = 'MA'`, you can't pick an ideal sort order for all of them.

People often partition (sort) on the low cardinality columns first as that tends to improve compression signficantly


It would be amazing if the code for working with arrow on GPUs could be made open source -- I think that would drive a significant amount of adoption


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: