Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Something that surprised me is how good GPT 4 is at extracting tables from unstructured data, or data with weird or messed up formatting.

For example, I copy-pasted some text from a HTML page where each table cell ended up on its own line, like so:

    Enable feature?
    Yes
    Expand things?
    No
This could be fixed with some multi-line regex, except that it was interleaved with headings that messed up the key-value pairings.

I fed this into GPT 4 and it correctly surmised which rows were headings, keys, and values. This is an easy task for a human, but a shocking thing to see a computer solve without years of programming effort put into solving this specific problem!

You've got to wonder how many of these single-purpose AI models like Table Transformer are going to be subsumed into LLMs. For example, Table Transformer comes with a bunch of labelled training data. Just point the variant of GPT 4 that has vision at the training data to make a tuned version. That should outperform a "small" model because it both understands what it's seeing and has been tuned on the special-purpose data set.

What I'm saying is that if you have a terabyte of specific training data and train a model on it, that's fine, but then you'll have a model with "1 TB of knowledge" at most. If you start with a pre-trained LLM with petabytes of knowledge crammed into it, then adding that 1 TB would give you the benefits of both, but the benefit of the petabyte vastly outstrips the extra terabyte!

With LLMs being quantized down to just a few gigabytes and able to run on mobile devices, I wonder if this is what the future of AI will look like. No more training models from scratch...



Indeed! I built a system just last year with - count em - three parsers to deal with PDF table extraction, including one built on TableTransformer. And then when GPT4 came out I just copy pasted a PDF into it as-is and darned if it didn’t do at least as good a job.

Now I can’t do this in earnest because of document privacy issues but I’ve diving down the rabbit hole of how small can we go and still get decent results. Spoiler: gpt2 is too small. :-)


If you were asked to extract lists or tables from html pages only, how would you go?

I was thinking: a) use the metric used in TableTransformer to detect the structured data. b) use the Markup LM model, maybe mixed with TableTransformer. c) find a way to work directly with GPT4.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: