Sidehelm: a pipeline to validate, test, and pull CSV data

victorNicollet · on Jan 19, 2017

It really is not that hard or time-consuming to write a CSV sanitization tool, one that actually manages to recover data from many kinds of bad (but consistent) formatting. I'd say my team spent around 6 man-weeks on it.

The real cost is that the fight is never over. Whenever a new customer comes in with a CSV so badly formatted that it is rejected by your system, further analysis reveals that there's actually some sanity behind the madness. That you could in fact add new rules to detect and correct the situation.

We've seen so many horrors. European dates ? Try a file with three different date formats over six date columns. Quotes that should be parsed in some columns but kept verbatim in others. Columns that contain raw binary data, without any escaping (but you can still infer the boundaries from context). UTF-16 with a byte skipped in the middle. Tab-separated files where the first line is space-separated instead. Cells that are 8MB of white space.

There is no end to the... creativity of CSV file authors.

kafkaesq · on Jan 20, 2017

There is no end to the... creativity of CSV file authors.

Indeed, it never ceases to amaze. That's why it's so hard to imagine a "general" solution to this problem. You can think of something to handle to 90% most likely business use cases, maybe... but not the fat tail.

iamwil · on Jan 20, 2017

Sounds like it was painful. I actually laughed out loud at 8MB of white space. Mind if I learn more about your experiences? I don't know how to reach you, but please msg me at [email protected].

jimktrains2 · on Jan 20, 2017

> There is no end to the... creativity of CSV file authors.

Eh, commas and new lines are common, let's use pipes. and count fields.

:'(

grenoire · on Jan 19, 2017

How trustworthy is Sidehelm in handling of the data? Do they ensure that the pipelined data does not come into contact with human eyes, or is not copied?

I cannot really think of many services off the top of my head that would be so willing to give significant chunks of data to another pipelining service.

iamwil · on Jan 19, 2017

Sidehelm operates the same way many ETL or logging services do. Everything that it does is completely automated and any data retained is for the end user to inspect, not us. Any domain specific data error that happens, we would notify you for you to fix and replay through the pipeline.

michalskop · on Jan 19, 2017

Related for working with CSV data, there is an interesting standardization project by Open Knowledge: http://frictionlessdata.io/guides/tabular-data-package/

It already includes some tools for working with and displaying CSV data.

notfed · on Jan 19, 2017

Please tell me this is not a third party web service?

iamwil · on Jan 19, 2017

What is your concern about a 3rd party web service here?

kafkaesq · on Jan 20, 2017

Bad "architecture smell", generally (except for niche cases).

Basically you don't want to rely on something that (1) requires a whole new service / monitoring later, (2) intrinsically exposes your data to untrusted sources and (3) can of course go "poof!" at any time -- unless you absolutely have to, or it's for something mostly ancillary to business and your data pipeline. (Or you're fairly small and just don't have a lot of people).

Especially when it's just CSV parsing, after all.

iforgotmypass · on Jan 19, 2017

Dear HN users,

Could you please link to alternative solutions?

Does AWS Glue solve the same problems? https://aws.amazon.com/glue/

ZenoArrow · on Jan 19, 2017

I could do with those suggestions too, we have a strong need for a CSV validation tool.

After a brief search, this seems promising:

https://theodi.org/blog/introducing-csvlint

ckorhonen · on Jan 19, 2017

CSVLint is useful in terms of identifying potential parsing issues, though I think the larger issue is that very few applications actually generate valid CSV files - even Excel has it's issues with certain encodings.

In most cases programmers seem to have an incorrect set of assumptions around encodings, escaping etc.

iamwil · on Jan 19, 2017

What do you do for work that you have a strong need for a CSV validation tool?

ZenoArrow · on Jan 20, 2017

I'm a software tester, but the company I work for gets a lot of third-party data from CSV files, which frequently do not match the requirements set out for those files. Using a CSV validation tool (ideally a web service controlled by the company I work for) would allow our clients to validate the files before they sent them to us.

I'm slowly putting together my own solution for this, but if suitable open-source software already exists I'd be happy to make use of or modify that.

iamwil · on Jan 25, 2017

I'm guessing you'd want the validation tool to be controlled by the company you work for, because of the sensitivity of the data being processed?

What type of data/industry is it? It'd help us to know where to look for people that might need us.

dgudkov · on Jan 19, 2017

If you don't deal with terabytes of new data every day then my EasyMorph might be of help: http://easymorph.com

automatwon · on Jan 19, 2017

The 'Parse', 'Transform', 'Validate' icons look alot like Google Cloud Platform Icons.

https://docs.google.com/presentation/d/1vjm5YdmOH5LrubFhHf1v...

jimktrains2 · on Jan 19, 2017

I've been looking for the icon set for presentations I've made for a while and never found this; thanks!

triplenineteen · on Jan 19, 2017

Does this support unicode characters? Other encodings?

iamwil · on Jan 19, 2017

Yes. Part of the reason seemingly simple CSV are a pain is because it's often malform, badly formatted, with odd encodings in the file. We take care of that for you, so you can fix it or throw it out.

TAForObvReasons · on Jan 19, 2017

Is there a functional demo or video demonstration? All I see is a marketing page ...

jimktrains2 · on Jan 19, 2017

non sequitur: I find it interesting that they're using the Google cloud blue hex style in their icons, but don't seem to support BigQuery or BigTable/HBase.