Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Sidehelm: a pipeline to validate, test, and pull CSV data (sidehelm.com)
51 points by iamwil on Jan 19, 2017 | hide | past | favorite | 23 comments


It really is not that hard or time-consuming to write a CSV sanitization tool, one that actually manages to recover data from many kinds of bad (but consistent) formatting. I'd say my team spent around 6 man-weeks on it.

The real cost is that the fight is never over. Whenever a new customer comes in with a CSV so badly formatted that it is rejected by your system, further analysis reveals that there's actually some sanity behind the madness. That you could in fact add new rules to detect and correct the situation.

We've seen so many horrors. European dates ? Try a file with three different date formats over six date columns. Quotes that should be parsed in some columns but kept verbatim in others. Columns that contain raw binary data, without any escaping (but you can still infer the boundaries from context). UTF-16 with a byte skipped in the middle. Tab-separated files where the first line is space-separated instead. Cells that are 8MB of white space.

There is no end to the... creativity of CSV file authors.


There is no end to the... creativity of CSV file authors.

Indeed, it never ceases to amaze. That's why it's so hard to imagine a "general" solution to this problem. You can think of something to handle to 90% most likely business use cases, maybe... but not the fat tail.


Sounds like it was painful. I actually laughed out loud at 8MB of white space. Mind if I learn more about your experiences? I don't know how to reach you, but please msg me at [email protected].


> There is no end to the... creativity of CSV file authors.

Eh, commas and new lines are common, let's use pipes. and count fields.

:'(


How trustworthy is Sidehelm in handling of the data? Do they ensure that the pipelined data does not come into contact with human eyes, or is not copied?

I cannot really think of many services off the top of my head that would be so willing to give significant chunks of data to another pipelining service.


Sidehelm operates the same way many ETL or logging services do. Everything that it does is completely automated and any data retained is for the end user to inspect, not us. Any domain specific data error that happens, we would notify you for you to fix and replay through the pipeline.


Related for working with CSV data, there is an interesting standardization project by Open Knowledge: http://frictionlessdata.io/guides/tabular-data-package/

It already includes some tools for working with and displaying CSV data.


Please tell me this is not a third party web service?


What is your concern about a 3rd party web service here?


Bad "architecture smell", generally (except for niche cases).

Basically you don't want to rely on something that (1) requires a whole new service / monitoring later, (2) intrinsically exposes your data to untrusted sources and (3) can of course go "poof!" at any time -- unless you absolutely have to, or it's for something mostly ancillary to business and your data pipeline. (Or you're fairly small and just don't have a lot of people).

Especially when it's just CSV parsing, after all.


Dear HN users,

Could you please link to alternative solutions?

Does AWS Glue solve the same problems? https://aws.amazon.com/glue/


I could do with those suggestions too, we have a strong need for a CSV validation tool.

After a brief search, this seems promising:

https://theodi.org/blog/introducing-csvlint


CSVLint is useful in terms of identifying potential parsing issues, though I think the larger issue is that very few applications actually generate valid CSV files - even Excel has it's issues with certain encodings.

In most cases programmers seem to have an incorrect set of assumptions around encodings, escaping etc.


What do you do for work that you have a strong need for a CSV validation tool?


I'm a software tester, but the company I work for gets a lot of third-party data from CSV files, which frequently do not match the requirements set out for those files. Using a CSV validation tool (ideally a web service controlled by the company I work for) would allow our clients to validate the files before they sent them to us.

I'm slowly putting together my own solution for this, but if suitable open-source software already exists I'd be happy to make use of or modify that.


I'm guessing you'd want the validation tool to be controlled by the company you work for, because of the sensitivity of the data being processed?

What type of data/industry is it? It'd help us to know where to look for people that might need us.


If you don't deal with terabytes of new data every day then my EasyMorph might be of help: http://easymorph.com


The 'Parse', 'Transform', 'Validate' icons look alot like Google Cloud Platform Icons.

https://docs.google.com/presentation/d/1vjm5YdmOH5LrubFhHf1v...


I've been looking for the icon set for presentations I've made for a while and never found this; thanks!


Does this support unicode characters? Other encodings?


Yes. Part of the reason seemingly simple CSV are a pain is because it's often malform, badly formatted, with odd encodings in the file. We take care of that for you, so you can fix it or throw it out.


Is there a functional demo or video demonstration? All I see is a marketing page ...


non sequitur: I find it interesting that they're using the Google cloud blue hex style in their icons, but don't seem to support BigQuery or BigTable/HBase.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: