It really is not that hard or time-consuming to write a CSV sanitization tool, one that actually manages to recover data from many kinds of bad (but consistent) formatting. I'd say my team spent around 6 man-weeks on it.
The real cost is that the fight is never over. Whenever a new customer comes in with a CSV so badly formatted that it is rejected by your system, further analysis reveals that there's actually some sanity behind the madness. That you could in fact add new rules to detect and correct the situation.
We've seen so many horrors. European dates ? Try a file with three different date formats over six date columns. Quotes that should be parsed in some columns but kept verbatim in others. Columns that contain raw binary data, without any escaping (but you can still infer the boundaries from context). UTF-16 with a byte skipped in the middle. Tab-separated files where the first line is space-separated instead. Cells that are 8MB of white space.
There is no end to the... creativity of CSV file authors.
There is no end to the... creativity of CSV file authors.
Indeed, it never ceases to amaze. That's why it's so hard to imagine a "general" solution to this problem. You can think of something to handle to 90% most likely business use cases, maybe... but not the fat tail.
Sounds like it was painful. I actually laughed out loud at 8MB of white space. Mind if I learn more about your experiences? I don't know how to reach you, but please msg me at [email protected].
How trustworthy is Sidehelm in handling of the data? Do they ensure that the pipelined data does not come into contact with human eyes, or is not copied?
I cannot really think of many services off the top of my head that would be so willing to give significant chunks of data to another pipelining service.
Sidehelm operates the same way many ETL or logging services do. Everything that it does is completely automated and any data retained is for the end user to inspect, not us. Any domain specific data error that happens, we would notify you for you to fix and replay through the pipeline.
Bad "architecture smell", generally (except for niche cases).
Basically you don't want to rely on something that (1) requires a whole new service / monitoring later, (2) intrinsically exposes your data to untrusted sources and (3) can of course go "poof!" at any time -- unless you absolutely have to, or it's for something mostly ancillary to business and your data pipeline. (Or you're fairly small and just don't have a lot of people).
CSVLint is useful in terms of identifying potential parsing issues, though I think the larger issue is that very few applications actually generate valid CSV files - even Excel has it's issues with certain encodings.
In most cases programmers seem to have an incorrect set of assumptions around encodings, escaping etc.
I'm a software tester, but the company I work for gets a lot of third-party data from CSV files, which frequently do not match the requirements set out for those files. Using a CSV validation tool (ideally a web service controlled by the company I work for) would allow our clients to validate the files before they sent them to us.
I'm slowly putting together my own solution for this, but if suitable open-source software already exists I'd be happy to make use of or modify that.
Yes. Part of the reason seemingly simple CSV are a pain is because it's often malform, badly formatted, with odd encodings in the file. We take care of that for you, so you can fix it or throw it out.
non sequitur: I find it interesting that they're using the Google cloud blue hex style in their icons, but don't seem to support BigQuery or BigTable/HBase.
The real cost is that the fight is never over. Whenever a new customer comes in with a CSV so badly formatted that it is rejected by your system, further analysis reveals that there's actually some sanity behind the madness. That you could in fact add new rules to detect and correct the situation.
We've seen so many horrors. European dates ? Try a file with three different date formats over six date columns. Quotes that should be parsed in some columns but kept verbatim in others. Columns that contain raw binary data, without any escaping (but you can still infer the boundaries from context). UTF-16 with a byte skipped in the middle. Tab-separated files where the first line is space-separated instead. Cells that are 8MB of white space.
There is no end to the... creativity of CSV file authors.