IndustrialJane's comments

IndustrialJane · on June 27, 2019

I always wondered why Mawk did not implement Gawk's extensions - or at least those that could be done without any major penalty for the rest of the code.

Do you know why?

IndustrialJane · on June 27, 2019

With --results or --joblog and --resume-failed GNU Parallel can do this, too.

IndustrialJane · on June 27, 2019

I have done that more than once. I often end up with a solution that works on the test set but which breaks after 10 TB just because <" "@example.com> is a valid email address according RFC-822 (Who the f* thought it was a good idea to allow spaces in email addresses?). Or some other exception that was not part of the test set, and that was not identified before starting.

Dealing with exceptions is extremely error prone if these exceptions are not mapped beforehand. Thus it can be very costly.

Similarly doing stuff in parallel is extremely error prone due to race conditions: What does not happen when running on your 1 GB test set, may very well happen when running on your 25 TB production data.

emmanueloga_ · on June 27, 2019

I get your point but the same error handling problems can appear in scripts and pipelines, no?

In a program I'd try/catch defensively "just in case", if missing one line out of 25TB is not a bit deal.

For parallel processing I'd reach for the nearest standard library at hand on the language of choice.

IndustrialJane · on June 28, 2019

> For parallel processing I'd reach for the nearest standard library at hand on the language of choice.

That is a good example of what I mean: The nearest standard library is likely to either buffer output in memory or not buffer at all (in which case you can have the start of one line ending with another line). This means you cannot deal with output bigger than physical RAM. And your test set will often be so small that this problem will not show up.

GNU Parallel buffers on disk. It checks whether the disk runs full during a run and exits with a failure if that happens. It also removes the temporary files immediately, so if GNU Parallel is killed, you do not have to clean up any mess left behind.

You _could_ do all that yourself, but then we are not talking 50 lines of code. Parallelizing is hard to get right for all the corner cases - even with a standard library.

And while you would not have to look up how to use command line parameters on S/O you _would_ be doing exactly the same for the standard libraries.

Assuming you can get better performance is also not given: GNU Sort has built-in parallel sorting. So you clearly would not want to use a standard non-parallelized sort.

Basically I see you have 2 choices: Built it yourself from libraries, or build it as a shell script from commands.

You would have to spend time understanding how to use the libraries and the commands in both cases, and you are limited by whatever the library or the command can do in both cases.

I agree that if you need tighter control than a shell script will give you, then you need to switch to another language.

emmanueloga_ · on June 28, 2019

I agree with everything you said, as always, everything is a trade off. Good point about trickiness of memory management w/parallel processing! Would have to be extra careful to avoid hoarding RAM.

IndustrialJane · on May 29, 2018

But that approach is potentially dangerous. See:

https://mywiki.wooledge.org/BashPitfalls#Non-atomic_writes_w...