Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I made a tool like this for my company in Ruby (it wasn't nearly as mature as this). The largest challenge I struggled with (and never really solved) is that ultimately, there's no way to generate data as useful as real data. The value of real data comes from the fact that it's messy. Real data is different sizes than you expect[1], collides with your sentinel values[2], and comes in with unexpected encodings[3]. And sometimes people will enter data that is intended to break your system[4][5].

The value of testing with real data is that it doesn't conform to your assumptions.

As far as I can tell, this benefit is impossible to fake with a system that generates fake data algorithmically. Generated data conforms to the assumptions of the system that generated it and therefore can only be used to test that a system conforms to those assumptions.

Fake data is still useful. Volume is often important (does your database slow down or crash when there are 10 billion records?). And if your fake data has very few assumptions, you can use that to reduce the assumptions made by the system you're testing.

Nevertheless, I'd really like to see a system like this which integrates data from some sort of general-purpose real dataset. Ideally it would be configurable so that people can document and choose a 99% use case they want to support (for example, a US company might want to support long names, but might not get a ton of value from supporting names with Chinese characters).

[1] http://jalopnik.com/this-hawaiian-womans-name-is-too-long-fo...

[2] http://www.snopes.com/autos/law/noplate.asp

[3] http://www.joelonsoftware.com/articles/Unicode.html

[4] http://en.wikipedia.org/wiki/Buffer_overflow

[5] http://xkcd.com/327/



I'd suggest looking into fuzzers. Short version - tools designed to input messy, non-conforming data to ensure the inputs don't cause problems, that things are sanitized correctly, etc. At this point they are a mature technology, with improvements constantly being researched. They are generally thought of as security tools[1], but are very useful for basic development too.

[1] The common use of fuzzers in a security context is to send malformed packets to protocol parsers to see if they fall over or cause buffer overruns, or otherwise do fun things in the context of exploiting a system. Another common one being automatic sql-injection discovery tools.


A quick search gave me this list: http://www.infosecinstitute.com/blog/2005/12/fuzzers-ultimat... is there a notable fuzzer missing? It's a pretty long list, does anyone know which of these tools are really worth checking out?


That's a pretty old list. Just to name one, I would recommend taking a look of Radamsa

https://www.ee.oulu.fi/research/ouspg/Radamsa

...from the Oulu University. It's more like a framework for generating intelligent fuzzers than a shrink-wrapped product, though.

The OUSPG guys are really good at fuzzing. There is also a commercial spin-off, Codenomicon, whose tools are quite widely used.


Crude fuzzing can be done on the command line using dd.

The command "dd if=/dev/urandom bs=1000 count=1" will spit out 1 KB of psuedorandom data you can pipe, POST or otherwise send to your application. (GNU's implementation lets you use "1K" as well.)



You might be interested in something like https://github.com/buger/gor




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: