I made a tool like this for my company in Ruby (it wasn't nearly as mature as this). The largest challenge I struggled with (and never really solved) is that ultimately, there's no way to generate data as useful as real data. The value of real data comes from the fact that it's messy. Real data is different sizes than you expect[1], collides with your sentinel values[2], and comes in with unexpected encodings[3]. And sometimes people will enter data that is intended to break your system[4][5].
The value of testing with real data is that it doesn't conform to your assumptions.
As far as I can tell, this benefit is impossible to fake with a system that generates fake data algorithmically. Generated data conforms to the assumptions of the system that generated it and therefore can only be used to test that a system conforms to those assumptions.
Fake data is still useful. Volume is often important (does your database slow down or crash when there are 10 billion records?). And if your fake data has very few assumptions, you can use that to reduce the assumptions made by the system you're testing.
Nevertheless, I'd really like to see a system like this which integrates data from some sort of general-purpose real dataset. Ideally it would be configurable so that people can document and choose a 99% use case they want to support (for example, a US company might want to support long names, but might not get a ton of value from supporting names with Chinese characters).
I'd suggest looking into fuzzers. Short version - tools designed to input messy, non-conforming data to ensure the inputs don't cause problems, that things are sanitized correctly, etc. At this point they are a mature technology, with improvements constantly being researched. They are generally thought of as security tools[1], but are very useful for basic development too.
[1] The common use of fuzzers in a security context is to send malformed packets to protocol parsers to see if they fall over or cause buffer overruns, or otherwise do fun things in the context of exploiting a system. Another common one being automatic sql-injection discovery tools.
Crude fuzzing can be done on the command line using dd.
The command "dd if=/dev/urandom bs=1000 count=1" will spit out 1 KB of psuedorandom data you can pipe, POST or otherwise send to your application. (GNU's implementation lets you use "1K" as well.)
The value of testing with real data is that it doesn't conform to your assumptions.
As far as I can tell, this benefit is impossible to fake with a system that generates fake data algorithmically. Generated data conforms to the assumptions of the system that generated it and therefore can only be used to test that a system conforms to those assumptions.
Fake data is still useful. Volume is often important (does your database slow down or crash when there are 10 billion records?). And if your fake data has very few assumptions, you can use that to reduce the assumptions made by the system you're testing.
Nevertheless, I'd really like to see a system like this which integrates data from some sort of general-purpose real dataset. Ideally it would be configurable so that people can document and choose a 99% use case they want to support (for example, a US company might want to support long names, but might not get a ton of value from supporting names with Chinese characters).
[1] http://jalopnik.com/this-hawaiian-womans-name-is-too-long-fo...
[2] http://www.snopes.com/autos/law/noplate.asp
[3] http://www.joelonsoftware.com/articles/Unicode.html
[4] http://en.wikipedia.org/wiki/Buffer_overflow
[5] http://xkcd.com/327/