I guess I should clarify that - performance sucks. There are obviously various ways around it, but you just can't write a lot of performance critical python that way and I guess this is one of the reasons why julia exists in the first place.
But if you look at the actual cutting edge research work in the HPC and scientific space, they are working on languages that allow domain experts to express computation using higher level primitives, not on fancy compiler techniques to make general purpose imperative languages like C or Python "automagically" run faster on single cores.
The general consensus, if you look at languages like Chapel and Fortress and X10, seems to be that most scientific codes shouldn't be written using for-loops. That is the low-level control flow construct dating from the age of assembler. Instead, what scientists generally want to say is, "Apply this kernel across this domain, with these windowing conditions", or "Reduce values from this computation along these keys in my dataset". As software developers, our job is to provide the language runtime to allow them to do that; such a runtime will be the most robust, correct, maintainable, and performant.
Right. I think we actually violently agree with each other on that :) Note that in general it's ok to not write much python and live very happily just using it for non-performance critical parts. No doubt we both know a lot of people who are quite happy with that.
The problem with numpy's performance is twofold:
* Numpy expressions might not be fast enough. I believe you guys at continuum are trying to address that one way or another. In general the kernel expressed using high-level constructs in python should not be slower than an equivalent loop in C.
* Sometimes you actually want to write a for loop, because you don't care, because it's faster, because it's a single run, because the data is manageable etc. You should not be punished for doing that with 100x performance drop. You can still be punished for that with 2x performance drop.