I would really love to have the story of PyMC told, especially it's technical evolution, how it was implemented first and how it changed over the years.
If you are looking for an explanation of MCMC that focuses on intuitive understanding to complement more mathematical introductions, I wrote a blog post trying to explain things in simple terms here: https://twiecki.io/blog/2015/11/10/mcmc-sampling/
No, back-propagation would not give full Bayesian inference (although there are some tricks [0]). They instead use variational inference[1], which allows for fast inference of continuous PGMs.
Brian Moriarty gave a great talk in 2015 about his story of Loom with many interesting behind-the-scenes infos: https://www.youtube.com/watch?v=z1aVDael-KM
Towards the end he also says he'd be excited to do a Loom sequel with the right studio.
There are packages that are just Python libraries, like PyMC3: http://pymc-devs.github.io/pymc3/ It allows you to specify a Probabilistic Program with a few lines of Python code (x ~ Normal(0, 1) becomes x = pymc3.Normal('x', 0, 1)) and then run automatic inference on whatever model you dream up, it's very powerful. I always found the approach of inventing a new language to specify Probabilistic Programs like Stan does a bit odd. There, you specify the model as a string and can't interact with it or debug if something goes wrong. Disclaimer: I'm a PyMC3 developer.
I don't think it has to be either or. For example, you can implement a Deep Net in a probabilistic programming framework (http://twiecki.github.io/blog/2016/07/05/bayesian-deep-learn...) like PyMC3. The inference here is not using sampling but rather ADVI (http://pymc-devs.github.io/pymc3/api.html#advi) which is almost as general but much much faster, and can be run on sub-samples of the data (similar to stochastic gradient descent used in deep learning). Once we bridge these two domains we can get the best of both worlds, like a deep net HMM.
PyMC3 (http://pymc-devs.github.io/pymc3/) has all the powerful samplers that Stan has (i.e. NUTS) as well as support for discrete priors. It allows you to specify your models in pure Python, supports matrix algebra, and also has variational inference which allows for large-ish scale machine learning (e.g. here is a blog post where I train a Bayesian Neural Net on MNIST http://twiecki.github.io/blog/2016/07/05/bayesian-deep-learn...). Under the hood, it uses theano for JIT compilation to C or GPU.
Thanks for your comment! I did split the data in two, but it's easy to miss. X_test and X_train are the two sets. ann_input.set_value(X_test) then switches in the test values. That's identical to running make_moons() again.
PyMC3 uses Theano to create a compute graph of the model which then gets compiled to C. Moreover, it gives us the gradient for free so that HMC and NUTS can be used which work models of high complexity.
I use it in production, despite it still being beta. We're close to the first stable release but there are still some small kinks to figure out.
What I find more interesting is that you felt the need to point it out and "correct" it, yet failed to actually do so. In fact, the original did something, albeit in improper form, from by PEP8 that you omitted entirely.