> At the core RL is just updating a table of values, and then using function approximation (aka, machine learning) for more complex cases.
this seems to be a common assertion about ml. other refrains include "ml is just matrix multiplication" and "ml is just affine transformations followed by nonlinearity".
while technically true, it is an unhelpful comment as it doesnt shed any light on the salient questions, such as:
* how do you reduce bias/variance, since youre sampling a minuscule slice of the state/action space?
* how do you construct a value function (or something like it) in a sparse reward environment, with possibly thousands of time steps?
* how do you know your policy network is exploring new states?
it is the equivalent of learning how to draw an owl: draw a few circles, then draw the fucking owl.
It's funny reading these comments. The thing is, if you go through AI ideas and principles as a mathematician there will be a lot of "hold on" and "you can't do that", "that's wrong" and so on. AI simply isn't correct and shouldn't work. Also there are many, many cases of "if A works, B should work as well" where A works and B doesn't.
AI is semi random. That's the truth.
> how do you reduce bias/variance, since you're sampling a minuscule slice of the state/action space?
You don't. You just try stuff and some of it works.
> how do you construct a value function (or something like it) in a sparse reward environment, with possibly thousands of time steps?
You don't. You just try stuff and some of it works.
> how do you know your policy network is exploring new states?
You don't. You just try stuff and some of it works.
Now there are intuitive answers (that are bullshit) about your questions. The problem is they don't stand up to scrutiny.
> how do you reduce bias/variance, since you're sampling a minuscule slice of the state/action space?
You assume the structure of the AI computation itself will have a "sufficiently" correct bias/variance. So the structure of the network and the structure of the updates encode something that works in the state/action space.
In other words: you try dozens of architectures until one works (also known as "hyperparameter search"). To put it another way, any AI algorithm is a huuuuuge polynomial. Ax + By + C. Backprop learns A, B and C, the "hyperparameters" encode that it's Ax + By + C that gets calculated. For instance, NEAT is a good hyperparameter search algorithm.
> how do you construct a value function (or something like it) in a sparse reward environment, with possibly thousands of time steps?
This is the problem of reinforcement learning. Literally all theory is about this very question, just don't expect to read math that holds up. But it's easy to see that for instance how Q-learning, TD(alpha), TD(0), TD(1) all do this.
How about I put down the intuition behind Q learning. You have states, rewards, and action. You are in a state, you pick an action and you are rewarded (or not). At some point the game ends and we have "rounds", which are a set of moves from the start of the game to the end. So you calculate the Q-value, which is discounted reward given optimal play. Wait what ? You calculate the expected reward given that you play optimally in the future. So the Q value for (state(t), action(t)) is R + E(max Q(state(t+1)|action(t), action(t+1)) over all possible action(t+1)'s. By playing you can constantly recalculate those Q values, meaning they become more accurate over time, and then you play by simply picking the action with the highest Q value.
> how do you know your policy network is exploring new states?
One way to do it is, when you calculate the gradient, you can see that the gradient is a big honking set of matrices. You just stick them all together into one vector, and then use the norm of that vector (which is intuitively a measure of how much the network "learned" in the last experiment) as a reward. Alternatively you just sum all the norms together and do the same.
There are other metrics. For instance you train a separate network to predict the future game state, then you add how wrong that network is to the reward function. Assuming both networks learn the same (one of those "but that's ... wrong" things), that should give you an indication of how new such a state is. The newer it is, the more the prediction network gets it wrong.
Another way is to have 2 instances of your AI solver and having them fight eachother, while also training the prediction of how the other will do, and using that same network to judge your own play.
You can also have long-term cycles where you switch between reward functions. For instance, sometimes your network just tried to terminate the game quickly, regardless of the outcome. Sometimes surprise is what you're looking for. Sometimes it just wants to see the world burn (lots of rapid action). Sometimes it wants everything to be quiet. Though, ideally, if you go down this path, you need to somehow tell the network which (combination of) rewards you're going for.
> 3) Test that the solution works on toy examples, like MNIST, simple block worlds, simulated data, etc.
youre right: mnist, imagenet, etc are toy examples that do not extend into the real world. but the point of reproducible research is to experiment on agreed upon, existing benchmarks.
> What are the use cases for adding yet another layer to the stack?
in my limited experience with horovod, horovod is most useful when youre running large clusters of workers/ps. in those situations, you typically have to manually find the appropriate balance of workers/ps. (otherwise youd run into blocking or network saturation issues.) horovod addresses this issue with their ring allreduce implementation.
having said all of that, im sticking with distributed tf for now.
> Usually you throw everything and see what sticks.
most practitioners start with the simplest possible learner, then gradually, and thoughtfully, increase model complexity while paying attention to bias/variance. this is far from a "kitchen sink" approach.
Certainly that's what most sensible practitioners do. I somewhat doubt that most people follow this to the letter every time.
It's a nice theory, and it works intuitively with models where you can ramp up complexity easily (like neural nets). It's less obvious if you have a "simple" problem that might be solved with a number of techniques. In that situation I don't see why you would be criticised for trying say any of SVM, random forest, logistic regression, naive bayes and comparing. Pretty much the only way you can categorically say that your method is better than others is by trying those other methods.
The simple approach actually came up in the iceberg challenge. The winning team won because they bothered to plot the data. It turned out that the satellite incidence angle was sufficient to segment a large number of icebergs with basically 100% reliability. So they simply thresholded that angle range and trained a bunch of models to solve the more complicated case when there might be a ship.
This was one of the things Andrew Ng really hammers on in the coursera course. This alongside separating out the training set from the cross validation set for tuning parameters went a long way to dispel some of the "magic" in how you iterate towards a sensible model.
They aren't common in deep learning, but if you look to estimation problems like odometry, optimal control, and calibration, the typical approach is to build a least squares estimator that optimizes with a gauss-newton approximation to the Hessian, or other quasi-newton methods. Gradient descent comparatively exhibits very slow convergence in these cases, especially when there is a large condition number. In the case of an actual quadratic loss function, it can (by definition) be solved in one iteration if you have the Hessian. However, getting it efficiently within most learning frameworks is difficult, as they primarily only compute VJPs or HVPs.
not exactly what you asked for, but if youre looking for a gentle academic introduction to the intersection of ai and games, i recommend [
AI Researchers, Video Games Are Your Friends!](https://arxiv.org/abs/1612.01608)