You can reduce it via PCA one of the many techniques in multivariate statistic.
You can do anova to select your predictors.
In general you can use a subset of it using the tools that statistic have provided.
Complaining about messy data... welcome to the real world. As for complaining about non-reproducible models , choose a reproducible ones. I've only done mostly statistical models and forest base algorithms and they're all reproducible.
All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?
> The results were consistent with asymptotic theory (central limit theorem) that predicts that more data has diminishing returns
CLT talks about sampling from the population infinitely. It doesn't say anything about diminishing returns. I don't get how you go from sampling infinitely to diminishing returns.
>> All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?
The solution is to direct research effort towards learning algorithms that generalise well from few examples.
Don't expect the industry to lead this effort, though. The industry sees the reliance on large datasets as something to be exploited for a competitive advantage.
>> You can reduce it via PCA one of the many techniques in multivariate statistic.
PCA is a dimensionality reduction technique. It reduces the number of featuers required to learn. It doesn't do anything about the number of examples that are needed to guarantee good performance. The article is addressing the need for more examples, not more features.
>>>Don't expect the industry to lead this effort, though. The industry sees the reliance on large datasets as something to be exploited for a competitive advantage.
This is only true for the Facebooks and Googles of the world. There are definitely small companies (like the one I work for) trying very hard to figure out how to build models that use less data because we don't have access to those large datasets.
Btw, if you have relational data and a few good people with strong computer science backgrounds rather than statisticians or mathematicians, have a look at Inductive Logic Programming. ILP is a set of machine learning techniques that learn logic programs from logic programs. The sample efficiency is on a class of its own and it generalises robustly from very little data[1].
I study ILP algorithms for my PhD. My research group has recently developed a new technique, Meta Interpretive Learning. Its canonical implementation is Metagol:
Please feel free to email me if you need more details. My address is in my profile.
___________________
[1] As a source of this claim I always quote this DeepMind paper where Metagol is compared to the authors' own system (which is itself an ILP system, but using a deep neural net):
ILP has a number of appealing features. First, the learned program is an
explicit symbolic structure that can be inspected, understood, and verified.
Second, ILP systems tend to be impressively data-efficient, able to generalise
well from a small handful of examples. The reason for this data-efficiency is
that ILP imposes a strong language bias on the sorts of programs that can be
learned: a short general program will be preferred to a program consisting of
a large number of special-case ad-hoc rules that happen to cover the
training data. Third, ILP systems support continual and transfer learning. The
program learned in one training session, being declarative and free of
side-effects, can be copied and pasted into the knowledge base before the next
training session, providing an economical way of storing learned knowledge.
You're absolutely right and I appreciate that very much. On the other hand,
there's an incredible amount of hype around Big Data and deep learning,
exactly because the large corporations are doing it. So now everyone wants to
do it, whether they have the data for it or not, whether it really adds
anything to their products or not.
As to the Big N (good one) what I meant to say is that I don't see them trying
very hard to undo their own advantage, by spending much effort developing
machine learning techniques that rely on, well, little data. That would truly
democratise machine learning- much more so than the release of their tools for
free, etc. But then, if everyone could do machine learning as well as Google
and Facebook et al, where would that leave them?
>CLT talks about sampling from the population infinitely. It doesn't say anything about diminishing returns. I don't get how you go from sampling infinitely to diminishing returns.
Yes it does. It even implies it in the name 'limit'. In the limit of infinitely many samples, we approximate a normal distribution. This approximation has diminishing returns.
>All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?
It's fine to point out problems without giving solutions. You seem very aggravated.
PCA has specific use cases. It’s not a catch all dimensionality reduction technique. You can’t use it effectively, for example, if things are not linearly correlated. There of course many tools for addressing many problems, but as the title states, this is often a grind. For any practical problem, exclusive of huge black box neural nets where you don’t need to understand the model, you are probably better off starting with a smaller set of reasonable sounding features and then slowly growing out your model to incorporate others.
Also if you meant random forest by forests... those aren’t especially reproducible. Understanding what’s going on is not always easy, and most people seem to misinterpret the idea of “variable importance” when you have a mix of categorical and numeric features. Decision trees and linear regressions are nice and reproducible.
> Complaining about messy data... welcome to the real world.
I mean, that's the crux is if you have bad data you will have bad results. Data cleanup/transformation is key for anything (reporting, etc...) and not just limited to ML because it's sexy these days.
You can reduce it via PCA one of the many techniques in multivariate statistic.
You can do anova to select your predictors.
In general you can use a subset of it using the tools that statistic have provided.
Complaining about messy data... welcome to the real world. As for complaining about non-reproducible models , choose a reproducible ones. I've only done mostly statistical models and forest base algorithms and they're all reproducible.
All I see in this post is complaints and no real solutions. The solution that's given is what? Have less data?
> The results were consistent with asymptotic theory (central limit theorem) that predicts that more data has diminishing returns
CLT talks about sampling from the population infinitely. It doesn't say anything about diminishing returns. I don't get how you go from sampling infinitely to diminishing returns.