Bias, according to google: > prejudice in favor of or against one thing, person,...

_udy5 · on July 26, 2018

Technically data can be biased because what you call a misrepresentation is termed bias:

https://en.wikipedia.org/wiki/Sampling_bias

https://en.wikipedia.org/wiki/Selection_bias

DoctorOetker · on July 27, 2018

Don't forget treatment bias, also known as "nature versus nurture". One can totally remove sampling and selection bias by tallying the whole population in the measurement, say in antiquity. I predict the data (which can't have sampling and selection bias) would describe how slaves have a higher injury and mortality rate: after whipping or overworking or being fed inferior food, the injured are last in line when the food is dispensed etc. This is real world treatment bias. How just is it to increase say insurance rate for the slaves? (I know they probably didn't have insurance in antiquity...)

Edit:

In other words, statistics done correctly (i.e. representatively) on the real world can tell us what the real world (and its status quo) looks like, but tells us nothing about the ethics of the situation.

lykr0n · on July 26, 2018

True. Inaccurate data can be bias, but I think made that distinction. If you have a representative, statically accurate data set, you can't call the data bias if it points one way or another.

jedberg · on July 26, 2018

So if I gave you a dataset that only had women in it, and from that you concluded that men never have accidents and should therefore have a $0 insurance rate, you wouldn't say that the data was biased against women?

lykr0n · on July 26, 2018

I would say the data is inaccurate.

jedberg · on July 26, 2018

Given that it is impossible to have a 100% accurate dataset for real world events, wouldn't every dataset be inaccurate to some measure? And wouldn't the level of inaccuracy represent a bias of some sort?

lykr0n · on July 26, 2018

There is a whole field of mathematics on how to figure this out.

https://en.wikipedia.org/wiki/Sample_size_determination

And yes, nothing is 100% accurate. The census isn't 100% accurate, but it's good enough. A 95% accuracy is the generally accepted target, which is what you have a statically significant. If you have a target population of 330 Million people, you only need 40k people to hit a 95% confidence value with a 0.1 confidence interval. Again, this has all been figured out, formulated, and settled.

You keep using the word bias and I don't think you know what it means. If the data is inaccurate, it's inaccurate. If I wanted to sample the US population, but only used people from Long Island and only polled 10 people, that's not a bias dataset- it's inaccurate. Conclusions drawn from inaccurate data are inaccurate.

Bias data would be statistics cherry-picked to showcase one view point over another. The dataset itself isn't bias, but maybe the presentation is. As I said before, good (and this is important) data can't be bias- what you do with it can be. Saying "Men are more likely to be bad drivers" is a factual statement. "All Men are bad drivers" is a bias one. See the difference?

jedberg · on July 26, 2018

You said: Saying "Men are more likely to be bad drivers" is a factual statement. "All Men are bad drivers" is a bias one.

Let's change it up a little. Take the statement "Black men are more likely to be criminals than white men". Is that factually true? A dataset with a statistically significant sample would say yes, in fact black men are incarcerated at a higher rate per capita than white men.

But look how that data came to be -- by police making arrests. And there is also plenty of data out there showing that black men get arrested more often than white men for the same crimes, and get longer sentences for the same crimes than white men.

So clearly, the dataset that shows black men are more likely to be criminals is biased, even though it is accurate.

Now imagine building an AI (or actuarial table) based on that data. It would necessarily identify black men as "more likely to be a criminal" simply based on the fact that they are black. So now you've magnified the bias in the data.

So yes, the data is accurate but that doesn't mean it isn't biased too.

lykr0n · on July 27, 2018

> But look how that data came to be -- by police making arrests.

Yes. And rape statics come from people getting rapped. I'm going to assume you are talking about convictions here, as arrests and convictions are different.

> And there is also plenty of data out there showing that black men get arrested more often than white men for the same crimes

Ok? Then add that as another data set to your model. You will also need to add in crime relative to the population. Ethnicity of that area. et cetera. You can't just make this statement, say it affects your conclusion without adding in everything else.

This fact doesn't change the initial statement that "Black men are more likely to be criminals than white men." It exposes issues in the society we live in, not the data that is being reflected. Any data scientist worth his salt would be able to take this into account.

White Population of the US is 62%. Black Population in the US is 12.6%. White arrests in the US were 70%, Black arrests were 27%. Whites are over represented by 8%. Blacks by 14.4%. Asians are under represented by 4%. FYI.

https://en.wikipedia.org/wiki/Demography_of_the_United_State... https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-...

> and get longer sentences for the same crimes than white men.

This is irreverent to your point. Ignoring.

> Now imagine building an AI (or actuarial table) based on that data. It would necessarily identify black men as "more likely to be a criminal" simply based on the fact that they are black.

A Machine Learning model, not AI (as AI is incorrectly used everywhere), would come to that conclusion- yes. The problem with your argument is that you want to bring your bias into a system that needs to be based on facts. Do black men get shafted? Yes. But approaching the problem at the end of the pipeline doesn't solve anything.

Going back to the initial discussion point on insurance rates. Men are more likely to die behind the wheel. Ok. We don't fix the problem by saying the data is bias, wrong, whatever. We fix it the source. You're entire argument boils down to "I'm not comfortable with what the data is telling me and I want the data to say something else." Which is emotional- I get that.

> So now you've magnified the bias in the data.

You've built a system that reflects the realities of the world around you. There was an article somewhere about a robbery on a BART train. The police wouldn't release the ethnicity of the suspects for some stupid reason about not wanting to feed into stereotypes. That does nothing to fix the problem, it just tries to cover it up. But, your reaction is the right one you just are not focusing it correctly. Instead of asking why our models look like this, and trying to manipulate them to make you feel better you need to ask why the data is happening the way it is.

My point is this. The conclusion of "Black men are more likely to be criminals than white men" should make you want to fix the reason why, not try and manipulate the data or system that produces that result. If you have a Machine Learning Model that draws that conclusion, you have two options:

1. Fix the source of the data. Isolate why Blacks in the US are over represented in the criminal population by 15%, and solve that.

2. Add in other data sources to further refine your model.

The fix is not run around screaming bias, as that's an incorrect characterization of the situation and actively hurts your cause.