Of course data contains biases. But again, please read the article I linked; alg...

wrsh07 · on Oct 3, 2016

Aha - I think I see our miscommunication. When you say bias you mean statistical bias.

Yes, machine learning is able to correct for that kind of bias - 538's polls forecast is a good example of that.

But you don't get to redefine racial bias to be something innocuous. Yes, black names are more likely to have arrest records, but that "fact" is super misleading [1].

Finally, you're talking past me. I'm not saying that statistics is broken. I'm saying that we should be especially mindful of the OPs point when they say this:

> So what’s your data being fried in? These algorithms train on large collections that you know nothing about. Sites like Google operate on a scale hundreds of times bigger than anything in the humanities. Any irregularities in that training data end up infused into in the classifier.

I think the OP author also has a related post about the kind of bias I'm talking about: http://idlewords.com/talks/sase_panel.htm

[1]: http://www.huffingtonpost.com/kim-farbota/black-crime-rates-...

yummyfajitas · on Oct 3, 2016

Without getting into a dispute about the definition of "bias", I'm saying that algorithms can accurately measure reality even if input(x=white, all else equal) != input(x=black, all else equal).

You are saying that algorithms are accurately measuring a reality you wish were different. I don't disagree with this.

The right thing to do is to actually answer unpleasant moral questions like "if blacks are 4x more likely to be dangerous criminals, what should we do about it?" But I guess overloading the word "bias" is a nice substitute for clearly thinking things through.

lilyball · on Oct 3, 2016

The problem is you're modeling a biased reality. And accurately modeling a biased reality may in many cases accentuate the bias. Take for example the previously-mentioned case of using an algorithm to determine where to focus your policing efforts. If the data you have says that more arrests are done in a particular part of the city, then you'll want to put more police there, right? But areas where there are more police will tend to see more arrests. So the fact that you're putting more police in an area where you see more arrests is just going to make the bias more extreme, causing even more arrests there. This causes a feedback loop. So you may be accurately modeling reality, but you're modeling a pre-existing bias and making it worse. And who knows why that pre-existing bias was even there? The fact that there were more arrests there may not be because that area actually has more crime committed, it could be due to other factors, such as racial profiling by police, and in that case your algorithm is now accidentally racist because it's perpetuating racial profiling.

dragonwriter · on Oct 3, 2016

The problems are really twofold:

(1) Defining the proper goals, and

(2) Measuring the right things (such as the real goals of interest rather than biased proxies.)

With police deployments, you are assuming the solution (rather than letting your algorithm optimize it) by saying "I want to put more police where more arrests occur". What you really want is probably something more like (the exact goal may be different, of course) "I want to deploy police resources where it will most effectively reduce the incidence of crime, weighted by some assigned measure of severity." Then let your ML algorithm crunch the various measurable factors and produce an optimum deployment to do that.

(But, then again with that goal -- and similar problems exist with many likely real goals -- you run into the other problem, which is measuring the incidence of crime -- measuring crime reports may be the obvious approach, but there's plenty of evidence that lots of factors can bias crime reports, including communities having bad experience with police being less likely to report crimes.)

wrsh07 · on Oct 3, 2016

Thank you. This is so much clearer than what I was saying.

As you say, proper goals and measurement can fix a lot of these problems, and I don't think it's obvious that ml algorithms solve either of those

yummyfajitas · on Oct 3, 2016

I directly addressed this critique two posts up. Why don't you go read that post?

https://news.ycombinator.com/item?id=12627359

lilyball · on Oct 3, 2016

I did read it, but you're talking about correcting for measurement biases in order to recover an accurate view of reality. But what I'm saying is that accurately measuring reality may in fact be how you get bias, because the very thing you're measuring may be biased. If you're aware the bias exists and have tools that can measure the bias itself then maybe you can correct for the bias, but you can't just expect your algorithm to automatically correct itself in the presence of bias because its goal is to model reality, not to figure out whether there's inherent bias in the thing it's modeling.

yummyfajitas · on Oct 3, 2016

Here's my concrete claim. Let pp = police presence, then P(crime detected) = r(pp).

Measured crime = crimes x r(police presence).

As long as your model is expressive enough to capture r(pp), bias should be detected.

Fundamentally you are making the claim that there are certain types of variable correlations that are just so evil that no statistical model can possibly understand them. That's a very bold claim; it's essentially the claim that science doesn't work.

lilyball · on Oct 3, 2016

No, I'm claiming that P(crime detected) != r(pp). More police in an area typically means more crime is detected, but that's not the only factor. If you have two areas with identical police presences and identical actual crime rates (as opposed to reported crime rates), the rate of crime detection (as measured by arrests and whatnot) may be higher in one area due to other factors such as racial bias (not just racial profiling, but also things like police letting white people off with a warning where the equivalent black person would be arrested). So you cannot simply correct for this by accounting for the police presence.

What's more, your data may not even have the necessary info to figure out if there's a bias. For example, what if police are more likely to arrest someone wearing a red shirt than someone wearing any other color shirt? Unless the color of the person's shirt is part of the arrest report, there's no way your statistical model is going to figure out that red shirts affect arrest rate.

yummyfajitas · on Oct 3, 2016

Your function r = r(pp, red shirts, race of offender, etc) exists. A model of the form a x r + b x something_else + ... will detect the bias you've described, assuming of course the biasing variable is either present or redundantly encoded in the data set.

We've now established the existence of a statistical model which can detect this bias.

Now, any other model which is capable of expressing your specific r(p) can do the same thing. The entire purpose of fancy models like random forests is that they can express lots of functions while also being reasonably generalizable.

If you want to claim that this bias is much more difficult to encode in an SVM than all the other typical hidden patterns, you need to establish that your specific r(...) is somehow vastly more complicated than all the other things that machine learning models regularly detect. That's a pretty strong claim.

Interestingly, you are now arguing the exact opposite of what most "machine learning is racist" people claim. They typically claim machine learning is racist because algorithms actually learn hidden factors they wish it wouldn't; e.g., a lending algorithm might "redline" blacks who don't pay back their debts. I take it you believe this is highly unlikely, and algorithms can't possibly distinguish between men and women and then show high paying job ads to more men than women?

srean · on Oct 4, 2016

>Your function r = r(pp, red shirts, race of offender, etc) exists. A model of the form a x r + b x something_else + ... will detect the bias you've described, assuming of course the biasing variable is either present or redundantly encoded in the data set.

No no no. Had to respond to this because this such a common confusion (not to say that you personally have this).

That such a model exists within the class of models being used says absolutely nothing about whether the statistical/ML algorithm will find it with any degree of confidence from a sample. The science is still grappling with the question of how to do model selection. There are two, sort of, equivalent class of methods, regularization (this can be a regularization over the dependency structure too, not just a simple penalty) and prior. Its only when you get those right that you have decent chance of estimating well, from reasonable amount of data.

Short answer: universal approximation property of a class of models says nothing about learnability.

lilyball · on Oct 3, 2016

Regarding your last paragraph, there's two different angles here. The "machine learning is racist" angle I think is quite valid, but covers a different topic than what we've been discussing here. To be more specific, there's two different ways in which we can have racist models:

1. The algorithm is biased in a way that reflects reality but does not reflect how we wish it to behave. This is the "machine learning is racist" angle. A lending algorithm might quite rightly think that black people are a higher risk, but this is ethically problematic to act on, because denying loans to black people only serves to compound the social problem (even though it may make financial sense for your bank).

2. What I'm arguing is that we can have racist algorithms due to the fact that the data itself may be biased in a way you're not aware of. To take the red shirt example, something I forgot to say before was that if, say, a fad spreads among the black community of wearing red shirts, then you're going to see an uptick in arrests of black people, but your algorithm won't be able to figure out that this is actually due to arresting red-shirted people, which means it will believe that black people in general are more likely to be arrested.

yummyfajitas · on Oct 3, 2016

Here's my point.

(1) is only possible if your data provides access to the biasing variable, perhaps via redundant encoding. This is the standard critique folks make.

As per (1), the biasing variable is available. Now if the algorithm is expressive enough to describe the functional form of the bias (e.g. the bias is quadratic, and the model includes quadratic terms), it will fix that bias.

You're right that there are lots of hidden variables that we can't use in a predictor. Murderous intent and mafia membership are also not available as predictive factors. You could build a more accurate model if you had that data. So what?

lilyball · on Oct 3, 2016

The problem with (2) isn't just that your model isn't as precise as it could be, it's that your model may be inadvertently biased because all of the data that it was fed was biased. This comment (https://news.ycombinator.com/item?id=12625917) gives a good example of that one. No amount of expressivity in the algorithm will account for the fact that the Friendface model (read the comment) was trained on a predominately white userbase versus FaceSpace's model which is trained on a predominately urban black userbase.

srean · on Oct 3, 2016

Are you saying that it can form a good estimate of the conditional probability ? I can believe that if the sampling process preserves the conditional.

Otherwise one would have to make assumptions about (or in other words, model) the corruption process. The bias compensation machinery then has to be deliberate, wont happen on its own.

Some sampling processes do not modify the conditional. In those cases no special machinery would be required.

yummyfajitas · on Oct 3, 2016

tOne approach is to directly model the corruption process. Being the model-based-Bayesian guy I am, this is something I like to do.

But if your model is sufficiently expressive you don't need to explicitly build or model the corruption process. In the example in my linked blog post, test scores might be biased against blacks. But race is also redundantly encoded, so the algorithm has enough information to fix the bias completely by accident.

Fundamentally what I'm saying here is that bias is a statistics problem and has a statistics solution. Insofar as your complaint is algorithms finding the wrong answer, the solution is better stats.

And nothing whatsoever that I've said here would be remotely controversial if the topic were remote sensing.

srean · on Oct 3, 2016

> But if your model is sufficiently expressive you don't need to explicitly build or model the corruption process

This is the claim that I am having trouble with.

Say I have two random variable X,Y with some joint distribution. If a corruption process can mess with the samples drawn from it, I cannot see how it could possibly recover either the joint or the conditional.

Are you saying that the corruption is benign like missing at random or missing completely at random ? Then its much more believable.

yummyfajitas · on Oct 3, 2016

So we both agree that if the bias is linear, and your model is linear, you capture it. Similarly if the model involves interaction (score x is_black), and you include linear interaction terms, you'll also capture it.

Now the question arises; what if things are more complex?

In real life they always are; both your biasing factor and the rest of the model. So we've cooked up all sorts of fun models like SVMs, random forests and neural networks to analyze such complicated models and find hidden features and relations that we didn't think of. Bias is one such feature.

If I built an algorithm that learned to display different ads to mobile and desktop people (i.e., treat mobile "time on site" differently from desktop "time on site"), would you be surprised by this?

srean · on Oct 4, 2016

That makes it clearer. I got thrown off by the claim that a standard algorithm will be able to de-bias if no de-biasing machinery has been built into it. BTW the machinery may be implicit in the choice of the model.

Simple toy example: say Y is a threshold function of X + high variance noise. I draw samples from this and scale down all y_i's that exceed the (unknown) threshold. In other words my corruption process is dependent on X. We can make it depend on Y too. These would require explicit modeling. Just throwing a uniformly rich class of P(X,Y) wont by itself fix this. We have to carve that space of P(X,Y) with the knowledge of possible corruption process to get a good model of the behavior before the corruption is applied.

BTW we have gone way off tangent, but that was a good conversation.