Algorithms can be a force for evil, as well as good


As we throttle onwards in our quest to build an importance classifier, a few cautionary tales:

1. Admissions of Guilt

In the late 70s, a doctor on the staff of St. George’s Hospital Medical School developed an algorithm to assist with the school’s admission process. Between 1982 and 1986,  100% of the interview decisions were made by this algorithm. And in 1987, the Commission for Racial Equality found St. George’s guilty of practicing racial and sexual discrimination in its admissions policy. From the British Medical Journal’s writeup of the incident:

“As many as 60 applicants each year among 2000 may have been refused an interview purely because of their sex or racial origin." [1]

This didn’t happen because the algorithm was faulty: indeed, at the end of its testing phase, its gradings had a 90-95% correlation with those of the (human) selection panel.

This happened because the algorithm’s training and testing data--that is to say, the school’s admissions records-- were already biased against women and racial minorities. And it happened because nobody at St. George’s questioned the algorithm’s decisions (and why would they, when the decisions corresponded so neatly with their own?).


2. Bad Ads

Let’s conduct a little experiment:

  1. Create nine Google accounts: 3 associated with white ethnic groups, 3 associated with African-American ethnic groups, 3 associated with Hispanic ethnic groups

  2. In each account’s Gmail, put the name and a shared set of terms (eg: Conor Erickson - Arrested; need a lawyer; DeShawn Washington - Arrested; need a lawyer) into the subject-header line.

  3. Do all the names see the same/similar Google ads?

Not under Nathan Newman’s watch. The Tech-Progress founder, who chronicled this experiment on the Huffington Post, found certain terms yielded significantly different results across the three ethnic groups. In particular, for the term “Buying Car,” the white names yielded ads for car buying sites, while each of the African-American names yielded one or more ads “related to bad credit card loans and included other ads related to non-new car purchases, such as auto insurance or purchasing ‘car lifts’ for home repairs.”

Newman also found that the location of the name came into play: in the South Bronx, the “Jake Yoder” who was interested in buying a car saw car lift and car warranty ads; his Upper West Side counterpart saw multiple Lexus ads.

Rebuttal: Google responded to Newman’s post saying that they “do not select ads based on sensitive information, including ethnic inferences from names.”

But: In a blog post about the experiment and response, Cathy O’Neil  writes: “it doesn’t matter what Google says it does or doesn’t do, if statistically speaking the ads change depending on ethnicity.”


3. Quantifiably Random 

At the beginning of each school year in public schools across the country, student information including attendance, race, gender, socioeconomic status, and past performance is fed into a value-added model. The model uses this information to calculate, given an average teacher,  what a class’s year-end standardized math and English test scores should be.  At the end of the  year, the students take their tests, and math and English teachers receive a value-added between 0 and 100, based on how their classes perform in relation to the VAM’s calculation.

In New York, value-added ratings make up 20-25% of a new teacher’s evaluation framework. In Florida, it’s 50%, for all teachers. Value-addeds are critical to teachers’ job security, but should they be?

After the New York Times released the 2007-2010 value-added data for 18,000 New York City teachers, Gary Rubenstein attempted to answer that question. If value-added metrics are a useful benchmark for evaluating teacher performance, Rubenstein hypothesized that they would agree with the following:

1) A teacher’s quality does not change by a huge amount in one year, with the exception being between the first and second years

2) A teacher in her second year >>> same teacher in her first year

3) Teachers generally improve each year

Rubenstein took the teachers who were rated in both 2008-2009 and 2009-2010, and plotted their scores from 2008-2009 on the x-axis and their scores from 2009-2010 on the y-axis.

You'd expect decent correlation between the two score sets, with points clustered on an upward sloping line.

Here’s what it actually looked like:


The correlation coefficient on 2009-2010 scores as dependent on 2008-2009 scores was .35. With that kind of correlation, you might as well only hire teachers for a year.

Still, there are a number of reasons why a teacher’s ability might change drastically from one year to another -- maybe there’s an illness in the family, maybe they’re dealing with some sort of trauma, maybe they’re thirty years in and ready to retire. So Rubenstein plotted a different set of scores: those for teachers whose first year was 2008-2009, and second year was 2009-2010.

The first-year teachers plot looks … about the same as the total teachers’ plot.


According to the value-adds, 52% of the first year teachers were better in their first year than in their second. 52%! Now you really might as well only hire teachers for a year.

Except, should those teachers happen to teach multiple grades, you might not want to hire them at all: Rubenstein also found that among 665 teachers who taught the same subject to different grade levels, the average difference between scores was almost 30 points.

Vis-a-vis the above hiring advice: ignore 100% of it. Rubenstein’s argument, of course, is that the VAM is scarcely better than a random number generator, and using it to decide whether or not to keep a teacher means a district will likely lose a lot of good eggs and gain a lot of rotten ones. [1]


4. To Conclude

My point with these examples isn’t that models are inherently bad, but rather that models are reflections of the data used to train them, and that data is a reflection of the people who collect it. The best way to prevent bias is to look carefully at the training data, and understand how it was collected, before you feed it to your model. That, and always, always check your model’s work.



1. H/T to the Guardian, who wrote about the St. George's affair here 

2. Or, ya know, get its teachers to cheat