How Cambridge Analytica Mined Data for Voter Influence

Why soliciting personal data with misinformation is a big deal.

Posted Mar 21, 2018

If we look back on the use of social media analytics in Obama’s elections, we might well ask, how is what the Trump campaign tried to do with the research firm Cambridge Analytica any different? Is this about Cambridge Analytica's violation of Facebook policy or is this a bigger deal than that?

In January 2013, I wrote about how President Obama effectively used social media in the 2008 and 2012 presidential campaigns, comparing his team’s social media savvy to Kennedy’s ability to use television.  Where Kennedy had a lot of innate talents such as charisma and good hair that allowed him to project well across the lens onto the TV screens at home, Obama’s team put social psychology to work using social media.  In 2016, Trump's people turned to data.

Pamela Rutledge/Shutterstock
Source: Pamela Rutledge/Shutterstock

We are now in the age of data science.  The ability to scrape data from across multiple social media platforms, capturing user behavior patterns and comments are unprecedented.  It has spawned a huge demand for top-notch data scientists, who are figuring out how to harvest and analyze vast quantities of data, creating algorithms that cull and respond, and building predictive models.  Their toolbox is an impressive mix of machine learning, statistics, robust programming skills and both artificial and natural intelligence—and they are all trying to capture and influence human behavior in evermore nuanced and targeted ways. 

Whatever comparisons between Obama and Trump may arise, they are red herrings.  It is the access to and use of data that is at the center of this very public debate.  This stuff isn't going away.  It will only get more sophisticated and ubiquitous.  It's neither all bad, nor all good.  This is a key teaching moment—an opportunity to better understand some of the key ethical and legal issues around data mining—if we don't fall into the rabbit hole of political finger-pointing.

Still, many will want to draw comparisons.  Four years is literally a lifetime in the evolution of data science.  The capabilities and social climate are both different compared to what was going on in 2008 and certainly 2012.   The tools and ability to scrape and evaluate data are much more sophisticated now—both from a technological and theoretical basis—than what Obama's team was able to use.

But more importantly, the social climate has changed and along with it, the awareness of data violations and the understanding using data to violate privacy, including increased ethical guidelines and regulations.  People are increasingly aware of how data algorithms are used based on our online behaviors, from Amazon recommendations to targeted ads that follow us from site to site.   Transparency, permission and maintaining privacy—for safety and to avoid manipulation--have all been major topics of whistle-blowers and social discourse. 

One of the big issues with the Cambridge Analytica controversy centers on how the data was collected.  According to reports In the New York Times and elsewhere, Cambridge solicited personal information through an app with misleading disclosure as to the purpose and intent.  The solicited various types of information, some of which seems innocuous, such as college majors and political affiliation, but the app also included personality assessment questions to generate personality profiles. 

Now, why is this a big deal?  We already know that it is possible to estimate a personality profile from a bunch of text data or by coding someone's Facebook profile, as researchers have shown.  The problem is that it's hard to do it at scale.  You have to have a significant amount of text from each participant, which becomes extremely costly and labor-intensive for a group of any size.  Where psych researchers can look at a group of participants of 200 and be happy as clams with their generalizability, this doesn't cut it for voter persuasion.   The desire to psychologically profile target audiences, however, has great appeal as it provides valuable information that is not currently publicly available.  Various research firms are working on solutions, using analytic techniques like Natural Language Processing or harnessing the power of IBM's Watson, but it is currently either used in small groups for HR purposes (with participant permission) or done in in the aggregate, “blind” to individual identities.  More importantly, however, these are estimates--admittedly some are better than others--but they are not the same as a personality profile you get from having people take validated psych testing measures. (FYI - some argue that since personality tests are self-report, they are actually less accurate than profiles estimated from data, but I leave that to those with more experience in the assessment trenches to battle out.)

Needless to say, it's much, much easier to get personality profiles of a whole lot of people if a company can get the people themselves to take a personality test.  If they don't tell people what it's for, the company doesn't have to worry about the participants skewing their answers to "look good" for their purposes.  A more few key questions and access to social media handles let the app developer scrape the data from social media accounts (which the app has since the participant entered it in order to use the app), it is easy for a crack data scientist to link personality profiles with likes, dislikes, policy positions, identify friends, and to build predictive models.  

Now they have the ability to target individuals based on psychological traits, not just "lifestyle variables" like movie preferences.  In all honesty, marketers would love to do that, but they don’t. It's not only hard to get personal preference data linked with targeted user data without violating privacy regulations and ethical standards from the social media companies but they also don't have the output of legitimate personality tests.  Companies like Twitter, for example, zealously guard individual user identification in data matching requests from marketing and political campaigns.  Cambridge Analytica took advantage of the fact that most of us will sign into an app and give away private information if we feel safe.  Soliciting on Facebook and telling people it was for academic research made them feel safe.  Thus the legal question:  does it count if permission was given under false pretenses?  

Back in 2008 and 2012 during Obama’s campaign, his team was using publicly available profiles. If you friended Obama on Facebook, you gave him your data and showed him who your friends were.  It’s in the fine print.  Read it some time.  The user-supplied data allowed them to identify likely predispositions toward policy, cross-matched with other available data, like zip codes.  

People make predictions from information all the time.  When we're using our own experience, it's called a heuristic.  With a bunch of information and math, it's called data science.   The question is accuracy. Even with the data Cambridge collected, their ability to influence people is not a sure thing or has the ability to disrupt culture as some have claimed. But personal targeting makes persuasion more likely and targeting without permission is, let's face it, kind of creepy.  Obama’s team was quite sophisticated at the time, but no more so than Google, Amazon or any other data-driven commercial concern.  Obama’s campaign was just the first time that social media marketing techniques had been applied to politics.  The attention came not from the sophistication of their targeting as much as from Obama's ability to use his personality on social networks to activate a grassroots enthusiasm and effectively crowdsource the campaign coffers.

What we still don't know is, what did the Trump campaign ask Cambridge to do with their data?   This story will unfold.  However, people don’t like to be manipulated.  You may recall the public reaction to Facebook experimenting with the positive and negative news feeds to see if the valence of content changed the overall tone and "mood"—and that was largely in the aggregate and not for individual targeting. It will be interesting to see if people uniformly find the prospect of manipulation offensive or if it gets labeled different things along party lines.

Every politician looks for what will resonate with voters.   The use of social media data and voter profiling in the 2012 election seems nearly wholesome in comparison with  Cambridge Analytica's data exploitation.  But this is the first time (that we know of) that data has been solicited for political purposes using misinformation to trick people into disclosure.  This triggers a serious hot button for many, given the amount of misinformation that has swirled around during and since the Trump-Clinton election.  Everyone is hyper-sensitive to fake news no matter their political persuasion.  Knowing that misinformation was at the root of this data gathering will make the violation seem even more egregious to many, especially given the cognitive bias that makes us attribute behaviors or intentions based on past experience.  (If they cheated on A, they will likely cheat on B OR if they cheated on B, they must have cheated on A.)  This is not rational but being manipulated moves people inherently from a trusting to a defensive and suspicious position.  

Sadly, this all this manifests in a lot of finger-pointing and we always like to have someone to blame.  In this case, it looks like Facebook’s feet will be held to the flame along with Cambridge Analytica.  I seriously doubt if there’s anything Facebook could feasibly have done to keep an organization from misrepresenting their intent.  However, ironically for Facebook, the strength of their brand implicitly validated the app.