You Are What You 'Like'

What your actions on social media say about you.

Posted Mar 19, 2018

On Friday, Facebook banned Cambridge Analytica (CA). We have been talking about the role CA's uniquely-targeted advertising approach played in the 2016 US presidential election since just after the election. This much-more-recent ban occurred because of a breach of data management protocol (which broadly covers how data are obtained, transferred, and stored) - NOT because of the way those data were used. An academic researcher (Aleksandr Kogan) obtained the data by asking users to opt-in to an app designed to estimate users' personalities from their pattern of behavior on Facebook. The problem began when Dr. Kogan chose to provide the data to someone else. CA has been banned from Facebook not because they accessed and used the data, but because they didn't go through the proper channels to do it. Facebook found out about the break in data management protocol and requested that CA delete the data. CA agreed, but then Facebook found out from a whistleblower that they had lied, and so now CA is banned.

Source: Blogtrepreneur/flickr

But what's receiving the most attention is HOW those data were used. The extent to which seemingly innocuous online behaviors can be used to predict users' characteristics is shocking to most people. Such prediction and targeting happens everyday, anytime you engage in a behavior that can be linked to your identity (either online, through social media profiles that track individuals across websites by comparing email addresses or site cookies, or in the 'real world', with purchases made at different stores using different bank and credit cards being matched up by credit reporting agencies). Most of this prediction happens in the background, with consumers rarely thinking about it, and consent for the collection and use of data exists in the fine print of user agreements that most of us click through without thinking.

What your 'likes' say about you

We easily understand that something like political orientation may be guessed by seeing that a person likes or follows certain politicians or organizations. If a researcher were to infer political orientation by politicians a person supports, we would call that face valid data. That is, the measure (politicians supported) is clearly related to the thing we're trying to predict (political orientation).

What's less intuitive is that most - if not all - of your personal attributes can be guessed (even if imperfectly) by ANY information that is known about you. Measures do not need to be face valid to provide accurate estimates. If we can establish that one thing is consistently related to another, it doesn't matter if that link is obvious or causal. All that matters is that link does exist, and now we can use it to make predictions. This is commonly referred to as an empirical, or bottom-up, or data-driven approach to measurement. Putting together a LOT of these weak (but non-zero) pieces of information allows us to make valid inferences. This is an example of the principle of aggregation: more data is always better, even if some or all of that data is of poor quality. Of course, you need less high-quality data to get the same accuracy of prediction; but if high-quality data might be suspect (for example, concerns about lying in direct, face-valid measures) or just flat out aren't available (for example, in-depth measures of millions of internet users), lots of low-quality data will do just fine.

A paper from a few years ago led by Michal Kosinski (summarized quite nicely by Stephen Colbert) demonstrated how such non-face-valid measures could be constructed from Facebook likes. Using a computer to test all possible combinations of each like predicting each personality trait or demographic outcome, researchers were able to efficiently estimate users' personality, sexual orientation, political affiliation, and more. Once these algorithms are developed on a group of people where the researchers know the actual status of the outcomes they're interested in (often referred to as the training or development sample), they can be applied to new people where the outcomes are unknown. You can try it out using your own data from either Facebook or Twitter. (This website is NOT AFFILIATED with the researcher implicated in the CA scandal, and there's no reason to suspect these folks have done or will do anything untoward with your information; but still consider that anytime you give someone access to your data, they have your data.)

Running my Facebook profile through the prediction algorithm accurately shows that I'm female (one of my top predictors: my liking of Vin Diesel), competitive (because I like Sephora), and really quite smart (which I agree with; because I like Will Smith). But it's not perfect. The algorithm incorrectly guesses that I'm unhappy (I swear I'm not; because I like Rob Zombie). Also interesting is how such an approach leads to the same predictors being used to inform multiple traits: my liking of Starbucks and Barack Obama shows up as contributing factors in almost all of the predictions about me. The goal of these algorithms isn't the perfect prediction for each person, though. It's about gathering and using data on a massive scale, so that, on average, political and corporate ads can be targeted more efficiently (saving money and maximizing impact) and, from an academic/scientific perspective, we can save our participants time by not asking them hundreds of questions that could be estimated from their existing data, as long as they're willing to share it.


Kosinski, Stillwell, & Graepel (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences.