Evidence and Error: psychology

At the sight of that skull, I seemed to see all of a sudden ... the nature of the criminal – an atavistic being who reproduces in his person the ferocious instincts of primitive humanity and the inferior animals. Thus were explained anatomically the enormous jaws, high cheek-bones, prominent superciliary arches ... found in criminals, savages, and apes ... and the irresistible craving for evil for its own sake.
–Cesare Lombroso

The idea that facial metrics can provide information about a person’s propensity for unethical behaviour is making something of a comeback in the scientific literature. Recent research has correlated facial width-to-height ratios with:

(a) time spent in the penalty box among Canadian hockey players (Carré and McCormick 2008),
(b) propensity to punish an opponent at a cost to oneself (Carré & McCormick 2008), and
(c) the propensity to selfishly exploit an opponent’s trust in a game context (Stirrat & Perrett 2010).

Not only can these “unethical” or “aggressive” tendencies be correlated to facial metrics via meticulous research – it turns out that casual observers are also able to predict a complete stranger’s propensity for aggression with surprising accuracy based only on a picture of that person’s face (Carré and McCormick 2009). The ethical implications of these findings are interesting and perhaps controversial (think racial and criminal profiling), but I will resist the temptation to go down that road at the moment, and instead focus on a recent study by Michael Haselhuhn and Elaine Wong titled Bad to the bone: facial structure predicts unethical behaviour.

Three things in particular attracted me to this paper: (i) the author’s use of the term ‘unethical’ which in my mind is subjective and imprecise; (ii) their reference to William March’s classic novel The Bad Seed and its conclusion that some people are just ‘born evil’; and (iii) the final sentence of the discussion “Perhaps some men truly are bad to the bone.” In truth, it was ridiculousness that drew me in, but nonetheless, let us proceed with the science!

Haselhuhn and Wong summarize their main result as follows:

...we show that genetically determined physical traits can serve as reliable predictors of unethical behaviour ... Speciﬁcally, we identify a key physical attribute, the facial width-to-height ratio, which predicts unethical behaviour in men.

In a first experiment, the authors arranged for 96 pairs of Business Admin students to participate in a negotiation exercise (conducted via email). Each pair consisted of a ‘buyer’ and a ‘seller’. Sellers were told that the property they were selling must not be commercially developed whereas buyers were instructed to obtain the property specifically for commercial development. The researchers quantified ‘unethical behaviour’ based on whether or not a buyer explicitly misstated his or her intentions at any point during the email negotiations.

In a second study, 103 undergraduates completed an online survey designed to assess their own ‘sense of power’ and then were allowed to enter a lottery for a chance to win a $50 gift card. The number of times that a student could enter the lottery was to be determined by a single roll of two dice (simulated at random.org). Because there was no oversight when students inputted the result of their dice roll at the end of the survey, there was nothing stopping participants from cheating and entering a higher number than was actually rolled (thereby increasing their chance of winning the gift card). Furthermore, because the dice roll was random, researchers were able to quantify cheating (at the group level, but not the individual level) based on deviations from the expected average roll of 7. For both experiments, two research assistants measured facial width-to-height ratios (hereafter ‘facial WHR’) of all participants based on school photographs (inter-rater agreement was high, r = 0.758).

So, what happened? In the first study, 18/96 buyers engaged in explicit deception during the email negotiation. Based on logistic regression, the probability of deception significantly increased with increasing facial WHR for men, while women’s facial WHR was not significantly related to deception. In the second study, the average reported dice roll from 103 participants was 7.76, significantly greater than what would be expected by random chance (i.e. 7.00), and therefore indicative of cheating. The reported dice roll (and therefore the incidence of cheating) did not differ between men and women. Just as before, ‘unethical behaviour’ (reported dice roll) was significantly and positively related to facial WHR for men, but not significantly influenced by facial WHR for women.

Recall that in the second study, participants also completed a survey about their own sense of power. It turns out that sense of power was positively related to both facial WHR and reported dice roll for men but unrelated to both variables among women. Some statistical voodoo (similar to path analysis) allowed the researchers to conclude that sense of power mediated the relationship between facial WHR and cheating behaviour among men.

Holy eff, right? I’m pretty amazed by these data. Before reading the paper, I expected that I would take issue with their analyses or conclusions, but the study was in fact well done. The only issue I take is with the authors’ claim (referring to the second experiment) that

...our approach introduces potential noise to the data as, for example, men with smaller facial WHRs may legitimately roll and report higher dice totals. Thus, testing for cheating behaviour using this paradigm represents a conservative test of our hypotheses.

This is simply untrue. It is equally likely that men with larger facial WHRs might legitimately roll and report higher dice totals. Their approach is imprecise, but not ‘conservative’. A more precise way to test the relationship between facial WHR and cheating would be to use the difference between reported dice rolls and actual dice rolls as a dependent variable. This would require an experimental modification allowing the researchers to know with certainty what each participant actually rolled (e.g. using cameras, software tracking, etc.). Nonetheless, the results seem robust.

So then, how could this relationship have evolved? Intuition would suggest that physical signals that reliably predict unethical behaviour should be selected against. If you’re trying to deceive or cheat someone, you don’t want to tell them up front. Haselhuhn and Wong suggest that a relationship between certain facial features and unethical behaviour could evolve through pleiotropic associations with sexually selected traits such as aggression and dominance. If females like to mate with dominant males, and dominance is correlated both with certain facial features and unethical behaviour, than unethical behaviour may come to be associated with those same facial features.

Importantly, the correlation between facial WHR and unethical behaviour demonstrated by Haselhuhn and Wong does not necessarily imply that some people are ‘born evil’. For starters the effect sizes they reported were fairly small – facial WHR explained a relatively small proportion of the variance in propensity to cheat and deceive. Second, as the authors point out

...it is important to recognize that other developmental processes may play an important part in forming these links ... one possibility is that men with greater facial WHRs are perceived and treated by others in ways that encourage unethical action (i.e. a self-fulfilling prophecy).

Again, although the reported relationships seem to be robust, I see no data here or elsewhere supporting the idea that some men are “bad to the bone”.

____________________________________

Haselhuhn, M., & Wong, E. (2011). Bad to the bone: facial structure predicts unethical behaviour Proceedings of the Royal Society B: Biological Sciences DOI: 10.1098/rspb.2011.1193

CBC News recently reported on a study published in the journal Emotion that attempted to determine how body language influences perceived sexual attractiveness. I take issue with some of the methods and interpretations. You can read the actual paper here, and the CBC article here, but the gist of it is as follows:

A large sample of men and women were shown photographs of members of the opposite sex and asked to rate their sexual attractiveness.
Each photo depicted a ‘model’ displaying one of four emotions – happiness, pride, shame, or neutral.
In the first study, all participants were asked to rate a single photograph. All male participants rated the same female model in one of the four possible poses. Likewise, all female participants rated the same male model in one of the four poses.

In a second study, three large groups of participants rated a bunch of photographs that were viewed online. Again the photographs displayed a member of the opposite sex expressing one of the four emotions. This time, however, the photographs (over 400 of them) were obtained online (e.g. from Google Images) and sorted into their respective categories (2 genders • 4 emotions = 8 categories) by trained assistants according to published guidelines. So, unlike the first study, each category here contained pictures of many models, and different models were used to depict each emotion.

The general result that held across both studies was that males expressing happiness were rated the least attractive and males expressing pride were the most attractive. The trend was essentially reversed for female models, such that happy females were rated the most attractive whereas females expressing pride were among the least attractive.
There were other interesting results and many details I have left out for the sake of brevity. Check out the original paper for more information.

So what are the shortcomings of this study? My problem with the first study (which in fairness the authors do acknowledge) is that the sample size for each gender is one. All female participants rated the same male model, and all male participants rated the same female model. This study provides great evidence that this particular woman and this particular man are respectively more and less attractive when smiling, but we have no evidence that this trend exists in the population at large. It is entirely plausible that for different subjects the trend would be reversed.

To really hammer this point home, consider the question – are songs in the key of C minor more enjoyable than those played in the key of D minor? What the authors have essentially done is asked the London Philharmonic to record two versions of Beethoven’s Symphony No. 5 – one version in the original C minor, and the other transposed into D minor. They then asked 184 participants to rate the enjoyability of one of the versions, and concluded that songs in C minor are more enjoyable than songs in D minor because participants on average gave the C minor version of Beethoven’s Symphony No. 5 a higher enjoyability score. Crazy, right!? Maybe Beethoven’s other symphonies actually sound better in D minor, or maybe his symphonies sound better in C minor but his sonatas are more enjoyable in D minor, or maybe Beethoven’s compositions are generally more enjoyable in C but Bach’s are consistently more enjoyable in D, etc. Point is, you can’t make generalizations with a sample size of one. Again, the authors do actually acknowledge this point, and claim that the second study addresses this shortcoming.

Problem number deux. In the second study, where photographs were obtained from the internet and many models were used in each category, I believe there were systematic differences between categories apart from just emotional expression. Admirably, the authors have posted all of the photos used in their study here. There are a few trends that really stuck out for me. One is that photographs in the pride samples were mostly comprised of athletes in their race or match apparel, whereas few or no athletes appeared in the other three categories. Another trend is that most neutral photographs tended to show only the face and sometimes shoulders, whereas hands, upper bodies, and even full bodies appeared in the other categories. A third issue is that neutral faces were almost always facing directly toward the camera with no angle or tilt, whereas faces and bodies in other categories were much more likely to be angled. There also seem to be differing proportions of professional-looking photographs between the different categories (the authors did partially control for the number of models that appeared to be professional models, but only in two of the three samples). In sample A, all of the shame photographs appear to be professionally taken, whereas most of the neutral photographs appear to have been taken by a kid at the DMV.

Going back to the music analogy, the authors have essentially downloaded a bunch of songs from iTunes, half in C minor and half in D minor, but for whatever reason most of their C minor songs happen to fall into the Hip-Hop & Reggae genre, and most of the songs in D minor happen to belong to the Country & Western genre. Even if we have a large and random sample of the population rating the enjoyability of these different songs, any average differences observed between songs in C and D minor are not necessarily due to the different key signatures, but could just as easily be due to any of the myriad differences that (on average) distinguish Hip-Hop music from Country music. Of course the same is true for the different sets of photographs in the study described above, except the confounding variables in this case were photograph quality, angle of head from camera, proportion of body appearing in the photograph, clothing and location of the model, etc.

To conclude (finally!), I don’t really doubt the claims made in this study, I just don’t think they necessarily follow from the obtained results. There are logistical limitations to any study, and we can rarely design studies that will definitively test a hypothesis of interest while controlling for every possible confounding factor. I do however think that it is reasonable and possible to more conclusively and meticulously test the hypothesis that emotional expression influences perceived attractiveness by members of the opposite sex.

____________________________________

Tracy, J. L., & Beall, A. T. (2011). Happy guys finish last: the impact of emotion expressions on sexual attraction. Emotion 11:1379-1387. DOI: 10.1037/a0022902

Tuesday, December 13, 2011

Facial metrics predict unethical behaviour

Monday, December 12, 2011

Too sexy for my smile