A Consultant's Casebook
about me: Andres Inn, Ph.D.
For a number of reasons, I regard Assessment Centers as overrated. In general, Assessment Centers are an incredibly expensive way to determine someone's capabilities to perform a job. The ratio of trained staff (read expensive staff) to participants is at least 1-1 for each exercise. When there are many candidates and Assessment Center exercises are held for multiple days, the costs multiply quickly. When there are more than a few candidates, the logistics of measurement become confounding. And, as most Assessment Centers operate across a number of days, I worry that later candidates are better prepared because they receive information about the exercises from colleagues who attended earlier. Then, there is the problem of validity; that is, can rated performance on artificially contrived exercises really predict future performance?
Problems with Scoring
Performance ratings, whether at Assessment Centers, or elsewhere have been much studied. To trust ratings, we must be sure that they are both reliable and valid. Ratings are reliable when they are repeatable, that is, the same rater will rate similarly in similar circumstances. Ratings are valid when they measure what they are supposed to measure and not something else. Reliability is, of course, much easier to achieve than validity. Even anecdotally we remember teachers who graded leniently, or harshly. The fact that they graded consistently did not mean that they graded fairly!
The best Assessment Centers avoid rating errors due to rater leniency by having each rater rate each candidate. When scores are averaged together, the differences in leniency disappear, because each candidate had the "benefit" of the easier raters as well as the "disadvantage" of the harsher raters. With many candidates, this little trick becomes expensive and complicated. Recall from above that it requires approximately 3 days to process 9 candidates with 3 raters. Bringing in an additional 3 raters will reduce the Assessment Center time by half, but adds the additional scheduling complications of (1) allowing all 6 raters to view the performance of the 9 candidates an equal number of times, and (2) considering whether the two teams of raters are indeed equivalent, or whether all 20 combinations of 6 raters taken 3 at a time will be used - again a difficult scheduling problem when there are only 9 candidates. Consequently, less rigorous Assessment Centers pretend to avoid differences in leniency by having groups of raters discuss their ratings for each candidate and work out a consensual rating. Of course, this ploy only muddies the water. When individual ratings are not maintained and group ratings are collected, we can't be sure whether they reflect the candidate's performance, rater biases, or some combination.
A further result of allowing raters to consensually develop their ratings, is that we cannot even begin to assess the effects of the most pervasive problem in measuring human performance through ratings. Sometimes called the rater-by-ratee interaction, it refers to a common characteristic of everyone - we tend to like people who share similarities with us, and dislike people who are different. These differences can be as obvious as race and sex, or as subtle as regional dialects and accents, or cultural experiences. Because "liking more" is subtly different from "performing better" even the most rudimentary research designs consistently find rater-by-ratee interactions.
To find these interactions, Assessment Center candidates from different ethnic backgrounds and both genders are paired with all combinations of raters of the same ethnic backgrounds and both genders. Statistically, it is possible to determine what percent of the variation in rating scores will be due to differences among the candidates, differences in race, differences in sex, and differences among raters. In a university research setting, these analyses make sense. Practically, however, such analyses usually remain undone.
Again, the reason is expense. It is frequently impractical to train and recruit raters only to fulfill a quota for race and gender. Also, Assessment Centers are frequently run with only a few candidates so a complete crossing of candidate personal characteristics with rater characteristics is even more impractical. But worst of all, who decides what personal characteristics are important and will yield rater-by-ratee interactions?
Much of the research on ratings focuses precisely on race, age, and gender because discrimination on these characteristics is prohibited by law. Perhaps, in controlled university research settings such obvious and illegal biases can be minimized, but one only needs to look at the race and sex characteristics of typical business settings to wonder whether rater biases in terms of similarities can ever be eliminated. And, I don't only refer to race, gender, and age. Consider also international oil companies where senior as well as junior staff speak with a common Southern drawl, or New York camera shops where the the counters are lined with young men in white shirts, black pants, and skull caps, or Hong Kong where the senior salesmen of Groz-Beckert speak in German accents.
People seek affinity among their own kind - however defined. I know, for example, that I prefer the company of well-traveled people. I enjoy trading stories of places, people, and generalizing about cultures. Others may look for affinities in other characteristics such as facial features, accents, dress, social status, whatever. We can never know what characteristics are important to whom. Consequently, we can never be sure we have accounted for rater-by- ratee interactions. There is no way to pair all possible combinations of social, appearance, physical, dress, etc. characteristics among even a small group of raters and candidates. We simply can't know apriori what characteristics will be important for whom. And, it is unlikely that our raters will even be able to tell us what characteristics they focus on when forming "like-dislike" opinions.
When I advise my students on job interviews, I remind them of the fact that we can never know which of our personal attributes the interviewer will find important or memorable. I remember in the early 1970s, Ross Perot's outfit, Electronic Data Systems (EDS) was notorious for interviewing job candidates as well as their wives! EDS personnel office interviewers were known to focus not only on candidates' qualifications but also on dress (blue suits with white shirts were company standard) as well as the candidate wife's social skills (entertaining was encouraged). But, the same can be true in any interview setting even when the focus is not so well known.
So, to approach an interview, it is frequently best to reveal as little personal background as possible. Some interviewers may prefer married candidates, others unmarried people. Some interviewers may regard cat-lovers as peculiar, others may find a kin-ship. Some interviewers may be sports fanatics, others might prefer reading the dictionary to watching the Chicago Bulls. And when your competitor for the job is a philandering divorcee and known cat breeder who's current job is with Random House publishing, you may prefer to leave your personal details unrevealed and to approach the interview with a genuine interest in the job and the company - topics about which the interviewer can demonstrate competence. Keep the interview focused on the job, and you are less likely to be eliminated from consideration because of an extraneous bias. (And, since people enjoy showing their competence, you will have made a hit with the interviewer as a bonus.)
The common response to these criticisms is that Assessment Center staff are trained professionals - trained to eliminate biases from their ratings. I find this completely improbable. The most professional and experienced raters I can think of are teachers. My experience as well as the complaints I have heard from students suggest that teacher biases are always manifest.
When performance measurement is based on observations of individuals working within groups, there is a further potential source of error. This might be termed a ratee-by- ratee interaction. Here I refer to the fact that performance judgments are generally comparative, and rarely against a "standard." So, if I am paired with weaker candidates, and together we suffer through a group exercise in an Assessment Center, I am likely to be rated better than if I were paired with much stronger candidates.
Again, the solution to avoiding ratee-by- ratee interactions is to pair every candidate participating in an assessment center with every other candidate an equal number of times, so that each candidate has the "benefit" of demonstrating his capabilities in comparison to weaker candidates, and the "disadvantage" of demonstrating capabilities in comparison to stronger candidates. Of course, if I am paired with weaker candidates more often than with stronger candidates, I will still have an advantage when scores are averaged or accumulated in whatever manner.
This advice is easier said than done. I defy you to develop a simple design in which all possible combinations of candidates work together under the eyes of all raters, much less under the eyes of all possible combinations of raters.
All material is copyrighted! Not to be used without the written permission of Andres Inn. ©2000.
Send mail to Andres.Inn@mail.ee with questions or comments about this web site.