The Concept of Validity

Much of my difficulty appreciating Assessment Centers centers around their lack of fit into traditional testing and measurement approaches.  Neither fish nor fowl, they avoid the standards by which we evaluate the usefulness of a measuring instrument.  Talk to an Assessment Center proponent and try to get her to give you an indication of reliability and validity.  For the most part, we are asked to accept these on faith because the Assessment Center model simply doesn't address these issues.

The construction and development of psychological measuring instruments can follow from two theoretically different approaches.  The first we might term the conceptual approach; the second the empirical approach.  Both approaches yield a kind of validity.   But neither approach applies to Assessment Centers.

Conceptually Constructed Tests

Binet's test of intelligence followed a conceptual construction approach.  Binet collected a series of tasks which were successfully completed by 6-year olds, 7-year olds, 8-year olds, etc.  Average Intelligence, by definition, was the ability to complete the (infinite set) of tasks appropriate for one's age group.  Practically, Binet began with many hundreds of tasks, and administered these to his target age groups, and calculated the total score for each child across the large number of tasks.  Keep in mind, that these total scores were approximations for each child's actual intelligence as defined by the ability to successfully complete age-specific tasks.  But, to simplify his test and reduce the number of items (tasks) necessary, he calculated which tasks best related to the total score.  In other words, he discarded tasks which were least predictive of intelligence, and kept the tasks which were most predictive.   

The conceptual approach relies initially on face validity.  That is, if subject matter experts agree that the tasks Binet chose measured "intelligence" and were representative of the tasks that children can be expected to complete, then the test has face validity.  As it happens, the construction methodology guarantees internal consistency, or reliability also.  Further, intelligence measured in this fashion also has external validities --  Intelligence test scores correlate well with success in school as well as success in life.

Correlations with constructs external to the measurement device are sometimes called construct validity.  That is, the construct of intelligence demonstrates predictable relationships with other constructs.  Intelligence correlates with school achievement for example and negatively with job terminations, etc.  

Lest you think all conceptually developed tests meet the standards of Binet's efforts, let me introduce the Picture Preference Test.  I inherited a copy of the Picture Preference Test as a young professor of testing and measurement at Wayne State University.  The test consists of a box with multiple series of male photographs.  The idea was for a subject to take each series and rank the photographs in the order that he felt he might like the person pictured.   

The theory or concept behind this test is quite simple.  It is reasonably well established that we like people who are similar to us.  The photographs in the Picture Preference Test were photographs of mental patients, each with a different diagnosis.  If I were to rank the photographs in order of preference, and each time placed a Schizophrenic at the top of the pile, then the researcher should conclude I was a Schizophrenic.  But, there were no "normal" people pictured.  So, regardless of how I ranked the photographs, I would demonstrate a pathology.  Like Binet's intelligence test, the Picture Preference Test has a kind of "face" validity, but because the test results don't relate to anything, the instrument is of no practical value.  

Empirically Constructed Tests

I present the Minnesota Multiphasic Personality Inventory (MMPI) as an example of an empirically constructed test.  To me, it is a bizarre collection of statements.  I am stretching my recollection, but the tone of the statements, if not the exact wording is approximated below:

  •     "The sight of blood excites me.
  •     Sometimes the top of my head feels soft." 

The total number of statements originally evaluated may have exceeded several thousand.  The point, though, is that the researchers did not need a theory to guide them.  They collected several thousand statements, administered them to a population of mental patients, and recorded the patterns of responses characteristic of various pathologies.  If a new patient exhibits a response pattern characteristic of Schizophrenics, then the appropriate diagnosis is made.  

The construction of the test was not guided by theory, but by practical results.  Only statements that statistically demonstrated a relationship with a pathology were retained for the final version of the test.  Thus, an empirically constructed test also has a validity by definition, predictive validity.  That is, the MMPI was constructed to predict patient pathologies.  The statements were chosen to relate to pathologies, and only statements that related to pathologies were retained and statements that had no bearing on patient pathologies were discarded.  Significantly, the MMPI demonstrates predictive validity even in patient populations different from the one it was developed on.

Assessment Centers Again

These two approaches reasonably well describe the construction processes associated with professionally developed measuring instruments.  You see, of course, that the two approaches are not mutually exclusive.  A hybrid approach could select test items using a conceptual approach, and subject these to empirical validation.  And likewise, one might try to find the unifying concepts or theory behind an empirically developed instrument such as the MMPI and begin to investigate the patterns of patient responses to find theories that explain these patterns.  But I see no correspondence between the Assessment Center approach and either traditional measurement approach.

Consider how we would construct an Assessment Center using the conceptual approach.  We would identify the set of cohesive traits necessary for successful performance.  We might include verbal ability, organizing ability, situational awareness, leadership, among others.  We would develop many hundreds of tasks or test items that tap these dimensions.  Then, we would administer these to a sample of subjects and look to see which tasks or test items best relate to the total score.  The subset of "best" tasks would be those that we define as "leadership," or "managerial ability," or whatever it is that Assessment Centers purport to measure.

To construct an Assessment Center using the empirical approach, we would identify criterion groups of successful and unsuccessful managers or leaders.  To these we would administer a series of test items or tasks regardless of their theoretical importance, and we would maintain those test items or tasks that differentiate between the successful and unsuccessful mangers, discarding the remainder.  

Both methods require the development of numerous tasks, and the evaluation of the tasks and measurements to determine their utility.  In the Assessment Center literature, I find no references to such "item development."  Instead, we are asked to accept on faith that the procedures yield valid results.  Well, to me it appears an arrogant conceit to claim that performance on arbitrarily chosen tasks in contrived settings predict future performance.  Recall that one of the candidates in the OSS assessment centers flunked their stress interview, but had actually passed an interrogation by the Gestapo!  The eminent psychologists working for the OSS were proud that they had identified a potential failure as a field agent.  I remain amazed that they screened out the one candidate who experienced success in the field!  Having one exemplary candidate fail makes me question the validity of the assessment technology.  Instead, the failure to accurately predict (even past) behavior was considered as a positive outcome.

Put yourself in the position of "Wild Bill"  Donovan hiring a spy.  Given both assessment center information plus background information, who would you predict to pass a Gestapo interrogation?  Would you place your bets on the guy who passed an Assessment Center stress interview or the guy who passed an actual interrogation by the Gestapo?  As I like to say, the best predictor of future performance is past performance.  I would unhesitatingly choose the candidate who had actually misled the Gestapo and hidden his true identity during a real interrogation.

