The Concept of Validity
Much of my difficulty appreciating Assessment Centers centers around their
lack of fit into traditional testing and measurement approaches. Neither
fish nor fowl, they avoid the standards by which we evaluate the usefulness of a
measuring instrument. Talk to an Assessment Center proponent and try to
get her to give you an indication of reliability and validity. For the
most part, we are asked to accept these on faith because the
Assessment Center model simply doesn't address these issues.
The construction and development of psychological measuring instruments can follow
from two theoretically different approaches. The first we might term the conceptual
approach; the second the empirical approach. Both approaches yield
a kind of validity. But neither approach applies to Assessment
Centers.
Conceptually Constructed Tests
Binet's test of intelligence followed a conceptual construction
approach. Binet collected a series of tasks which were successfully
completed by 6-year olds, 7-year olds, 8-year olds, etc. Average
Intelligence, by definition, was the ability to complete the (infinite
set) of
tasks appropriate for one's age group. Practically, Binet began with many
hundreds of tasks, and administered these to his target age groups, and
calculated the total score for each child across the large number of tasks. Keep in
mind, that these total scores were approximations for each child's actual
intelligence as defined by the ability to
successfully complete age-specific tasks. But, to simplify his test and
reduce the number of items (tasks) necessary, he calculated which tasks best
related to the total score. In other words, he discarded tasks which were
least predictive of intelligence, and kept the tasks which were most
predictive.
The conceptual approach relies initially on
face validity. That is, if subject matter experts agree that the tasks
Binet chose measured "intelligence" and were representative of the tasks that children can be
expected to complete, then the test has face validity. As it
happens, the construction methodology guarantees internal consistency, or
reliability also. Further, intelligence measured in this fashion also has external validities
-- Intelligence test scores correlate well with success in school as
well as success in life.
Correlations with constructs external to the measurement device are sometimes called construct validity. That
is, the construct of intelligence demonstrates predictable relationships with
other constructs. Intelligence correlates with school achievement for
example and negatively with job terminations, etc.
Lest you think all conceptually developed tests meet the standards of Binet's
efforts, let me introduce the Picture Preference Test. I inherited a copy
of the Picture Preference Test as a young professor of testing and measurement
at Wayne State University. The test consists of a box with multiple series
of male photographs. The idea was for a subject to take each series and
rank the photographs in the order that he felt he might like the person
pictured.
The theory or concept behind this test is quite simple. It is
reasonably well established that we like people who are similar to us. The
photographs in the Picture Preference Test were photographs of mental patients,
each with a different diagnosis. If I were to rank the photographs in
order of preference, and each time placed a Schizophrenic at the top of the
pile, then the researcher should conclude I was a Schizophrenic. But,
there were no "normal" people pictured. So, regardless of how I
ranked the photographs, I would demonstrate a pathology. Like Binet's
intelligence test, the Picture Preference Test has a kind of "face"
validity, but because the test results don't relate to anything, the instrument
is of no practical value.
Empirically Constructed Tests
I present the Minnesota Multiphasic Personality Inventory (MMPI) as an
example of an empirically constructed test. To me, it is a bizarre
collection of statements. I am stretching my recollection, but the tone of
the statements, if not the exact wording is approximated below:
- "The sight of blood excites me.
- Sometimes the top of my head feels soft."
The total number of statements originally evaluated may have exceeded several
thousand. The point, though, is that the researchers did not need a theory
to guide them. They collected several thousand statements, administered
them to a population of mental patients, and recorded the patterns of responses
characteristic of various pathologies. If a new patient exhibits a
response pattern characteristic of Schizophrenics, then the appropriate
diagnosis is made.
The construction of the test was not guided by theory, but by practical
results. Only statements that statistically demonstrated a
relationship with a pathology were retained for the final version of the
test. Thus, an empirically constructed test also has a validity by
definition, predictive validity. That is, the MMPI was constructed
to predict patient pathologies. The statements were chosen to relate to
pathologies, and only statements that related to pathologies were retained and
statements that had no bearing on patient pathologies were discarded.
Significantly, the MMPI demonstrates predictive validity even in patient
populations different from the one it was developed on.
Assessment Centers Again
These two approaches reasonably well describe the construction processes
associated with professionally developed measuring instruments. You
see, of course, that the two approaches are not mutually exclusive. A
hybrid approach could select test items using a conceptual approach, and subject
these to empirical validation. And likewise, one might try to find the
unifying concepts or theory behind an empirically developed instrument such as
the MMPI and begin to investigate the patterns of patient responses to find
theories that explain these patterns. But I see no correspondence between
the Assessment Center approach and either traditional measurement approach.
Consider how we would construct an Assessment Center using the conceptual
approach. We would identify the set of cohesive traits necessary for
successful performance. We might include verbal ability, organizing
ability, situational awareness, leadership, among others. We would
develop many hundreds of tasks or test items that tap these dimensions.
Then, we would administer these to a sample of subjects and look to see which
tasks or test items best relate to the total score. The subset of
"best" tasks would be those that we define as "leadership,"
or "managerial ability," or whatever it is that Assessment Centers
purport to measure.
To construct an Assessment Center using the empirical approach, we would
identify criterion groups of successful and unsuccessful managers or
leaders. To these we would administer a series of test items or tasks
regardless of their theoretical importance, and we would maintain those test
items or tasks that differentiate between the successful and unsuccessful
mangers, discarding the remainder.
Both methods require the development of numerous tasks, and the
evaluation of the tasks and measurements to determine their utility. In
the Assessment Center literature, I find no references to such "item
development." Instead, we are asked to accept on faith that the
procedures yield valid results. Well, to me it appears an arrogant conceit to claim that performance
on arbitrarily chosen tasks in contrived settings predict future performance. Recall that one of the candidates in the OSS
assessment centers flunked their stress interview, but had actually passed an
interrogation by the Gestapo! The eminent psychologists working for
the OSS were proud that they had identified a potential failure as a field agent. I
remain amazed that they screened out the one candidate who experienced success
in the field! Having one exemplary candidate fail makes me question
the validity of the assessment technology. Instead, the failure to
accurately predict (even past) behavior was considered as a positive outcome.
Put yourself in the position of "Wild Bill" Donovan hiring a
spy. Given both assessment center
information plus background information, who would you predict to pass a
Gestapo interrogation? Would you place your bets on the guy who passed an Assessment Center stress
interview or the guy who passed an actual interrogation by the Gestapo? As
I like to say, the best predictor of future performance is past
performance. I would unhesitatingly choose the candidate who had actually
misled the Gestapo and hidden his true identity during a real interrogation.
-