This is the second
post in a series on the topic of validity in educational
assessment. In my first post, I described the traditional
characterization of content-related, criterion-related, and construct-related evidence
as they are relevant to educators and credentialing innovators who use and
design assessments. This post summarizes traditional characterizations of “consequential
validity.” This aspect of validity concerns
the broader consequences of administering assessments to learners and using the
resulting scores. It is a complex idea that is really crucial to many
assessment and credentialing innovators (because broader change is their goal).
Many measurement professionals have long argued that it an
"unsanctioned" aspect of validity. Before I write about how
that is changing, I want to describe how consequential validity has
traditionally been written about and why I have long disagreed.
As I will describe in a subsequent
post, most assessment and measurement professionals now agree that assessment
developers should consider both the intended and unintended consequences of the
way their assessments are used. It is worth noting that the large scale K-12
assessment reforms in the US collapsed swiftly around 1995. This was in part
because they failed to deliver promised improvements in teaching and learning;
arguably, the same thing is happening right now with new assessments developed
in the massive Race to the Top assessment initiative. First, however, let’s
consider how assessment and measurement specialists have traditionally thought
about the consequences of assessment.
As I introduced in the first post, I
have been teaching a course called Assessment in Schools for about 15 years,
using Jim Popham’
book Classroom Assessment: What Teachers
Need to Know. Because
it is the most widely used textbook on the topic in the English language, it
gives a good idea of what many teachers learn about educational assessment
(typically in a graduate-level course). Because Popham updates the text about
every three years, the book is a good indicator of how mainstream views of assessment
are evolving. Over many years, I have found that this evolution is helpful for
assisting practitioners understand the complex-but-crucial issues that make for
effective assessment practices. This evolution is also particularly helpful
with validity, which is the most important and most complex assessment topic of
them all.
Why
Disagreement is Good for Learning and Innovation
Over the years,
I have also found that letting students see where I disagree with a textbook is
a great way to help students make sense of these complex issues. Textbooks
necessarily avoid some of the biggest debates in a discipline because they
overwhelm students who lack the prior knowledge and context they need to make
sense of such things. Once they make sense of the straightforward characterizations
in the text, however, we can really dig into those issues. I have found
that the same holds true of credentialing innovators with whom I work. As
I raise these sorts of issues with people when they are just starting to
develop a badge system, it overwhelms them. But something magical happens when
a group of people come together and struggle through (a) the claims they want
to make, (b) the evidence needed to support those claims, and (c) the
assessments that would generate this evidence. Once they are in the trenches,
they have enough experience to appreciate how the issues I I am raising in this series that are
crucial to the success of their innovation.
James Popham |
I have always disagreed with two
aspects of Popham’s validity chapter in his most recent editions. One aspect
is “consequential
validity,” which concerns
the extent to which the uses of test are valid. The other is “face validity,” which concerns
the extent to which the evidence from an assessment appears to
be consistent with the assessment’s
purpose.
In the seventh edition (2014), Popham characterized both as “unsanctioned forms of
validity evidence.” He states:
If teachers can’t draw valid score-based inferences about students, there is simply no reason to measure students in the first place. However, because validity is such a central notion in educational assessment, some folks have attached specialized meanings to it that, although helpful at some level, also may introduce confusion (p. 118)
I am definitely one
of the “some folks” he is referring
to here. While I agree that both concepts are confusing, I have argued for
years that it is a mistake for anyone who uses or designs assessments to ignore
these aspects of validity. This argument reflect my embrace of newer
situative/sociocultural perspectives. These perspectives lead to a much broader
view of learning, and therefore assessment, including the social context of
classrooms and institutional climate of schools. As I will show across this
series of posts, Popham and many other assessment and measurement scholars are
gradually coming to embrace this perspective as well.
Why does Consequential Validity Matter when Using Assessments?
To elaborate,
consequential validity concerns the impact of taking and adminstering
assessments and tests, and then using the resulting evidence. Importantly, this
includes both the intended and unintended consequences of
doing so. The idea was introduced and vigorously debated in the context of
large scale testing reforms that spread rapidly in the US in the early 1990s.
As I introduced above, Popham has long argued that concern with consequential
validity is confusing and distracts from the central goal of making valid
inferences about what students know. On one hand, he admits in the 2014
edition that “educators should
obviously be attentive to the consequences of test use” (such as
inappropriately holding students back from advancing to the next grade). But he
goes on to say that
the notion of consequential validity is apt to confuse the central thrust of validity--namely to confirm or disconfirm the defensibility of the score-based inferences we make about our students. If we make accurate inferences about students based on a test, yet relay on those inferences to make terrible decisions, our tests will have negative consequences. But it was the use to which we put our valid score based inferences that was deplorable (p. 118).
While I let Popham
have his say in my classes, I still let my students know that I have argued
against this position in my research for many years. In general, I think that
educators need to be much more sensitive to the way their assessments practices
transform the culture of their classrooms. In particular, I think that
educators should appreciate that the reason students often resist discussing
content in personally meaningful ways in class is because students know that
they will primarily be held accountable for knowing the abstract
decontextualized representations of the content, which teachers typically use
in their assessments.
My core beliefs about
what people who use assessments need to know about assessment consequences
emerged across a decade of assessment research around the GenScope software for
learning introductory inheritance (summarized here). It took that whole
decade to fully wrap my head around the broader notion of systemic validity as
raised in 1989 by Frederiksen and Collins. They describe
systemic validity as assessment practices that
John Frederiksen |
induce curricular and instructional changes in educational systems (and learning strategy changes in students) that foster the development of cognitive traits that the tests are designed to measure (p. 27).
While they couched
their ideas in very cognitive terms to appeal to measurement specialist, I
definitely could read between the lines. What really influenced me was
reading that article alongside a groundbreaking initial treatise on situated
cognition that Collins Brown and Duguid published the same year. One of the
things that struck me about measurement specialists at ETS was that back in
1995-97, they dismissed any consideration of
consequences. I was thrilled when John Frederiksen came to Princeton to
do a talk on this topic in 1997 and I was the only person who showed up!
Since I had already gone a long ways down the situated cognition
path, I began realize that my path as an assessment researchers was gong to be
challenging and sometimes lonely.
Why
does Consequential Validity Matter when Developing Assessments?
In the early 1990s,
Samuel Messick (a leading validity theorist at Educational Testing Service)
convinced some (but certainly not all) in the educational testing community to
consider the consequences of assessment and testing. More specifically, he
argued that consequences were was one of “six distinguishable aspects of
construct validity” in an
influential 1995 paper in American Psychologist.
In
the ensuring years, consequential validity emerged as one of the most
contentious issues in the history educational assessment and testing. Some
measurement scholars agree with Mark Reckase’s arguments in his 1998 paper that
test developers were not responsible for assessment consequences. Other
assessment scholars like Pam Moss and Lorrie Shepard argued that Messick
actually downplayed assessment consequences by arguing that test developers
could consider consequences after assessments had been
developed. I was immediately convinced by their argument that doing so often
excuses measurement professionals from some horrific consequences.
Lorrie Shepard |
One of
the examples that Lorrie Shepard used in her 1993 article in the Review of Educational Research concerned
kindergarten “readiness” tests. These tests became widely used in the early
1970s during the heyday of Piagetian developmental psychology. The
test provided “scientific” evidence that large cohorts of young children
(mostly disadvantaged ethic minorities) needed to wait one or two years until they were “developmentally
ready” to benefit from kindergarten. Ten years later this resulted in
large cohorts of those same students starting high school at age 15 and
16. Shockingly large numbers of these students then went on to drop out of high
school when they turned 18.
Conformative, Deformative, and Transformative Assessment
Consequences
I have now published in several
papers that argue that both educators and test developers should think primarily
about the consequences of their assessment practices and secondarily about the
other aspects of validity. But I think the UK assessment theorist Harry Torrance captured
it most clearly for higher education in a 2012 paper in the Oxford Review
of Education, where he introduced the notions of conformative,
deformative, and transfomative assessment. He points
out that the increased attention on formative assessment has increased the
consequences of assessment practices for classroom culture.
Harry Torrance |
He argues
that much of seemingly helpful definitions of content standards (in the name
of “transparency”) and implementation of formative assessment
towards those standards is actually very “conformative.” Specifically
he argues that efforts to define specific competencies and standards and then
align formative and summative assessment according in that it works to narrow
curriculum and standardized content. He goes on to argue that such trends
discourage the development of “self-regulated” and “lifelong” learning,
the ostensible goal of most efforts to derive content standards and reform
assessment:
The issue then, is that we have moved towards transparency of objectives and assessment criteria, coupled with clear feedback being provided in relation to these criteria, but such a combination of transparency and feedback may not really be considered sufficient to the purpose of higher education. What we have here is not so much formative assessment, but conformative assessment (p. 332).
Torrance goes on to make
the crucial point that all assessment is “formative,” in that the process
of getting performance feedback will always impact learners, but that this
impact is not necessarily positive and may well be negative:
It is probably the case that most learners fail most of the time, and many students will come to think of themselves as failures if they do not attain the grades they expect or need, whether they are ‘failures” in any real sense or not. This may not be the formative impact we might aspire to, but it happens none the less. Perhaps we should call this de-formative assessment. A key issue must be how we minimize such impact if we can. A key starting point must be to acknowledge the nature of the problem (p. 332).
He makes the crucial point
that most assessment theorist (and particular proponents of formative
assessment) embrace very cognitivist views of learning and motivation that
focus entirely on individual change. In particular, he critiques the widely known Black
and Wiliam model of formative assessment because it frames motivational
consequences in terms goal orientation and attribution theory. Like me,
Torrance argues that the motivational consequences of assessment should primarily
be considered in terms ethnographic considerations of the social construction
of failure in the context of assessment. He argues that
attention much be given to the prior social knowledge and expectations that learners bring with them and which will mediate their experiences of assessment, as well as the social construction of particular assessment incidents themselves (p. 334).
The point that many
educators and assessment innovators often overlook is that all
assessment is formative, but not necessarily in a positive way. This is what brings Torrance to the key issue of “recognizing the nature of
deformative assessment and how we might better understand and attempt to
address the affective impact of assessment” (p.
335).
The final section of
Torrance’s article introduces the notion of transformative assessment. I found
this to be a current and concise
characterization of system validity as originally envisioned by Frederiksen and
Collins 23 years earlier. Torrance takes issue with several key theorists
in the formative assessment literature. Particularly relevant to the digital
badges community is the need to recognize that most important consequence of
their assessment practices is that
those practices define a trajectory that leads learners
into a disciplinary or professional community of practice. Indeed, the most
successful digital badge systems we studied in the Design Principles
Documentation Project did just that. Torrance goes on to outline very specific
questions that assessment users and designers can ask to help develop more
positively transformative assessment practices. In the interest of space I will
not replicate them here, but I will let Torrance have the last word with his
own conclusion:
Ultimately, however, the solution to the problem is probably less located in answering such questions directly, less located in programme design and the search for a perfectly understood and functioning system, and more located in the vitality and authenticity of the relationships between teachers and students. We need to understand our task as one of collaborating with students to bring about learning, to be alert to the generation of unpredictable outcomes and indeed to regard the production of unpredictable and unintended outcomes as an indication of success, not lack of compliance with the programme. We need to make the rules of the game as apparent as possible, but we also need to try to communicate that we would be happy to see the rules of the game change, if someone comes up with better ones.
No comments:
Post a Comment