re-mediating assessment: Traditional Approaches to Validity in Assessment Innovation (Part 2: Consequential Validity)

This is the second post in a series on the topic of validity in educational assessment. In my first post, I described the traditional characterization of content-related, criterion-related, and construct-related evidence as they are relevant to educators and credentialing innovators who use and design assessments. This post summarizes traditional characterizations of “consequential validity.” This aspect of validity concerns the broader consequences of administering assessments to learners and using the resulting scores. It is a complex idea that is really crucial to many assessment and credentialing innovators (because broader change is their goal). Many measurement professionals have long argued that it an "unsanctioned" aspect of validity. Before I write about how that is changing, I want to describe how consequential validity has traditionally been written about and why I have long disagreed.

As I will describe in a subsequent post, most assessment and measurement professionals now agree that assessment developers should consider both the intended and unintended consequences of the way their assessments are used. It is worth noting that the large scale K-12 assessment reforms in the US collapsed swiftly around 1995. This was in part because they failed to deliver promised improvements in teaching and learning; arguably, the same thing is happening right now with new assessments developed in the massive Race to the Top assessment initiative. First, however, let’s consider how assessment and measurement specialists have traditionally thought about the consequences of assessment.

As I introduced in the first post, I have been teaching a course called Assessment in Schools for about 15 years, using Jim Popham’ book Classroom Assessment: What Teachers Need to Know. Because it is the most widely used textbook on the topic in the English language, it gives a good idea of what many teachers learn about educational assessment (typically in a graduate-level course). Because Popham updates the text about every three years, the book is a good indicator of how mainstream views of assessment are evolving. Over many years, I have found that this evolution is helpful for assisting practitioners understand the complex-but-crucial issues that make for effective assessment practices. This evolution is also particularly helpful with validity, which is the most important and most complex assessment topic of them all.

Why Disagreement is Good for Learning and Innovation

Over the years, I have also found that letting students see where I disagree with a textbook is a great way to help students make sense of these complex issues. Textbooks necessarily avoid some of the biggest debates in a discipline because they overwhelm students who lack the prior knowledge and context they need to make sense of such things. Once they make sense of the straightforward characterizations in the text, however, we can really dig into those issues. I have found that the same holds true of credentialing innovators with whom I work. As I raise these sorts of issues with people when they are just starting to develop a badge system, it overwhelms them. But something magical happens when a group of people come together and struggle through (a) the claims they want to make, (b) the evidence needed to support those claims, and (c) the assessments that would generate this evidence. Once they are in the trenches, they have enough experience to appreciate how the issues I I am raising in this series that are crucial to the success of their innovation.

James Popham

I have always disagreed with two aspects of Popham’s validity chapter in his most recent editions. One aspect is “consequential validity,” which concerns the extent to which the uses of test are valid. The other is “face validity,” which concerns the extent to which the evidence from an assessment appears to be consistent with the assessment’s purpose. In the seventh edition (2014), Popham characterized both as “unsanctioned forms of validity evidence.” He states:

If teachers can’t draw valid score-based inferences about students, there is simply no reason to measure students in the first place. However, because validity is such a central notion in educational assessment, some folks have attached specialized meanings to it that, although helpful at some level, also may introduce confusion (p. 118)

I am definitely one of the “some folks” he is referring to here. While I agree that both concepts are confusing, I have argued for years that it is a mistake for anyone who uses or designs assessments to ignore these aspects of validity. This argument reflect my embrace of newer situative/sociocultural perspectives. These perspectives lead to a much broader view of learning, and therefore assessment, including the social context of classrooms and institutional climate of schools. As I will show across this series of posts, Popham and many other assessment and measurement scholars are gradually coming to embrace this perspective as well.

Why does Consequential Validity Matter when Using Assessments?

To elaborate, consequential validity concerns the impact of taking and adminstering assessments and tests, and then using the resulting evidence. Importantly, this includes both the intended and unintended consequences of doing so. The idea was introduced and vigorously debated in the context of large scale testing reforms that spread rapidly in the US in the early 1990s. As I introduced above, Popham has long argued that concern with consequential validity is confusing and distracts from the central goal of making valid inferences about what students know. On one hand, he admits in the 2014 edition that “educators should obviously be attentive to the consequences of test use” (such as inappropriately holding students back from advancing to the next grade). But he goes on to say that

the notion of consequential validity is apt to confuse the central thrust of validity--namely to confirm or disconfirm the defensibility of the score-based inferences we make about our students. If we make accurate inferences about students based on a test, yet relay on those inferences to make terrible decisions, our tests will have negative consequences. But it was the use to which we put our valid score based inferences that was deplorable (p. 118).

While I let Popham have his say in my classes, I still let my students know that I have argued against this position in my research for many years. In general, I think that educators need to be much more sensitive to the way their assessments practices transform the culture of their classrooms. In particular, I think that educators should appreciate that the reason students often resist discussing content in personally meaningful ways in class is because students know that they will primarily be held accountable for knowing the abstract decontextualized representations of the content, which teachers typically use in their assessments.

My core beliefs about what people who use assessments need to know about assessment consequences emerged across a decade of assessment research around the GenScope software for learning introductory inheritance (summarized here). It took that whole decade to fully wrap my head around the broader notion of systemic validity as raised in 1989 by Frederiksen and Collins. They describe systemic validity as assessment practices that

John Frederiksen

induce curricular and instructional changes in educational systems (and learning strategy changes in students) that foster the development of cognitive traits that the tests are designed to measure (p. 27).

While they couched their ideas in very cognitive terms to appeal to measurement specialist, I definitely could read between the lines. What really influenced me was reading that article alongside a groundbreaking initial treatise on situated cognition that Collins Brown and Duguid published the same year. One of the things that struck me about measurement specialists at ETS was that back in 1995-97, they dismissed any consideration of consequences. I was thrilled when John Frederiksen came to Princeton to do a talk on this topic in 1997 and I was the only person who showed up! Since I had already gone a long ways down the situated cognition path, I began realize that my path as an assessment researchers was gong to be challenging and sometimes lonely.

Why does Consequential Validity Matter when Developing Assessments?

In the early 1990s, Samuel Messick (a leading validity theorist at Educational Testing Service) convinced some (but certainly not all) in the educational testing community to consider the consequences of assessment and testing. More specifically, he argued that consequences were was one of “six distinguishable aspects of construct validity” in an influential 1995 paper in American Psychologist.

Samuel Messick

In the ensuring years, consequential validity emerged as one of the most contentious issues in the history educational assessment and testing. Some measurement scholars agree with Mark Reckase’s arguments in his 1998 paper that test developers were not responsible for assessment consequences. Other assessment scholars like Pam Moss and Lorrie Shepard argued that Messick actually downplayed assessment consequences by arguing that test developers could consider consequences after assessments had been developed. I was immediately convinced by their argument that doing so often excuses measurement professionals from some horrific consequences.

Lorrie Shepard

One of the examples that Lorrie Shepard used in her 1993 article in the Review of Educational Research concerned kindergarten “readiness” tests. These tests became widely used in the early 1970s during the heyday of Piagetian developmental psychology. The test provided “scientific” evidence that large cohorts of young children (mostly disadvantaged ethic minorities) needed to wait one or two years until they were “developmentally ready” to benefit from kindergarten. Ten years later this resulted in large cohorts of those same students starting high school at age 15 and 16. Shockingly large numbers of these students then went on to drop out of high school when they turned 18.

Conformative, Deformative, and Transformative Assessment Consequences

I have now published in several papers that argue that both educators and test developers should think primarily about the consequences of their assessment practices and secondarily about the other aspects of validity. But I think the UK assessment theorist Harry Torrance captured it most clearly for higher education in a 2012 paper in the Oxford Review of Education, where he introduced the notions of conformative, deformative, and transfomative assessment. He points out that the increased attention on formative assessment has increased the consequences of assessment practices for classroom culture.

Harry Torrance

He argues that much of seemingly helpful definitions of content standards (in the name of “transparency”) and implementation of formative assessment towards those standards is actually very “conformative.” Specifically he argues that efforts to define specific competencies and standards and then align formative and summative assessment according in that it works to narrow curriculum and standardized content. He goes on to argue that such trends discourage the development of “self-regulated” and “lifelong” learning, the ostensible goal of most efforts to derive content standards and reform assessment:

The issue then, is that we have moved towards transparency of objectives and assessment criteria, coupled with clear feedback being provided in relation to these criteria, but such a combination of transparency and feedback may not really be considered sufficient to the purpose of higher education. What we have here is not so much formative assessment, but conformative assessment (p. 332).

Torrance goes on to make the crucial point that all assessment is “formative,” in that the process of getting performance feedback will always impact learners, but that this impact is not necessarily positive and may well be negative:

It is probably the case that most learners fail most of the time, and many students will come to think of themselves as failures if they do not attain the grades they expect or need, whether they are ‘failures” in any real sense or not. This may not be the formative impact we might aspire to, but it happens none the less. Perhaps we should call this de-formative assessment. A key issue must be how we minimize such impact if we can. A key starting point must be to acknowledge the nature of the problem (p. 332).

He makes the crucial point that most assessment theorist (and particular proponents of formative assessment) embrace very cognitivist views of learning and motivation that focus entirely on individual change. In particular, he critiques the widely known Black and Wiliam model of formative assessment because it frames motivational consequences in terms goal orientation and attribution theory. Like me, Torrance argues that the motivational consequences of assessment should primarily be considered in terms ethnographic considerations of the social construction of failure in the context of assessment. He argues that

attention much be given to the prior social knowledge and expectations that learners bring with them and which will mediate their experiences of assessment, as well as the social construction of particular assessment incidents themselves (p. 334).

The point that many educators and assessment innovators often overlook is that all assessment is formative, but not necessarily in a positive way. This is what brings Torrance to the key issue of “recognizing the nature of deformative assessment and how we might better understand and attempt to address the affective impact of assessment” (p. 335).

The final section of Torrance’s article introduces the notion of transformative assessment. I found this to be a current and concise characterization of system validity as originally envisioned by Frederiksen and Collins 23 years earlier. Torrance takes issue with several key theorists in the formative assessment literature. Particularly relevant to the digital badges community is the need to recognize that most important consequence of their assessment practices is that those practices define a trajectory that leads learners into a disciplinary or professional community of practice. Indeed, the most successful digital badge systems we studied in the Design Principles Documentation Project did just that. Torrance goes on to outline very specific questions that assessment users and designers can ask to help develop more positively transformative assessment practices. In the interest of space I will not replicate them here, but I will let Torrance have the last word with his own conclusion:

Ultimately, however, the solution to the problem is probably less located in answering such questions directly, less located in programme design and the search for a perfectly understood and functioning system, and more located in the vitality and authenticity of the relationships between teachers and students. We need to understand our task as one of collaborating with students to bring about learning, to be alert to the generation of unpredictable outcomes and indeed to regard the production of unpredictable and unintended outcomes as an indication of success, not lack of compliance with the programme. We need to make the rules of the game as apparent as possible, but we also need to try to communicate that we would be happy to see the rules of the game change, if someone comes up with better ones.