Monday, July 4, 2016

Traditional Approaches to Validity in Classroom Assessment and Innovative Credentialing (Part 1)

By Daniel Hickey
In my work with the Participatory Assessment Lab at Indiana University and in my graduate education courses, I spend a lot of time helping people understand validity in the context of educational assessment.  In this post, I describe validity as it has traditionally been presented to educators. I summarize what one leading textbook has long said educators should know about validity when assessing learning in their own classes, and I extend that to credentialing innovators who are developing digital badge systems, micro-credentials, and competency-based educational programs.  In subsequent posts, I will explore traditional views of “face validity” and “consequential validity.” Together, these posts will lay the groundwork for a final post that will explore several new developments in validity theory that I believe are important for these two communities.

Generally speaking, validity concerns the accuracy of evidence for supporting particular claims. Unlike reliability, validity is not a property of an assessment. While many people refer to “valid assessments,” there is really no such thing. As I will elaborate below, most educators only need to know a few key ideas when using classroom assessments, because they typically make relatively modest claims from student assessment scores. However, my colleagues who are developing innovative credentialing systems (i.e., digital badges, portfolio assessments, competency based education, etc.) often intend to make more comprehensive claims about what a individual “knows” or “can do.”  I have been working diligently in the last year to show the “credentialing innovators” that a systemic approach to validity is necessary if their efforts are going to thrive.  This has included work with:

Traditional Views of Validity in Classroom Assessment

For the past 15 years, I have been teaching a graduate course called Assessment in Schools.  It is intended for classroom teachers and school administrators. We use Jim Popham’s aptly titled book Classroom Assessment: What Teachers Need to Know, the most widely used textbook on educational assessment in the English language. Because Popham rewrites the book every three years or so, the revisions to the validity chapter nicely capture the most important aspects of validity for educators. The latest (eighth) edition has a completely reworked chapter on validity. Before I write about these new developments, I want to start with the view of validity that many measurement professionals advanced through the 1990s and many educators are still learning in their teacher education textbooks.

The textbook and my course both start with curricular aims. Sometimes called educational standards, course outcomes, or competencies, these are observable outcomes that a course aims to support and/or that an assessment practice will produce evidence of. In the case of digital badges and competency-based education, these are represented by lists of competencies to be demonstrated, possibly independent of where and how those competencies were gained. Validity then refers to the extent to which the evidence generated by assessments supports those claims. Popham’s seventh edition, published in 2014, describes three types of validity evidence:

Content-related validity evidence. This evidence concerns “the extent to which an assessment procedure adequately represents the content of the curricular aim being measured.” In my assessment course, the students who work as classroom teachers overwhelmingly concluded that this was the most relevant type of validity evidence for them. This is because they are making and/or using classroom assessments that are used to make claims about mastery of the aims of their curriculum. In my work with credentialing innovators, this also turns out to be a very important kind of evidence. As such, I will elaborate on it below.

Criterion-related validity evidence. This evidence concerns “the degree to which performance on an assessment procedure accurately predicts a student’s performance on an external criteria.” In my course, many current or future administrators concluded that this was most relevant type of validity evidence for them. This is because administrators are often responsible for deciding what scores on placement tests are needed to test into or out of classes, advance grades, etc. In these cases, the “external criteria” concern whether these decisions are associated with student success. In my work with credentialing innovators, I am often surprised how little attention they give to this evidence, given the nature of the claims they are making. For example, competency-based education programs often assume that someone who has demonstrated many small competencies on performance assessments will be able to use those competencies together in a real-world workplace setting.

Construct-related validity evidence. This evidence concerns “the extent to which empirical evidence confirms that an inferred construct exists and that a given assessment procedure is measuring the inferred construct accurately.” Typically, just a handful of students in my assessment class find this type of evidence most relevant. These are usually PhD students who are interested in psychological constructs like self-efficacy or “grit.”  Likewise, most of my credentialing colleagues don’t end up assessing or measuring constructs. It’s a good thing too, as it turns out to be quite challenging to gather convincing construct-related validity evidence.

In summary, educators and credential innovators often make claims that call for content-related validity evidence if those claims are to be validated.  Some administrators and program designers often make claims from assessments that call for criterion-related evidence. Generally speaking, construct-related validity evidence is associated with psychological tests and professional developed achievement tests.

Types of Content-Related Validity Evidence
Popham’s 2014 validity chapter summarizes the ways that educators and innovators can gather evidence, so that the content detailed in their curricular aims or competencies are adequately represented in their assessments. For classroom assessments, the most common source of evidence is simply developmental care, whereby peers or experts examine the assessments in light of the curricular aims or competencies they target.  For higher stakes assessments, this might be extended to external reviews by panels of outside experts.

Whether gathered informally or formally, content-related validity evidence is concerned with alignment of the assessment’s content to the claims being made. While formal alignment is most relevant for high-stakes achievement tests, the basic idea is the same for classroom assessments, as well as the performance and portfolio assessments often used by credentialing innovators. The four “alignment questions” advanced by Noreen Webb are helpful starting points:

Categorical concurrence.  This is a rather general indication of whether consistent categories of content are represented in both the curricular aims and the assessment.  With multiple choice tests, this would be indicated by the presence of at least a few items for each curricular aim or standard. With performance and portfolio assessment, the process is the same, but these are likely to be a smaller number of elements on the scoring rubric for each standard. The obvious challenge here is that increasing the number of curricular aims or competencies requires a more extensive assessment.

Depth-of-knowledge (DOK) consistency. This asks about the extent to which the cognitive demands of the assessments are consistent with the curricular aims or competencies. Building on Bloom’s familiar taxonomy, Webb advances four “levels” of depth, including recall, skill/concept, strategic thinking, and extended thinking. This distinction is important because, some educational programs make ambitious claims about “extended thinking” that go well beyond what their assessment practices capture.  It turns out that it is quite difficult to create multiple-choice items or computer-based assessments that provide valid evidence of ability to solve complex problems.  And even when programs develop assessments that capture deeper knowledge, educators often compromise those assessments by teaching directly to them. (More on this in a subsequent post).

Range-of-knowledge correspondence.  This is a more specific version of the first question.  This concerns the span of knowledge represented in the curricular aims compared to the span of knowledge represented by assessments. But it takes into account the fact that individual items or assessment elements may cover the content in multiple curricular aims or competencies. This is particularly important for performance assessments and portfolio assessments. This is because such assessments often include multiple extended problems or activities that are intended to cover multiple aims or competencies. This can get very messy very quickly.

Balance of representations. This concerns the degree to which curricular aims are given equal emphasis on assessments. In other words, is the distribution of content in assessment equal to the distribution of content in the curricular aims or competencies? This is a relatively straightforward problem with multiple choice tests which can include quite a few items. But this becomes a very big problem with performance and portfolio assessments that have fewer items or elements.  

One of the points that I hope readers will take away from this post is the challenge that alternative formats such as portfolio and performance assessment present for content-related validity evidence. While both educators and students dislike multiple-choice assessment formats, they allow for the inclusion of many more items in a given assessment. This means that alignment is mostly an “empirical” process where the content and level of each items is matched up with the set of aims or competencies. While there may be some disagreements between reviewers, they will often concern specific items. However, the alignment process for performance and portfolio assessment is largely an interpretive exercise. With a smaller number of more extended activies or elements, there are simply many more degrees of freedom. As I will elaborate in a later post, a combination of formats is sometimes the best solution.

Critiques of Traditional Views of Validity
By the early 1990s, many measurement and assessment professionals began to question this traditional view of validity. Samuel Messick led the charge. Messick was a leading theorist at Educational Testing Service. His 1995 paper in Educational Psychologist on “the Validity of Psychological Assessment” was one of several very influential papers he published during his lifetime. The abstract opens as follows:
The traditional conception of validity divides it into three separate and substitutable types: content, criterion, and construct validities. This view is fragmented and incomplete, especially because it fails to take into account both evidence of the value implications of score meaning as a basis for action and the social consequences of score use.
Samuel Messick (1931-1998)

Messick was was writing in part in response to the explosion of innovation in large scale assessment in the US in the early 1990s. I found myself in the middle of these changes when I started my postdoc at the ETS Center for Performance Assessment which was established to help inform these efforts--and particularly to help them deliver the promised social consequences for improving teaching and learning. Rather than the traditional tripartite view, Messick argued for a
new unified concept of validity interrelates these issues as fundamental aspects of a more comprehensive theory of construct validity that addresses both score meaning and social values in test interpretation and test use. 
Thus Messick attempted to unify the study of validity within the idea of construct validity, broken down into six “distinguishable aspects of construct validity” including content, substantive, structural, generalizability, external, and consequential aspects. While I followed Popham's lead in not introducing these distinctions to teachers and administrators in my classes, this framework was central in my assessment work going forward. In particular I used them in a validity study of the performance assessments for introductory inheritance that were central to my research from 1996 to 2005. As I will elaborate in my next post, I particularly appreciated the way the Messick helped draw attention to the consequences of the ways that evidence from assessments and tests are used.

No comments:

Post a Comment