By Daniel Hickey
In my work with the Participatory Assessment Lab at Indiana
University and in my graduate education courses, I spend a lot of time helping
people understand validity in the context of educational assessment. In this post, I describe validity as it has traditionally been presented to educators. I summarize what one leading textbook has long
said educators should know about validity when assessing learning in their
own classes, and I extend that to credentialing innovators who are developing
digital badge systems, micro-credentials, and competency-based educational
programs. In subsequent posts, I will explore traditional views of “face validity” and “consequential validity.” Together, these posts will lay the groundwork for a final post that will explore several new developments in
validity theory that I believe are important for these two communities.
Generally speaking, validity concerns the accuracy of
evidence for supporting particular claims. Unlike reliability, validity is not a property of an assessment. While many people refer to “valid assessments,” there is really no such thing. As I
will elaborate below, most educators only need to know a few key ideas
when using classroom assessments, because they typically make
relatively modest claims from student assessment scores. However, my colleagues
who are developing innovative credentialing systems (i.e., digital badges,
portfolio assessments, competency based education, etc.) often intend to make
more comprehensive claims about what a individual “knows” or “can do.” I have been working diligently in the last year to
show the “credentialing innovators” that a systemic approach to validity is
necessary if their efforts are going to thrive.
This has included work with:
- The American Council in Education and their efforts to define Quality Dimensions for Connected Credentials,
- IMS Global in their efforts develop metadata standards for digital badges,
- Mozilla Foundation’ efforts to foster educator competencies with digital age skills,
- The dozens of badge development efforts that we supported in the Open Badges in Higher Education project at Indiana University and the efforts of the Badge Alliance and the lager open badges community.
- ePortfolio designers and users, including members of the Association for the Advancement of Experiential and Evidence-Based Learning (AAEEBL).
Traditional Views of Validity in Classroom
Assessment
For the past 15 years, I have been teaching a graduate
course called Assessment in Schools. It is intended for classroom teachers and
school administrators. We use Jim Popham’s aptly titled book Classroom Assessment: What Teachers Need to
Know, the most widely used
textbook on educational assessment in the English language. Because Popham
rewrites the book every three years or so, the revisions to the validity
chapter nicely capture the most important aspects of validity for
educators. The latest (eighth) edition has a completely reworked chapter on validity. Before I write about these new developments, I want to start with the view of validity that many measurement professionals advanced through the 1990s and many educators are still learning in their teacher education textbooks.
The textbook and my course both start with curricular aims. Sometimes called educational standards, course outcomes, or competencies, these are observable outcomes that a course
aims to support and/or that an assessment practice will produce evidence of. In
the case of digital badges and competency-based education, these are
represented by lists of competencies to
be demonstrated, possibly independent of where and how those competencies were
gained. Validity then refers to the extent
to which the evidence generated by assessments supports those claims. Popham’s seventh edition, published in 2014, describes three types of validity evidence:
Content-related validity evidence. This evidence concerns “the
extent to which an assessment procedure adequately represents the content of
the curricular aim being measured.” In my assessment course, the students who
work as classroom teachers overwhelmingly concluded that this was the most
relevant type of validity evidence for them. This is because they are making
and/or using classroom assessments that are used to make claims about mastery
of the aims of their curriculum. In my work with credentialing innovators, this
also turns out to be a very important kind of evidence. As such, I will
elaborate on it below.
Criterion-related validity evidence. This evidence concerns “the
degree to which performance on an assessment procedure accurately predicts a
student’s performance on an external criteria.” In my course, many current or
future administrators concluded that this was most relevant type of validity
evidence for them. This is because administrators are often responsible for
deciding what scores on placement tests are needed to test into or out of
classes, advance grades, etc. In these cases, the “external criteria” concern whether
these decisions are associated with student success. In my work with
credentialing innovators, I am often surprised how little attention they give
to this evidence, given the nature of the claims they are making. For example,
competency-based education programs often assume that someone who has
demonstrated many small competencies on performance assessments will be able to
use those competencies together in a real-world workplace setting.
Construct-related validity evidence. This evidence concerns “the
extent to which empirical evidence confirms that an inferred construct exists
and that a given assessment procedure is measuring the inferred construct
accurately.” Typically, just a handful of students in my assessment class find
this type of evidence most relevant. These are usually PhD students who are
interested in psychological constructs like self-efficacy or “grit.” Likewise, most of my credentialing colleagues
don’t end up assessing or measuring constructs. It’s a good thing too, as it turns out to
be quite challenging to gather convincing construct-related validity evidence.
In summary, educators and credential innovators often
make claims that call for content-related validity evidence if those claims are
to be validated. Some administrators and
program designers often make claims from assessments that call for
criterion-related evidence. Generally speaking, construct-related validity
evidence is associated with psychological tests and professional developed
achievement tests.
Types of
Content-Related Validity Evidence
Popham’s 2014 validity chapter summarizes the ways that
educators and innovators can gather evidence, so that the content detailed
in their curricular aims or competencies are adequately represented in their
assessments. For classroom assessments, the most common source of evidence is
simply developmental care, whereby
peers or experts examine the assessments in light of the curricular aims or
competencies they target. For higher
stakes assessments, this might be extended to external reviews by panels of outside experts.
Whether gathered informally or formally, content-related
validity evidence is concerned with alignment
of the assessment’s content to the claims being made. While formal alignment is most relevant for
high-stakes achievement tests, the basic idea is the same for classroom
assessments, as well as the performance and portfolio assessments often used by
credentialing innovators. The four “alignment questions” advanced by Noreen Webb
are helpful starting points:
Categorical concurrence. This
is a rather general indication of whether consistent categories of content are
represented in both the curricular aims and the assessment. With multiple choice tests, this would be
indicated by the presence of at least a few items for each curricular aim or
standard. With performance and portfolio assessment, the process is the same,
but these are likely to be a smaller number of elements on the scoring rubric
for each standard. The obvious challenge here is that increasing the number of curricular
aims or competencies requires a more extensive assessment.
Depth-of-knowledge (DOK) consistency. This asks about the extent to
which the cognitive demands of the assessments are consistent with the curricular
aims or competencies. Building on Bloom’s familiar taxonomy, Webb advances four
“levels” of depth, including recall,
skill/concept, strategic thinking, and extended
thinking. This distinction is important because, some educational programs
make ambitious claims about “extended thinking” that go well beyond what their
assessment practices capture. It turns
out that it is quite difficult to create multiple-choice items or computer-based
assessments that provide valid evidence of ability to solve complex problems. And even when programs develop assessments
that capture deeper knowledge, educators often compromise those assessments by
teaching directly to them. (More on this in a subsequent post).
Range-of-knowledge correspondence.
This is a more specific version of the first question. This concerns the span of knowledge
represented in the curricular aims compared to the span of knowledge
represented by assessments. But it takes into account the fact that individual
items or assessment elements may cover the content in multiple curricular aims
or competencies. This is particularly important for performance assessments and
portfolio assessments. This is because such assessments often include multiple extended
problems or activities that are intended to cover multiple aims or
competencies. This can get very messy very quickly.
Balance of representations. This concerns the degree to which
curricular aims are given equal emphasis on assessments. In other words, is the
distribution of content in assessment equal to the distribution of content in
the curricular aims or competencies? This is a relatively straightforward problem with multiple choice tests
which can include quite a few items. But
this becomes a very big problem with performance and portfolio assessments that
have fewer items or elements.
One of the points that I hope readers will take away
from this post is the challenge that alternative formats such as portfolio and
performance assessment present for content-related validity evidence. While both educators and students dislike multiple-choice assessment formats, they
allow for the inclusion of many more items in a given assessment. This means
that alignment is mostly an “empirical” process where the content and level of
each items is matched up with the set of aims or competencies. While there may
be some disagreements between reviewers, they will often concern specific
items. However, the alignment process for performance and portfolio assessment
is largely an interpretive exercise. With a smaller number of more extended activies or elements, there are simply many more degrees of freedom. As I will
elaborate in a later post, a combination of formats is sometimes the best
solution.
Critiques of Traditional Views of Validity
By the early 1990s, many measurement and assessment professionals began to question this traditional view of validity. Samuel Messick led the charge. Messick was a leading theorist at Educational Testing Service. His 1995 paper in Educational Psychologist on “the Validity of Psychological Assessment” was one of several very influential papers he published during his lifetime. The abstract opens as follows:
Messick was was writing in part in response to the explosion of innovation in large scale assessment in the US in the early 1990s. I found myself in the middle of these changes when I started my postdoc at the ETS Center for Performance Assessment which was established to help inform these efforts--and particularly to help them deliver the promised social consequences for improving teaching and learning. Rather than the traditional tripartite view, Messick argued for a
Critiques of Traditional Views of Validity
By the early 1990s, many measurement and assessment professionals began to question this traditional view of validity. Samuel Messick led the charge. Messick was a leading theorist at Educational Testing Service. His 1995 paper in Educational Psychologist on “the Validity of Psychological Assessment” was one of several very influential papers he published during his lifetime. The abstract opens as follows:
The traditional conception of validity divides it into three separate and substitutable types: content, criterion, and construct validities. This view is fragmented and incomplete, especially because it fails to take into account both evidence of the value implications of score meaning as a basis for action and the social consequences of score use.
Samuel Messick (1931-1998) |
Messick was was writing in part in response to the explosion of innovation in large scale assessment in the US in the early 1990s. I found myself in the middle of these changes when I started my postdoc at the ETS Center for Performance Assessment which was established to help inform these efforts--and particularly to help them deliver the promised social consequences for improving teaching and learning. Rather than the traditional tripartite view, Messick argued for a
new unified concept of validity interrelates these issues as fundamental aspects of a more comprehensive theory of construct validity that addresses both score meaning and social values in test interpretation and test use.Thus Messick attempted to unify the study of validity within the idea of construct validity, broken down into six “distinguishable aspects of construct validity” including content, substantive, structural, generalizability, external, and consequential aspects. While I followed Popham's lead in not introducing these distinctions to teachers and administrators in my classes, this framework was central in my assessment work going forward. In particular I used them in a validity study of the performance assessments for introductory inheritance that were central to my research from 1996 to 2005. As I will elaborate in my next post, I particularly appreciated the way the Messick helped draw attention to the consequences of the ways that evidence from assessments and tests are used.
No comments:
Post a Comment