Friday, January 22, 2010

Can We Really Measure "21st Century" Skills?

The members of the 21st Century Assessment Project were asked a while ago to respond to four pressing questions regarding assessment of “21st Century Skills.” These questions had come via program officers at leading foundations, including Connie Yowell at MacArthur’s Digital Media and Learning Initiative, which funds our Project. I am going to launch my efforts to blog more during my much-needed sabbatical by answering the first question, with some help from my doctoral student Jenna McWilliams.

Question One: Can critical thinking, problem solving, collaboration, communication and "learning to learn" be reliably and validly measured?

As Dan Koretz nicely illustrated in the introduction to his 2008 book, Measuring Up: What Educational Testing Really Tells Us, the answers to questions about educational testing are never simple. We embrace strongly situative and participatory view of knowing and learning, which is complicated to explain to those who do not embrace it. But I have training in psychometrics (and completed a postdoc at ETS) and have spent most of my career refining a more pragmatic stance that treats educational accountability as inevitable. When it comes to assessment, I am sort of a born-again situativity theorist. Like folks who have newly found religion and want to tell everybody how Jesus helped them solve all of the problems they used to struggle with, I am on a mission to tell everyone how situative approaches measurement can solve some nagging problems that they have long struggled with.

In short, no, we don’t believe we can measure these things in ways that are reliable and yield scores that are valid evidence of what individuals are capable of in this regard. These are actually “practices” that can most accurately be interpreted using methods accounting for the social and technological contexts in which they occur. In this sense, we agree with skeptics like Jim Greeno and Melissa Gresalfi who argued that we can never really know what students know. This point riffs on the title of the widely cited National Research Council report of the same name that Jim Pellegrino (my doctoral advisor) led. And as Val Shute just reminded me, Messick has reminded us forever that measurement never really gets directly at what somebody knows, but instead provides evidence about what the seem to know. My larger point here is my concern about what happens with these new proficiencies in schools and in tests when we treat them as individual skills rather than social practices. In particular I worry what happens to both education and evidence when students, teachers, and schools are judged according to tests of these new skills.

However, there are lots of really smart folks who have a lot of resources at their disposal who think you can measure them. This includes most of my colleagues in the 21st Century Assessment Project. For example, check out Val Shute’s great article in the International Journal of Learning and Media. Shute also has an edited volume on 21st Century Assessment coming out shortly. Likewise Dan Schwartz has a tremendous program of research building on his earlier work with John Bransford on assessments as preparation for future learning. Perhaps the most far reaching is Bob Mislevy’s work on evidence-centered design. And of course there is the new Intel-Microsoft-Cisco partnership which is out to change the face of national assessments and therefore the basis of international comparisons. I will elaborate on these examples in my next post, as that is actually the second question we were asked to answer. But first let me elaborate on why I believe that the assessment (of what individuals understand) and the measurement (of what groups of individuals have achieved) of 21st Century skills is improved if we assume that we can never really know what students know.

To reiterate, from the perspective of contemporary situated views of cognition, all knowledge and skills are primarily located in the social context. This is easy to ignore when focusing on traditional skills like reading and math that can be more meaningfully represented as skills that individuals carry from context to context. This assumption is harder to ignore with these newer ones that everyone is so concerned with. This is expecially the case with explicity social practices like collaborating and communicating, since these can't even practiced in isolated contexts. As we argued in our chapter in Val’s book, we believe it is a dangerously misleading to even use the term skills in this regard. We elected to use the term proficiencies because that term is broad enough to capture the different ways that we think about them. As 21st Century Assessment project leader Jim Gee once put it
Abstract representations of knowlege, if they exist at all, reside at the end of long chains of situated activity.
However, we also are confident that that some of the mental “residue” that gets left behind when people engage meaningfully in socially situated practices can certainly be assessed reliably and used to make valid interpretations about what individuals know. While we think these proficiencies are primarily social practices, it does not exclude recognizing the secondary “echoes” of participating in these practices. This can be done with performance assessments and other extended activities that provide some of that context and then ask individuals or groups to reason, collaborate, communicate, and learn. If such assessments are created carefully, and individuals have not been directly trained to solve the problems on the assessments, it is possible to obtain reliable scores that are valid predictions of how well individuals can solve, communicate, collaborate, and learn in new social and technological contexts. But this continues to be difficult and the actual use of such measures raises serious validity issues. Because of these issues (as elaborated below), we think this work might best be characterized as “guessing what students know.”

More to the point of the question, we believe that only a tiny fraction of the residue from these practices can be measured using conventional standardized multiple-choice tests that provide little or no context. For reasons of economy and reliability, such tests are likely to remain the mainstay of educational accountabiity for years to come. Of course, when coupled with modern psychometrics, such tests can be extremely reliable, with little score variation across testing time or version. But there are serious limitations in what sorts of interpretations can be validly drawn from the resulting scores. In our opinion, scores on any standardized test of these new skills are only valid evidence of proficiency when they are
a) used to make claims about aggregated proficiencies across groups of individuals;
b) used to make claims about changes over longer times scales, such as comparing the consequences of large scale policy decisions over years; and
c) isolated from the educational environment which they are being used to evaluate.
Hence, we are pretty sure that national and international assessments like NAEP and PISA should start incorporating such proficiencies. But we have serious concerns about using these measures to evaluate individual proficiencies in an high-stakes sorts of ways. If such tests are going to continue to be used on any high stakes decisions, they may well best be left to more conventional literacies, numeracies, and knowledge of conventional content domains, which are less likely to be compromised.

I will say that I am less skeptical about standardized measures of writing. But they are about the only standardized assessments left in wide use that actually requires students to produce something. Such tests will continue to be expensive and standardized scoring (by humans or machines) requires very peculiar writing formats. But I think the scores that result are valid for making inferences about individual proficiency in written communication more broadly, as was implied by the original question. They are actually performance assessments and as such can bring in elements of different contexts. This is particularly true if we can relax some of the needs for reliability (which requires very narrowly defined prompts and typically gets compromized and writers get creative and original). Given that I think my response to the fourth question will elaborate on my belief that written communication is probably the single most important “new proficiency” needed for economic, civic, and intellectual engagement, I think that improved testing of written communication will be the one focus of assessment research that yields the most impact on learning and equity.

To elaborate on the issue of validity, it is worth reiterating that validity is a property of the way the scores are interpreted. Unlike reliability, validity is never a property of the measure. In other words, validity always references the claims that are being supported by the evidence. As Messick argued in the 90s, the validity of any interpretation of scores also depends on the similarity between prior education and training contexts and the assessment/measurement context. This is where things get messy very quickly. As Kate Anderson and I argued in a chapter in an NSSE Yearbook on Evidence and Decision Making edited by Pam Moss, once we attach serious consequences to assessments or tests for teachers or students, the validity of the resulting scores will get compromised very quickly. This is actually less of a risk with traditional proficiencies and traditional multiple choice tests. This is because these tests can draw from massive pools of items that are aligned to targeted standards. In these cases, the test can be isolated from any preparation empirically, by randomly sampling from a huge pool of items. As we move to newer performance measures of more extended problem solving and collaboration, there necessarily are fewer and fewer items and the items become more and more expensive to develop and validate. If teachers are directly teaching students to solve the problems, then it becomes harder and harder to determine how much of an individual score is real proficiency and how much is familiarity with the assessment format (what Messick called construct-irrelevant variance). The problem is that it is impossible to ever know how much of the proficiency is “real.” Even in closely studied contexts, different observers are sure to differ in the validity—a point made most cogently in Michael Kane’s discussions of validity as interpretive argument.

Because of these validity concerns, we are terrified that the publishers of these tests of “21st Century Skills” are starting marketing curricula and test preparation materials of those same proficiencies. Because of the nature of these new proficiencies, these new “integrated” systems raise even more validity issues than the ones that emerged under NCLB for traditional skills. Another big validity issue we raised in our chapter concerns the emergence of socially networked cheating. Once these new tests are used for high-stakes decisions (especially for college entrance), social networks will emerge to tell students how to solve the kinds of problems that are included on the tests. (This has already begun to happen, as in the "This is SPARTA!" prank on the English Advanced Placement test that we wrote about in our chapter and in a more recent "topic breach" wherein students in Winnipeg leaked the essay topic for the school's 12th grade English exam.)

Of course, proponents of these new tests will argue that learning how to solve the kinds of problems that appear on their tests is precisely what they want students to be doing. And as long as you adopt a relatively narrow view of cognition and learning, there is some truth to that assumption. Our real concern is that this unbalanced focus in addition to new standards and new tests will distract from the more important challenge of fostering equitable, ethical, and consequential participation in these new skills in schools.

That is it for now. We will be posting my responses to the three remaining questions over the next week or so. We would love to hear back from folks about their responses to the first question.

Questions remaining:
2) Which are the most promising research initiatives?
3) Is it or will it be possible to measure these things in ways that they can be scored by computer? If so, how long would it take and what sorts of resources would be needed?
4) If we had to narrow our focus to the proficiencies most associated with economic opportunity and civic participation, which ones do we recommend? Is there any evidence/research specifically linking these proficiencies to these two outcomes? If we further narrowed our focus to only students from underserved communities, would this be the same list?


  1. My understanding of this question (maybe wrong): Critical thinking, problem solving, collaboration, communication and "learning to learn” cannot be fully measured through multiple choice items and traditional performance assessments. These assessments can measure some of the mental “residue” that people get after participate in the practices related to those 21st century skills and these assessment can be reliable and valid. Knowing those skills is only interpreted based on the social practices in which people are applying those skills. A question is “is that interpretation reliable and valid?” or “Do we need to concern validity and reliability to measure knowing those skills from the actual social practices?”
    My first thought regarding this question is that “for what do we think of the exact measure of those skills?” If we can measure the aggregated scores or individual understanding of those skills through external valid and reliable assessments and the concerns on reliability and validity shift the social practices towards preparation on test, why do we need to think of valid and reliable measure those skills in the actual practices again?
    Maybe researcher needs to think of the validity and reliability of these, but I think that concern on the reliability and validity is not necessary for teachers and students who act on the real social practices. Their main concerns should be full participation on the social practices through critical thinking, problem solving, collaboration, communication and "learning to learn”. When we post something, e.g., in blog, why do we need to think of measurement in 21st skills for our blog? We only need to think of running “blog”. In other words, as inside participants in the actual social practices, why do they need to consider how measuring getting those skills from their actual participation looks like? They only need to focus on their participation on the social practice through critical thinking, problem solving, collaboration, communication.

  2. Val Shute touched on this some in an email, and I made some edits according. He pointed out that we never "directly" or "exactly" measure human cognition. This is one of the reasons vhy Val and Jim Pellegrino prefer the word "assess" over "measure" when it comes to education. We can measure behavior, so in that sense we can measure achievement in terms of the number of items a kid gets right.

    Val and Jim and others point out that focusing on prevailing ways of measuring aggregated achievement miss the most important aspects of problem solving, and communication. I am extending that argument to say that prevailing and proposed ways of assessing reasoning, problem solving, and communication can't capture the crucial social aspects of those practices. I have no problem with measuring achievment and assessing understanding. It is the way those tests and assessment get used and misused. It undermines their validity and corrodes the social practices that are ultimately neeed for kids to learn them.

    Take your great example of blogging. When I decided to include some social networking requirements in a doctoral seminar last spring, I used the educational networking exercises that Sam Rose post at Howard Rheingold's Social Media Classroom. They worked great and by the end of the semester most of my students were blogging about interesting stuff (both scholarly and popular stuff) and one them (Jenna) had even started getting paid for posts at the Guardian Online in the UK. I never gave them a test of their "blogging skills". I worry that classroom teachers might turn to something liked Scholastic's new Expert 21 system. I have not looked at it in the last few months but they seemed to be promising and integrated system of exercises and tests that would promise you students would be 21st century bloggers--without ever actually putting them on the Internet. That time would be better spent helping kids learn to write, as that is a skill that we can readily teach and tests.

    As I am learning myself, the actual skills involved in blogging beyond writing are quite minimal. But stuff like gaining an audience, responding to flames, cross posting, etc. are deeply nuanced social practices that are hard to directly teach and impossible to put on any test that a teacher might be able to use. So don't bother. Teachers should kids learn to blog by giving the acccess to the Internet, letting the blog about stuff they are interested in, reading their posts, and giving them feedback about their writing, and making sure that when the do their writing skills are improving.

  3. I have two points that I would like to make with respect to this conversation about blogging. Here's point one; point two will come in a second comment.

    It's true that around the same time as Dan introduced blogging to his class (which I was sitting in on virtually, since I lived in another state), I also started up my own blog and eventually got hired by the Guardian to write a series of posts. I was also doing a lot of reading in anticipation of becoming a graduate student. I was also engaging in theory-based arguments in a reading group I had started with some friends, and I wished I had more time and space to develop those arguments further. I was also quite exhausted--just absolutely out of gas--on the day I started my blog, and therefore perhaps a bit more courageous than I may have been had the notion to start a blog come to me at a calmer time.

    Aside from the challenges of assessing "blogging skills," there's the question of what relationship all of these things had to my development as a blogger. It's a question of causality. Dan brought blogging into his class and encouraged all students to start blogging, but I was neither an enrolled student nor required to start a public blog. In fact, most of my earliest posts had nothing to do with the class. Many of the students in that class started up blogs of their own, and some continue to blog but many of those blogs have been discontinued. I soldier on, time permitting.

    This is, in my view, another challenge of teaching and assessing "21st century skills": Blogging was perfect for me. Gaming would be perfect for another learner. Some tool that hasn't yet been invented will be perfect for someone else. But within the confines of a class, there's no way to know, support, or measure deep engagement with a communication tool as complex as the ones we're talking about now. You can learn how to start a blog, submit a post, and possibly build stronger arguments by using the blogger's invincibility spell, the hyperlink; but the ability to engage in the practices Dan mentions above--the practices that separate good writers from good bloggers--takes a commitment of time and energy, generally outside of the classroom setting, and over a space that often outlasts the confines of the class.

  4. My second point:

    Dan, I disagree with your point that "the actual skills involved in blogging beyond writing are quite minimal." In fact, participation in a community of bloggers requires careful attunement to the needs, interests, and values of like-minded and not-so-like-minded bloggers; it requires the work of reading both deeply and broadly; it requires building and maintaining a complex social network. It takes time and energy, and a mindset that I carry with me across affinity spaces.

    The skills required are social proficiencies, and they're deep and often implicit. That's why they're so hard for many people to see, and that's why the challenge of bringing participatory practices into the classroom is so exquisitely problematic. We want our kids to blog, for example, because we see the social value (and we especially see the connection to the traditional literacy practices of reading and writing). But if we don't ourselves blog, it's hard to measure--to assess--students' proficiency with respect to the affinity space occupied by bloggers.

  5. I should have said "testable" skills, or perhaps "measurable" skills. What I am pushing for is that the term "skills" be used to behavior or other things that can be precisely measured. I think we should not use that term for "social proficiencies" like you describe that are "deep and often implicit" This is the reason I pushed for the term "proficiences" in our chapter because it works nicely in this context, and seems to demand qualifiers like "individual" and "social." I have taken to using the phrase "21st Century bla bla bla" that I got from Lorrie Shepard. But in my writing I ma finding that the phrase "21st Century reasoning, communicating, and learning" and then referencing back to that by saying "these 21st Century proficiencies"

  6. the unlikely scholarJanuary 25, 2010 at 11:26 PM

    Thanks for this thought-provoking post. I don't know if I'm taking the discussion too far afield and outside the scope of what you'd like to discuss, but I am wondering what the assessment of 21st century skills -- or proficiencies -- means for educational equity. These proficiencies, even more so than discrete skills, are rooted in ways of using language and text. I'm wondering specifically about immigrant English Language Learner students who, typically, are denied access to the kind of education where they can develop the linguistic registers/ discourse mores to fully participate. Do you have thoughts about how assessing 21st century proficiencies can (or should) move us closer to more democratic participation?

  7. You are welcome, thanks for commenting. Yes, we are frankly terrified of the coming "participation gap". Our thoughts here are shaped here by the White Paper at Project New Media Literacies. The issue they don't raise there is that underserved kids will continue to be drilled and drilled on trivial skills and won't have guidance outside of school. Things like the Digital Youth Project, Digital Youth Network, and Global Kids will help. Also check out Learning Race and Ethnicity in the outstanding MacArthur/MIT Press series on Digital Media and Learning. (You can download the chapters). But if we just have to computers and budgets at underserved schools out from under the drudgery of NCLB testing and test prep. For this to happen, innovators are going to have to work harder to make sure their innovations deliver the goals they are directly supporting, and make sure that meeting those goal also impact achievement. It is possible.

  8. Thanks for the great post, Dan--very thought-provoking! I just wanted to briefly share a thought about assessment and games vs. assessment and education--my colleague Seann Dikkers brought this up to me, and it's been percolating in the back of my mind ever since. [Warning: I'm going to throw around words like "learn" and "know" and--eeeek!--even "game" willy nilly, with no respect for all the nuances--it seems like to efficiently share a concept, such disrespect is sadly necessary...] Seann noted that when someone finishes playing a game, say "God of War" [] or Civilization III (, you can make certain judgments about what they know--no information about their actions or choices in the game is necessary to establish a baseline. In other words, the simple *completion* of a game tells you a lot--no assessment necessary! And in our educational system, we require an ongoing series of (formative and summative) assessments to keep tabs on what people know--so there is some sort of fundamental difference between what it means to complete a *game* and what it means to complete a *class* or *unit.*

    Now, my partner, Jordan Thevenow-Harrison (who just recently started working with Chuck Kalish and Tim Rogers here at UW), says that the *game* is constantly assessing--players have to learn and pass "tests" in order to beat the game, they're just not teacher-administered. And yes, I agree with that to some extent, and I think he's got a good point--but I believe that "I've beaten game X" means something more consistently than "I've taken class X" (this is contributed to by the inconsistencies between classes and assessments based on the same topic). The fact remains, somehow I consider there to be a qualitatively different learning experience with games, as well as a different sort of certainty about what players have learned at the end of the game. (Maybe it has to do with Messick's comment about tests: if someone fails a test, all you can say is that your test did not measure anything. But what does it mean to "fail" a game? And if failing a game is different from failing a test, then perhaps passing a test is different from beating a game...)

    Sheesh, perhaps if I could sit down with you and talk for an hour, we could get a better sense of this distinction--maybe it's because games can be a strange blend between performance and thought?

  9. Caro!

    What would you think about a change from "I've taken class X" to "I've beaten 'Four Years of Undergraduate Education'" or "I've beaten 'French Minor'"? This is something like what happens with certain books, at least--saying "I've read 'Moby-Dick'" is basically meaningless, because most people who've read 'Moby-Dick' slogged through it to pass a class. But saying "I beat 'Moby-Dick'"--now that's saying something we can understand.

    I know it seems like I'm just talking semantics here. But I think you're on to something, and I think we approximate the distinction you make between passing a test and beating a game when we step back a bit and examine the games (Games) we play in educational systems.

  10. Caro--
    Jim Gee has been pushing this point hard in recent years. One concern is that games are fundamentally about "just-in-time" learning and that makes what you describe possible. As Allen Collins said in his new book, school at some level are stuck with "just-in-case" knowledge. One of the things I am trying to do on sabbatical is game more with my son. It is fun but I still find myself doing it just to spend time with him. I don't play by myself. School is always going to be stuck with the unpleasant job of teaching stuff that some learners are just not going to want to learn. So we can only get so far with that approach--but we should keep trying. But there is another point the Melissa Gresalfi made that always run true for me. You can quite World of Warcraft when you want to stop playing. You can't really have a functioning educational system if everybody just quits when they don't feel like playing that day or that game.

  11. Dan,
    I'm not extremely familiar with Jim Gee's writing on big-G and little-g games, but isn't his work around games/Games aimed at getting at exactly your last point? He writes:

    "in an educational context... we want to be certain that we create Games (social systems) that ensure that experiences will be well interrogated in the game (the in-game play). This is where “teachers” and mentors (not necessarily official teachers in the school sense) become crucial. Such teachers and mentors help create an effective Game and game/Game combination."

    After all, what is a game but an agreement to do certain things, according to certain (fairly arbitrary) rules, with the assumption that everybody else is doing the same and that a "win state" can ultimately be attained? It's true that we don't have the same freedom to quit the game of school as we do to quit the game of WoW, but certainly if everybody DID quit, school would cease to be a game, or at least a game worth playing.

  12. Dan,
    I really enjoyed your post! I'm just getting back from giving a talk on literacies in social media spaces at the ESRC funded seminar in the UK so your timing is perfect as assessment is one of the issues that came up mostly in the context of our talking about connecting in-school and out-of-school learning. In terms of assessing the 'communication' piece of the 21st century skills some are pointing to the iSkills assessment coming out of ETS as a step in this direction. I wonder what you think of this. (Educational Testing Service. (2009). iSkills Overview at
    Also, I wonder if Michael Russell's book might be relevant to your efforts here. In it, he lays out the history of the assesment movement and how it is at odds with educational reforms involving technology/21st century skills. A summary of the book is here:
    Russell, M. (2006). Technology and assessment:The Tale of Two Perspectives.Greenwich, CT: Information Age Publishing

    I look forward to the next installment!

  13. Christine--
    Great question. We did a review of all the various efforts like iSkills about a year ago for our chapter in Val Shute's book. A bunch of it got cut for space considerations, so maybe we should update it an post it here.

    And an update is clearly needed. Notice the ETS's iSkills is now called iCritical Thinking. I suspet they are responding in part to the concerns that we and others are raising about the label of "skills". I am acquainted with Irv Katz and I have tremendous respect for him at the others who are working on this program at ETS. But I have followed their foray into new media proficiencies with concern. Inevitably there are huge consequences attached to any test ETS creates. This inevitably results in narrowing of the curriculum, drill and practice test prep, and outright cheating. This is a big enough problem with traditional tests where massive pools of items with tons of data behind them can be randomly assigned, and where (I presume) DIF or other techniques can detect compromized items. But these newer items are incredibly expensive get to the point of operational level, and are inevitably not as reliable. Between narrowing, test prep, and socially-networked cheating, I don't think these scores will be valid for any high-stakes decisions at just about the same time it becomes operational--unless someone can prove that the data was not compromized. I might use these tests in my studys, but only at a great distance from the curriculum.

    Another concern I have with this trajectory at ETS is that they seem to be straying from their roots as first rate measurement professionals. While at ETS, Sam Messick taught the world that validity is property of an interpretation rather than an assessment. Most of the research papers at the iCritical Thinking (iCT?--clever!) only talk about how important critical thinking is. We know that already. But the one that focuses on measuring ICT skills in college students at ( only mentions valdity once. (!) More worrisome is that they seem to have ignored Messick (I believe their former VP of research) in doing so: "Completion of the assessment development process included pre-launch field testing. ETS also conducted beta testing at colleges and universities to ensure that each task is conceptually and statistically valid. ETS will also conduct ongoing research to examine the performance data, ensuring the test instruments meet the highest standards of validity, reliability, quality and utility" (p. 4). Not only do they assert that validity is a property of items and tests, the remind us how long and drawn out the development cycle is. And at ETS it is REALLY expensive too. So I worry for ETS and for the colleges who might rely on these assessments that the test will be DOA--even if one thinks these are things that can be measured with such a test. I am chair of student recruiting for my doctoral program so I spend a lot of time looking at scores on ETS tests and comparing them to student who we admit. I am not sure how these tests will be used, but I suspect I might actually be worried about a high score on an exam like this, because it would make me worry that somebody spent a bunch of time being trained uncritically to think critically.

    As for the Russell volume I have not seen it. (Information Age publishes some great stuff but so far they don't do a very good job getting it out in the libraries or available on line.) But thanks to your detailed review, I can see that it looks like a very thoughtful book that touches on a bunch of very important points. I am going to order it today and hopefully post about it in the next month or two. Thanks!

  14. You should be aware that the last "post" on this page links (via the ....) to a site I don't believe you want associated with your work. It should be removed.

  15. Quote: You can't really have a functioning educational system if everybody just quits when they don't feel like playing that day or that game.

    I think you can. I went to an elementary school like this. In 4th-6th grade, we had a set amount of work to be completed by the end of a quarter (of a year). How you allocated time was up to you. The teachers would help push you, if you were new or unmotivated, but for the most part students did as they pleased. I remember that instead of doing work, I would read Milton in the woods, for entire days, without doing any work, such as research papers or Mathimagination ( I would play computer games and take them apart. I could just sit with my head in my hands. But then I had to catch up.