Lord knows I'm not a huge fan of Ning, the social networking tool that allows users to create and manage online networks. I find the design bulky and fairly counterintuitive, and modifying a network to meet your group's needs is extremely challenging, and Ning has made it extremely difficult or impossible for users to control, modify, or move network content. Despite the popularity of Ning's free, ad-supported social networks among K-16 educators, the ads that go along with the free service have tended toward the racy or age-inappropriate.
But given the Ning trifecta--it's free, getting students signed up is fast and fairly easy, and lots of teachers are using it--I've been working with Ning with researchers and teachers for the last two years. So the recent news that Ning will be switching to paid-only membership is obnoxious for two reasons.
The first reason is the obvious: I don't want to pay--and I don't want the teachers who use Ning to have to pay, either. One of the neat things about Ning is the ability to build multiple social networks--maybe a separate one for each class, or a new one each semester, or even multiple networks for a single group of students. In the future, each network will require a monthly payment, which means that most teachers who do decide to pay will stick to a much smaller number of networks. This means they'll probably erase content and delete members, starting fresh each time. The enormous professional development potential of having persistent networks filled with content, conversations, and student work suddenly disappears.
Which brings me to my second point: That anyone who's currently using Ning's free services will be forced to either pay for an upgrade or move all of their material off of Ning. This is tough for teachers who have layers upon layers of material posted on various Ning sites, and it's incredibly problematic for any researcher who's working with Ning's free resources. If we decide to leave Ning for another free network, we'll have to figure out some systematic way of capturing every single thing that currently lives on Ning, lest it disappear forever.
Ning's decision to phase out free services amounts to a paywall, pure and simple. Instead of putting limits on information, as paywalls for news services do, this paywall puts limits on participation. In many ways, this is potentially far worse, far more disruptive and destructive, far more short-sighted than any information paywall could be.
If Ning was smart, it would think a little more creatively about payment structures. What about offering unlimited access to all members of a school district, for a set fee paid at the district level? What about offering an educator account that provides unlimited network creation for a set (and much lower) fee? What about improving the services Ning provides to make it feel like you'd be getting what you paid for?
More information on Ning's decision to go paid-only will be released tomorrow. For now, I'm working up a list of free social networking tools for use by educators. If you have any suggestions, I'd love to hear them.
Update, 4/15/10, 6:48 p.m.: Never one to sit on the sidelines in the first place, Alec Couros has spearheaded a gigantic, collaborative googledoc called "Alternatives to Ning." As of this update, the doc keeps crashing because of the number of collaborators trying to help build this thing (the last time I got into it, I was one of 303 collaborators), so if it doesn't load right away, keep trying.
Thursday, April 15, 2010
Friday, April 2, 2010
Diane Ravitch Editorial on the Failure of NCLB
I have long admired Diane Ravitch. While I have disagreed with her on fundamental philosophical grounds, her arguments have always been grounded in the realities of schooling--even if those were the realities of conservative parents and stakeholders.
Now the evidence has shown what some of us predicted and what many of us have known for years: that external tests of basic skills and punitive sanctions were just going to lead to illusory gains (if any) and undermine other value outcomes. Her editorial in today's (April 2) Washington Post is very direct. While I disagree with her on where to go from here, I applaud her for using her audience and her reputation to help convince a lot of stakeholders who have found one reason or another to ignore the considerable evidence against continuing NCLB. Like Jim Popham has been saying for years, all of the improvement schools could make with test scores already happened between 1990 and 2000, once newspapers began publishing test scores.
Certainly this will factor into the pending NCLB reauthorization. Perhaps Indiana's Republican leadership will read this and think twice about going forward with their two core ideas for their Race to the Top reform proposal, even though it was not funded. The twin shells in their reform shotgun is "Pay for Performance" merit pay for Indiana teachers based on basic skills test scores, and "Value Added" growth modeling that ranks teachers based on how much "achievement" they instilled in their kids. For reasons Ravitch summarizes and other concerns outlined in a recent letter and report by the National Academy, the recoil from pulling these two triggers at once might be just enough to blow our schools and our children pretty far back into the 20th century.
Now the evidence has shown what some of us predicted and what many of us have known for years: that external tests of basic skills and punitive sanctions were just going to lead to illusory gains (if any) and undermine other value outcomes. Her editorial in today's (April 2) Washington Post is very direct. While I disagree with her on where to go from here, I applaud her for using her audience and her reputation to help convince a lot of stakeholders who have found one reason or another to ignore the considerable evidence against continuing NCLB. Like Jim Popham has been saying for years, all of the improvement schools could make with test scores already happened between 1990 and 2000, once newspapers began publishing test scores.
Certainly this will factor into the pending NCLB reauthorization. Perhaps Indiana's Republican leadership will read this and think twice about going forward with their two core ideas for their Race to the Top reform proposal, even though it was not funded. The twin shells in their reform shotgun is "Pay for Performance" merit pay for Indiana teachers based on basic skills test scores, and "Value Added" growth modeling that ranks teachers based on how much "achievement" they instilled in their kids. For reasons Ravitch summarizes and other concerns outlined in a recent letter and report by the National Academy, the recoil from pulling these two triggers at once might be just enough to blow our schools and our children pretty far back into the 20th century.
Tuesday, March 9, 2010
Video of Barry McGaw on Assessment Strategies for 21st Century Skills (Measurement Working Group)
I just came across a video of a keynote by Barry McGaw at last month’s Learning and World Technology Forum. McGaw heads the Intel/Microsoft/Cisco initiative known as Assessment and Teaching of 21st Century Skills. This high-powered group is aiming to transform the tests used in large-scale national comparisons and education more broadly. Their recent white papers are a must read for anyone interested in assessment or new proficiencies. McGaw’s video highlights aspects of this effort that challenge conventional wisdom about assessment. In this post I focus on McGaw’s comments on the efforts of the Measurement working group. In particular they point to (1) the need to iteratively define assessments and the constructs they aim to capture, and (2) the challenge of defining developmental models of these new skills.
Iterative Design of Assessments and Constructs
McGaw highlighted that the Measurement Working Group (led by Mark Wilson) emphasized the need for iterative refinement in the development of new measures. Various groups spent much of the first decade of the 21st century debating how these proficiencies should be defined and organized. In this abstract context, this definition process could easily consume the second decade as well. Wilson’s group argues that the underlying constructs being assessed must be defined and redefined in the context of the assessment development process. Of this, McGaw said
In other words, you can’t just have one group define standards and definitions and then pitch them to the measurement group when dealing with these new proficiencies. Because of their highly contextualized nature, we can’t just pitch standards to testing companies as has been the case with hard skills for years. This has always nagged at me in previous consideration, in that they seemed to overlook both the issue and the challenge that it presents (e.g., the Partnership for 21st Century Skills). Maybe now we can officially decide to stop trying to define what assessment scholar Lorrie Shepard so aptly labeled “21st Century Bla Bla Bla.”
The Lack of Learning Progression Models
McGaw also reiterated the concerns of the Measurement Working Groups over the lack of consensus about the way these new proficiencies develop. There is a strong consensus about the development of many of the hard skills in math, science, and literacy, and these insights are crucial for developing worthwhile assessments. I learned about this first hand developing a performance assessment for introductory genetics working with Ann Kindfield at ETS. Ann taught me the difference between the easier cause-to-effect reasoning (e.g., completing the Punnett square) and the more challenging effect-to-cause reasoning (e.g., using a pedigree chart to infer mode of inheritance). We used these and other distinctions she uncovered in her doctoral studies to create a tool that supported tons of useful studies on teaching inheritance in biology classes. Other more well known work on “learning progressions” include Ravit Duncan’s work in molecular genetics and Doug Clements’ work in algebra. In each case it took multiple research teams many years reach consensus about the way that knowledge typically developed.
Wilson and McGaw are to be commended for reminding us how difficult it is going to be to agree on the development of these much softer 21st century proficiencies. They are by their very definition situated in more elusive social and technological contexts. And those contexts are evolving. Quickly. Take for example judging credibility of information on the Internet. In the 90s this meant websites. In the past decade it came to mean blogs. Now I guess it includes Twitter. (There is a great post about this at MacArthur’s Spotlight Blog, as well as a recent CBC interview about fostering new media literacies, featuring my student Jenna McWilliams.)
Consider that I taught my 11-year-old son to look at the history page on Wikipedia to help distinguish between contested and uncontested information in a given entry. He figured out on his own how to verify the credibility of suggestions for modding his Nerf guns at nerfhaven.com and YouTube. Now imagine you are ETS, where it inevitably takes a long time and buckets of money to produce each new test. They already had to replace their original iSkills test with the iCritical Thinking test. From what I can tell, it is still a straightforward test of information from a website. Lots of firms are starting to market such tests. Some places (like Scholastic’s Expert21) will also sell you curriculum and classroom assessments that will teach students to pass the test—without ever actually going on the Internet. Of course ETS know that they can’t sell curriculum if they want to maintain their credibility. But I am confident that as soon as organizations start attaching meaningful consequences to the test, social networks will spring up telling students exactly how to answer the questions.
There is lots of other great stuff in the Measurement white paper. Much if it is quite technical. But I applaud their sobering recognition of the many challenges that these new proficiencies pose for large scale measurement. And they only get harder when these new tests are used for accountability purposes.
Next up: McGaw’s comments about the Classroom Environments and Formative Evaluation working group.
Iterative Design of Assessments and Constructs
McGaw highlighted that the Measurement Working Group (led by Mark Wilson) emphasized the need for iterative refinement in the development of new measures. Various groups spent much of the first decade of the 21st century debating how these proficiencies should be defined and organized. In this abstract context, this definition process could easily consume the second decade as well. Wilson’s group argues that the underlying constructs being assessed must be defined and redefined in the context of the assessment development process. Of this, McGaw said
You think about it first, you have a theory about what you want those performances to measure. You then begin to develop ways of capturing information about that skills. But the data themselves give you information about the definition, and you refine the definition. This is the important point of pilot work with these assessment devices. And not just giving the tests to students, but giving them to students and seeing what their responses are, and discovering why they gave that response. And not just in the case where it is the wrong response but in the case where it is the correct response, so that you get a better sense of the cognitive processes underlying the solution to the task.
In other words, you can’t just have one group define standards and definitions and then pitch them to the measurement group when dealing with these new proficiencies. Because of their highly contextualized nature, we can’t just pitch standards to testing companies as has been the case with hard skills for years. This has always nagged at me in previous consideration, in that they seemed to overlook both the issue and the challenge that it presents (e.g., the Partnership for 21st Century Skills). Maybe now we can officially decide to stop trying to define what assessment scholar Lorrie Shepard so aptly labeled “21st Century Bla Bla Bla.”
The Lack of Learning Progression Models
McGaw also reiterated the concerns of the Measurement Working Groups over the lack of consensus about the way these new proficiencies develop. There is a strong consensus about the development of many of the hard skills in math, science, and literacy, and these insights are crucial for developing worthwhile assessments. I learned about this first hand developing a performance assessment for introductory genetics working with Ann Kindfield at ETS. Ann taught me the difference between the easier cause-to-effect reasoning (e.g., completing the Punnett square) and the more challenging effect-to-cause reasoning (e.g., using a pedigree chart to infer mode of inheritance). We used these and other distinctions she uncovered in her doctoral studies to create a tool that supported tons of useful studies on teaching inheritance in biology classes. Other more well known work on “learning progressions” include Ravit Duncan’s work in molecular genetics and Doug Clements’ work in algebra. In each case it took multiple research teams many years reach consensus about the way that knowledge typically developed.
Wilson and McGaw are to be commended for reminding us how difficult it is going to be to agree on the development of these much softer 21st century proficiencies. They are by their very definition situated in more elusive social and technological contexts. And those contexts are evolving. Quickly. Take for example judging credibility of information on the Internet. In the 90s this meant websites. In the past decade it came to mean blogs. Now I guess it includes Twitter. (There is a great post about this at MacArthur’s Spotlight Blog, as well as a recent CBC interview about fostering new media literacies, featuring my student Jenna McWilliams.)
Consider that I taught my 11-year-old son to look at the history page on Wikipedia to help distinguish between contested and uncontested information in a given entry. He figured out on his own how to verify the credibility of suggestions for modding his Nerf guns at nerfhaven.com and YouTube. Now imagine you are ETS, where it inevitably takes a long time and buckets of money to produce each new test. They already had to replace their original iSkills test with the iCritical Thinking test. From what I can tell, it is still a straightforward test of information from a website. Lots of firms are starting to market such tests. Some places (like Scholastic’s Expert21) will also sell you curriculum and classroom assessments that will teach students to pass the test—without ever actually going on the Internet. Of course ETS know that they can’t sell curriculum if they want to maintain their credibility. But I am confident that as soon as organizations start attaching meaningful consequences to the test, social networks will spring up telling students exactly how to answer the questions.
There is lots of other great stuff in the Measurement white paper. Much if it is quite technical. But I applaud their sobering recognition of the many challenges that these new proficiencies pose for large scale measurement. And they only get harder when these new tests are used for accountability purposes.
Next up: McGaw’s comments about the Classroom Environments and Formative Evaluation working group.
Friday, January 22, 2010
Can We Really Measure "21st Century" Skills?
The members of the 21st Century Assessment Project were asked a while ago to respond to four pressing questions regarding assessment of “21st Century Skills.” These questions had come via program officers at leading foundations, including Connie Yowell at MacArthur’s Digital Media and Learning Initiative, which funds our Project. I am going to launch my efforts to blog more during my much-needed sabbatical by answering the first question, with some help from my doctoral student Jenna McWilliams.
Question One: Can critical thinking, problem solving, collaboration, communication and "learning to learn" be reliably and validly measured?
As Dan Koretz nicely illustrated in the introduction to his 2008 book, Measuring Up: What Educational Testing Really Tells Us, the answers to questions about educational testing are never simple. We embrace strongly situative and participatory view of knowing and learning, which is complicated to explain to those who do not embrace it. But I have training in psychometrics (and completed a postdoc at ETS) and have spent most of my career refining a more pragmatic stance that treats educational accountability as inevitable. When it comes to assessment, I am sort of a born-again situativity theorist. Like folks who have newly found religion and want to tell everybody how Jesus helped them solve all of the problems they used to struggle with, I am on a mission to tell everyone how situative approaches measurement can solve some nagging problems that they have long struggled with.
In short, no, we don’t believe we can measure these things in ways that are reliable and yield scores that are valid evidence of what individuals are capable of in this regard. These are actually “practices” that can most accurately be interpreted using methods accounting for the social and technological contexts in which they occur. In this sense, we agree with skeptics like Jim Greeno and Melissa Gresalfi who argued that we can never really know what students know. This point riffs on the title of the widely cited National Research Council report of the same name that Jim Pellegrino (my doctoral advisor) led. And as Val Shute just reminded me, Messick has reminded us forever that measurement never really gets directly at what somebody knows, but instead provides evidence about what the seem to know. My larger point here is my concern about what happens with these new proficiencies in schools and in tests when we treat them as individual skills rather than social practices. In particular I worry what happens to both education and evidence when students, teachers, and schools are judged according to tests of these new skills.
However, there are lots of really smart folks who have a lot of resources at their disposal who think you can measure them. This includes most of my colleagues in the 21st Century Assessment Project. For example, check out Val Shute’s great article in the International Journal of Learning and Media. Shute also has an edited volume on 21st Century Assessment coming out shortly. Likewise Dan Schwartz has a tremendous program of research building on his earlier work with John Bransford on assessments as preparation for future learning. Perhaps the most far reaching is Bob Mislevy’s work on evidence-centered design. And of course there is the new Intel-Microsoft-Cisco partnership which is out to change the face of national assessments and therefore the basis of international comparisons. I will elaborate on these examples in my next post, as that is actually the second question we were asked to answer. But first let me elaborate on why I believe that the assessment (of what individuals understand) and the measurement (of what groups of individuals have achieved) of 21st Century skills is improved if we assume that we can never really know what students know.
To reiterate, from the perspective of contemporary situated views of cognition, all knowledge and skills are primarily located in the social context. This is easy to ignore when focusing on traditional skills like reading and math that can be more meaningfully represented as skills that individuals carry from context to context. This assumption is harder to ignore with these newer ones that everyone is so concerned with. This is expecially the case with explicity social practices like collaborating and communicating, since these can't even practiced in isolated contexts. As we argued in our chapter in Val’s book, we believe it is a dangerously misleading to even use the term skills in this regard. We elected to use the term proficiencies because that term is broad enough to capture the different ways that we think about them. As 21st Century Assessment project leader Jim Gee once put it
More to the point of the question, we believe that only a tiny fraction of the residue from these practices can be measured using conventional standardized multiple-choice tests that provide little or no context. For reasons of economy and reliability, such tests are likely to remain the mainstay of educational accountabiity for years to come. Of course, when coupled with modern psychometrics, such tests can be extremely reliable, with little score variation across testing time or version. But there are serious limitations in what sorts of interpretations can be validly drawn from the resulting scores. In our opinion, scores on any standardized test of these new skills are only valid evidence of proficiency when they are
I will say that I am less skeptical about standardized measures of writing. But they are about the only standardized assessments left in wide use that actually requires students to produce something. Such tests will continue to be expensive and standardized scoring (by humans or machines) requires very peculiar writing formats. But I think the scores that result are valid for making inferences about individual proficiency in written communication more broadly, as was implied by the original question. They are actually performance assessments and as such can bring in elements of different contexts. This is particularly true if we can relax some of the needs for reliability (which requires very narrowly defined prompts and typically gets compromized and writers get creative and original). Given that I think my response to the fourth question will elaborate on my belief that written communication is probably the single most important “new proficiency” needed for economic, civic, and intellectual engagement, I think that improved testing of written communication will be the one focus of assessment research that yields the most impact on learning and equity.
To elaborate on the issue of validity, it is worth reiterating that validity is a property of the way the scores are interpreted. Unlike reliability, validity is never a property of the measure. In other words, validity always references the claims that are being supported by the evidence. As Messick argued in the 90s, the validity of any interpretation of scores also depends on the similarity between prior education and training contexts and the assessment/measurement context. This is where things get messy very quickly. As Kate Anderson and I argued in a chapter in an NSSE Yearbook on Evidence and Decision Making edited by Pam Moss, once we attach serious consequences to assessments or tests for teachers or students, the validity of the resulting scores will get compromised very quickly. This is actually less of a risk with traditional proficiencies and traditional multiple choice tests. This is because these tests can draw from massive pools of items that are aligned to targeted standards. In these cases, the test can be isolated from any preparation empirically, by randomly sampling from a huge pool of items. As we move to newer performance measures of more extended problem solving and collaboration, there necessarily are fewer and fewer items and the items become more and more expensive to develop and validate. If teachers are directly teaching students to solve the problems, then it becomes harder and harder to determine how much of an individual score is real proficiency and how much is familiarity with the assessment format (what Messick called construct-irrelevant variance). The problem is that it is impossible to ever know how much of the proficiency is “real.” Even in closely studied contexts, different observers are sure to differ in the validity—a point made most cogently in Michael Kane’s discussions of validity as interpretive argument.
Because of these validity concerns, we are terrified that the publishers of these tests of “21st Century Skills” are starting marketing curricula and test preparation materials of those same proficiencies. Because of the nature of these new proficiencies, these new “integrated” systems raise even more validity issues than the ones that emerged under NCLB for traditional skills. Another big validity issue we raised in our chapter concerns the emergence of socially networked cheating. Once these new tests are used for high-stakes decisions (especially for college entrance), social networks will emerge to tell students how to solve the kinds of problems that are included on the tests. (This has already begun to happen, as in the "This is SPARTA!" prank on the English Advanced Placement test that we wrote about in our chapter and in a more recent "topic breach" wherein students in Winnipeg leaked the essay topic for the school's 12th grade English exam.)
Of course, proponents of these new tests will argue that learning how to solve the kinds of problems that appear on their tests is precisely what they want students to be doing. And as long as you adopt a relatively narrow view of cognition and learning, there is some truth to that assumption. Our real concern is that this unbalanced focus in addition to new standards and new tests will distract from the more important challenge of fostering equitable, ethical, and consequential participation in these new skills in schools.
That is it for now. We will be posting my responses to the three remaining questions over the next week or so. We would love to hear back from folks about their responses to the first question.
Questions remaining:
2) Which are the most promising research initiatives?
3) Is it or will it be possible to measure these things in ways that they can be scored by computer? If so, how long would it take and what sorts of resources would be needed?
4) If we had to narrow our focus to the proficiencies most associated with economic opportunity and civic participation, which ones do we recommend? Is there any evidence/research specifically linking these proficiencies to these two outcomes? If we further narrowed our focus to only students from underserved communities, would this be the same list?
Question One: Can critical thinking, problem solving, collaboration, communication and "learning to learn" be reliably and validly measured?
As Dan Koretz nicely illustrated in the introduction to his 2008 book, Measuring Up: What Educational Testing Really Tells Us, the answers to questions about educational testing are never simple. We embrace strongly situative and participatory view of knowing and learning, which is complicated to explain to those who do not embrace it. But I have training in psychometrics (and completed a postdoc at ETS) and have spent most of my career refining a more pragmatic stance that treats educational accountability as inevitable. When it comes to assessment, I am sort of a born-again situativity theorist. Like folks who have newly found religion and want to tell everybody how Jesus helped them solve all of the problems they used to struggle with, I am on a mission to tell everyone how situative approaches measurement can solve some nagging problems that they have long struggled with.
In short, no, we don’t believe we can measure these things in ways that are reliable and yield scores that are valid evidence of what individuals are capable of in this regard. These are actually “practices” that can most accurately be interpreted using methods accounting for the social and technological contexts in which they occur. In this sense, we agree with skeptics like Jim Greeno and Melissa Gresalfi who argued that we can never really know what students know. This point riffs on the title of the widely cited National Research Council report of the same name that Jim Pellegrino (my doctoral advisor) led. And as Val Shute just reminded me, Messick has reminded us forever that measurement never really gets directly at what somebody knows, but instead provides evidence about what the seem to know. My larger point here is my concern about what happens with these new proficiencies in schools and in tests when we treat them as individual skills rather than social practices. In particular I worry what happens to both education and evidence when students, teachers, and schools are judged according to tests of these new skills.
However, there are lots of really smart folks who have a lot of resources at their disposal who think you can measure them. This includes most of my colleagues in the 21st Century Assessment Project. For example, check out Val Shute’s great article in the International Journal of Learning and Media. Shute also has an edited volume on 21st Century Assessment coming out shortly. Likewise Dan Schwartz has a tremendous program of research building on his earlier work with John Bransford on assessments as preparation for future learning. Perhaps the most far reaching is Bob Mislevy’s work on evidence-centered design. And of course there is the new Intel-Microsoft-Cisco partnership which is out to change the face of national assessments and therefore the basis of international comparisons. I will elaborate on these examples in my next post, as that is actually the second question we were asked to answer. But first let me elaborate on why I believe that the assessment (of what individuals understand) and the measurement (of what groups of individuals have achieved) of 21st Century skills is improved if we assume that we can never really know what students know.
To reiterate, from the perspective of contemporary situated views of cognition, all knowledge and skills are primarily located in the social context. This is easy to ignore when focusing on traditional skills like reading and math that can be more meaningfully represented as skills that individuals carry from context to context. This assumption is harder to ignore with these newer ones that everyone is so concerned with. This is expecially the case with explicity social practices like collaborating and communicating, since these can't even practiced in isolated contexts. As we argued in our chapter in Val’s book, we believe it is a dangerously misleading to even use the term skills in this regard. We elected to use the term proficiencies because that term is broad enough to capture the different ways that we think about them. As 21st Century Assessment project leader Jim Gee once put it
Abstract representations of knowlege, if they exist at all, reside at the end of long chains of situated activity.However, we also are confident that that some of the mental “residue” that gets left behind when people engage meaningfully in socially situated practices can certainly be assessed reliably and used to make valid interpretations about what individuals know. While we think these proficiencies are primarily social practices, it does not exclude recognizing the secondary “echoes” of participating in these practices. This can be done with performance assessments and other extended activities that provide some of that context and then ask individuals or groups to reason, collaborate, communicate, and learn. If such assessments are created carefully, and individuals have not been directly trained to solve the problems on the assessments, it is possible to obtain reliable scores that are valid predictions of how well individuals can solve, communicate, collaborate, and learn in new social and technological contexts. But this continues to be difficult and the actual use of such measures raises serious validity issues. Because of these issues (as elaborated below), we think this work might best be characterized as “guessing what students know.”
More to the point of the question, we believe that only a tiny fraction of the residue from these practices can be measured using conventional standardized multiple-choice tests that provide little or no context. For reasons of economy and reliability, such tests are likely to remain the mainstay of educational accountabiity for years to come. Of course, when coupled with modern psychometrics, such tests can be extremely reliable, with little score variation across testing time or version. But there are serious limitations in what sorts of interpretations can be validly drawn from the resulting scores. In our opinion, scores on any standardized test of these new skills are only valid evidence of proficiency when they are
a) used to make claims about aggregated proficiencies across groups of individuals;Hence, we are pretty sure that national and international assessments like NAEP and PISA should start incorporating such proficiencies. But we have serious concerns about using these measures to evaluate individual proficiencies in an high-stakes sorts of ways. If such tests are going to continue to be used on any high stakes decisions, they may well best be left to more conventional literacies, numeracies, and knowledge of conventional content domains, which are less likely to be compromised.
b) used to make claims about changes over longer times scales, such as comparing the consequences of large scale policy decisions over years; and
c) isolated from the educational environment which they are being used to evaluate.
I will say that I am less skeptical about standardized measures of writing. But they are about the only standardized assessments left in wide use that actually requires students to produce something. Such tests will continue to be expensive and standardized scoring (by humans or machines) requires very peculiar writing formats. But I think the scores that result are valid for making inferences about individual proficiency in written communication more broadly, as was implied by the original question. They are actually performance assessments and as such can bring in elements of different contexts. This is particularly true if we can relax some of the needs for reliability (which requires very narrowly defined prompts and typically gets compromized and writers get creative and original). Given that I think my response to the fourth question will elaborate on my belief that written communication is probably the single most important “new proficiency” needed for economic, civic, and intellectual engagement, I think that improved testing of written communication will be the one focus of assessment research that yields the most impact on learning and equity.
To elaborate on the issue of validity, it is worth reiterating that validity is a property of the way the scores are interpreted. Unlike reliability, validity is never a property of the measure. In other words, validity always references the claims that are being supported by the evidence. As Messick argued in the 90s, the validity of any interpretation of scores also depends on the similarity between prior education and training contexts and the assessment/measurement context. This is where things get messy very quickly. As Kate Anderson and I argued in a chapter in an NSSE Yearbook on Evidence and Decision Making edited by Pam Moss, once we attach serious consequences to assessments or tests for teachers or students, the validity of the resulting scores will get compromised very quickly. This is actually less of a risk with traditional proficiencies and traditional multiple choice tests. This is because these tests can draw from massive pools of items that are aligned to targeted standards. In these cases, the test can be isolated from any preparation empirically, by randomly sampling from a huge pool of items. As we move to newer performance measures of more extended problem solving and collaboration, there necessarily are fewer and fewer items and the items become more and more expensive to develop and validate. If teachers are directly teaching students to solve the problems, then it becomes harder and harder to determine how much of an individual score is real proficiency and how much is familiarity with the assessment format (what Messick called construct-irrelevant variance). The problem is that it is impossible to ever know how much of the proficiency is “real.” Even in closely studied contexts, different observers are sure to differ in the validity—a point made most cogently in Michael Kane’s discussions of validity as interpretive argument.
Because of these validity concerns, we are terrified that the publishers of these tests of “21st Century Skills” are starting marketing curricula and test preparation materials of those same proficiencies. Because of the nature of these new proficiencies, these new “integrated” systems raise even more validity issues than the ones that emerged under NCLB for traditional skills. Another big validity issue we raised in our chapter concerns the emergence of socially networked cheating. Once these new tests are used for high-stakes decisions (especially for college entrance), social networks will emerge to tell students how to solve the kinds of problems that are included on the tests. (This has already begun to happen, as in the "This is SPARTA!" prank on the English Advanced Placement test that we wrote about in our chapter and in a more recent "topic breach" wherein students in Winnipeg leaked the essay topic for the school's 12th grade English exam.)
Of course, proponents of these new tests will argue that learning how to solve the kinds of problems that appear on their tests is precisely what they want students to be doing. And as long as you adopt a relatively narrow view of cognition and learning, there is some truth to that assumption. Our real concern is that this unbalanced focus in addition to new standards and new tests will distract from the more important challenge of fostering equitable, ethical, and consequential participation in these new skills in schools.
That is it for now. We will be posting my responses to the three remaining questions over the next week or so. We would love to hear back from folks about their responses to the first question.
Questions remaining:
2) Which are the most promising research initiatives?
3) Is it or will it be possible to measure these things in ways that they can be scored by computer? If so, how long would it take and what sorts of resources would be needed?
4) If we had to narrow our focus to the proficiencies most associated with economic opportunity and civic participation, which ones do we recommend? Is there any evidence/research specifically linking these proficiencies to these two outcomes? If we further narrowed our focus to only students from underserved communities, would this be the same list?
Subscribe to:
Posts (Atom)