Wednesday, January 11, 2023

ChatGPT vs. Online Formative Self-Assessments and Summative Multiple-Choice Exams

by Daniel Hickey and Qianxu (Morgan) Luo


This is the third post in a row considering ChatGPT’s implications for academic integrity, particularly in online education. The first post analyzed the media predictions. The second post tried to use ChatGPT to complete a graduate course on learning and cognition and write a literature review paper. We concluded that ChatGPT is indeed a threat to the academic integrity of many conventional educational practices, but that it would be difficult to use ChatGPT to complete those particular assignments, interactions, and literature reviews, without being detected and without learning quite a bit of the content.


ChatGPT vs. Summative  Assessments

Thanks to modern learning management systems and Google Forms, online educational assessment has become widespread, even for face-to-face classes. The popular Canvas LMS offers ten item types and four assessment formats for 40 combinations and can automatically score selected response and short-answer items. Time limits can thwart students from scouring the web for answers when scores matter. These allow all instructors to make powerful and efficient online assessments.

The most obvious question for us (and many others) is whether ChatGPT can thwart efforts to preserve the validity of online exam scores as evidence of what someone has learned in a course or knows about the field. One less obvious question concerns formative assessments: might ChatGPT give accurate feedback and relieve instructors of this burden? Another less obvious question concerns the role of digital proctors: Is ChatGPT going to require more of these intrusive and racially-biased monitors? (We are currently studying whether the approach below can help an online high school stop using them)

We scoured the research literature and uncovered a few preprints of recent studies exploring the first two questions. On the first question, Susnjak (2022) concluded that ChatGPT is “capable of exhibiting critical thinking skills and generating highly realistic text with minimal input, making it a potential threat to the integrity of online exams, particularly in tertiary education settings where such exams are becoming more prevalent.” Susnjak suggested we might need to return to in person and oral exams. Gilson et al. (2022) concluded that ChatGPT performed on the US Medical Licensing Exam (USMLE) “comparable to a third-year medical student”, while Kung et al. (2022) concluded that ChatGPT “performed at or near a passing threshold” on all three USMLE exams. On the second question Kung et al. concluded that ChatGPT “demonstrated high concordance and insight in its explanations,” which means it “may have the potential to assist with medical education…” We found no studies exploring the third question. But these studies were all conducted within the last month, so more are surely underway.


ChatGPT vs. Participatory Assessment

For the last decade, our lab has been exploring a situative approach to assessment that addresses long-standing challenges in assessment  As introduced in Hickey and Zuiker (2012) and elaborated in Hickey (2015) and Hickey and Harris (2021), this approach “aligns” learning across multiple levels of increasingly formal assessments:


  1. Public instructor assessment of student work and annotations positions students as accountable authors and audience for peer authors.

  2. Engagement reflections summatively assess prior engagement while formatively assessing understanding and future engagement.

  3. Formative self-assessments allow students to check and enhance their understanding and indirectly prepare for exams without requiring laborious private instructor feedback.

  4. Automated time-limited multiple-choice exams with unsearchable “best-answer” items provide valid evidence of achievement of learning outcomes.


These features free up instructor time to focus on efficient and powerful public assessment and engagement at the first level. The features support understanding and achievement while leaving digital “traces” of learning that bolster integrity.  The previous post examined a course that only included the first two levels and where students drafted a literature review paper week by week.  This post presents an initial analysis of a course called Assessment and Schools that include all four levels.


ChatGPT vs. Our Summative Exams

The course includes three summative exams at the end of each of the three modules. Illustrating the situative assessment design principle measure achievement discreetly (as needed) the exams are each only worth 10% of the final grade. The items are:


  • Drawn from the item banks for multiple textbooks, to help document transfer.

  • Never directly included in the assignments or formative assessments.

  • Include challenging “best answer” and are rewritten regularly in an effort to ensure that correct answers can’t be quickly located in search.

  • Regularly scrutinized to ensure that they have a high discrimination index (d), which means that students who get the item correct scored relatively well overall.

  • Regularly scrutinized to ensure that no more than one student gets a perfect score and the scores are normally distributed around an average difficulty (p) between 0.7 and 0.8 (70-80%)


Students are given forty minutes to complete each twenty-minute exam. To further ensure integrity, students are only shown their overall score after completing each exam and do not see the correct answer to each item.

Morgan entered each stem and four responses into ChatGPT (it recognizes them as multiple-choice items). She found that ChatGPT returned the correct answer for just 51% of the items. In contrast, the average score across students across exams was 80%. She then gave herself two minutes to see if she could locate that correct answer in google and concluded that she located the correct answer for 45% of the items (though this was likely inflated because students would not know the correct answer they were searching for).

We are currently analyzing the items and searching for conclusions regarding which item types ChatGPT is better or worse at.  The correlation between ChatGPT and p was surprisingly low.  The average difficulty of the items that ChatGPT missed was .67 while the average difficulty of the items it answered correctly was .77. ChatGPT correctly answered this item, which 93% of the students scored correctly:


Question/Prompt Which of the following assertions is most accurate?

Group of answer choices (inserted by ChatGPT)

Criterion-referenced score-based inferences are relative interpretations.

Norm-referenced score-based inferences and criterion-referenced score-based inferences are not essentially interchangeable.

Norm-referenced score-based inferences are typically more useful to classroom teachers than criterion-referenced score-based inferences.

Norm-referenced score-based inferences are absolute interpretations.

ChatGPT Reply: The most accurate assertion out of the options provided is:

"Norm-referenced score-based inferences and criterion-referenced score-based inferences are not essentially interchangeable."

Criterion-referenced score-based inferences are based on a specific set of standards or criteria and are used to determine whether a student has met those standards. They are relative to the specific criteria and are used to evaluate the student's performance relative to the criteria.

Norm-referenced score-based inferences, on the other hand, are based on the student's performance relative to a group of peers. They are typically used to rank the student's performance relative to others. Norm-referenced score-based inferences can be considered as absolute interpretation but is not necessary.

Criterion-referenced score-based inferences and norm-referenced score-based inferences are different types of inferences and are used for different purposes. They cannot be used interchangeably.

It can also depend on the context and the goals of an assessment, whether norm-referenced or criterion-referenced score-based inferences are more useful for a specific assessment. It can be helpful for teachers to understand the difference between the two and when to use them.


Both the answer and the answer explanations were factually correct.  In contrast. this item more relatively difficult for students (.75) 


Question/Prompt: A rubric is a scoring guide to be employed in judging students’ responses to constructed-response assessments such as a performance test.  Which one of the following elements is the least necessary feature of a properly constructed rubric?

Group of answer choices (inserted by ChatGPT)

An indication of whether a holistic or analytic scoring approach is to be used

An identification of the evaluative criteria to be used in appraising a student’s response

A designation of a performance standard required for skill-mastery (correct response)

Descriptions of different quality levels associated with each evaluative criterion

ChatGPT’s (incorrect)  reply: The least necessary feature of a properly constructed rubric among the options provided is:

"An indication of whether a holistic or analytic scoring approach is to be used"

Rubrics are often used to score constructed-response assessments such as performance tests, essays, and projects. A well-constructed rubric should have the following elements:

  1. An identification of the evaluative criteria to be used in appraising a student’s response: The rubric should clearly identify the specific criteria that will be used to evaluate the student's performance on the assessment.

  2. A designation of a performance standard required for skill-mastery: The rubric should include a description of the level of performance required for mastery of the skill or content being assessed.

  3. Descriptions of different quality levels associated with each evaluative criterion: The rubric should include a clear description of the different levels of quality that can be achieved for each criterion, such as excellent, good, fair, and poor.

The indication of whether a holistic or analytic scoring approach is to be used is something that is more associated with the choice of scoring method and it can depend on the type of assessment and the goal of the assessment. A holistic scoring approach means that an evaluator assigns a single score to the student's response, whereas an analytic scoring approach means that an evaluator assigns a separate score for each criterion or dimension of the assessment.

Rubrics can be used for both types of scoring methods, and the rubric itself doesn't dictate the scoring approach.

While all of the feedback provided is correct, ChatGPT selected the wrong answer. And there is nothing in the explanation that would point a student to the correct answer.  Its very high d index (.75) indicates that nearly all students who missed it were in the bottom half of the class in terms of their overall score.  

We are now further analyzing the items and searching for patterns. But clearly, more systematic study is needed, likely employing measurement scholars who have access to large pools of items created by professionals with detailed performance data. For now, we concluded that these exams do indeed boost academic integrity in the course. Some students may well boost GPT’s performance by scouring the output and checking answers but doing so likely would require knowledge about that concept and might have limited formative value. 

But it is worth noting that marginally engaged students sometimes score nearly as poorly on these exams as ChatGPT (i.e., 60%). Because each exam is only worth 10% of the grade, ChatGPT would have only lost 15% of a 100-point final grade, or a B if all other points were earned.


ChatGPT vs. Our Formative Self-Assessments

Unsurprisingly, the research literature on “cheating” on formative assessments is quite small.  Arnold (2016) used sophisticated techniques to find that doing so is “negatively related to academic progress” and “does not seem to pay off” in terms of summative exam performance. Our broader situative view of learning leads us to a rather different way of thinking about the formative functions of assessment. We view all assessments as “peculiar” forms of disciplinary discourse. We view learning primarily as increasingly successful participation in such discourse and only secondarily as the relatively static knowledge represented by assessment items. We acknowledge that this is unorthodox and continues to baffle our constructivist assessment colleagues and our empiricist measurement colleagues. But it has led us to innovative responses to the “conundrum” of formative assessment described in Bennett (2011) and Hickey (2015).

Building on Duschl and Gitomer (1997)Hickey et al. (2003) initiated an extended program of formative assessment research using situative theory. This work has gone in a number of directions. In the online Assessment course, each of the eleven assignments concludes with 5-10 formative self-assessment items.  These are constructed response items drawn from supplemental digital material from the textbook and the assignments.  Because they are more formal and objective than the assignments, they require a degree of transfer. Students are instructed to:


  • Try to draft a coherent/complete response from memory.

  • Consult with the textbook and/or student work as necessary to draft a complete/coherent response.

  • Compare their response with the “expert” answer displayed only after they submit something (anything).


 The key assessment design principles here are that (a) the items should require the transfer of knowledge, (b) students should believe that completing them as suggested will support exam performance, and (c) such students are well-positioned and motivated at that moment to understand feedback beyond the correct answer to that item.  

Students are instructed to complete the self-assessments after completing each assignment. But they are not mandatory and 80-90% of students typically attempt them. Naturally, we wondered what would happen if students consulted ChatGPT in addition to or instead of the textbook when completing these items.  Morgan tried to use ChatGPT to find the correct answer for each of the 78 items and compared that answer with the expert answer that the items provided. She also tried to find the correct answer on the web (no time limit) and tested both the chat GPT response and the expert response using the AI Content Detector at Writer.com.

She concluded that ChatGPT returned a clearly correct answer for 31% of the items and a partly correct answer for 11% of the items. The ChatGPT responses averaged 80% human-generated at Writer.com while the expert answers averaged 98% human generated. It is worth noting that many instructors might elect to use similar items in summative assessments, suggesting that ChatGPT is not a large threat to integrity in such a case.

As expected, ChatGPT was quite proficient and basic definitional items like this one:

Assessment question/prompt: In What-to-Assess Considerations, what are Relevant Curricular Configurations?

ChatGPT’s (correct) response: In the context of educational assessment, "relevant curricular configurations" refers to the ways in which the curriculum is organized and structured in a particular school or program. This can include things like the specific courses that are offered, the content and themes that are covered in those courses, and the way that instruction is delivered.

When designing assessments, it is important to consider the relevant curricular configurations so that the assessment can accurately measure student learning and progress. For example, if a school has a strong emphasis on hands-on, project-based learning, then an assessment that primarily consists of multiple choice questions may not be the best fit. Instead, the assessment should include opportunities for students to demonstrate their knowledge and skills through hands-on activities or projects.

Similarly, if a school focuses on particular themes or content areas, the assessment should reflect that focus. For example, a school that emphasizes environmental science should include questions or tasks related to environmental science in their assessment.

Relevant curricular configurations also can be linked to regional or national frameworks, like Common Core standards, or other standards or frameworks like the Next Generation Science Standards or International Baccalaureate program, these would also need to be considered when designing assessments.

It is important to keep in mind that different schools and programs may have different curricular configurations, so assessments should be tailored to the specific context in which they will be used.

Course’s “expert” answer: This concerns the extent to which students' mastery of curricula and performance on classroom assessments will contribute to mastery of some external state, national, or professional achievement test.  In other words how related is it to relevant external standards.


In fact, we see that ChatGPT provides a much more complete and comprehensive response than the expert response provided by the LMS (which was a rewording of an entry in the textbook, and probably should be extended). A student who relied on ChatGPT to complete the item as instructed might learn more using ChatGPT than the textbook, so long as the student paraphrased the information rather than simply copying and pasting. Not surprisingly, Mogan readily located an answer to the item on the web as well.

In contrast, we had dozens of items like this one that ChatGPT failed to answer correctly:


Assessment question/prompt: If a classroom teacher actually computed a Kuder-Richardson coefficient for a final exam, this would be an example of an:


stability reliability coefficient

internal consistency reliability coefficient

content validity coefficient

construct validity coefficient


ChatGPT’s (incorrect) response The dominant shortcoming of this item is that it confuses different types of reliability coefficients.


The Kuder-Richardson coefficient is a type of internal consistency reliability coefficient. It is used to measure the consistency of a test or questionnaire by comparing the responses of different items within the test.


Stability reliability coefficients, such as test-retest reliability coefficient or the equivalent-forms coefficient, are used to measure the consistency of test scores over time. They look at the consistency of scores obtained by a group of individuals when they take the same test on two different occasions.


Content validity coefficient and construct validity coefficient are not types of reliability coefficients but rather are forms of validity. Content validity coefficient evaluates if a test or measure is measuring the content it should be measuring, is it covering the whole domain, and if so, in an appropriate way. Construct validity coefficient evaluates if the test or measure is actually measuring what it is supposed to measure.


So to summarize, the statement is incorrect because it describes the Kuder-Richardson coefficient as a stability reliability coefficient, instead of an internal consistency reliability coefficient.

The use of the article “an” renders only Choice C a grammatically correct option.

 

Course’s “expert” answer: The use of the article “an” renders only Choice C a grammatically correct option

 

This response illustrates the title of Ian Bogosts' article in the special section of The Atlantic entitled “ChatGPT is dumber than you think.”  The open question is whether students would recognize that ChatGPT simply misunderstood this relatively easy item. 

The following self-assessment item illustrates a different point about the kinds of mistakes that ChatGPT is prone to make: 

 

Assessment question/prompt: What are the three primary deterrents to formative assessment according to Popham?

 

ChatGPTs (incorrect) response. According to James Popham, a leading expert on educational assessment, there are three primary deterrents to formative assessment:

 

  1. Time constraints: One of the biggest challenges to formative assessment is finding the time to design, administer, and analyze assessments, as well as provide feedback to students. Teachers are often already stretched for time and may not see the value in taking the time for formative assessment.

  2. Limited assessment literacy: Many teachers lack the knowledge and skills needed to effectively implement formative assessment, which can lead to a lack of confidence in their ability to design and use assessments.

  3. Lack of infrastructure support: Another deterrent to formative assessment is the lack of infrastructure support from schools and districts, such as a lack of resources to support assessment, technology and professional development to ensure teachers are prepared to use formative assessment in the classroom.

 

These are the three primary deterrents to formative assessment that Popham commonly highlights in his works, addressing these deterrents will help to promote the use of formative assessment in the classroom, which can lead to improved student learning and achievement.

 

Courses “expert” response: The three deterrents include:

  • Educators' misunderstanding about the nature of formative assessment: It is a test rather than a process; They only help teachers adjust their teaching (rather than helping students adjust their learning); Conflating commercial interim tests with classroom formative assessments

  • Educators' reluctance to change their ways 

  • Insensitivity of external achievement tests to the impact of formative assessment  (real or perceived).

 

This response nicely illustrates the points by Von Hippel in Education Next: “The viral chatbot is often wrong, but never in doubt.” ChatGPT knows that Popham is an assessment expert and it knows a good bit about formative assessment. But it does not know the answer to this specific question even though the answer is plausible in that those are indeed deterrents.

Finally, we uncovered a few questions and responses that illustrated perhaps the most worrisome concern that many have expressed about ChatGPT

 

Assessment question/prompt: In your own words, write the four steps for building a learning progression:

Building a learning progression involves four steps:

  1. Identify key concepts and skills: The first step in building a learning progression is to identify the key concepts and skills that students should learn at each grade level or developmental stage. This involves reviewing the existing curriculum and research on student learning to identify the most important and relevant content.

  2. Sequence the concepts and skills: The second step is to sequence the concepts and skills in a logical and developmentally appropriate order. This involves creating a clear progression of learning that builds on what students already know and can do, and prepares them for more advanced concepts and skills.

  3. Create assessment tasks: The third step is to create assessment tasks that align with the key concepts and skills identified in step one, and that can be used to measure student progress along the learning progression. These tasks should be designed to be challenging, but achievable for students at each stage of the progression.

  4. Analyze and interpret student data: The final step is to use the assessment tasks from step three to gather data on student progress, and then analyze and interpret that data to identify areas of strength and weakness in student learning. This information can then be used to inform instruction, adjust the learning progression as needed, and target additional support to students who are struggling.

 

Overall, the goal of building a learning progression is to provide a clear roadmap for student learning and to ensure that students are making progress towards long-term academic goals by providing them with well-defined and sequenced learning objectives, formative assessment and well aligned instruction.

 

Courses “expert” answer The following are the four steps written in Popham’s words. Check your own list against these steps to make sure yours has the same meaning.

 

Step 1: Acquire a thorough understanding of the target curricular aim.

Step 2: Identify all the requisite precursory subskills and bodies of enabling knowledge.

Step 3: Determine the measurability of each preliminarily identified building block.

Step 4: Arrange all the building blocks in an instructionally sensible sequence.


This response illustrates Roose’s point in an article about ChatGPT’s “brilliance and weirdness” in the New York Times. Roose observed that ChatGPT is“ominously good at answering the types of open-ended analytical questions that frequently appear on school assignments” but “prone to giving wrong answers.”  

Conclusion

ChatGPT is certainly impressive. We concluded that a score of 53% shows that ChatGPT is a modest threat to the integrity of these particular summative assessments. As with the Learning and Cognition course, we concluded that it would be difficult for a student to use ChatGPT to complete and discuss the weekly e-portfolio assignments in this course without being detected and without learning a substantial amount of content. 

We further concluded that ChatGPT might supplement the textbook as a resource for formative self-assessments but can’t effectively supplant it. We also concluded that a score of 31% means that ChatGPT is a very small threat to the integrity of these particular formative assessment items that were used in a summative assessment. 

This likely concludes this round of posts on ChatGPT.  Our heads are spinning imagining the possible research designs where we might use these two courses to explore some of the many suggestions for exploring ChatGPTs' value when instructors and students work together to exploit it in support of learning (such as Mollick and Mollick, 2022). We are both participating in an advanced topical seminar on theories of online learning this semester and look forward to tracking and systematically analyzing the explosion of research and media accounts.


No comments:

Post a Comment