eJournals Psychologie in Erziehung und Unterricht71/2

Psychologie in Erziehung und Unterricht
3
0342-183X
Ernst Reinhardt Verlag, GmbH & Co. KG München
10.2378/peu2024.art08d
3_071_2024_2/3_071_2024_2.pdf41
2024
712

Empirische Arbeit: Comparing Generative AI and Expert Feedback to Students’ Writing: Insights from Student Teachers

41
2024
Thorben Jansen
Lars Höft
Luca Bahr
Johanna Fleckenstein
Jens Möller
Olaf Köller
Jennifer Meyer
Feedback is crucial for learning complex tasks like writing, yet its creation is time-consuming, often leading to students receiving insufficient feedback. Generative artificial intelligence, particularly Large Language Models (LLMs) like ChatGPT 3.5-Turbo, has been discussed as a solution for providing more feedback. However, there needs to be more evidence that AI-feedback already meets the quality criteria for classroom use, and studies have yet to investigate whether LLM-generated feedback already seems useful to its potential users. In our study, 89 student teachers evaluated the usefulness of feedback for students’ argumentative writing, comparing LLM against expert-generated feedback without receiving information about the feedback source. Participants rated LLM-generated feedback as useful for revision in 59% of texts (compared to 88% for expert feedback). 23% of the time, participants preferred to give LLM-generated feedback to students. Our discussion focuses on the conditions in which AI-generated feedback might be effectively and appropriately used in educational settings.
3_071_2024_2_0003
n Empirische Arbeit Psychologie in Erziehung und Unterricht, 2024, 71, 80 -92 DOI 10.2378/ peu2024.art08d © Ernst Reinhardt Verlag Comparing Generative AI and Expert Feedback to Students’ Writing: Insights from Student Teachers Thorben Jansen 1 , Lars Höft 1 , Luca Bahr 1 , Johanna Fleckenstein 2 , Jens Möller 3 , Olaf Köller 1 & Jennifer Meyer 1 1 Leibniz-Institut für die Pädagogik der Naturwissenschaften und Mathematik, Kiel 2 Institut für Erziehungswissenschaft, Universität Hildesheim 3 Institut für Pädagogisch-Psychologische Lehr- und Lernforschung, Christian-Albrechts-Universität zu Kiel Summary: Feedback is crucial for learning complex tasks like writing; yet its creation is time-consuming, often leading to students receiving insufficient feedback. Generative artificial intelligence, particularly Large Language Models (LLMs) like ChatGPT 3.5-Turbo, has been discussed as a solution for providing more feedback. However, there needs to be more evidence that AI-feedback already meets the quality criteria for classroom use, and studies have yet to investigate whether LLM-generated feedback already seems useful to its potential users. In our study, 89 student teachers evaluated the usefulness of feedback for students’ argumentative writing, comparing LLM against expert-generated feedback without receiving information about the feedback source. Participants rated LLM-generated feedback as useful for revision in 59 % of texts (compared to 88 % for expert feedback). 23 % of the time, participants preferred to give LLM-generated feedback to students. Our discussion focuses on the conditions in which AI-generated feedback might be effectively and appropriately used in educational settings. Keywords: Generative Artificial Intelligence, ChatGPT, Feedback, Writing Feedback von generativer KI versus Expertenfeedback zum argumentativen Schreiben in der Sekundarstufe: Einblicke von Lehramtsstudierenden Zusammenfassung: Feedback ist für den Lernprozess von entscheidender Bedeutung. Schülerinnen und Schüler erhalten jedoch zu wenig Feedback, insbesondere für komplexe schriftliche Leistungen. Ein Grund dafür ist die zeitaufwendige Erstellung des Feedbacks. Generative künstliche Intelligenz, insbesondere Large-Language-Models (LLMs) wie ChatGPT-3.5-Turbo, können diesen Zeitaufwand reduzieren. Es gibt jedoch kaum Belege dafür, dass LLM-generiertes Feedback bereits die Qualitätskriterien für den Unterrichtseinsatz erfüllt und in keiner Studie wurden potenzielle Nutzende befragt. In unserer Studie verglichen 89 Lehramtsstudierende die Nützlichkeit von LLM-generiertem Feedback mit Expertenfeedback zur Überarbeitung argumentativer Schreibaufgaben aus der Sekundarstufe. Dabei erhielten sie keine Informationen zur Quelle des Feedbacks. Die Teilnehmenden bewerteten LLM-generiertes Feedback bei 59 % der Urteile als nützlich für die Schülerinnen und Schüler (im Vergleich zu 88 % beim Expertenfeedback). In 23 % der Urteile bevorzugten die Teilnehmenden im direkten Vergleich das vom LLM generierte Feedback. Die Diskussion fokussiert auf die Bedingungen, unter denen KI-generiertes Feedback erfolgreich in Lernumgebungen eingesetzt werden kann. Schlüsselbegriffe: Generative künstliche Intelligenz, ChatGPT, Feedback, Schreiben Writing is a fundamental skill required to enable individuals to participate in society (UNESCO, 2011) and succeed in all school subjects (Graham, Kiuhara & MacKay, 2020). For instance, argumentative writing is crucial to participating in societal discussions such as climate change and thus is an important goal of science education. Developing writing skills requires students to go through many cycles of writing and revision of their texts (Flower & Hayes, 1981). Autorenhinweis: Gefördert durch das Bundesministerium für Bildung und Forschung (BMBF) - 01JA23S03B Comparing AI and Expert Feedback to Students’ Writing 81 Text revision is crucial for learning because students compare their intended texts with their actual text version while revising, identify discrepancies, and make necessary changes in their thinking and knowledge base (Bereiter & Scardamalia, 1987). Feedback can facilitate the revision process by showing students the gap between their current performance and learning objectives (Biber, Nekrasova & Horn, 2011; Black & Wiliam, 1998). Hence, feedback is a key instructional practice for writing (Graham, Herbert & Harris, 2015; Skar, Graham & Rijlaarsdam, 2022), and from the students’ perspective, an integral part of supportive teaching behavior (Gencoglu, Helms-Lorenz, Maulana, Jansen & Gencoglu, 2023). Creating feedback on argumentative texts requires clear assessment criteria and the ability to assess writing based on the criteria (Hyland & Hyland, 2006). Especially for complex performances such as writing, this assessment process is particularly time-consuming, creating a significant challenge for teachers and a situation in which students receive individual feedback on their writing too infrequently (e. g., in under 20 % of classes in Applebee & Langer, 2011). This underscores the need for supportive mechanisms like digital tools to help teachers efficiently generate and deliver high quality feedback. Advances in artificial intelligence make it possible to use machine learning to automate teachers’ writing evaluation (AWE) of texts based on a large set of training data for a specific writing task (Meyer, Jansen, Fleckenstein, Keller & Köller, 2020). While AWE can assess students’ writing as accurately than teachers (Jansen, Vögelin, Machts, Keller, Köller & Möller, 2021; Zhai, Krajcik & Pellegrino, 2021), teachers still need support to generate feedback because the algorithms work only for few, specially trained writing tasks, and are thus not well suited for the continuous classroom use (Ley et al., 2023). A potential solution to overcome the taskspecificity of current AWE feedback is using Large Language Models (LLMs) because LLMs are trained on vast amounts of data and thus may resemble teacher feedback in many contexts. The international discourse on the potentials and challenges of LLMs for feedback generation is gaining momentum, with discussions in prominent journals like Nature (van Dis et al., 2023) and by leading organizations such as UNESCO (2023). Despite the discussions and growing presence, the quality of feedback generated by LLMs still needs to be improved, creating a need for evaluation. In the few studies, the automated feedback has been compared with expert feedback (Zhai, Krajcik & Pellegrino, 2021) and evaluated by trained raters (e. g., Steiss et al., 2023). However, no study asked teachers as potential users if LLM-generated feedback can be useful for classroom usage (Demszky et al., 2023). The following study presents an experiment in which student teachers compare the quality of feedback generated for authentic argumentative writing about climate change from secondary schools in Germany. For eight texts, participants evaluated the usefulness of two feedback messages, one generated by an LLM (ChatGPT 3.5 Turbo) and one provided by an expert teacher. Participants had no explicit information about the source of the feedback. Feedback Definition Feedback is information communicated to the learners to modify their thinking or behavior to close the gap between their actual performance and target performance (Hattie & Timperley, 2007) and to improve learning (Shute, 2008). Meta-analyses showed feedback’s potential to improve writing skills. However, some studies also report negative effects on writing performance (Graham et al., 2023) and writing motivation (Cen & Zheng, 2023). Such negative effects may occur because feedback that addresses writing strategies that a student does not know about or that are unimportant may discourage engagement with the feedback (Grimes & Warschauer, 2010; Hattie & Timperley, 2007). To ensure the effectiveness of feedback, Hattie and Timperley (2007) suggested that feedback should answer the questions “Where am I going? ”, “How am I going? ” and “What’s next? ”. 82 Thorben Jansen, Lars Höft, Luca Bahr, Johanna Fleckenstein, Jens Möller, Olaf Köller, Jennifer Meyer To address “Where am I going? ”, it is important to clarify learning goals and assessment criteria, thereby setting clear performance expectations (Reddy & Andrade, 2010). To answer, “How am I going? ” students need evaluations of their performance. Feedback is particularly effective when it evaluates and explains these evaluations, helping students understand and rectify errors or misconceptions (Rich et al., 2017). Lastly, the question “Where to next? ” is answered by providing students with forward-looking guidance. This includes specific directions for future tasks and assignments, aiding in their ongoing learning journey (Sadler, 2010). To create effective feedback, it is essential to include the information and make the information engaging (Winstone et al., 2017). Student engagement is critical for feedback effects to support students’ interest in and improvement of their writing skills, highlighting the need for students to perceive feedback as useful (Van der Kleij & Lipnevich, 2021). Perceived usefulness consists of the comprehensibility, detail, and positioning of the provided feedback and should consider individual needs (Henderson et al., 2021). Generating Feedback with AI Technology can support feedback generation using automated writing evaluation (AWE; Ngo et al., 2022). Current AWE systems are based on corpora of essays to single writing tasks hand-scored by raters for specific elements related to writing, for example, holistic scores of writing quality (Shermis, 2014) or argumentative elements (Crossley, Baffour, Tian, Picou, Benner & Boser, 2022). These “training data”, consisting of texts and human ratings, are then analyzed with machine learning algorithms, which can recognize patterns in the training data and evaluate new texts on the same task based on these patterns (see Ercikan & McCaffrey, 2022, for a description of the strengths and weaknesses of the process). While the approach has been shown to imitate teacher judgments accurately (Horbach et al., 2022; Zhai et al., 2021) and be a suitable foundation for feedback systems (Fleckenstein, Liebenow & Meyer, 2023; Jansen, Meyer, Fleckenstein, Horbach, Keller & Möller, 2024), the extensive requirement of training data elevates costs and restricts teachers’ flexibility in using automated feedback in the classroom. Additionally, AWE systems match pre-defined feedback sets with texts rather than generating individualized feedback contextually, which can limit their relevance and effectiveness in varying educational scenarios. Recent developments in artificial intelligence (i. e., LLMs or foundation models; Yang et al., 2023) make it possible to overcome these challenges and evaluate texts without specific training data (Chen et al., 2023; Tate et al., 2023). LLMs are advanced artificial intelligence systems trained on vast amounts of text data using the transformer architecture (Devlin et al., 2018) to analyze language patterns in varying contexts, enabling them to evaluate and generate human-like natural language. That means that LLMs can evaluate texts and generate feedback for each student individually (Jia et al., 2022; Mizumoto & Eguchi, 2023). While the discussion recognized the potential for feedback creation for argumentative writing (Su, Lin & Lai, 2023) and science education (Zhai & Nehm, 2023), only one peer-reviewed study has evaluated the quality of LLM-generated feedback (Tack & Piech, 2022). This paucity of evaluation studies creates a great need for research, especially since LLMs, unlike AWEs, were created without a pedagogical goal, and it is unclear how useful they are for educational purposes. Empirical Studies Evaluating LLM-generated Feedback Empirical studies investigating the quality of LLM-generated feedback compared the LLM- Feedback with teacher feedback, despite the shortcomings of teacher judgments (e. g., Jansen, Vögelin, Machts, Keller & Möller, 2021; Jansen & Möller, 2022; Hennes et al., 2022; Möller et al., 2022). In the only peer-reviewed Comparing AI and Expert Feedback to Students’ Writing 83 study by Tack and Piech (2022), the pedagogical effectiveness of two LLM chatbots was compared with that of teachers in a secondary language classroom. The study involved two trained raters who scored LLMs and teacher responses based on their likelihood of being teacher-generated, understandability, and helpfulness to students. One LLM (Blender) outperformed the teachers’ responses, and teachers’ responses outperformed the other LLM (GPT-3). The study’s findings that LLM-generated feedback can be of comparable quality to teacher-generated feedback are further supported by a series of preprints. Steiss et al. (2023) asked trained raters to compare the quality of LLMgenerated with expert teacher-generated feedback on students’ writing in secondary schools. They used the currently freely available and mainly used LLM called ChatGPT in version 3.5. The results showed that expert teachers outperformed LLMs on all criteria (e. g., accuracy, direction for improvement, supportive tone) other than the degree to which the feedback was connected to the evaluation criteria. Other studies underscored that ChatGPT could provide feedback surpassing human feedback in certain aspects, such as readability, positivity (Dai et al., 2023), and praise (Hirunyasiri, Thomas, Lin, Koedinger & Aleven, 2023). One preprint even showed that GPT-4, if prompted effectively, can be superior to teachers in overall quality (Jacobsen & Weber, 2023). The mentioned studies all trained human raters to compare the LLM feedback quality with the quality of expert feedback. This approach to evaluating LLM-generated feedback has been called expert evaluation and has the strength to reliably measure feedback characteristics, such as its positivity. However, it needs to incorporate the practical classroom applicability of the feedback. To address this, Demszky et al. (2023) suggested an alternative approach, the impact approach. This method focuses on gathering insights from potential users of LLMgenerated feedback, aiming to capture the feedback’s real-world utility and relevance in educational settings. Notably, no study has yet solicited feedback from teachers as intended LLM users, revealing a gap in understanding teachers’ perspectives on these emerging educational technologies (Kizilcec, 2023). The Present Study This empirical study is designed to provide initial evidence from the perspective of potential users regarding the usefulness of feedback generated by LLMs to support secondary students’ revisions of argumentative texts. The research question of our study is how preservice teachers assess the usefulness of LLMgenerated feedback compared to expert-generated feedback without having information about the feedback source. To answer the question, we let preservice teachers read texts written by secondary students, which we displayed together with two feedback messages, one generated by an LLM (ChatGPT 3.5 Turbo) and one by an expert teacher (see Steiss et al., 2023 for a similar approach). The participant’s task was to assess the feedback’s usefulness and compare the feedback with each other regarding which of them they would rather give to the student-authored text for revising it. To further investigate the participants’ assessment, we debriefed them about the sources of the feedback after the assessment and asked them which feedback messages were LLMgenerated. Given the mixed findings from previous studies and considering that this is the first study focusing on teacher assessments of LLM feedback, our analysis will be exploratory. Method This study involved 89 student teachers (60 % female, mean age = 22.46 years, SD = 3.16, average semester = 4.31, SD = 3.01) who assessed the quality of two feedback messages per text for eight texts. The research design was a single-group format, with the feedback source as the within-subjects independent variable. The study’s design and its analysis were not preregis- 84 Thorben Jansen, Lars Höft, Luca Bahr, Johanna Fleckenstein, Jens Möller, Olaf Köller, Jennifer Meyer tered. The Ethics and Data Protection Commission of the Leibniz Institute for Science and Mathematics Education (IPN) reviewed the study. The online supplementary materials can be found under the link https: / / osf.io/ tjbrw/ Variables Independent variable: Feedback Source The independent variable varied in two levels (LLMgenerated versus Expert-generated Feedback). The LLM-generated feedback was produced using OpenAI’s ChatGPT (Version 3.5 Turbo), selected for its widespread usage and relevance to contemporary classroom settings. The prompts for the LLMgenerated feedback included task details, student materials, essays, and assessment criteria. Instructions regarding feedback structure were also provided. Supplementary material A1 contains detailed prompts for LLM feedback generation, and supplementary material A2 presents an example of LLM-generated feedback. For the expert-generated feedback, three experienced science teachers, each with a minimum of five years in the field, created individual responses. They adhered to a 150-word limit and followed a feedback structure like that used for the LLM. Supplementary material A2 includes an example of teacherwritten feedback and the exact instruction the teachers received. Dependent variable: Feedback Usefulness We assessed the usefulness of the feedback as perceived by student teachers. For the assessment, we used the feedback usefulness items of the Feedback Perceptions Questionnaire (FPQ, Strijbos, Pat-El & Narciss, 2021). The original items (“I consider this feedback as useful“, “I consider this feedback as helpful“, “This feedback provides a lot of support to me“) were translated into German and altered for third-party assessment (“This feedback is useful”, “This feedback is helpful”, “This feedback provides a lot of support”) and rated on a 10-point Likert-scale from 1 (not at all ) to 10 (very). Covariates We included three covariates in our analyses. The first covariate in our study was the text’s position within the assessment sequence, ranging from the first to the eighth text. This factor was included to account for potential shifts in participants’ perceptions of feedback usefulness over the course of the study. As participants progressed through the series of texts, their exposure to various feedback examples could influence their evaluations. This exposure might lead to a more informed understanding of the range and variability in feedback quality, potentially affecting their judgments. The second covariate was feedback position, indicating whether the feedback message for the text was shown first or second. Third, we analyzed lexical variance operationalized by the type-token relation (Johansson, 2008), which is a measure used in linguistic analysis. This measure was included based on the premise that the perceived quality of feedback might be influenced by its linguistic characteristics (Jacobsen & Weber, 2023). It is calculated by dividing the number of unique words (types) by the total number of words (tokens) in a given text segment. We calculated the value using LATICthe Linguistic Analyzer for Text and Item Characteristics (Cruz-Neri, Klückmann & Retelsdorf, 2022). Material: Student Essays Participants assessed eight argumentative essays written by 10 th -grade students. Students were asked to write a 20-minute essay discussing the pros and cons of building wind, solar, or hydroelectric power plants to achieve climate neutrality (see supplementary material A2 for the essays and A3 for the full task). These essays were chosen from a larger corpus of 56 texts. Texts were selected to represent the distribution of text quality and length in the corpus, as these factors could influence assessments (Fleckenstein, Meyer, Jansen, Keller & Köller, 2020). Procedure The study was conducted in an online survey on the participant’s personal computers, and participants needed, on average, 28 min (SD = 14.67 min) to complete it. The study was conducted in German, the participants’ first language. The process began with an overview of the writing assignment given to the students, which provided context for the subsequent assessment of feedback quality. Throughout the experiment, participants had access to this contextual information. Comparing AI and Expert Feedback to Students’ Writing 85 Every webpage revolved around one student essay (for a schematic sequence, see Figure 1), which was presented at the top of the page. Both types of feedback (LLM-generated and expert-generated feedback) were shown on the same page as one student essay in a randomized manner. The participants viewed the student essay on top of the computer screen, and the first feedback text was presented underneath. Below the feedback text, the items for assessment were located. When participants scrolled down further, the student text was shown again, followed by the second feedback text and its items for assessment. After completing the questionnaire for each feedback text, participants were asked to rank the feedback texts based on the question, “Would you prefer to give feedback 1 or 2 to the student? Please rank them.” After evaluating all feedback texts, participants were informed about feedback sources (LLM or expertgenerated) and asked to identify which feedback was generated by a LLM. The survey concluded with a questionnaire about demographic data. Finally, participants were thanked for their involvement and offered the option to enter a lottery for one of four 25 € gift cards by providing their email addresses in a separate survey. Statistical Analysis Comparing the two feedback sources, we fitted three linear mixed models (estimated using restricted maximum likelihood [REML] and nloptwrap optimizer) to predict feedback usefulness with feedback source (0 = expert-generated feedback). In the first model, we did not include covariates. In the second model, we included text position and feedback position. In the full model, we included lexical diversity, text position, and feedback position (formula of the full model: feedback usefulness ~ source + text position + feedback position + lexical variance). The model included participant ID as a random effect (formula: ~1 | id). 95 % Confidence Intervals (CIs) and p-values were computed using a Wald t-distribution approximation. Please note that the intercept multiplied by ten represents the mean rating for the expert-generated feedback because the dependent variable was measured on a scale from zero to ten. Figure 1: Study procedure Schematic representation of the experimental procedure Welcome and instruction n Information about study procedure n Information regarding data privacy Likert scale assessment of feedback usefulness n Student text n LLM-generated feedback n Expert-generated feedback Assessing feedback source n Debriefing about feedback sources n Assessment which feedback were LLM-generated Questionnaire n Demographic data 86 Thorben Jansen, Lars Höft, Luca Bahr, Johanna Fleckenstein, Jens Möller, Olaf Köller, Jennifer Meyer Results Analysis of the 714 ratings on LLM-generated feedback showed that in 59 % of cases, participants agreed (i. e., assigned a score higher than five on a 1 [not at all] to ten [very]) that the feedback would be useful for students (see Figure 2), compared to 87 % for expert feedback. When participants were asked to directly compare which of the two presented feedback descriptions they would rather give to the students, in 22,7 % of the cases, participants preferred the LLM-generated feedback over the expert feedback. This also means that in 77,3 % of cases, participants preferred the expert-generated feedback over the LLM-generated feedback. The feedback source showed a negative effect in all models (Table 1), with LLM-generated feedback being rated, on average, 1.97 points lower than feedback generated by experts. This finding suggests that student teachers clearly prefer expert-generated feedback over LLM feedback in terms of usefulness. Additionally, the lexical variance within the feedback also demonstrated a negative effect. This means that student teachers perceived feedback with a higher number of unique words as less useful, suggesting that student teachers prefered simpler, more straightforward feedback. The lexical variance did vary between feedback sources (LLM-generated feedback: Mean type-token ratio .71 [SD = .03]; expert-generated feedback: Mean type-token ratio .65 [SD = .09]). After the assessment of all eight texts, we disclosed that one of the feedback items was generated by an LLM. When participants were asked which of the two types of feedback was more likely generated by an LLM, they chose the correct feedback in 85 % of the cases, which is larger than the guessing probability. Discussion Can any teacher who wants to help a student with writing use a freely available tool to generate feedback in seconds that rivals the quality of expert teacher feedback written in twenty minutes? The simple answer, for now, is no. Our findings demonstrate that expert-generated feedback was perceived as more useful when compared to LLM-generated feedback. However, while this result may not surprise many today - and would have been a foregone con- This feedback is helpful This feedback is useful This feedback provides a lot of support How much do you agree or disagree to the following questions? 0 20 40 60 80 100 Percentage (%) 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 1 Rating 1 (not at all) 2 Rating 2 3 Rating 3 4 Rating 4 5 Rating 5 6 Rating 6 7 Rating 7 8 Rating 8 9 Rating 9 10 Rating 10 (very) Rating: Figure 2: Participants’ Evaluations of the LLM-generated Feedback Comparing AI and Expert Feedback to Students’ Writing 87 clusion in the not-so-distant past - it is remarkable to note that student teachers perceived LLM-generated feedback as useful to give to a student without even knowing that the feedback was written by an LLM. Further, it is even more remarkable that in one-fifth of ratings, participants would prefer to give LLM-generated feedback to the student in direct comparison to expert-generated feedback. While this finding shows the superiority of expert-generated feedback in four-fifths of cases, it also underlines the potential of LLM-generated feedback for providing future feedback to students’ writing. Even if the LLM-generated feedback does not (yet) fulfill quality standards to the same degree as feedback from expert teachers, it might still be beneficial for student learning, especially compared to students receiving no feedback at all (which is often the case for writing assignments in real classrooms; see Applebee & Langer, 2011). Our findings extend prior research by providing evidence on the quality of LLM-generated feedback by implementing a different methodological approach (i. e., we used impact versus expert evaluation; Demzsky et al., 2023). Our results are consistent with other studies evaluating LLM-generated feedback (cf. Dai et al., 2023; Hirunyasiri et al., 2023; Jacobsen & Weber, 2023; Steiss et al., 2023). These studies generally show that LLM-generated feedback, especially from ChatGPT 3.5, is usually considered lower quality than expert-generated feedback. Still, it has been shown to be superior in certain aspects and specific instances. This pattern of findings highlights the need for future research to understand better the contexts and circumstances in which the quality of LLM-generated feedback is sufficient to improve student learning. More research is also needed to consider student outcomes in addition to student teacher and expert rater perceptions (e. g., improvements of text quality during revision with the feedback, Meyer et al., 2024). Notably, studies utilizing the more advanced GPT-4 model (Hirunyasiri et al., 2023; Jacobsen & Weber, 2023) suggest that the effective- Predictors Feedback usefulness Feedback usefulness Feedback usefulness Feedback usefulness Estimates CI p Estimates CI p Estimates CI p Estimates CI p (Intercept) Feedback source [LLM] Text position Feedback position Lexical variance 6.70 6.48 - 6.92 < 0.001 7.64 -1.88 7.40 - 7.87 -2.05 - -1.72 < 0.001 < 0.001 7.77 -1.88 -0.03 -0.01 7.48 - 8.07 -2.05 - -1.72 -0.06 - 0.01 -0.18 - 0.15 < 0.001 < 0.001 0.125 0.858 8.88 -1.97 -0.02 0.04 -1.64 7.99 - 9.78 -2.15 - -1.79 -0.06 - 0.01 -0.13 - 0.21 -2.89 - -0.39 < 0.001 < 0.001 0.211 0.624 0.010 Random Effects σ 2 τ 00 ICC N 3.45 0.90 id 0.21 89 id 2.51 0.96 id 0.28 89 id 2.50 0.96 id 0.28 89 id 2.50 0.98 id 0.28 89 id Observations Marginal R 2 / Conditional R 2 1424 0.000 / 0.208 1424 0.203 / 0.425 1424 0.204 / 0.426 1424 0.206 / 0.429 Table 1: Table 1 shows the models’ result about the predictors of feedback usefulness, indicating distinct effects related to the feedback source and its lexical variance. Note: σ 2 (sigma squared) is the variance of the residuals. τ 00 id (tau squared) shows the variance of random intercepts across groups. ICC (Intraclass Correlation Coefficient) measures the proportion of total variance attributable to group differences, with N id being the number of participants. Observations refer to the total number of data points. Marginal R 2 and Conditional R 2 represent the variance explained by fixed effects alone and by the entire model (fixed and random effects), respectively. 88 Thorben Jansen, Lars Höft, Luca Bahr, Johanna Fleckenstein, Jens Möller, Olaf Köller, Jennifer Meyer ness of LLM-generated feedback could further strongly improve with advancements in LLM technology. This advancement in LLM capabilities provides a rationale for replicating our study using a more powerful version, such as GPT-4, to explore the potential improvements in feedback quality. However, the strong advancement also causes a central problem of research on LLM, namely that the results are only a snapshot in time and that sustainable knowledge can only be created under the assumption that LLM improves in all functions. In our analyses, we included some covariates, i. e., the lexical variance of the feedback, to attribute the differences more strongly to the feedback content. Our findings revealed a significant negative correlation between lexical variance and perceived feedback usefulness. This pattern indicates that feedback expressed in more straightforward language is generally more favorably received. Since the feedback sources differed in lexical variance, the lexical variance could be a reason for the differences between feedback sources. This insight opens avenues for enhancing LLM-generated feedback. The perceived usefulness of AI-generated feedback could be improved by prompting LLMs to use more straightforward language, as aligned with Jacobsen and Weber’s (2023) recommendations. This suggests that the efficacy of LLM feedback is not just a product of the technology itself but also of how it is directed and utilized, emphasizing the importance of effective prompting and a clear understanding of what constitutes good feedback. Limitations A limitation in our study was the unique language and tone of the LLMand expert-generated feedback (see supplementary material A2), which challenged our ability to achieve complete “blinding” of the feedback sources. This aspect raises concerns regarding the possibility of biases among teachers, either favoring or disfavoring AI-generated feedback, as highlighted by Farazouli, Cerratto-Pargman, Bolander- Laksov, and McGrath (2023). Such biases could influence participants’ assessments and increase or decrease the differences. However, it is important to note that while most of the feedback was correctly identified as LLM-generated when participants were informed about the potential AI source, this does not necessarily imply that participants were consciously considering the source during their initial assessment of the feedback. We did not ask the participants during the assessment to avoid triggering stereotypes and to keep their focus on the usefulness of the feedback. Moreover, our study’s design focused on directly comparing LLM-generated and teachergenerated feedback without allowing participants to assess the quality of LLM-generated feedback independently. This comparative framework may have affected how participants rated each piece of feedback. Future research could benefit from a design where participants evaluate LLM-generated feedback in isolation, such as by assessing a single LLM feedback for each student essay without a direct teacher comparison. We chose this design because the text’s characteristics could influence the feedback’s perceived usefulness. Another notable limitation of this study concerns the selection and conditions of the expert feedback providers, which may have biased the comparison in favor of the expertgenerated feedback. Our experts had the advantage of ample time and a limited number of essays to evaluate, a scenario not typically reflective of real classroom environments. In contrast, classroom teachers, despite having greater contextual knowledge about their students, often face significant time constraints, potentially impacting the quality of their feedback. Future studies should aim for a more realistic setting where feedback is provided under conditions that closely resemble the time and workload pressures experienced by teachers in schools. This would yield a more accurate comparison between LLM-generated feedback and the feedback typically given in educational settings, offering more practically applicable insights. Comparing AI and Expert Feedback to Students’ Writing 89 The composition of our sample introduces certain limitations to the study’s findings. While we focused on student teachers assessing the quality of feedback, an essential perspective - that of the students receiving the feedback - was not included. Additionally, our participants are developing their didactic skills and may lack the practical experience of routinely providing feedback and observing its impact on students’ progress. Their perspectives, while valuable, may not fully capture the nuances of feedback effectiveness as perceived by experienced teachers. Future research would benefit from involving active teachers who engage with students daily to better understand the utility of feedback in real-world classroom environments. A final consideration is the timing of our study, which was conducted between June and July 2023, marking the inaugural year of ChatGPT’s deployment. Hence, we face the limitations of evolving technology. As with any emerging technology, the quality of user-generated instructions and the model’s performance will likely improve over time. Therefore, repeating this study with the latest AI iteration, such as GPT-4, could yield different insights and further inform the evolving landscape of AI in educational settings. Practical implications The findings from our study suggest cautious considerations for educators exploring the use of ChatGPT and similar AI tools in teaching. One key implication is the relative difference in feedback quality between AI and expert human feedback. While generative AI shows promise, its integration into educational practices should be approached with careful evaluation and supervision. Moreover, our research indicates that the effectiveness of feedback from ChatGPT can vary, often improving with refined prompting and attention to factors like lexical diversity. This aligns with Jacobsen and Weber’s (2023) findings, suggesting that the utility of educator feedback can be enhanced with better AI instruction. Implications for Research Considering the convergence of several key factors - our study’s findings that mirror the broader literature indicating LLM-generated feedback is not far behind expert-generated feedback, ChatGPT’s expansive user base exceeding 100 million users per month, leading to UNESCO’s (2023) call for in-depth research into LLM feedback, along with the rapid technological advancements in LLMs - there is an evident and growing demand for extensive research in this field. This confluence of factors highlights the critical need to deeply understand and evaluate the impact and potential of LLM-generated feedback in educational settings. Given this context, we advocate for applying a full spectrum of educational research methodologies to evaluate classroom LLM feedback usage. This should include rigorous approaches such as randomized controlled trials, examinations of individual differences, and analyses of potential moderators and mediators. Such comprehensive research is essential to understand the depth and breadth of LLM feedback’s impact in educational settings and mitigating the risk. Among the pressing questions to be addressed is the impact of AI-generated feedback on student behavior and writing quality. In particular, it is important to study how students interact with AI feedback: Does it encourage them to engage more deeply in the revision process than they typically would? And if so, does this increased engagement lead to improvements in the quality of their writing? Conclusion Returning to our initial query regarding the readiness of LLM-generated feedback for classroom use, our study’s findings suggest a nuanced answer. While LLM-generated feedback was generally rated lower in usefulness than expert-generated feedback, the direct comparison indicated that the difference was 90 Thorben Jansen, Lars Höft, Luca Bahr, Johanna Fleckenstein, Jens Möller, Olaf Köller, Jennifer Meyer less than to conclude that it needs to be classroom-ready. This consideration becomes particularly relevant when acknowledging the significant time and effort required for teachers to generate feedback, a factor known to limit the assignment of writing tasks (Applebee & Langer, 2011). Moreover, our study needed to fully explore specific strengths of LLM-generated feedback, such as its ability to provide multiple feedback during the writing process and interact with the feedback. Following this, our results should be viewed more as a pathway for further exploration rather than a definitive conclusion. Future studies, leveraging the full capabilities of the latest LLMs and without the advantage of extended time for expert feedback, might likely discover scenarios where LLM-generated feedback surpasses that of teachers. Our findings, therefore, highlight a potential shift in the landscape of educational feedback, inviting more comprehensive research to understand and utilize the evolving capabilities of LLMs fully. References Applebee, A. N. & Langer, J. A. (2011). “EJ” Extra: A snapshot of writing instruction in middle schools and high schools. The English Journal, 100 (6), 14 - 27. Bereiter, C. & Scardamalia, M. (1987). The psycholog y of written composition. Routledge. Biber, D., Nekrasova, T. & Horn, B. (2011). The Effectiveness of Feedback for L1-English and L2-Writing Development: A Meta-Analysis. ETS Research Report Series, 2011 (1), 1 - 99. https: / / doi.org/ 10.1002/ j.2333-8504. 2011.tb02241.x Black, P. & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education, 5 (1), 7 - 74. https: / / doi.org/ 10.1080/ 0969595980050102 Cen, Y. & Zheng, Y. (2023). The motivational aspect of feedback: A meta-analysis on the effect of different feedback practices on L2 learners’ writing motivation. Assessing Writing, 59, 100802. https: / / doi.org/ 10.1016/ j.asw.2023.100802 Chen, Y., Wang, R., Jiang, H., Shi, S. & Xu, R. (2023). Exploring the use of large language models for referencefree text quality evaluation: A preliminary empirical study. arXiv preprint. arXiv: 2304.00723. Crossley, S. A., Baffour, P., Tian, Y., Picou, A., Benner, M. & Boser, U. (2022). The persuasive essays for rating, selecting, and understanding argumentative and discourse elements (PERSUADE) corpus 1.0. Assessing Writing, 54, 100667. https: / / doi.org/ 10.1016/ j.asw. 2022.100667 Cruz-Neri, N. C., Klückmann, F. & Retelsdorf, J. (2022). LATIC - A linguistic analyzer for text and item characteristics. PLOS ONE, 17 (11), e0277250. https: / / doi. org/ 10.1371/ journal.pone.0277250 Dai, W., Lin, J., Jin, H., Li, T., Tsai, Y. S., Gaševic´, D. & Chen, G. (2023, July). Can large language models provide feedback to students? A case study on ChatGPT. In 2023 IEEE International Conference on Advanced Learning Technologies (ICALT) (pp. 323 - 325). https: / / doi. org/ 10.35542/ osf.io/ hcgzj Demszky, D., Yang, D., Yeager, D. S., Bryan, C. J., Clapper, M., Chandhok, S., … Pennebaker, J. W. (2023). Using large language models in psychology. Nature Reviews Psychology, 2 (11), 688 - 701. https: / / doi.org/ 10.1038/ s44159-023-00241-5 Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. (2018, October 11). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint. http: / / arxiv.org/ pdf/ 1810.04805v2 Ercikan, K. & McCaffrey, D. F. (2022). Optimizing Implementation of Artificial-Intelligence-Based Automated Scoring: An Evidence Centered Design Approach for Designing Assessments for AI-based Scoring. Journal of Educational Measurement. 59 (3), 272 - 287. https: / / doi. org/ 10.1111/ jedm.12332 Farazouli, A., Cerratto-Pargman, T., Bolander-Laksov, K. & McGrath, C. (2023). Hello GPT! Goodbye home examination? An exploratory study of AI chatbots impact on university teachers’ assessment practices. Assessment & Evaluation in Higher Education, 1 - 13. https: / / doi. org/ 10.1080/ 02602938.2023.2241676 Fleckenstein, J., Meyer, J., Jansen, T., Keller, S. & Köller, O. (2020). Is a Long Essay Always a Good Essay? The Effect of Text Length on Writing Assessment. Frontiers in Psychology, 11, 562462. https: / / doi.org/ 10.3389/ fpsyg.2020.562462 Fleckenstein, J., Liebenow, L. W. & Meyer, J. (2023). Automated feedback and writing: A multi-level meta-analysis of effects on students’ performance. Frontiers in Artificial Intelligence, 6. https: / / doi.org/ 10.3389/ frai. 2023.1162454 Flower, L. & Hayes, J. R. (1981). A cognitive process theory of writing. College Composition and Communication, 32 (4), 365 - 387. https: / / doi.org/ 10.2307/ 356600 Gencoglu, B., Helms-Lorenz, M., Maulana, R., Jansen, E. P. & Gencoglu, O. (2023). Machine and expert judgments of student perceptions of teaching behavior in secondary education: Added value of topic modeling with big data. Computers & Education, 193, 104682. https: / / doi.org/ 10.1016/ j.compedu.2022.104682 Graham, S., Hebert, M. & Harris, K. R. (2015). Formative assessment and writing: A meta-analysis. The Elementary School Journal, 115 (4), 523 - 547. Graham, S., Kim, Y.-S., Cao, Y., Lee, J., Tate, T., Collins, P., Cho, M., Moon, Y., Chung, H. Q. & Olson, C. B. (2023). A meta-analysis of writing treatments for students in grades 6 - 12. Journal of Educational Psychology, 115 (7), 1004 - 1027. https: / / doi.org/ 10.1037/ edu000 0819 Graham, S., Kiuhara, S. A. & MacKay, M. (2020). The Effects of Writing on Learning in Science, Social Studies, and Mathematics: A Meta-Analysis. Review of Educational Research, 0034654320914744. https: / / doi.org/ 10.3102/ 0034654320914744 Grimes, D., & Warschauer, M. (2010). Utility in a fallible tool: A multi-site case study of automated writing evaluation. The Journal of Technology, Learning and Assessment, 8 (6). Comparing AI and Expert Feedback to Students’ Writing 91 Hattie, J. & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77 (1), 81 - 112. https: / / doi.org/ 10.3102/ 003465430298487 Henderson, M., Ryan, T., Boud, D., Dawson, P., Phillips, M., Molloy, E. & Mahoney, P. (2021). The usefulness of feedback. Active Learning in Higher Education, 22 (3), 229 - 243. https: / / doi.org/ 10.1177/ 1469787419872 393 Hennes, A. K., Schmidt, B. M., Yanagida, T., Osipov, I., Rietz, C. & Schabmann, A. (2022). Meeting the Challenge of Assessing (Students’) Text Quality: Are There any Experts Teachers Can Learn from or Do We Face a More Fundamental Problem? Psychological Test and Assessment Modeling, 64 (3), 272 - 303. Hirunyasiri, D., Thomas, D. R., Lin, J., Koedinger, K. R. & Aleven, V. (2023). Comparative analysis of gpt-4 and human graders in evaluating praise given to students in synthetic dialogues. arXiv preprint. arXiv: 2307.02 018. Hyland, K. & Hyland, F. (2006). Feedback on second language students’ writing. Language teaching, 39 (2), 83 - 101. Horbach, A., Laarmann-Quante, R., Liebenow, L., Jansen, T., Keller, S., Meyer, J., … & Fleckenstein, J. (2022). Bringing Automatic Scoring into the Classroom - Measuring the Impact of Automated Analytic Feedback on Student Writing Performance. In Swedish Language Technology Conference and NLP4CALL (pp. 72 - 83). https: / / doi.org/ 10.3384/ ecp190008 Jacobsen, L. J. & Weber, K. E. (2023, September 29). The Promises and Pitfalls of ChatGPT as a Feedback Provider in Higher Education: An Exploratory Study of Prompt Engineering and the Quality of AI-Driven Feedback. OSF Preprints. https: / / doi.org/ 10.31219/ osf.io/ cr257 Jansen, T., Vögelin C., Machts, N., Keller, S. & Möller, J. (2021). Don’t Just Judge the Spelling! The Influence of Spelling on Assessing Second Language Student Essays. Frontline Learning Research. https: / / doi.org/ 10.14786/ flr.v9i1.541 Jansen, T., Vögelin C., Machts, N., Keller, S., Köller, O. & Möller, J. (2021). Judgment accuracy in experienced versus student teachers: Assessing essays in English as a foreign language. Teacher and Teaching Education, 97, 103216. https: / / doi.org/ 10.1016/ j.tate.2020.103216 Jansen, T. & Möller, J. (2022). Teacher Judgments in School Exams: Influences of Students’ Lower-Order- Thinking Skills on the Assessment of Students’ Higher- Order-Thinking Skills. Teacher and Teaching Education. 103616. https: / / doi.org/ 10.1016/ j.tate.2021.103616 Jansen, T., Meyer, J., Fleckenstein, J., Horbach, A., Keller, S. & Möller, J. (2024). Individualizing goal-setting interventions using automated writing evaluation to support secondary school students’ text revisions. Learning and Instruction, 89, 101847. https: / / doi.org/ 10.10 16/ j.learninstruc.2023.101847 Jia, Q., Young, M., Xiao, Y., Cui, J., Liu, C., Rashid, P. & Gehringer, E. (2022). Insta-reviewer: A data-driven approach for generating instant feedback on students’ project reports. International Educational Data Mining Society. https: / / doi.org/ 10.5281/ zenodo.6853 099 Johansson, V. (2008). Lexical diversity and lexical density in speech and writing: A developmental perspective. Working papers / Lund University, Department of Linguistics and Phonetics, 53, 61 - 79. Kizilcec, R. F. (2023). To Advance AI Use in Education, Focus on Understanding Educators. International Journal of Artificial Intelligence in Education, 1 - 8. https: / / doi. org/ 10.1007/ s40593-023-00351-4 Ley, T., Tammets, K., Pishtari, G., Chejara, P., Kasepalu, R., Khalil, M., Saar, M., Tuvi, I., Väljataga, T. & Wasson, B. (2023). Towards a partnership of teachers and intelligent learning technology: A systematic literature review of model-based learning analytics. Journal of Computer Assisted Learning, 39 (5), 1397 - 1417. https: / / doi. org/ 10.1111/ jcal.12844 Li, T., Reigh, E., He, P. & Miller, E. A. (2023). Can we and should we use artificial intelligence for formative assessment in science? Journal of Research in Science Teaching, 60 (6), 1385 - 1389. https: / / doi.org/ 10.1002/ tea.21867 Mayer, J.,Jansen, T., Schiller, R., Liebenow, L. W., Steinbach, M., Horbach, A. & Fleckenstein, J. (2024). Using LLMs to bring evidence-based feedback into the classroom: AI-generated feedback increases secondary students’ text revision, motivation, and positive emotions. Computers and Education: Artificial Intelligence, 6, 100199. https: / / doi.org/ 10.1016/ caeai.2023.1001 99 Meyer, J., Jansen, T., Fleckenstein, J., Keller, S. & Köller, O. (2020). Machine Learning im Bildungskontext: Evidenz für die Genauigkeit der automatisierten Beurteilung von Essays im Fach Englisch [Machine Learning in the Educational Context: Evidence of Prediction Accuracy Considering Essays in English as a Foreign Language]. Zeitschrift für Pädagogische Psychologie. https: / / doi.org/ 10.1024/ 1010-0652/ a000296 Mizumoto, A. & Eguchi, M. (2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2 (2), 100050. https: / / doi.org/ 10.1016/ j.rmal.2023.100050 Möller, J., Jansen, T., Fleckenstein, J., Machts, N., Meyer, J. & Reble, R. (2022). Judgment accuracy of German student texts: Do teacher experience and content knowledge matter? Teaching and Teacher Education, 119, 103879. https: / / doi.org/ 10.1016/ j.tate.2022.103879 Ngo, T.T.-N., Chen, H. H.-J. & Lai, K. K.-W. (2022). The effectiveness of automated writing evaluation in EFL/ ESL writing: a three-level meta-analysis. Interactive Learning Environments, 1 - 18. https: / / doi.org/ 10.108 0/ 10494820.2022.2096642 Reddy, Y. M. & Andrade, H. (2010). A review of rubric use in higher education. Assessment & Evaluation in Higher Education, 35 (4), 435 - 448. https: / / doi.org/ 10.1080/ 02602930902862859 Rich, P. R., van Loon, M. H., Dunlosky, J. & Zaragoza, M. S. (2017). Belief in corrective feedback for common misconceptions: Implications for knowledge revision. Journal of Experimental Psychology. Learning, Memory, and Cognition, 43 (3), 492 - 501. https: / / doi.org/ 10.10 37/ xlm0000322 Sadler, D. R. (2010). Beyond feedback: Developing student capability in complex appraisal. Assessment & Evaluation in Higher Education, 35 (5), 535 - 550. https: / / doi.org/ 10.1080/ 02602930903541015 Shermis, M. D (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20, 53 - 76. https: / / doi.org/ 10.1016/ j.asw.2013.04.001 Shute, V. J. (2008). Focus on Formative Feedback. Review of Educational Research. 78 (1), 153 - 189. https: / / doi. org/ 10.3102/ 0034654307313795 92 Thorben Jansen, Lars Höft, Luca Bahr, Johanna Fleckenstein, Jens Möller, Olaf Köller, Jennifer Meyer Skar, G. B., Graham, S. & Rijlaarsdam, G. (2022). Formative writing assessment for change - introduction to the special issue. Assessment in Education, 29 (2), 121 - 126. https: / / doi.org/ 10.1080/ 0969594X.2022.2089488 Su, Y., Lin, Y. & Lai, C. (2023). Collaborating with ChatGPT in argumentative writing classrooms. Assessing Writing, 57, 100752. https: / / doi.org/ 10.1016/ j.asw. 2023.100752 Steiss, J., Tate, T. P., Graham, S., Cruz, J., Hebert, M., Wang, J., … Warschauer, M. (2023, September 7). Comparing the Quality of Human and ChatGPT Feedback on Students’ Writing. OSF Preprints. https: / / doi.org/ 10.35542/ osf.io/ ty3em Strijbos, J., Pat-El, R. & Narciss, S. (2021). Structural validity and invariance of the Feedback Perceptions Questionnaire. Studies in Educational Evaluation, 68, 100980. https: / / doi.org/ 10.1016/ j.stueduc.2021.100 980 Tack, A. & Piech, C. (2022). The AI teacher test: Measuring the pedagogical ability of blender and GPT-3 in educational dialogues. arXiv preprint. arXiv: 2205.07540. Tate, T. P., Steiss, J., Bailey, D. H., Graham, S., Ritchie, D., Tseng, W., … Warschauer, M. (2023, December 5). Can AI Provide Useful Holistic Essay Scoring? OSF Preprints. https: / / doi.org/ 10.31219/ osf.io/ 7xpre UNESCO. (2011). UNESCO and education: Everyone has the right to education. https: / / unesdoc.unesco.org/ ark: / 48223/ pf0000212715 UNESCO. (2023, September 8) Guidance for generative AI in education and research. https: / / www.unesco.org/ en/ articles/ guidance-generative-ai-education-andresearch Van der Kleij, F. M. & Lipnevich, A. A. (2021). Student perceptions of assessment feedback: A critical scoping review and call for research. Educational Assessment, Evaluation and Accountability, 33, 345 - 373. https: / / doi.org/ 10.1007/ s11092-020-09331-x Van Dis, E. A., Bollen, J., Zuidema, W., Van Rooij, R. & Bockting, C. L. (2023). ChatGPT: Five priorities for research. Nature, 614 (7947), 224 - 226. https: / / doi. org/ 10.1038/ d41586-023-00288-7 Winstone, N. E., Nash, R. A., Parker, M. & Rowntree, J. (2017). Supporting learners’ agentic engagement with feedback: A systematic review and a taxonomy of recipience processes. Educational Psychologist, 52 (1), 17 - 37. https: / / doi.org/ 10.1080/ 00461520.2016.120 7538 Yang, S., Nachum, O., Du, Y., Wei, J., Abbeel, P. & Schuurmans, D. (2023). Foundation Models for Decision Making: Problems, Methods, and Opportunities. arXiv Preprint. https: / / doi.org/ 10.48550/ arXiv.2303.04129 Zhai, X. Krajcik, J. & Pellegrino, J. W. (2021). On the validity of machine learning-based next generation science assessments: A validity inferential network. Journal of Science Education and Technology, 30, 298 - 312. https: / / doi.org/ 10.1007/ s10956-020-09879-9 Zhai, X. & Nehm, R. H. (2023). AI and formative assessment: The train has left the station. Journal of Research in Science Teaching, 60 (6), 1390 - 1398. https: / / doi.org/ 10.1002/ tea.21885 Dr. Thorben Jansen Dr. Lars Höft Luca Bahr Prof. Dr. Olaf Köller Dr. Jennifer Meyer Leibniz-Institut für die Pädagogik der Naturwissenschaften und Mathematik Olshausenstraße 62 24118 Kiel E-Mail: tjansen@leibniz-ipn.de jmeyer@leibniz-ipn.de hoeft@leibniz-ipn.de bahr@leibniz-ipn.de koeller@leibniz-ipn.de Prof. Dr. Johanna Fleckenstein Institut für Erziehungswissenschaft Universität Hildesheim Universitätsplatz 1 31141 Hildesheim E-Mail: fleckenstein@leibniz-ipn.de Prof. Dr. Jens Möller Institut für Pädagogisch-Psychologische Lehr- und Lernforschung Christian-Albrechts-Universität zu Kiel Olshausenstraße 75 24118 Kiel E-Mail: jmoeller@ipl.uni-kiel.de