Volume 11, Issue 3 e3433
SYNTHESIS
Open Access

Exploring the impact of language models, such as ChatGPT, on student learning and assessment

Araz Zirar

Corresponding Author

Araz Zirar

Department of Management, Huddersfield Business School, University of Huddersfield, Huddersfield, UK

Correspondence

Araz Zirar, Department of Management, Huddersfield Business School, University of Huddersfield, Huddersfield, UK.

Email: [email protected]

Search for more papers by this author
First published: 30 October 2023

Abstract

Recent developments in language models, such as ChatGPT, have sparked debate. These tools can help, for example, dyslexic people, to write formal emails from a prompt and can be used by students to generate assessed work. Proponents argue that language models enhance the student experience and academic achievement. Those concerned argue that language models impede student learning and call for a cautious approach to their adoption. This paper aims to provide insights into the role of language models in reshaping student learning and assessment in higher education. For that purpose, it probes the impact of language models, specifically ChatGPT, on student learning and assessment. It also explores the implications of language models in higher education settings, focusing on their effects on pedagogy and evaluation. Using the Scopus database, a search protocol was employed to identify 25 articles based on relevant keywords and selection criteria. The developed themes suggest that language models may alter how students learn and are assessed. While language models can provide information for problem-solving and critical thinking, reliance on them without critical evaluation adversely impacts student learning. Language models can also generate teaching and assessment material and evaluate student responses, but their role should be limited to ‘play a specific and defined role’. Integration of language models in student learning and assessment is only helpful if students and educators play an active and effective role in checking the generated material's validity, reliability and accuracy. Propositions and potential research questions are included to encourage future research.

Context and implications

Rationale for this study

This synthesis explores the use of language models like ChatGPT by students and educators. While they benefit dyslexic students, concerns arise about over-reliance on these tools, given questionable accuracy and unreliable detection tools.

Why the new findings matter

The significance of the themes is in the dual role of language models. While they help students in problem-solving and creativity, over-reliance without critical evaluation impedes learning. Also, language models should have a limited role in generating teaching and assessment material.

Implications for practice

The findings have implications for practitioners, educators and the general public. Students need to be aware of the limitations of language models and critically evaluate their output. Educators should embrace and adapt these tools, ensuring they actively verify the reliability and accuracy of generated material. This necessitates a shift towards a futuristic approach rather than fearing or resisting language models. The integration of language models in education should involve a proactive role by educators to ensure validity and reliability, enhancing students' learning and assessment experience.

INTRODUCTION

Student learning requires aligning teaching, learning activities and assessment tools (Biggs & Tang, 2011). In their academic journey, students must develop transferable skills, such as critical thinking abilities (Penkauskienė et al., 2019; Ryan & Aasetre, 2021). Assessments promote student learning by providing feedback and motivation and allowing students to reflect on their progress (Boud, 2000). For example, low-stake quizzing and peer and tutor feedback are practical approaches (Morris et al., 2021). On the other hand, students' learning experiences reflect the nature of teaching practices (Arnold, 2022). Such learning experiences within higher education can lead to deep revelations, challenging assumptions and reshaping students' worldviews (Hodge et al., 2011). Stakeholders, particularly educators, play an integral role in shaping the learning environment and, directly and indirectly, students' worldviews (Hodge et al., 2011). However, language models like ChatGPT are beginning to rewrite this relationship (Stokel-Walker, 2022).

The use of language models in education has gained increasing attention with the release of ChatGPT in November 2022 (Heilweil, 2023). However, how students can use language model tools is debated (Heilweil, 2023; Reich, 2022; Stokel-Walker, 2022) considering the reliability of the generated material (Heaven, 2022; Hunter, 2023; Waters, 2023), expectations in terms of student learning and assessment (Huang, 2023; Kovanovic, 2022; Weale, 2023) and student's perception of the use of such tools to fulfil assessment requirements (Heyward, 2022). Therefore, education is an area where the use of language models is highly opinionated (Khan et al., 2023; Tlili et al., 2023).

While the public is excited about using generative artificial intelligence (AI) in educational settings, various voices call for a cautious approach to adopting language models (Khan et al., 2023; Tlili et al., 2023). These voices highlight concerns around output quality, usefulness, privacy and ethical issues; they also refer to cheating, honesty, misleading and manipulation (Gregorcic & Pendrill, 2023; Roy & Rambo-Hernandez, 2021; Tlili et al., 2023).

Recent literature reviews (Table 1) have explored various aspects of language models and education. Jeon et al. (2023) reviewed speech-recognition chatbots, their applications and their effectiveness in educational settings. Wu and Yu (2023) reviewed inconsistencies in findings regarding the impact of AI chatbots on students' learning through a meta-analysis. Yan et al. (2023) systematically scoped the literature, exploring employing language models to automate educational tasks. Sharma et al. (2023) reviewed the use of ChatGPT within the plastic surgery field. Lin et al. (2023) reviewed chatbots' purpose, methods and datasets employed in their development. Lo (2023) discussed the capabilities of ChatGPT in education. In their review, Alqahtani et al. (2023) introduced readers to AI, natural language processing (NLP) and large language models (LLMs) concepts. These diverse reviews contribute to our understanding of the evolving field of language models and education.

TABLE 1. Previous reviews.
Reference Type Topic
Jeon et al. (2023) Systematic review Speech-recognition chatbots
Wu and Yu (2023) Meta-analysis Inconsistency in findings on how AI chatbots impact students' learning
Yan et al. (2023) Systematic scoping review Employing language models to automate educational tasks
Sharma et al. (2023) General review Use of ChatGPT in plastic surgery field
Lin et al. (2023) General review The purpose of chatbots and methods and dataset used to build them
Lo (2023) General review ChatGPT capabilities in education
Alqahtani et al. (2023) General review An introduction to AI, NLP and LLMs

This synthesis adds to these attempts and explores the potential of language models for enhancing the student experience and supporting their academic achievement in higher education. The research question that guides this synthesis is: How do language models, such as ChatGPT, impact student learning and assessment in higher education? This synthesis will probe the literature and identify themes and areas where further research is needed.

There are several sections to this synthesis. The subsequent section offers a concise discussion of concerns about students' use of ChatGPT. This is followed by the section describing the research method and the section that presents the analysis and synthesis. The subsequent sections cover discussion, contribution, limitations and future research direction.

STUDENTS' USE OF ChatGPT

Concerns of ChatGPT on students' over-relying

Language models such as ChatGPT triggered education dilemmas such as students using the tool to cheat (Rahm & Rahm-Skågeby, 2023). ChatGPT or other large language models, such as Bing AI, Bard, and so on, are defined as ‘an artificial intelligence (AI) powered chatbot that creates surprisingly intelligent-sounding [and human-like] text in response to user prompts, including homework assignments and exam-style questions’ (Stokel-Walker, 2022). As language models become prevalent, students will encounter them frequently, directly or through search engines like Bing, Google and Brave (Harrison & Ajjan, 2019).

Language models can generate assessed work in a human-like manner (Dwivedi et al., 2023; Farrokhnia et al., 2023). The generated text, however, in terms of the knowledge it offers and the accuracy of the knowledge, unless checked, is questionable (Harrison & Ajjan, 2019; Roy & Rambo-Hernandez, 2021). It is tempting to suggest that language models only be used to develop early drafts and that the output is thoroughly checked (Farrokhnia et al., 2023; Mohdzaini, 2023). Students may get into the habit of quickly generating assessed work rather than going through the learning process (Strzelecki, 2023). If students rely heavily on language models without verifying the information, they lose essential skills in critical thinking and analysis (Iskender, 2023). Also, this use is highly discouraged when sensitive information is to be fed to language models as prompts (Mohdzaini, 2023).

Concerns of ChatGPT on students' cheating

Students who outsource their assessed work to third parties commit contract cheating (Dawson et al., 2020). It is, therefore, unlikely for a student to think that the terms of reference between them and the academic institution they are enrolled in allow them to outsource their assessed work to third parties (Dawson et al., 2020). However, when using language models to generate assessed work, students might think such tools do not constitute contract cheating (Heyward, 2022). Students might perceive using such tools to fulfil assessment requirements as acceptable (Heyward, 2022). They might lack awareness of how much language model use constitutes contract cheating (Dawson et al., 2020; Heyward, 2022).

Tools are developed to detect AI-generated text (Dawson et al., 2020). However, they are unreliable (Cotton et al., 2023; Heikkilä, 2023). While the output of language models is likely inaccurate and irrelevant when generating assessed work (Farrokhnia et al., 2023; Tlili et al., 2023), using detection tools to detect AI-generated work is also unreliable (Dawson et al., 2020; Heikkilä, 2023). Detectors seem to deter students from using language models to generate assessed work rather than actual detection (Fowler, 2023).

In the future, students and academic institutions will likely live with language models (Dwivedi et al., 2023; Farrokhnia et al., 2023). When students are exposed to language models (Harrison & Ajjan, 2019; or when they expose themselves to them), academic institutions need to have policies explaining the use of generative AI. Clarity around the use of language models helps students to adapt to such advancements. Students must understand that they must not outsource their work to these tools to gain an unfair advantage (Dawson et al., 2020). However, these policies need to outpace language model developments (Mohdzaini, 2023).

MATERIALS AND METHODS

This synthesis adopted a stream-based approach to generate themes. The study combined a systematic and protocol approach for data source identification (Tranfield et al., 2003) with the opportunities that thematic analysis provides to develop analytically driven themes (Braun & Clarke, 2006). The synthesis followed the steps outlined by Tranfield et al. (2003) to plan the synthesis, conduct the study, and report the findings. A synthesis protocol was developed to document data source identification. “Thematic Analysis” was employed, as described by Braun and Clarke (2006, 2022), to develop the themes. Other studies have adopted this approach, such as Vrontis et al. (2021) and Zirar et al. (2023).

The article collection process consisted of formulating the synthesis question, determining the keywords, and identifying, collecting, analysing and synthesising the relevant literature (Denyer & Tranfield, 2009; Kitchenham et al., 2009). The research question – How do language models, such as ChatGPT, impact student learning and assessment in higher education? – guided article selection and analysis.

Scopus was searched to locate relevant articles with the following search string:

ABS ( ( "Higher Education" OR "Education" OR "Student Perspective" OR "Pupil Perspective" OR "Stakeholder Perspective" OR Student OR Pupil OR Educator OR "Education Technology" ) AND ( "Language Model*" OR "Large Language Model" OR "Machine Learning" OR "Generative AI" OR "OpenAI" OR "Conversational AI" OR "Pathways Language Model" OR PALM OR "Generalist Language Model" OR GLAM OR "Language Model for Dialogue Applications" OR LAMDA OR "Megatron-Turing NLG" OR "Generative Explicit Textured 3D" OR GET3D OR DreamFusion OR "BigScience Large Open-science Open-access Multilingual" OR BLOOM OR "Generative Pre-trained Transformer" OR GPT?2 OR GPT?3 OR ChatGPT OR InstructGPT OR GPT?3.5 OR "Pre-training of Deep Bidirectional Transformers" OR BERT OR GPT-NeoX-20b OR "Open Pretrained Transformer" OR OPT-175b OR Point-E OR "Robotics Transformer" OR RT-1 OR "Enhanced Language Representation with Informative Entities" OR ERNIE-Code OR VALL-E OR "GitHub Copilot" ) )

The initial search yielded 12,839 documents. The list was narrowed to 4452 documents using the databases' inclusion criteria: English language, journal as a source, article as a document, and last 10 years (2014–2023). The list was narrowed further to 514 documents by limiting the returned articles to peer-reviewed journals (Kraus et al., 2022) using the Scopus field code EXACTSRCTITLE ().

The AI-assisted ASReview tool was employed to screen the articles' abstracts (Ferdinands, 2021). A thorough reading of the articles' abstracts helped train the AI-assisted ASReview tool (van de Schoot et al., 2021) to narrow the list to 13 full-text articles (see Figure 1).

Details are in the caption following the image
The process of selecting the articles.

ASReview is a machine learning-based screening system. Van de Schoot et al. (2021) explained how the tool works to screen abstracts. The tool accelerates screening titles and abstracts when conducting a review (van de Schoot et al., 2021). It improves the efficiency of the screening process of titles and abstracts. The tool uses active learning to train a machine learning model to predict relevance from texts using a limited number of labelled samples (van de Schoot et al., 2021). The ASReview project file can be provided for review; thus, the transparency of the screening process is also improved in terms of the number of articles and which articles were screened (van de Schoot et al., 2021).

The Scopus database provides an alert service for new articles. While conducting the synthesis, more articles from Scopus alerts were added (see Figure 1).

The objective of the analysis stage was to understand the selected list of articles. The articles were organised into themes in an Excel spreadsheet (Table 2). The analysis adopted ‘Reflexive Thematic Analysis’ (Braun & Clarke, 2019, 2022). ‘Reflexive’ in this context highlights the role of the researcher in generating the themes (Braun & Clarke, 2019, 2022). This analysis method relies on the researchers' interpretation and active engagement with the data considering the research question (Braun & Clarke, 2022).

TABLE 2. Key summary and which references contributed to which theme.
# References Research design Theme 1 Theme 2 Key summary
1 Chan & Hu (2023) 399 questionnaire responses, descriptive and thematic analysis Language models can provide students with information for problem-solving, critical thinking, and creativity, but relying solely on language models for information without critical evaluation of such information adversely impact student learning. Language models can play a role in generating teaching and assessment material and evaluating and scoring student responses to assessment material. However, the focus is on 'playing a role' rather than the whole role.
2 Dhruv et al. (2018) Review
3 Dwivedi et al. (2023) Expert opinion
4 Gonzalez et al. (2022) Two between-subjects experiments
5 Farrokhnia et al. (2023) SWOT analysis framework
6 Foroughi et al. (2023) 406 questionnaire responses, a hybrid approach including “partial least squares” (PLS) and “fuzzy-set qualitative comparative analysis” (fsQCA)
7 Harrison and Ajjan (2019) Review
8 Jeon and Lee (2023) An exploratory qualitative approach
9 Khademi (2023) Intraclass correlation (ICC)
10 Lancaster (2023) A small scale study
11 Li et al. (2023) Expert evaluation and Deep Learning classification methods
12 Liu et al. (2022) An additive piecewise-linear value function
13 Luckin and Cukurova (2019) Three case studies
14 Lyons (2017) A framework
15 Mele et al. (2022) A framework
16 Nielsen (2022) Review
17 Rahm & Rahm-Skågeby (2023) A heuristic lens
18 Rodriguez-Torrealba et al. (2022) Neural language models
19 Rosé et al. (2019) Review
20 Stojanov (2023) An autoethnographic study
21 Strzelecki (2023) 534 questionnaire responses, partial-least squares method of structural equation modelling
22 Tapalova and Zhiyenbayeva (2022) Experiment
23 Thontirawong and Chinchanachokchai (2021) A 2-day workshop
24 Zapalska et al. (2018) A stepwise approach
25 Cotton et al. (2023) Opinion

When reviewing the selected articles, the researcher allowed the six phases of thematic analysis (Braun & Clarke, 2006; Braun & Clarke, 2019) to guide theming and anchoring data to themes. First, the researcher immersed himself in the selected articles by re-reading them to familiarise himself with the data. During this phase, by being curious, the researcher made casual notes of interesting statements, such as ‘eight ways in which education technology can change how learning is facilitated and who will facilitate that learning’ (Lyons, 2017, p. 49), ‘focus on real problems and issues and provision of clear unambiguous instructions’ (Zapalska et al., 2018, p. 291), and so on. In the second phase, the researcher started anchoring statements from the selected articles to interesting codes such as ‘superior to humans’, ‘acceptable explanation’, ‘trialling prompts’, ‘scoring students’, and so on. In the third phase, constructing themes, themes were ‘built, molded, and given meaning’ (Braun & Clarke, 2019, p. 854), and they were the analytic output of immersion and engagement (Braun & Clarke, 2022). The researcher reviewed the candidate themes in the fourth and fifth phases and revised and defined them. The researcher also shared the themes in research circles to enhance reflexivity (Braun & Clarke, 2022; Dwivedi et al., 2019). After that, in the sixth phase, the researcher used an iterative approach to report the themes with selected references from the list of articles and relate the analysis to the research question (Braun & Clarke, 2022).

RESULTS

The analysis highlights two overarching themes, and they are discussed below.

Theme 1: Reliance solely on language models for factual and relevant information adversely impacts student learning

This theme discusses how inaccurate information from language models likely adversely impacts student learning, skill development, and the ability to transfer learning to new situations (Ryan & Aasetre, 2021; Zapalska et al., 2018).

Language models can analyse rich data at a speed far superior to humans (Luckin & Cukurova, 2019). They can help students acquire knowledge and skills (Khan et al., 2023; Thontirawong & Chinchanachokchai, 2021). However, language models are tools, not persons (Techatassanasoontorn et al., 2023). They will likely provide an acceptable explanation and relevant information in response to a prompt (Farrokhnia et al., 2023; Huh, 2023). For instance, ChatGPT produced advanced responses linguistically for physics questions but were unreliable and contradictory (Gregorcic & Pendrill, 2023). It assumed a single truth without evidence (Cooper, 2023). While language models are likely to provide an acceptable explanation, they do not understand the meaning, intent and effect of the utterances of their explanation (Gond et al., 2016). Further, an acceptable explanation differs from a correct one (Huh, 2023).

For example, students in higher education need to develop critical thinking skills (Penkauskienė et al., 2019; Ryan & Aasetre, 2021). Critical thinking requires factual, relevant information and clear instructions (Penkauskienė et al., 2019; Zapalska et al., 2018). Factual and relevant information is also crucial in problem-solving (Desha et al., 2021). Unless the output of language models is checked, language models do not provide factual and relevant information for that purpose (Dwivedi et al., 2023; Farrokhnia et al., 2023). Besides, a correct explanation and relevant information borrow from one's interpretation ability of certain information and the explanation (Moser et al., 2022; Ryan-Mosley, 2023). The language models' ability to interpret is not yet comparable to students' (Farrokhnia et al., 2023; Huh, 2023). Language models such as ChatGPT are still incapable of separating facts from fiction and hallucination (Dwivedi et al., 2023). Therefore, students need to consider the limitations of the output of language models when acquiring knowledge from language models (Farrokhnia et al., 2023). A sole reliance on language models such as ChatGPT lowers students' critical thinking skills (Iskender, 2023).

Although students need increased exposure to such tools (Harrison & Ajjan, 2019; Kong et al., 2022), this is unhelpful without students' increased awareness of the importance of factual and relevant information and the limitations of such tools (Desha et al., 2021). Without increasing students' awareness of the nature of such information, students are unlikely to hone their problem-solving skills or creative use of knowledge in new settings using language models (Desha et al., 2021; Ryan & Aasetre, 2021).

Therefore, while language models can help students approach problems with a critical and analytical mindset, the information for such a mindset might be inaccurate when generated by language models without careful research (Heaven, 2022; Stokel-Walker, 2022). ‘[Y]ou can't tell when it's wrong unless you already know the answer’ (Hsu & Thompson, 2023). The output of language models only improves the student experience and academic achievement if students question these outputs and do not merely treat them as factual information (Rosé et al., 2019).

On the other hand, academic institutions, rather than using language model tools to detect AI-generated text (Cotton et al., 2023), might increase student awareness of the limitations of such tools (Eaton & Mindzak, 2021). Rather than students drafting everything from scratch, their learning can focus on ‘original thoughts’ (Heikkilä, 2023) and creative use of knowledge in new settings (Ryan & Aasetre, 2021) and editing and fact-checking (Mohdzaini, 2023) the output of language models in their assessed works (Heikkilä, 2023). Proposition 1 summarises this theme. Table 3 proposes relevant research questions to investigate this proposition further.

Proposition 1.Language models can provide students with information for problem-solving, critical thinking and other skills, but relying solely on language models for information without critical evaluation of such information adversely impacts student learning.

TABLE 3. Research questions for future studies.
Theme Potential research questions
Reliance solely on language models for factual and relevant information adversely impacts student learning
  • How do students perceive the use of language models in generating assessed work?
  • What constitutes contract cheating when using language models in generating assessed work?
  • How do language models help improve student skills of critical thinking, creativity, fact-checking, etc.?
  • How do language models impact students' awareness of factual and relevant information in problem-solving?
  • How should students be trained to question and evaluate the information generated by language models for problem-solving?
  • How can language models augment students' critical and analytical thinking in problem-solving while ensuring the accuracy of the information generated?
  • How do language models such as ChatGPT affect students' motivation and engagement in learning?
  • How can academic institutions increase student awareness of the limitations of language models in generating assessed work?
Educators use language models to generate teaching and assessment material and evaluate student responses to assessment material
  • How accurately do language models generate teaching and assessment material?
  • How can educators evaluate the quality of generated assessment material using language models?
  • What are the benefits and challenges of using language models for formative and summative assessment?
  • How can language models incorporate feedback from assessors to support learners' development?
  • How can language models account for linguistic diversity, cultural backgrounds, and different needs among learners and educators?
  • How do language models compare to other technologies used in education to enhance student learning and academic achievement?
  • What are the best practices and challenges for integrating language models in student learning?
  • What are the ethical implications of using language models in enhancing student learning?

Theme 2: Educators use language models to generate teaching and assessment material and evaluate student responses to assessment material

Educators use their expertise and creativity to generate teaching and assessment material (Lucas, 2016; Wrigley, 2018). Various resources, particularly time and human resources, are used to ensure the validity and reliability of teaching and assessment material (Wrigley, 2018). Educators, among other uses such as exercises and classroom discussions, can use language models to generate teaching and assessment material (Dhruv et al., 2018; Roy & Rambo-Hernandez, 2021). Language models such as ChatGPT and Perplexity AI can contribute to student learning and achievement through personalised learning strategies, teaching interventions, and recommended resources (Cheng et al., 2021; Farrokhnia et al., 2023).

While trialling various prompts (prompt writing), language models—in particular ChatGPT—can generate teaching and assessment material in quantities far superior to human educators (Rodriguez-Torrealba et al., 2022; Roy & Rambo-Hernandez, 2021). However, due to the nature of the output of language models, human educators will find the output as a source of inspiration or suggestion rather than factual, valid and reliable (Rosé et al., 2019; Roy & Rambo-Hernandez, 2021).

Language models can also play a role in evaluating and scoring student responses to assessment material (Cheng et al., 2021; Cotton et al., 2023; Lee, 2023). However, the focus is on ‘play a role’ rather than the whole role (Cotton et al., 2023; Rahm & Rahm-Skågeby, 2023). Using such technology to evaluate student responses to assessment material is still experimental (Cheng et al., 2021). Unless the experimental nature of such tools is part of the process, educators require a cautious approach (Cheng et al., 2021). Human educators must actively review and edit the teaching and assessment material generated by language models (Klein et al., 2022; Lee, 2023) and any personalised learning strategies, teaching interventions, and resources recommended by such tools (Cheng et al., 2021; Tapalova & Zhiyenbayeva, 2022).

Language models, therefore, facilitate rich and fast feedback loops and add an iterative approach to feedback to students' learning in meeting learning objectives (Lyons, 2017; Roy & Rambo-Hernandez, 2021). Students can engage in a back-and-forth activity with language models to improve work and to meet learning objectives before submitting work for summative assessment (Farrokhnia et al., 2023; Lyons, 2017). However, this argument depends on how familiar students are with language model tools and the nature of the output of such tools (Gonzalez et al., 2022).

Although language models produce assessment material in quantities incomparable to human abilities, such material mainly assesses knowledge retention (Rodriguez-Torrealba et al., 2022) rather than comprehension, interpretation, analysis, evaluation and synthesis (Huh, 2023; Moser et al., 2022). Therefore, unless language models make sense of the material they generate, they are unlikely to generate assessment materials or assess students with higher-order thinking skills (Whittle et al., 2018). Further, while educators might adopt language models to provide timely feedback to students when resources are limited, they are advised to do so only when data to train such models is plentiful and for assessment purposes that carry no or low weighting (Lee, 2023; Roy & Rambo-Hernandez, 2021). Proposition 2 summarises this theme. Table 3 proposes relevant research questions to investigate this proposition further.

Proposition 2.Language models can play a role in generating teaching and assessment material and evaluating and scoring student responses to assessment material, but they should only ‘play a specific and defined role’ rather than the whole role.

DISCUSSION

The analysis suggests that developments in generative AI might give higher education institutions new purposes, such as training students to live with short-term employment patterns (Moscardini et al., 2022; Zirar et al., 2023). This also extends to educators' mindsets; rather than fighting generative AI tools, educators adopt a mindset that encourages how such tools can augment their abilities to mentor and guide student learning (Moscardini et al., 2022; Zirar et al., 2023).

The analysis also argued that without effective student awareness of the quality of language models' output, students are unlikely to develop the critical thinking skills necessary to question the quality of such outputs (Kasneci et al., 2023). Therefore, as mentors, educators must instil in students that critical thinking is a social responsibility by which students can tell fact from fiction from the output of generative AI (Cooper, 2023; Penkauskienė et al., 2019). However, this is unlikely if educators are not trained in critical thinking (Lorencová et al., 2019). The analysis also implies that students and educators need new skills, such as prompt writing along critical thinking skills (Kasneci et al., 2023).

The analysis also implied that educators and education institutions need to be given an active role in improving the output of language models. Although this argument suggests that educators will reflect on evidence-based teaching to improve such tools, evidence-based teaching also has limitations (Wrigley, 2018).

The analysis implies that educators have a moral mandate to encourage students to improve the accuracy of generative AI outputs (Kasneci et al., 2023). Although such mandates are argued in the literature for platforms such as Wikipedia (Masukume, 2020), generative AI will have a more far-reaching impact than Wikipedia. Thus, it is reasonable to argue that such mandates extend beyond Wikipedia, and educators have a moral mandate to encourage students to contribute in order to improve the quality of language model outputs.

The analysis, however, points to a contradiction in the literature. Contrary to some promising attempts (e.g., Ganguli et al., 2023) that argue language models can engage in self-correction, other recent studies (e.g., Gregorcic & Pendrill, 2023) do not confirm such attempts. Gregorcic and Pendrill (2023) engaged in a Socratic dialogue with ChatGPT to fix the errors and contradictions in ChatGPT's responses to their question. However, these attempts had limited success. Rather than understanding why it is wrong, the ChatGPT give in by stating, ‘you are correct’ (Gregorcic & Pendrill, 2023). Although self-correction is necessary, it is only effective if language models understand why they were wrong before self-correction. Therefore, the literature provides a contradictory account of the self-correction approach to fix the language model's output.

The output of generative AI will affect the quality of education negatively or positively. Generative AI models can generate material in quantities beyond human abilities (Rodriguez-Torrealba et al., 2022). However, only humans can make sense of such material and whether the material is fiction or factual and relevant to enhance student learning and assessment (Bender & Koller, 2020; Kasneci et al., 2023). Other than limited news articles, research needs to explore how these generative AI tools influence the quality of education (Liu et al., 2022).

CONCLUSION

Contribution

The analysis extends Cognitive Load Theory (Sweller, 1988) to discussions about language models like ChatGPT and student learning and assessment in higher education. Language models like ChatGPT help reduce cognitive load for students by providing instant feedback and support (Hunter, 2023). However, the analysis argued that the feedback and support are only helpful if students are effectively aware of the reliability and factual issue of the output of language models (Huh, 2023; Stokel-Walker, 2022). They also need awareness of how much the use of language models in their assessed work constitutes contract cheating (Mitchell, 2022). Therefore, this paper adds to the discussion about improving education and enriching educator resources (Liu et al., 2022).

For practical contribution, the analysis implied that educators are going to live with large language models such as ChatGPT, Perplexity AI, etc. Embracing and adopting these tools are futuristic rather than fearing and resisting them (Stokel-Walker, 2022; Stokel-Walker & Van Noorden, 2023). These models can help educators generate teaching and assessment material (Khan et al., 2023; Tlili et al., 2023). As a result, they can provide students with engaging and compelling learning experiences and academic outcomes. However, such integration of language models in student learning and assessment by educators is only helpful if educators play an active and effective role in checking the validity, reliability and accuracy of the generated material (Cooper, 2023; Rosé et al., 2019). Further, language models are yet to be an effective cheating tool or act as a tutor (Gregorcic & Pendrill, 2023). Therefore, language models are less likely to form the whole learning experience of students and be accepted as the only medium of learning. There remain advocates in favour of language models augmenting student classroom learning instead of replacing it: ‘To me, that's one of the great things about school and about learning: you're in a classroom with all of these other people who have different life experiences’ (Ryan-Mosley, 2023).

For policy, higher education must prepare for language models' potential consequences and opportunities (Cotton et al., 2023). Policy makers also must debate the impact of language models, such as ChatGPT, on higher education and society (Eaton & Mindzak, 2021; Larsen & Narayan, 2023).

Limitations and future research directions

The analysis is limited to journal articles, excluding books and book chapters and practitioner research. In addition, the used keywords may have only returned relevant articles from the Scopus database.

Also, ASReview helped with screening the abstracts but the tool is yet to be fine-tuned to provide ‘an accurate estimate of the system's error rate’ (van de Schoot et al., 2021, p. 131). Further, the tool employs machine learning to screen abstracts. However, ‘empirical benchmarks of actual performance’ are yet to be produced for using the tool for purposes other than reviews (van de Schoot et al., 2021, p. 131). Moreover, the tool only assists with screening articles based on their abstracts in a synthesis process (van de Schoot et al., 2021).

Explicit considerations of the ‘sensitivity versus precision’ when selecting the relevant literature is valuable in conducting future literature reviews (Higgins et al., 2023). Researchers can aim for a balance between maximising sensitivity and maintaining relevance in the search strategy (Higgins et al., 2023).

When generating the themes, the researcher followed Braun and Clarke's guideline that ‘… the researcher needs to decide on and develop the particular themes that work best for their project—recognising that the aims and purpose of the analysis …’ (2022, p. 10). The researcher played an active role in theme generation and adopted the reflexive thematic analysis method (Braun & Clarke, 2022). Accordingly, the analysis and the themes need to be viewed through the 10-point core assumptions of reflective thematic analysis (Braun & Clarke, 2022, pp. 8–9). With this approach, the researcher generated intriguing themes by blending data, subjectivity, theoretical and conceptual understanding, training and experience (Braun & Clarke, 2022). The researcher chose this form of analysis to engage with compelling, insightful, thoughtful, complex and deep meanings from the texts explored for this study (Braun & Clarke, 2022). This is also in line with the existing literature that suggests subjectivity is a resource for research rather than an issue to be managed (Gough & Madill, 2012, as cited in Braun & Clarke, 2022).

However, the researcher understands that this analysis method is inconsistent with the objectivity and reproducibility of themes (Braun & Clarke, 2022). Future studies can adopt analytic approaches of ‘Small q qualitative paradigms’ to control for subjectivity and the researchers' active role (Braun & Clarke, 2022; Kidder & Fine, 1987). Alternative analytical methods can advance our understanding of other themes and the recurrent nature of the themes identified in this study.

One theme that can be generated from news articles and the analysis could not substantiate from the returned articles is that students and educators might develop bonds with generative AI tools similar to bonds they develop with humans. Recent news articles imply that people might develop romantic relationships with generative AI tools (Tong, 2023), ask such tools to control their lives (Ramage, 2023), or consider such tools as ‘best’ friends (Clarke, 2023). While students and educators make choices of whom to love, build a relationship with, and allow any control over their lives (Clarke, 2023; Ramage, 2023; Tong, 2023), replacing a natural human with a generative AI tool in this relationship is intriguing. Future research can explore metaphorical sense-making (Techatassanasoontorn et al., 2023) of educators and students of language models and if there will be implications and if such implications are concerning.

ACKNOWLEDGEMENTS

The author would like to thank Dr Tribi Budhathoki from Huddersfield Business School for his constructive comments and suggestions.

    CONFLICT OF INTEREST STATEMENT

    The author declares no conflicts of interest.

    FUNDING INFORMATION

    No funding was received for conducting this study.

    ETHICS STATEMENT

    The synthesis is based on previously published articles; thus no ethical approval was required.

    DATA AVAILABILITY STATEMENT

    The data (the scopus.csv, the asreview project file, the final list Excel file, and the achieved record of manuscript development) are available on request from the author.