From AI Tools to AI Teammates: Educational Applications in the Gen AI Era

April 15, 2026

By Dr. Zheng Yuan

The conversation around AI in education has shifted. It is no longer simply about whether AI should be used in classrooms, but about how, and more importantly, about who remains in control. This post summarises the key ideas from a recent talk by Dr. Zheng Yuan, exploring this shift.

A Brief History of ML in Education

A timeline of the rise of ML - from early rule-based systems in 1960s to LLMs in 2024 Yuan et al. (2025). NLP and Generative AI for Language Learning and Assessment

The use of machine learning in education is not new. It stretches back decades, from early rule-based chatbots in the 1960s through to statistical methods like SVMs in the 1990s, deep learning architectures such as GANs and VAEs in the 2010s, and the transformer revolution that gave us BERT (2018) and ChatGPT (2022).

One finding worth highlighting: for well-defined educational tasks such as knowledge tracing, error prediction, and performance forecasting, traditional supervised models trained on high-quality annotated data still outperform most LLM-based approaches. Fine-tuned LLMs are closing the gap, but specialised models remain the state of the art. This is a useful reminder that bigger is not always better, and that the right tool depends on the task.

Where Do LLMs Fit in Education?

LLM-based educational applications can be broadly divided into academic research and commercial tools. Within academic research, the target user matters: are we building for teachers, or for students?

A flowchart from teachers and students, into 'AI Technologies', into different aspects such as 'Tutoring systems' "Question Construction' and 'Error Correction' Yuan et al. (2025). NLP and Generative AI for Language Learning and Assessment

On the teacher side, LLMs are being used for learning content generation, lesson planning, and automatic assessment and scoring. On the student side, the focus is on AI tutoring systems - tools that can support learners when a human teacher is not available around the clock. Another useful lens is the distinction between assistive and assessment technologies. Assistive tools help students learn: they adjust reading difficulty based on proficiency, detect and correct errors, and provide personalised feedback and suggestions. Assessment tools, by contrast, sit within the pipeline used by organisations like ETS, Pearson, and Cambridge Assessment. These tools automatically generate questions, score responses, and deliver structured feedback.

4 quadrants with 'Assistive Technologies', 'Assessment Technologies', 'New Research Directions', and 'Ongoing Challenges' Vajjala et al. (2025). Opportunities and Challenges of LLMs in Education: An NLP Perspective

The Rise of AI Agents in Education

The most significant shift in recent work is the move from standalone LLM prompting or fine-tuning toward agentic systems. Unlike a simple chatbot, an AI agent combines an LLM with memory, planning capabilities, and access to external tools. This enables systems that can pursue goals and solve complex, multi-step problems.

In educational settings, agents can take on multiple roles: pedagogical agents acting as tutors, student agents acting as collaborators or simulators, and multi-agent systems where several AI entities and humans work together as a team.

A diagram shoing teachers and students can interact with a multi-agent systen. The system comprises of 'Memory', 'Planning', 'Personalisation' and 'Explainability' Chu et al. (2025). LLM Agents for Education: Advances and Applications

In a multi-agent classroom setup, the architecture typically includes several specialised components. Teachers interact with agents for classroom simulation, feedback generation, and curriculum design. Students interact with agents for adaptive learning, knowledge tracing, and error correction. Behind the scenes, dedicated agents handle:

Memory - tracking all conversations and learning histories
Planning - coordinating which agents activate and when
Personalisation - tailoring content and difficulty to each learner.

The system can also connect to external tools like search engines, and subject-specific databases, so that agents serving a maths class draw on different resources than those supporting a coding or physics course.

Two cross-cutting concerns run through all of this: personalisation and explainability. Education is a high-stakes domain. Teachers, students, and regulators all need to understand why a system made a particular recommendation or prediction.

How Multi-Agent Systems Work in Practice

Deploying multi-agent systems in real classrooms raises immediate practical questions. In one experimental setup, we worked with classroom dialogue data: one teacher, multiple students, all interacting in a group session. AI student agents were introduced with prescribed personas - one that always asks challenging questions, one that follows the crowd, one that asks simple clarifying questions, and the researchers observed whether these additional agents improved group learning outcomes.

In a separate study focused on teamwork and creativity, groups of four were configured with varying compositions: four human students, three humans and one AI agent, two humans and two AI agents, and all-AI groups. Each group was given one hour to complete a product design task and produce a business model canvas, which was then rated by both domain experts and LLM-based judges. The aim was not just to measure output quality, but to explore what the optimal team composition might be and whether AI agents with diverse cultural personas could bring perspectives that a homogeneous human team might lack.

A key design challenge is agent initiative. Purely passive agents, i.e. ones that only respond when prompted, are not really agents at all. But fully autonomous agents can create problems: in preliminary experiments, agents sometimes tried to speak simultaneously, talking over each other. In another configuration with only AI agents, nobody spoke at all. Solutions being explored include introducing a facilitator agent to moderate discussion, and using heuristic rules, such as ending every contribution with a direct question to a named participant, to keep conversation flowing.

Human-AI Collaboration: AI as Teammate, Not Just Tool

There is a broader shift underway from human-computer interaction (HCI) to human-AI collaboration (HAIC). The distinction matters. In an HCI paradigm, AI is an external tool you call on when needed. In HAIC, AI is a participant from the start. It is embedded in the workflow, co-creating alongside the human user.

This framing brings its own challenges, particularly around over-reliance on AI and the preservation of student and teacher agency. Early findings from longitudinal observations suggest a concerning pattern: when students begin relying heavily on generative AI, their outputs tend to converge, losing the diversity and creativity that characterised their earlier, unassisted work. Stronger students tend to benefit from AI assistance, producing even better results. But weaker students can actually perform worse than they did without AI, potentially because they lack the ability to evaluate whether the AI’s output is trustworthy. This asymmetry is one of the more important open problems in this field.

GenQuest: Co-Creative Storytelling for Language Learning

A complex flowchart explaining the GenQuest workflow showing the player input, the transformations into information stored in memory, and then generating the plot Wang et al. (2025). GenQuest: An LLM-based Text Adventure Game for Language Learners

GenQuest is an example of human-LLM co-creation designed specifically for language learning. The system creates personalised, choose-your-own-adventure-style stories where the learner actively shapes the narrative.

The architecture uses multiple specialised LLMs working together. A proficiency LLM adapts the language difficulty to each learner’s level. An outline LLM generates the story structure, including milestones, decision points, and possible endings. At each decision point, the student chooses a direction, and a plot LLM generates the next section accordingly. A memory agent tracks everything, and a summary LLM keeps the narrative coherent across branches.

The result is that each learner ends up with a unique, personalised story they have co-authored with the AI. A pilot study with Chinese university students learning English as a second language showed clear vocabulary gains and positive perceptions of both usefulness and ease of use. Students could highlight unfamiliar words at any point and were also asked to produce summaries, combining receptive and productive language skills in a single experience.

Tutor CoPilot: AI as a Teaching Teammate

A depiction of the Tutot CoPilot system from front-end user interaction to internal generations and privacy protections Wang et al. (2024). Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise

Tutor CoPilot takes a different approach, focusing on supporting novice tutors rather than replacing them. The system provides real-time, expert-like guidance during live tutoring sessions.

When a student submits a response, the system uses what it calls a “bridge” network to generate structured feedback: first identifying whether a mistake exists, then locating where the mistake is, and finally providing hints to help the student improve, rather than simply offering a generic one-sentence correction.

The system was evaluated through a randomised controlled trial involving approximately 800 maths tutors and 1,800 K-12 students from historically underserved communities. Students whose tutors were assigned to use the system were 4 percentage points more likely to master the topics being taught. This is a meaningful effect size, and it demonstrates the potential of AI as a teammate for educators rather than a replacement.

Ethics, Regulation, and Responsibility

Any serious discussion of AI in education must confront the ethical dimension. AI systems can introduce bias, lack transparency, and raise privacy concerns - all of which directly affect fairness and trust in educational outcomes.

It is not enough to acknowledge that a model or dataset is biased and treat that as a technical inevitability. The people who build these systems bear responsibility for understanding where bias is likely to arise and designing against it, whether that means addressing algorithmic bias or curating training data more carefully.

The regulatory landscape is also evolving. The EU AI Act classifies educational AI as high-risk, which imposes specific requirements around transparency, safety, and ongoing oversight. These are not simply recommendations, they are legal obligations for anyone deploying AI systems with real students and teachers in real classrooms.

Current Trends and Emerging Directions

Several research directions are converging to shape the next generation of educational AI:

Interpretability and explainability. Traditional feature-based models are inherently more explainable than deep learning or LLM-based approaches. But the need for explainability does not disappear just because the models become more complex - if anything, it becomes more urgent. In a high-stakes domain like education, being able to trace why a model made a particular prediction is essential.
Personalisation beyond content. Moving from “one size fits all” towards genuinely adaptive systems means considering more than just difficulty level. Multilingual support and cross-cultural sensitivity matter: students from different cultural backgrounds respond differently to direct versus indirect feedback, and an AI system that ignores this risks alienating the learners it is designed to help.
Multimodal interaction. Research suggests that learning outcomes improve when students engage through multiple modalities such as text, visual, and audio. Building systems that support this kind of rich interaction is a growing area of focus.
Collective intelligence. Most current work focuses on personalised, one-on-one learning. But in practice, people learn and work in groups. How can AI systems support and enhance collective intelligence across a classroom or cohort, rather than optimising only for individual performance?
Alignment with pedagogical values. Standard ML benchmarks and evaluation metrics are often poorly aligned with educational goals. A model that achieves high accuracy on a public benchmark may still fail to support meaningful learning. Bridging this gap between technical metrics and pedagogical aims is a critical open problem.
Synthetic data for fine-tuning. High-quality annotated educational data is expensive to collect and concentrated in well-resourced institutions. Using LLMs to generate synthetic training data is a promising approach for under-resourced settings, but it carries risks. You do not want AI tutors trained exclusively on conversations with AI students, creating a feedback loop disconnected from real learner behaviour.
Efficiency and edge computing. LLM-based systems are expensive to train and run. For mobile deployment or use in areas with limited computing infrastructure, efficiency is a prerequisite, not just a luxury.
Specialised models. For well-defined tasks, specialised models trained on task-specific data consistently outperform general-purpose LLMs.

The Road Ahead

The trajectory is clear: AI in education is moving from tool to teammate. But this shift demands more than technical progress. It requires careful attention to trust, transparency, agency, and equity. The systems that will matter most are not necessarily the most powerful - they are the ones that keep humans in control, adapt to the diversity of real learners, and earn the trust of the educators who use them.