What Two Landmark Studies Tell Us About AI Coaching (and Why It Matters for Starlight)

Adam Sturdee
Jan 14
4 min read

Over the past year, two rigorous, peer-reviewed studies have quietly changed what we can say with confidence about AI coaching in schools.

Not opinion pieces. Not vendor case studies. Randomised controlled trials, run in real classrooms, with real teachers.

Together, they answer a question many school leaders and teachers are rightly asking:

Does AI-generated feedback actually improve teaching practice?

The answer, based on the best evidence we now have, is yes.

The Stanford Study: AI That Scales Expertise, Not Judgement

The first study, led by researchers at Stanford, examined a tool called Tutor CoPilot, designed to support live tutoring through real-time language suggestions.

You can read the full study here:https://doi.org/10.26300/81nh-8262

In a large randomised trial involving over seven hundred tutors and one thousand students, the researchers found that:

Students supported by tutors using AI coaching were more likely to master lesson objectives
The impact was strongest for less experienced practitioners
The mechanism of improvement was not content knowledge, but changes in language

Tutors using the tool:

Asked more probing questions
Used less generic praise
Avoided giving answers too quickly

Crucially, the AI did not replace tutors or direct them. It offered multiple possible responses, preserved tutor agency, and made expert thinking visible in the moment.

The headline insight from Stanford is this:

AI coaching works best when it amplifies expert practice and leaves human judgement intact.

That principle sits at the heart of Starlight.

The Harvard Study: Automated Feedback Works in Real Classrooms

The second study, led by Harvard researchers in partnership with TeachFX, tested automated feedback with mathematics and science teachers in brick-and-mortar classrooms.

You can read the study here:https://doi.org/10.1016/j.compedu.2024.105183

This matters, because until now, most strong evidence for AI coaching came from online or tutoring contexts.

In this trial:

Teachers received automated feedback focused on one practice: focusing questions
Teachers who received the feedback increased their use of those questions by around twenty percent

The feedback worked in this case because it was narrow.

It did not try to improve everything. It did not offer a general judgement on teaching quality. It focused on one high-leverage move and returned to it repeatedly.

The study also surfaced important realities:

Teachers engaged most when feedback was easy to access
Trust, accuracy, and time were bigger barriers than technology
Teachers valued reflection more than prescription

The key insight from Harvard is this:

AI coaching changes practice when it is precise, repeated, and reflective.

What These Studies Mean for Starlight

Taken together, these studies strongly validate the direction Starlight is taking.

They tell us that effective AI coaching is:

Language-focused, not metric-driven
Specific, not general
Repeated over time, not one-off
Private, reflective, and teacher-owned
Designed to prompt thinking, not enforce compliance

This is why Starlight is built around:

Transcript-based analysis
Coaching templates rather than universal scores
“You said this → you could try this” feedback
Optional use, not mandated evaluation

The research also explains why we are cautious about over-engineering dashboards, ratings, or surveillance-style metrics. None of the evidence suggests that those are what drive improvement. Language does.

What This Means for Leaders and Teachers Building Templates

One of the most important implications of both studies is for schools using Starlight’s template system.

The evidence is very clear:

The more focused the template, the greater the impact.

Effective templates tend to:

Target one practice at a time
Use concrete examples from the transcript
Offer alternative phrasing rather than abstract advice
Encourage experimentation over perfection

Templates that try to do everything tend to do very little.

This is why we actively encourage departments, coaches, and teachers to:

Build and share their own focused templates
Reuse the same template across multiple lessons
Treat AI feedback as a mirror, not a verdict

The power of AI coaching does not come from novelty. It comes from noticing patterns, again and again.

Where We Are Taking This Next

These studies are not the end of the story. They are the foundation.

At Starlight, we are using this evidence to:

Refine our questioning and feedback templates
Improve how we surface missed opportunities in transcripts
Make our emailed reports even clearer and lower-friction
Support schools in building template cultures grounded in trust and professional reflection

Most importantly, they reinforce a simple belief:

Great coaching does not judge. It helps people see.

Spark Insight with Starlight today and build evidence-informed coaching that scales.

🎥 Subscribe to our channel here: https://www.youtube.com/@Star21-ai

🌐 Read more on our blog: www.coaching.software

💡 Explore the platform: www.starlightmentor.com

🐦 Follow us on X: @star21starlight

The Insight Engine is written by Adam Sturdee, co-founder of Starlight—the UK’s first AI-powered coaching platform—and Assistant Headteacher at St Augustine’s Catholic College. This blog is part of a wider mission to support educators through meaningful reflection, not performance metrics. It documents the journey of building Starlight from the ground up, and explores how AI, when shaped with care, can reduce workload, surface insight, and help teachers think more deeply about their practice. Rooted in the belief that growth should be private, professional, and purposeful, The Insight Engine offers ideas and stories that put insight—not judgment—at the centre of development.