New AI tutor achieves 0.71-1.30 SD effect size in Dartmouth course [pdf]

(science.uu.nl)

103 points | by jonahbard 3 hours ago

21 comments

radioactivist 2 hours ago
I am somewhat skeptical of this.
First, the headline result of 0.7*sigma improvement is the output of a statistical based on lessons/reviews they engaged with and their mid-term score, with that shift being for "full engagement". Based on their tables something like ~16 students (11% of the group) actually reached that level of engagement
Second, trying to incorporate past grades into their modelling is not a substitute for a randomized trial.
Third, the headline engagement number of 90% is for "engaging with the platform, via Module Review or Lesson Quizzes, at least once". I don't know why much of that couldn't just be attributed to novelty. Or even partly a professor with all sorts of enthusiasm for the platform.
Fourth, the "full dosage" effectiveness is measured based the final exam scores. Were these exam questions produced independently from the "Phosphor" materials? (e.g. by blinding?) Were they checked for direct overlap with those materials? The 0.7 sigma shift is 3 points on a 24 point exam; if even a few of the questions on that exam were very similar to those materials it could account for almost all of it. This is not clear to me from the manuscript.
If this was the case, then it's a question less of "is AI effective" vs. "did the students look at the materials". You could still argue that the AI platform got them to read, but that is a somewhat different statement than the AI helped them learn.
[-]
- yorwba 1 hour ago
  Worse, because students complained about the difficulty of the AI-graded quizzes, they switch to multiple-choice questions only, which increases engagement, but after analyzing the exam results they determine that multiple-choice questions don't seem to help and add AI-graded questions back, after which engagement drops again.
  That means their experiment design is partially caused by their results instead of the other way around, which is a bad situation to be in. Their statistical analysis is completely inadequate for dealing with this.
  And the change in engagement suggests that there's strong selection involved. Their attempt to use midterm scores to control for selection effects is unconvincing. Why not control for whether students used the platform more when there were only multiple-choice questions? Those are the ones who self-selected out of using the AI grader.
- p1necone 16 minutes ago
  It feels to me that the venn diagram between "students that fully engaged with the material" and "students that learned well from the material" is going to basically be a circle for any teaching method.
- computerdork 1 hour ago
  This is a helpful explanation - am not a researcher so I have little idea how to run an unbiased, meaningful experiment (except that it takes a lot of effort and thought to run one). Useful analysis
KaiserPro 2 hours ago
I'm not an expert, but how much of this is down to novelty, ie https://en.wikipedia.org/wiki/Hawthorne_effect ?
(ie changing the environment can lead to short term productivity gains because either participants are aware they are being watch, or it breaks up the monotony and makes people work a bit harder. )
baq 3 hours ago
I'm on record saying that a system like this with some extra hardware (i.e. a way for the LLM to have live understanding of the student's paper notebook or handout which are being written in with a plain old pencil) combines the best of both worlds - individual tutoring with approximately zero screen time which scales linearly with the number of students. The role of the teacher or professor then becomes a manager of the student - agentic tutor pairs, a referee when the student and model disagree, etc. and most importantly still being the human teacher you can just talk to in the human education process.
I'm convinced this is the future of education - models are there, we need the classroom tech to catch up. The alternative is obvious and quantified in the paper - students just use models to do their work for them and learn nothing.
[-]
- chasd00 2 hours ago
  I work in consulting and one of my projects is piloting an AI use case for a department within one of my clients. On a discovery call someone casually brought up that they bought a reMarkable notebook themselves and were wondering if it could be integrated into the use case. It really got me thinking.
  Maybe reMarkable or something like it could help bridge a student's writing with an LLM without having to fall back to a laptop or ipad.
  https://remarkable.com/
  [-]
  - fn-mote 6 minutes ago
    > Maybe reMarkable or something like it could help bridge a student's writing with an LLM without having to fall back to a laptop or ipad
    What does “bridge a student’s writing” mean?? If this is a real argument it needs to be clearer.
    What’s the functional difference between a Remarkable and an iPad? The former is less responsive, costs less, and has better battery life, right? I really don’t see how that’s significant to any kind of development of anything.
    Are you talking about running a local model??
- terribleperson 2 hours ago
  A 'smart pen' that records the student's writing in some way, maybe? My first thought was a tablet that boots straight into a writing software but students should not be subjected to any amount of latency in their writing.
  Practically, I think if you want the AI system to have a live view of what the student's doing you're going to have to replace one of either the tablet or the writing instrument. A wearable camera could work as well but there are issues with that.
  [-]
  - universa1 2 hours ago
    there was a pen that used special paper to directly record your notes (15-20years ago)... should be possible nowadays to directly transfer this to a connected device and have it feed it to an llm.
    and after looking it up, it appears they are still available: https://www.livescribe.com/landingpage/ls3_onenote/
    [-]
    - AbsurdCensor 35 minutes ago
      They still exist, along with a bunch of different ones. I don't think it's going to be all that different compared to just writing notes in an e-book or on an iPad though. And for many people who learn in other ways, the iPad or similar is superior because you can copy in pictures, make diagrams, and use other ways of learning all in one spot. For me, honestly, something like OneNote (or especially Obsidian) is awesome, because it's super easy to tie in AI into mark down text.
    - terribleperson 2 hours ago
      That was likely what I was thinking of - I have vague memories of seeing an ad for this in Popular Mechanics or Popular Science in the 2000s.
- tomaskafka 1 hour ago
  Today I saw a demo of Remarkable turned into Voldemort's diary from Harry Potter - you write to it, and it writes back, in handwriting.
- Buttons840 2 hours ago
  I would add that somewhere in there should be a spaced repetition algorithm.
  Spaced repetition is very effective, but it's really really clunky to use. My unpopular opinion is that we all have Stockholm syndrome when it comes to creating "cards", and people talk about how valuable creating cards is; but I think it stucks, it takes a lot of time.
  If AI is already teaching me math (let's say), it would be nice to tell the AI/app "quiz me on this periodically", and then the AI makes up a fresh polynomial to factor (or whatever) and presents that to you according to a spaced repetition algorithm.
  Behind the scenes, the AI should have access to what has happened the last several times a specific topic has been quized, so the AI can watch to see that certain mistakes are resolved, and the AI might also know better how to correct the user if it has context about previous quizzes of that topic.
  [-]
  - tired-turtle 2 hours ago
    But the very act of making and organizing your card deck is part of the SRS! It “sucks” because you get no dopamine hit from a fresh desk, as the reward system is not yet in place.
    [-]
    - Buttons840 2 hours ago
      Again, I really think this is a viewpoint we've talked ourselves into to help us feel better about how cumbersome creating the cards are.
      I'm willing to grant that there is some value in choosing what to put in the cards, but most of the awkwardness around making cards is UI related. Nobody creates cards on their phone, or while they're walking (AI could do both of these) - people create cards sitting at their computer (like cavemen!) usually clicking through a clunky UI and managing thousands of cards with thousands of clicks. That sucks, and people probably wont realize it sucks until something better comes along.
      [-]
      - fragmede 1 hour ago
        > Nobody creates cards on their phone, or while they're walking
        Wait, when are you doing it then? No wonder you think it sucks! Adopt some modern tools, yo. Use Anki, or vibe code your own app.
rictic 2 hours ago
Yes! Very exciting to see this.
Bloom's Two Sigma Opportunity suggests that there's another SD improvement available: https://en.wikipedia.org/wiki/Bloom%27s_2_sigma_problem
[-]
- Herring 2 hours ago
  The story around Bloom's two sigma is a bit complex https://nintil.com/bloom-sigma/
  [-]
  - SgtBastard 12 minutes ago
    Thank you for this and for parent’s comment - I know what rabbit hole I’m going down today.
rusbus 3 hours ago
This is exciting because the effect size is so large. But as the author's acknowledged, selection bias is nearly impossible to control for in this non-randomized study:
> and lacks randomized controls. Self-selection is the central threat: students who complete more quizzes may be more motivated or higher-performing generally
But this is still a strong result. I'm excited to see more in this space.
[-]
- rahimnathwani 2 hours ago
  They tried to control for this. It's described in the first paragraph of section 4.
- syou1024 2 hours ago
  [flagged]
or_am_i 1 hour ago
The article explicitly calls out selection bias (this is entirely based on 90% that opted into using the tutor, there was no control group), I wish the headline did as well. "Engaged students score 0.71 - 1.30 SD better in tests" sounds like a much simpler explanation.
[-]
- cgearhart 1 hour ago
  I used to TA a graduate level CS math class at Georgia Tech. We regularly saw that the students who self-organized study groups did dramatically better in the course than average. One semester they told us to put everyone in study groups to see if it helped. The effect disappeared. Turns out that it was the self-selection of the most engaged students into a small group that mattered, not the study group itself.
  [-]
  - sebastiennight 33 minutes ago
    So there might be zero effect?
    If it's purely a correlation, then maybe those students would be more successful than average even without the study group. They're already the most motivated kids. Maybe they just do "motivated kid stuff" and would still outperform.
- tfirst 44 minutes ago
  "Full dosage of the Phosphor material is associated with an increase in final exam performance."
  This sentence is accurate, but inevitably leads to the confusion you see in these comments.
mmarian 2 hours ago
Conflicted about this study. On one hand, LLMs have been incredible for my personal learnings of new concepts.
On the other, I'm sceptical of that it'll have "strong benefits" at scale; I'd be more in favor if the wording was "some"/"moderate". I reckon self-selection plays a huge part, as mentioned in the "Limitations" section of the paper.
I'd also caution against attaching the tool to grading. That means students have to put more effort into the course, which increases the chances that they will use LLMs to save time rather than make the investment.
[-]
- zerobees 2 hours ago
  > LLMs have been incredible for my personal learnings of new concepts.
  Mind if I ask what did you learn and how you're using it?
  The reason I'm asking is that I repeatedly felt excitement only to realize down the line that the explanations didn't actually translate into practical skills. I'm not sure it's even an AI problem, it's a "doing versus reading" problem. Same as with reading a pop-science article and thinking to myself that I learned something about physics or medicine or mathematics.
  [-]
  - mmarian 1 hour ago
    Various concepts when I joined new teams in domains I've never worked in before. And system design. So very practical, and where stakes were high.
wxw 2 hours ago
The title is misleading. This isn't an AI tutor so much as a practice quiz platform with an AI autograder.
> constructed-response questions (CRQ) are graded by Claude Sonnet 4.6 against instructor-defined, question-specific rubric criteria
> Crucially, LLMs make it feasible to grade formative CRQ against rubric criteria at scale, a capability that appears pedagogically significant rather than merely convenient.
They specifically call out that the "RAG chat assistant" part of Phosphor (the platform) wasn't used much.
I commend the effort here, but I don't think these results are particularly noteworthy. The conclusion is essentially that people who do practice quizzes will do better on exams.
[-]
- fragmede 1 hour ago
  > a practice quiz platform with an AI autograder.
  What do you think tutoring is?
  [-]
  - sumeno 26 minutes ago
    What do you think tutoring is? Because it's not just extra quizzes
  - mkl 1 hour ago
    Definitely not just grading. Tutoring is explaining and back and forth discussion to impart knowledge, in context and in response to specific difficulties/confusion the student is having.
zerobees 1 hour ago
While there's some skepticism in the thread, I'm not particularly surprised if this is true. Children who can get human tutoring do a lot better. An LLM that can answer questions and patiently explain likely offers some benefit.
What creeps me out about bringing LLM into early education is that it's a period where kids learn to socialize and cope with problems, and I do worry about forming substitute relationships with chatbots that are engineered for sycophancy / enablement. But I guess that's a problem either way, because almost every student will try an LLM at some point.
[-]
- sarchertech 57 minutes ago
  From my understanding the actual AI that was barely used. What was used was a quiz with an AI grader.
NeutralForest 1 hour ago
Interesting article, wonder where we're going with this though, I find it's very difficult to keep LLMs on track and critical enough to be useful.
Just want to say that:
>In our deployment, student-reported reading completion baselines for MATH 010 were approximately 15%, with instructors estimating 10%. Individual student reports of reading compliance ranged from "literally no one does that" to "is this being recorded?"
is hilarious
RA_Fisher 1 hour ago
This is super, but students will have access to AI during the test in real life, so it's ironically less realistic to remove it (thinking of the "... GPT-4 actually harmed subsequent performance by 17% when the tool was removed ..." part).
I'm more curious how students perform on the test with vs. without AI.
boulos 3 hours ago
Do you have a larger study planned for the Fall? It definitely seems promising.
I'm curious how well you feel this worked because the subject was Statistics (objective grading) versus something more subjective like Civics or Literature.
PS - I'd say this qualifies for Show HN, too!
Do you
[-]
- ilaksh 3 hours ago
  They were using Sonnet 4.6 for some fre form responses so that could be applied to something subjective.
  [-]
  - boulos 1 hour ago
    But it's not clear that using Sonnet or any other LLM as a "grader" would result in the same improvement. For objective grading, you could be sure that the additional adaptive support is helping. For subjective things like writing style, literature, poetry, you end up with whatever Sonnet thinks is good (and randomly so).
    It still could be better for students, but it's not obvious that it would be (or maybe not as strongly?).
glenstein 1 hour ago
In mice!
Jk, but the skepticism is inevitable. I think we can be dubious about how AI mobilizes global capital while also appreciating tutoring as one of its best targeted use cases.
constantius 3 hours ago
Interesting, congrats.
Are you planning on opening access to Phosphor?
[-]
- thadk 2 hours ago
  maybe they did already via the "formerly known as" comment in the paper?: https://www.spongium.org
klustregrif 1 hour ago
A lot of pessimism in the comments, but I am just happy that we are seeing some work towards bridging the 2 Sigma gap for regular education vs. elite private tutoring. I can't imagine that people assume it's the physical presence of the tutor that is making the difference, it has to come down to the personalisation and expertise which is exactly what AI can provide in a form. And yea it might not be "there" yet. But if we don't start trying and studying then it'll never get there.
[-]
- johnfn 59 minutes ago
  Tell me if I am oversimplifying, but I never understood the noise about the two sigma problem. Like, of course if you have a private tutor to immediately answer any question that pops into your head at the immediate moment you get confused, you are going to learn vastly more efficiently than in a large classroom where once you get confused you are likely to stay confused. To say nothing of how the pace will likely either drag way behind what you'd like, or accelerate too fast ahead of it.
  The environment is just obviously two sigma better. This just... seems obvious to me? In the same way that I will get stronger much faster if I have a physical trainer to tell me exactly what I am doing wrong when I do it? And it seems obviously unsolvable other than by getting everyone a private tutor (or AI..?).
  Asking from a place of curiosity.
ilaksh 2 hours ago
Shocking that a well executed AI tutor improves outcomes.
Hasn't computer assisted interactive learning already been proven for years? Why does there seem to be so much skepticism about enhancing it with AI?
Is this just something like, astoundingly slow adoption or poor execution? Being held back by paper textbook makers? Teachers unions dragging their feet?
How can interactive AI driven individually paced learning _not_ be obviously dramatically more effective?
[-]
- skybrian 2 hours ago
  Selection effects are extremely important in education. Dartmouth students have already had a large selection effect. If you try to apply this more broadly then it might not work.
  Motivation is also a huge part of the problem. I'm wondering if the novelty of the AI tutoring gets more people to try it and whether it would wear off?
  It's surprising to me that many students at Dartmouth don't read the textbook. You'd think college admissions would select for that?
  It seems promising but, as they say, more research needed.
- dghlsakjg 2 hours ago
  Lots of people in education will happily tell you how the past 15 years of tech integration has been a net negative.
  There ARE technologies that have improved things, but so much high-cost useless tech has been shoved into every level of education that many educators are incredibly leery of new tech.
  The issue is that while the underlying technology is useful, the way it gets integrated is frequently not. An administrator cuts a deal for a product they never have to use to an ed-tech giant for a huge amount. Because the ink is dry and a huge sum of money has been spent admins pressure educators to use the technology as much as possible regardless of outcome.
  In that context it makes a lot more sense why there is pushback and FUD among educators.
  [-]
  - fragmede 1 hour ago
    I had a chance to use Google Classroom for a non-profit I was volunteering with, and wow it sucks. If that is what teachers and students have to contend with, yeah, I'd push back against any and all tech forced on me as well. It's all well intentioned, but the road to hell is paved with them.
- dominotw 2 hours ago
  its like anything else. benifits students that are already motivated to learn.
  very few are actually motivated to learn and are just there to get a job or its just next thing that they have to do in life.
  [-]
  - hajile 25 minutes ago
    I fear even a lot of bright, motivated people will be so discouraged by AI doomers that they won’t bother trying to learn.
Rperry2174 3 hours ago
Honestly whether or not this was effective seems less important to me than the adoption numbers.
Text book reading in this course was 10-15% at baseline ... but this AI thing got 90% voluntary usage ungraded.
Even if its worse per-hour than a textbook, you're now teaching 6x as many students _something_ instead of teaching a small minority everything.
So really it just becomes an optimization problem at that point because most students are at least in the funnel/in the running to learn something.
The paper kind of proves this itself ... they tweaked the quize formats mid-semester and where able to iterate which you can't do on a textbook that nobody opens in the first place
[-]
- baq 3 hours ago
  I'd argue the results are even better: just reading a textbook doesn't really teach you much. You have to do exercises, but they're expensive to create and grade. LLMs with a proper harness (see paper) tackle both.
kubb 3 hours ago
Too bad the educational use case doesn't make any money. Good LLMs are a game changer for people motivated to learn.
[-]
- Robotbeat 3 hours ago
  Wikipedia doesn’t make much money but is still helpful. LLMs don’t need to make a whole bunch of money to be helpful.
  [-]
  - kubb 2 hours ago
    People aren't paying trillions to train them to be helpful. They want to make quadrillions.
- TheLML 3 hours ago
  I don't want to learn from hallucinations where it will change its answers based on me questioning their teachings. I use it for conversations in a language I'm learning, but I quickly learned that asking it grammar questions for example is not a wise decision.
  [-]
  - afro88 2 hours ago
    Curious whether you were just bare asking it questions, or whether you provided it with lessons one by one with instruction that the lesson is the baseline truth etc
  - treis 2 hours ago
    Are we talking about human teachers or LLMs here?
    [-]
    - sumeno 20 minutes ago
      This is such a lazy response to every LLM criticism
- jasondigitized 2 hours ago
  Not sure if its education, but there is huge money in the college admissions process, e.g., SAT prep.
  [-]
  - kubb 2 hours ago
    Not enough to cover labs' expenses.
albinahlback 3 hours ago
Very nicely typeset.
tancop 1 hour ago
[dead]
MoneyBurning 2 hours ago
[flagged]
[-]
- isomorphic_duck 2 hours ago
  Why did you make a new account to spam AI comments?