James Routley

Generative AI is fast becoming an everyday tool across almost every field and industry. Teaching children about it should include an understanding of how it works, how to think critically about its risks and limitations, and how to use it effectively. In this post, I’ll share how I’ve been doing this.

My approach is to take students through six projects that give a practical and hands-on introduction to generative AI through Scratch. The projects illustrate how generative AI is used in the real world, build intuitions about how models generate text, show how settings and prompts shape the output, highlight the limits (and risks) of “confidently wrong” answers, and introduce some practical techniques to make AI systems more reliable.

As with the rest of what I’ve done with Machine Learning for Kids, my overall aim is AI literacy through making. Students aren’t just told what language models are – they go through a series of exercises to build, test, break, and improve generative AI systems in Scratch. That hands-on approach helps to make abstract ideas (like “context” or “hallucination”) more visible and memorable.

I’ve long described Scratch as a safe sandbox, and this makes it ideal to experiment with the sorts of generative AI concepts they will encounter in daily tools such as chatbots, writing assistants, translation apps, and search experiences.

Core themes

Across the six projects, students repeatedly encounter three core questions:

1. How does a model decide what to say next?

2. Why do outputs vary, and how can we steer them?

3. When should we not trust a model, and what do we do then?

All of this is intended to be a (much simplified!) mirror of real-world practice. Professional uses of generative AI combine generation (writing), grounding (bringing in trusted sources), instruction (prompting), and evaluation (testing and comparing). Children can be introduced to all of these aspects through hands-on experiences.

What jargon do I include?

I’ve tried to avoid teaching jargon for the sake of jargon, but there are a few AI terms that appear through the projects.

A key aspect is that I’ve only introduced terms where students can experiment with the why behind that idea, not only the what.

For example, I avoid just declaring “temperature has the effect of…” as a fact, and instead help students to build an intuition about why temperature has the impact that it does through simplified experimentation and visualisations.

Similarly, I avoid just declaring things like “hallucination is where models do…” as a fact. I want to help students to understand why the way that models work results in these issues.

But, the sorts of terms I do cover include:

Language model – A system that generates text by predicting what word is most likely to come next given the recent text.
Context window – The “working memory” of the model, and the amount of conversation (text) it can consider at once before older parts drop out.
Temperature – A setting for how the model chooses the next word, which can feel like a creativity dial, with higher values making outputs more varied and surprising, and lower values making more predictable and repetitive output.
Top-p – Another setting used in choosing the next word letting you limit the model to the most likely choices. It feels like another creativity dial.
Hallucination – A model producing an answer that sounds confident but is incorrect or invented.
Retrieval-augmented generation (RAG) – A technique for improving answers by fetching relevant information from a trusted source and adding it to the prompt before generating a response.
Role prompting (persona) – A technique where the model is asked to respond “as” a particular character or professional role to shape tone, style, and viewpoint.
Zero / one / few-shot prompting – including no examples, one example, or several examples of the desired output in the context, so the model generates text consistent with the pattern.
Benchmarking – How lanaugage models are tested so that we can compare performance between different models at different tasks.

That sounds like a lot if you’re not familiar with the jargon!

Students likely won’t memorise all of those terms, but I don’t feel like that’s a problem. These are all ideas that even young children are able to build an understanding and an intuition around. That is the more important thing. Whether they remember the label that we use for the idea is less important than whether they can understand the idea.

I introduce the jargon mostly because I hope that when they hear the terms used in discussions about AI, it will ring a bell from something they did themselves. Instead of sounding like a technical foreign language that excludes them from that discussion, I want to give them a sense of familiarity that encourages them to engage with the discussion.

The projects

“Language Models” focuses on foundations
“Story Teller” explores creative generation
“RAG Time” gets into reliability and grounding
“Personas” explores some prompting techniques and roles
“Translation Telephone” gets into semantic drift, bias, and other prompting techniques
“Benchmarking” explores how models are tested and compared

I’ve written about most of these projects before, so I won’t repeat all of that here, but I’ll try and give a high level overview.

“Language models” – the foundation project

The first project sets the tone: langauge models generate text based on patterns and recent text.

The main activity is to build a simple and interpretable “toy” language model that makes some crucial ideas tangible. For example, students see how and why more context helps (longer chains result in more coherent output), and the way that sampling choices like temperature and top-p affect the style and accuracy by altering the randomness controlling the next word choices.

This is the anchor for all the projects, with ideas introduced here referred back to repeatedly through later projects. All the other things that students build using language models will come back to the core idea that is introduced in this project of how the model chose the next words.

It is also an opportunity to discuss the ethics in how these large AI models are trained. As students see for themselves how large amounts of text are an essential requirement to be able to make the models better, they understand some of the motivation that tech companies have to gather huge amounts of text. This understanding is a useful starting point for a discussion on whether it is ethical to use all text on the Internet.

Why I like this project

Understanding the way that language models generate text by statistically predicting the most likely next word demystifies it. The randomness controls that they experiment with start to establish the core idea that different settings lead to different types of responses. All of this helps students to see AI as a tool that can be guided and understood, not a magic black box of truth.

Students see how chatbots can sound fluent and knowledgeable, but start to understand how this is different to “understanding” in a human sense.

Jargon included

language model
context window
temperature
top-p

More info

blog post: “Exploring Language Models in Scratch“
project worksheet
narrated video walkthrough of the project
learn more about language models

“Story teller” – the creative generation project

In the second project, students take the text prediction idea, and use it to to make a story-generating Scratch sprite. They ask the sprite for poems or stories on different themes, and the model generates them.

This lets them put the concepts they learned about in the previous project into action.

They can see how with small context windows, the model loses track of the initial instructions as it gets further into the story. And they see that with larger context windows the model can continue to generate longer stories.

The project is ideal for experimenting with the temperature and Top-P settings they learned about in the previous project, as they can quickly see how they affects the creativity of the text generated by the model. Students see how that for creative tasks, variation is a feature of language models, not a bug – if you use it correctly.

Why I like this project

This is a very simple project to make, so it helps to build students’ confidence in AI as something they can use and build with, not just observe.

Jargon included

language model
context window
temperature
top-p

More info

“RAG-time” – the reliability project

In the third project, students explore the sorts of questions that language models cannot answer, such as fact-based questions on events that happened after the model was trained, or questions on topics where no online material exists. They see for themselves how confident the answers sound that the model generates, even when the answers aren’t correct (and that it wouldn’t be possible for the model to correctly answer).

This reinforces what they learned before – that models generate text based on the patterns derived from data, rather than using knowledge. More importantly, they see one impact of this: language models often generate answers even when they shouldn’t or can’t.

Students then learn one technique for mitigating this limitation: by adding supporting information (e.g. from Wikipedia) into the model’s context that contains the correct answer. They see that this can enable a model to correctly answer questions, even about private topics or recent events.

Why I like this project

Hallucinations are one of the main risks in the real-world use of generative AI, so understanding the reasons behind this and the risks that this introduces is critical.

The technique that the students implement in Scratch for mitigating this is called RAG (retrieval augmented generation) in the real world, and is the same principle behind “chat with your documents” systems, customer support assistants that cite a knowledge base, and enterprise AI tools that restrict answers to approved corporate sources.

Jargon included

language model
hallucination
retrieval-augmented generation

More info

"RAG-time" is a #MLforKids project that introduces how we use language models to answer questions about recent events.

Adding relevant documents to the context transforms the answers the model can give

Step-by-step instructions to create this in #Scratch at machinelearningforkids.co.uk/worksheets

[image or embed]

— Dale Lane (@dalelane.co.uk) 1 September 2025 at 14:02

“Personas” – the first prompt engineering project

The fourth project starts to explore the idea of prompt engineering: how to influence the output of a language model by the way that they ask it questions.

Students create a Scratch project that randomly selects a persona (e.g. pirate, alien, sports commentator, medieval knight, etc.), adding a description of the persona to the language model. They then start interacting with the model, and have to work out the hidden persona just from the answers that it gives.

This playful exploration is an introduction to “role prompting” – which simply means that if you tell a language model who it should pretend to be, it will generate text differently.

After a few projects focusing on technical settings that change the output from a language model, this project teaches that outputs are also shaped by how you phrase your request.

Why I like this project

I sometimes see prompt engineering framed as the “magic words” that you should include in your prompt, which I think is less helpful. Instead, I think a more helpful approach is to build on an understanding of how language models work to understand why role prompting is effective. Student understanding from previous projects about the impact of context on how the next words are selected is the foundation that this project builds upon.

In real-world usage, role prompting can be as simple as changing “Explain gravity” into “You are a friendly science teacher. Explain gravity to a Year 5 pupil in a UK school. Use language from the UK Key Stage 2 National Curriculum.” This dramatically changes the tone and style of the responses, as well as the content and level of detail. This is a super useful skill to learn.

Jargon included

language model
role prompting

More info

blog post: “Explaining role prompting in Scratch“
project worksheet
video recording
learn more about role prompting

“Translation Telephone” – the second prompt engineering project

The fifth project introduces another prompt engineering idea.

Students create a multi-language translation chain across multiple Scratch sprites. It’s a bit like Chinese Whispers, with sprites using a language model to translate an English sentence into French, the French sentence into German, the German sentence into Chinese, and the Chinese sentence back into English – before repeating that loop again.

The chain of translations results in semantic drift – the sentence drifts from the original input the more translations it goes through.

There are several practical insights that become obvious from playing with this. For example, experimenting with starting sentences that contradict common stereotypes (e.g. sentences about female engineers, or male nurses, or female doctors) typically drift in ways that conform to stereotypes (e.g. engineers become men, nurses become women, etc.). This is a very visible illustration of the risks of bias in these models.

Creating this project also introduces a practical challenge: their language models don’t return only a translation, they return a translation together with commentary, explanations, or follow-up questions. This breaks the idea of the project, as that gets included in the text given to the next sprite to translate.

Students have to learn a new prompt engineering technique to get the model to respond only with the translation: few-shot prompting. This is an effective introduction to a very useful technique, and yet another reminder that the way that we ask questions will change the way that the model generates text.

Why I like this project

As with role prompting introduced in the previous project, one-shot and few-shot prompting is a super useful skill to learn to be most effective with generative AI. What students learn with this project mirrors how a lot of real-world generative AI applications are created.

Jargon included

language model
temperature
top-p
zero / one / few-shot prompting

More info

blog post: “Explaining few-shot prompting in Scratch“
project worksheet
video recording
learn more about one-shot prompting

“Benchmark” – the testing project

The sixth project encourages students to step back and compare the many language models they had to choose from in all the projects so far. This project introduces the idea that some models are better at different types of tasks than others.

Students use Scratch to make a benchmarking project, creating a simple visualisation to show what questions the model gives correct answers to and what it gets wrong.

Why I like this project

Over multiple projects, students are repeatedly shown that generative AI models are not perfect and not infallible. They have hands-on experiences to explore multiple reasons why any language model can get things wrong. In this project, they take a step back to get an insight into how different language models are assessed and compared.

They explore the trade-off between different models. Students look at the accuracy in comparison with the complexity (number of parameters) and size (in MB) of the models. This is analogous to the way that organisations have to choose models and settings, choosing the right tool for the job.

Jargon included

language model
benchmarking

More info

blog post: “Introducing LLM benchmarks using Scratch“
project worksheet
more graphs from Scratch!
video recording
learn more about benchmarks

The takeaways

After completing these projects, I hope students are able to:

Explain in everyday terms that a language model predicts likely next words based on what came before
Describe why answers can change even with the same question, and how configuration and settings can influence that
Recognise common failure modes: hallucinations (confidently made-up answers), forgetting (limited context window), drift (meaning changing over repeated transformations), and bias (skewed outputs reflecting stereotypes in data)
Improve results using practical strategies: providing context, asking for the right format, including examples, retrieving information from trusted sources
Understand that there are many different language models and explain how they can be compared

If nothing else, I hope students will leave with an instinct that language models are a powerful statistical next-word guesser, and not a truth machine.

All of this translates directly into being a more capable and critical user of AI tools. These are exactly the set of insights that makes someone more effective with generative AI in everyday life: understanding why it behaves as it does, how to get better results, how to use it safely, and when to apply scepticism.

I want students to learn how to ask, how to check, and when not to rely on AI.

The challenges

All of this is admittedly a best-case outcome. Before I finish, I should acknowledge some of the challenges I’ve run into when doing this in a classroom.

Most of the problems stem from my decision that students download language models to run on their own computer or device (as I didn’t feel that I could afford to host models for them).

This means that students need to be using a machine that is able to run a language model. This brings CPU and memory requirements that not all schools and code clubs can satisfy. My advice here is to test the models on the student devices before a class to check what will work on those systems. I’ve included a wide range of model sizes in Machine Learning for Kids, from some tiny ones to some fairly sophisticated models that require a powerful computer. I hope that the smaller ones will be viable for most classrooms.

Students need to download the model before they can get started. The two largest models need a 1.5gb download, and for many school networks that is a painfully slow and time-consuming step. My advice here is to download the models on all of the student machines before the session. Once downloaded, the models can be used even if a user logs out/in, and/or closes and reopens their web browser, so that shouldn’t prevent an early setup. This is essential, as trying to download models during a lesson would introduce an infeasibly long delay.

To wrap up

If you are a teacher or code club volunteer, please give some of these ideas or projects a try. If there is anything I can do to help, please get in touch.

If you’re a teacher or code club volunteer somewhere near me in Hampshire – I’d love to run through these projects again with some new student groups, so please let me know if you’d be interested in that.

Tags: mlforkids, scratch

This entry was posted on Wednesday, January 28th, 2026 at 11:22 pm and is filed under tech. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.

How to explain Generative AI in the classroom

Core themes

What jargon do I include?

The projects

“Language models” – the foundation project

Why I like this project

Jargon included

More info

“Story teller” – the creative generation project

Why I like this project

Jargon included

More info

“RAG-time” – the reliability project

Why I like this project

Jargon included

More info

“Personas” – the first prompt engineering project

Why I like this project

Jargon included

More info

“Translation Telephone” – the second prompt engineering project

Why I like this project

Jargon included

More info

“Benchmark” – the testing project

Why I like this project

Jargon included

More info

The takeaways

The challenges

To wrap up