TECHNOLOGY

AI engine

A glass-box Claude architecture tuned for an 8-year-old. Every reasoning step is visible, every patch is reversible, every output is moderated on both sides of the model call. This is the product IP.

PRIMARY MODEL

Claude Sonnet 4.6

ANTHROPIC_MODEL in lib/anthropic/client.ts. Haiku tier handles classification in Stage 2.

lib/anthropic/client.ts

MODERATION PASSES

2 per turn

Pre-model (kid input) and post-model (Claude output). omni-moderation-latest.

lib/moderation/client.ts

SESSION COST

$1.50 to $3.00

Full build session, Sonnet 4.6 with prompt caching. Platform cap $200/day.

lib/cost/caps.ts

TURN CAP

30 per session

Plus $30/day per parent. Hard caps, not soft advisories.

lib/cost/caps.ts

FIRST-TOKEN TARGET

Under 300ms

Cached system and template prompts land the first token inside the kid attention window.

TODO(ai-engine): wire p50/p95 from observability and publish

WRITE PRIMITIVES

submit_patch and checkpoint_question. The model cannot touch the workspace any other way.

lib/ai/tools.ts

LAST UPDATED 2026-04-22

The AI engine is the load-bearing piece of Tekku. It is what turns a kid idea into a shipped URL without losing safety, pedagogy, or parent legibility on the way. The shape of the engine is deliberate and, at each layer, the tradeoffs are readable.

This page walks the full turn. Kid input arrives. The moderation layer runs. The Claude call fires with a cached system prompt and a cached per-template prompt. The model responds with a tool call (submit_patch or checkpoint_question). The response is moderated again. The concept detector labels the code. The snapshot is written. The parent view updates. Everything below is the detail under each of those boxes.

Per-turn architecture

One turn from kid input to parent view. Moderation passes sit on both sides of the model call. The concept detector labels every code save. Every edge is labeled with what crosses it.

Claude as the primary modelAnthropic because safety-first priors match a kid product. Model tiering across Sonnet and Haiku keeps cost inside the unit economics.

We run Claude Sonnet 4.6 as the primary reasoning model for the kid build loop. The constant lives at ANTHROPIC_MODEL in lib/anthropic/client.ts, so the model is swappable in one line as Anthropic ships tier upgrades. Sonnet handles planning, scaffolding, error explanation, and the tool call that becomes the patch. Haiku is the classifier tier in Stage 2, not the kid-facing conversation.

The selection is not about benchmark scores. It is about the safety posture, the refusal behavior, the reading-level steering, and the data-use contract. Anthropic treats a kid-facing use case the way we need a kid-facing use case to be treated. OpenAI moderation handles the content classification on both sides of the call (cheap, fast, and purpose-built for that job) while Anthropic handles the reasoning.

Cost discipline rides on prompt caching. Input is $3 per 1M, output is $15 per 1M, cache reads land at $0.30 per 1M and cache writes at $3.75 per 1M (see computeCost in lib/anthropic/client.ts). The system prompt and the per-template prompt are cached server-side at the Anthropic layer. The running transcript is the only full-price spend per turn. A full build session lands between $1.50 and $3.00. That envelope is what makes $149 per month with 50 sessions a month sit above gross margin.

Concept graph: how we label what a kid builtFive concepts today, detected via regex. Stage 2 is an LLM classifier with confidence scores and subtle concept coverage.

The concept graph is the translation layer between what a kid wrote and what a parent reads. Every time a kid saves code, Tekku walks the file with a regex detector and tags the concepts it finds. The Stage 1 set is five concepts: state, effects, events, lists, and conditionals. Each carries a kid-facing phrase (grade 5) and a parent-facing phrase (adult English) in lib/concepts/catalog.ts. The detector lives in lib/concepts/detector.ts and is 20 lines long.

The regex shape is narrow on purpose. useState matches state, useEffect matches effects, onClick and its siblings match events, Array.map inside JSX matches lists, and a ternary or short-circuit in JSX context matches conditionals. The detector ships with a known false-positive rate around 5% on strings in JSX attribute values that happen to look like event handlers. Acceptable for Stage 1, documented in the file header. That cost is the price we pay for shipping the weekly email loop today rather than next quarter.

Stage 2 replaces the regex with a Claude Haiku classifier (TODO-002 in TODOS.md). Input is the kid code plus the running AI transcript. Output is concepts with evidence lines and confidence scores. Cost per save is about $0.001. At 100 projects per month that is 30 cents. The accuracy gain is the difference between a parent trusting the weekly email and a parent unsubscribing.

Teaching evaluator: measuring learning without testing kidsNo quizzes. Evidence of learning lives in the code artifact, the session transcript, and the concept graph that links them.

Kids quit products that feel like tests. The teaching evaluator is the reason Tekku never has to ask a kid to prove anything. The evidence of learning is the work itself. A kid who shipped a clicker that survives a page refresh has used state. The detector saw it. The snapshot stores the code and the transcript side by side. The parent-facing claim is auditable, because the evidence line is a real code snippet from a real session.

What the evaluator returns is a concept-per-session record plus a first-time-used flag per concept per kid. The first time a kid uses state across any project is the event that triggers the parent-facing claim "Maya learned state this week." Subsequent uses feed the retention curve on that concept, which is what determines when we promote a concept from "new" to "confident" in the parent dashboard. TODO(learning): the first-to-confident threshold is not yet sourced in published pedagogy work. We will either cite the source at Stage 2 or soften the parent-facing language.

What the evaluator deliberately does not do: grade kid code, score kid output, or expose a rubric to the kid. The kid ships. The parent reads. The concept evidence is the bridge.

Prompt architectureThree layers. Two are cached aggressively. The running transcript is the only per-turn spend at full price.

Every Claude call is three layers stacked. The first is the system prompt: kid-first voice, pedagogy rules, moderation floor, reading-level targets (grade 3 for errors, grade 5 for checkpoints), refusal patterns, and output format. It lives in lib/prompts/build-assistant.md and is loaded via loadPrompt in lib/ai/system-prompt.ts, with an in-process cache in production. On the wire it is cached at the Anthropic prompt-cache tier so the first token after a cache hit lands fast.

The second layer is the per-template cached prompt. Each starter template (clicker, timer, scorekeeper, drawing pad, reaction tester) has its own scoped prompt that defines what good looks like for that app type. This cuts token cost on repeat sessions and raises answer quality at the same time. Both cached layers read at $0.30 per 1M tokens, against $3 per 1M for a cold input.

The third layer is the running transcript plus the current file state. This is the only per-turn full-price spend. It is bounded so the session never exceeds the $3 soft cap. The transcript is truncated at turn boundaries; no message window management is done inside a single Claude call because the tool-call shape does the right work already.

Guardrails live at two additional layers. Inside the system prompt, refusal patterns and reading-level constraints are declarative. Outside the model, the two build tools (submit_patch, checkpoint_question in lib/ai/tools.ts) are the only write primitives. The model cannot emit free-form code that lands in the workspace. It cannot emit a partial diff. submit_patch demands a full replacement file with an explanation string. checkpoint_question demands exactly three choices. tool_choice is forced on the classifier and idea-generator calls so the model cannot drop out into free text.

Rate limiting and cost controlsThree dimensions: per-project turn count, per-parent 24h rolling cost, platform 24h rolling cost. Hard caps.

A kid holding the Enter key is a predictable failure mode for a conversational AI product. lib/cost/caps.ts implements the Stage 1 cap check with three dimensions. The turn cap is 30 per project per session. The per-parent cap is $30 in the last 24 hours, summed across that parent's kid accounts. The platform cap is $200 in the last 24 hours across everything. When any cap trips, the API returns a cap code and the UI shows a kid-legible message.

The read-then-act shape of the Stage 1 implementation is documented inside the file: a burst of 5 to 10 calls can slip past before the first write lands. The burst mitigation (an advisory-lock plus reservation-row pattern) is in the v1.1 queue and ships before the Stage 2 public pilot. Today the platform is small enough and the founding cohort is controlled enough that the simple shape is enough.

Cost caps are instrumented in Supabase via the ai_transcripts table. Every transcript row carries a cost_usd column computed by computeCost in lib/anthropic/client.ts. The caps query sums that column over the 24-hour window for the parent and for the platform. No cost-estimation math lives outside that computeCost function, so every pricing decision touches one file.

Provider matrix

Primary model, moderation, fallbacks. The architecture can survive any one of these providers breaking. Switching cost grows with the prompt-cache and fine-tune surface, not with the code path.

	Primary use	Cost per 1M	Latency	Availability	Switching cost
Anthropic (Claude Sonnet 4.6)	Kid-facing reasoning and tool calls	$3 in / $15 out (cache $0.30 / $3.75)	TODO(ai-engine): p50/p95 from observability	Hot path, primary	Moderate: prompt cache warmth, evaluation suite
Anthropic (Claude Haiku)	Stage 2 concept classifier (TODO-002)	$0.80 in / $4 out (approx)	Sub-second	Stage 2, not yet wired	Low: classifier is one call with a strict schema
OpenAI (omni-moderation-latest)	Pre and post-model content moderation	Included with OpenAI account	10-second timeout, usually under 200ms	Hot path, primary	Low: moderation is a single call behind a wrapper
OpenAI (GPT) fallback	Cold-swap if Anthropic availability breaks	Comparable tier pricing	Comparable	Contingency only. TODO(ai-engine)	Moderate: different tool-call schema, different safety posture
Vercel (Blob + hosting)	Kid ship pipeline and shared artifact URLs	Platform cost, not per-turn	Edge-cached	Hot path, ship flow	Low at Stage 1: Next.js app, portable
Supabase	Auth, parent-kid linkage, transcripts, snapshots	Platform cost	Fast-path reads	Hot path, persistence layer	Moderate: RLS rules and schema are portable Postgres

Latency p50/p95, cache hit rates, and false-positive rates are TODO(ai-engine) until the observability dashboard publishes them. We will not hand-wave numbers here.

Per-turn architecture

See also