The Problem with Generic AI in Teacher Evaluation
Every week, another AI tool promises to "revolutionize teacher evaluation." Most of them do the same thing: you paste a classroom transcript into a chatbot, and it spits out generic feedback. The language sounds authoritative. The suggestions seem reasonable. But there is a fundamental problem that no amount of prompt engineering can fix.
These tools have no idea what your state actually requires.
Teacher evaluation in the United States is not a free-form exercise. It is a structured, legally mandated process governed by state-specific rubrics that define exactly what effective teaching looks like, how it should be measured, and what scoring levels mean. When a principal in Texas evaluates a teacher, they are scoring against T-TESS -- not against whatever ChatGPT thinks "good teaching" means. When a coach in Tennessee observes a classroom, they are looking at TEAM domains, not generic AI-generated bullet points.
This is the gap that most AI teacher evaluation tools ignore entirely. They treat evaluation as a content analysis problem -- "tell me what happened in this lesson" -- when it is actually a standards alignment problem: "how does this instruction score against the specific indicators my state has defined for Domain 2, Indicator 2.3, at the Distinguished level?"
Why State Rubrics Exist -- and Why They're Non-Negotiable
State teacher evaluation rubrics are not bureaucratic paperwork. They represent decades of research into effective teaching practice, refined through legislative processes, stakeholder input, and field testing. Each rubric reflects a state's specific educational priorities and legal framework.
Consider just a few examples of how dramatically rubrics differ across state lines:
- Texas (T-TESS) uses four domains with 16 dimensions, scored on a five-point scale from Improvement Needed to Distinguished. Its planning domain emphasizes standards alignment and data-driven instruction.
- Tennessee (TEAM) evaluates across four domains with 19 indicators, using a five-point scale from Significantly Below Expectations to Significantly Above Expectations. TEAM places particular emphasis on questioning strategies and academic feedback.
- Danielson Framework (PA, IL, WI, KY, MD, DE, HI, NM, ID, SD) organizes teaching into four domains with 22 components and 76 smaller elements, each with detailed critical attributes across four performance levels. Ten states use variants of this framework, but each has state-specific adaptations.
- Mississippi (PGS) focuses on four domains with specific attention to student engagement metrics and classroom environment indicators unique to Mississippi's educational context.
- Kansas (KEEP) uses a streamlined rubric aligned to Kansas's own teaching standards, with scoring levels and indicators tailored to the state's professional development framework.
A generic AI tool that produces a single "evaluation" regardless of whether the teacher is in Arkansas (TESS) or Ohio (OTES 2.0) is producing something that looks useful but cannot withstand scrutiny. It is like using a single building code for every state -- technically about buildings, but legally meaningless where it matters.
This is why Upraiser was built with state rubrics as the foundation, not an afterthought. The AI doesn't generate generic feedback and then try to map it to a rubric. It evaluates the transcript against each rubric domain's specific indicators from the start, using the exact scoring criteria and performance level descriptors that evaluators use in the field.
The "Paste Into ChatGPT" Problem
We understand the temptation. A principal with 30 evaluations to complete this quarter discovers they can paste a classroom transcript into ChatGPT and get something that reads like evaluation feedback in 30 seconds. After spending 45 minutes on each write-up, that looks like a lifeline.
But this approach has three serious problems that go beyond accuracy.
1. No rubric alignment
ChatGPT does not know your state rubric's specific indicators, scoring levels, or critical attributes. It generates plausible-sounding educational feedback, but it cannot tell you whether a teacher demonstrated "Distinguished" level questioning under T-TESS Dimension 2.4 versus "Proficient" level. It does not know the difference, because it was never trained on the specific distinction your state draws.
2. Data privacy and FERPA concerns
When you paste a classroom transcript into a consumer AI tool, you are sending student voice data, teacher performance data, and potentially identifiable classroom information to a third-party service with no data processing agreement, no FERPA compliance, and no guarantee about how that data will be used for model training. Many districts have explicit policies prohibiting this, and for good reason.
3. No audit trail or consistency
Evaluations are professional records. When a teacher questions their rating, the evaluator needs to point to specific evidence aligned to specific rubric indicators. A ChatGPT conversation provides no structured audit trail, no consistency across evaluations, and no way to demonstrate that the same standards were applied to every teacher. Purpose-built teacher evaluation software maintains this chain of evidence automatically.
What Rubric-Aligned AI Evaluation Actually Looks Like
When we built Upraiser, we started from the rubrics. Not from the AI, not from the transcript processing, not from the user interface. The rubrics came first because they define what a valid evaluation actually is.
Here is how the process works in practice:
- Audio capture: The evaluator records the classroom observation using their phone or tablet. This can be an in-person observation, a sit-down coaching meeting, or a remote session.
- AI transcription: The audio is transcribed using AssemblyAI with multi-language support, producing a timestamped, word-level transcript -- not a rough summary, but a precise record of what was said and when.
- State rubric evaluation: The transcript is evaluated by OpenAI's GPT-5 against the teacher's specific state rubric. Every domain is evaluated in parallel using the exact indicators, scoring levels, and critical attributes defined in the rubric. The AI knows that "Proficient" in T-TESS Domain 2 means something specific and different from "Proficient" in TEAM Domain 3.
- Image analysis: Photos of lesson plans, anchor charts, student work, and classroom environment are analyzed as visual evidence alongside the transcript, providing a more complete picture of instruction.
- Human review: The AI generates draft scores and evidence-based feedback for each domain. The evaluator reviews, adjusts, and approves. The AI accelerates the process; the educator makes the final call.
This pipeline runs the same way whether the evaluator is in Mississippi using PGS, in Georgia using TKES, or in Connecticut using TPES. The underlying AI is the same; the rubric knowledge is state-specific. That is the difference between a general-purpose AI tool and purpose-built teacher evaluation software.
Addressing the Valid Concerns About AI in Evaluation
Skepticism about AI in teacher evaluation is healthy and warranted. The concerns raised by education researchers and advocacy groups -- about context, bias, transparency, and human oversight -- are legitimate. They should shape how AI evaluation tools are designed, not whether they exist.
Here is how state-rubric-aligned AI systems address the most common concerns:
"AI can't understand classroom context"
This is true -- and it is exactly why the human evaluator remains in the loop. The AI processes the transcript and identifies evidence aligned to rubric indicators. But the evaluator who was in the classroom adds context the AI cannot see: a student who had a difficult morning, a fire drill that disrupted the lesson, a co-teaching arrangement that shifted mid-class. The AI handles the time-intensive transcript analysis; the evaluator handles the contextual judgment.
"AI evaluations might be biased"
Bias is a real risk in any evaluation system -- including purely human ones. State rubrics actually help mitigate AI bias because they constrain the evaluation to specific, observable indicators. The AI cannot score based on a vague impression of "good teaching." It must identify evidence for specific rubric components at specific performance levels. This is more structured and auditable than many human evaluation processes.
"How do I know what the AI is scoring?"
Transparency is built into the rubric-aligned approach. For every domain score, Upraiser provides the specific transcript evidence that supports the rating, mapped to the specific rubric indicators. Evaluators can see exactly why the AI suggested a "Proficient" versus "Accomplished" score and override it with a click. This is not a black box producing a number -- it is an evidence-mapping tool that shows its work.
"Will AI replace evaluators?"
No. Upraiser is designed to make evaluators more effective, not to replace them. A principal who spends 45 minutes writing up each evaluation can get a rubric-aligned draft in minutes. That does not eliminate the principal's role -- it shifts their time from paperwork to the professional conversations that actually improve teaching. AI classroom observation tools should augment human judgment, not substitute for it.
24 States, One Platform -- Every Rubric Built In
Upraiser currently supports rubric-aligned AI evaluation across 24 states. Each rubric implementation includes the state's specific domains, indicators, scoring levels, performance level descriptors, and critical attributes. This is not a template with state names swapped in -- each rubric is a distinct evaluation framework.
The full list of supported state frameworks:
- Mississippi -- PGS
- Arkansas -- TESS
- Kansas -- KEEP
- Tennessee -- TEAM
- Texas -- T-TESS
- Alabama -- ATOT
- Georgia -- TKES
- Ohio -- OTES 2.0
- North Carolina -- NCEES
- Pennsylvania -- Danielson FFT (Act 82)
- Illinois -- Danielson FFT
- Wisconsin -- Danielson FFT (2022)
- Kentucky -- PGES (Danielson)
- Maryland -- Danielson FFT
- Delaware -- DPAS II (Danielson)
- Hawaii -- EES (Danielson)
- New Mexico -- Elevate NM (Danielson)
- Idaho -- Danielson FFT
- South Dakota -- Danielson FFT
- Indiana -- RISE 3.0
- Massachusetts -- MA Educator Evaluation
- Nevada -- NEPF
- South Carolina -- SCTS 4.0
- Connecticut -- TPES
Ten of these states use variants of the Danielson Framework for Teaching, but each state's implementation differs in meaningful ways. Pennsylvania's Act 82 adaptation has different weighting than Illinois's version. Kentucky's PGES embeds Danielson within a broader professional growth system. Delaware's DPAS II adds state-specific indicators. A rubric-aware system handles these distinctions; a generic AI tool cannot.
Built by Educators Who Have Done This Work
There is a reason most AI teacher evaluation tools feel like they were built by engineers who read an article about education once. They were.
Upraiser was designed by a 17-year veteran educator and principal who has personally conducted hundreds of teacher evaluations. The difference shows up in details that only someone who has sat through a post-observation conference would think to build:
- Coaching vs. evaluation modes: Not every observation needs a rubric score. Upraiser supports both full rubric-scored evaluations and lightweight coaching observations that focus on growth-oriented feedback without formal ratings. Because in practice, most classroom visits are coaching, not evaluation.
- Evidence capture during observation: Audio recording, timestamped notes, and photo capture all happen from the same interface during the observation. No switching between apps, no remembering to save, no reconstructing what happened from memory later.
- Session workspace: In-person observations, sit-down coaching meetings, and remote sessions each have their own workflow because they are fundamentally different interactions. A sit-down meeting with a teacher about their growth goals is not the same as a formal classroom walkthrough.
- Action steps that carry forward: Coaching is longitudinal. When an evaluator sets an action step in one session, it appears in the next session for follow-up. The AI coaching summary references prior sessions to track growth over time, not just evaluate a single lesson in isolation.
This practitioner-driven design extends to the consulting group features built for organizations that support multiple schools. Contract management, engagement tracking, coach assignment, and compliance reporting are all built for the specific workflows that instructional consulting groups use -- because the team has lived those workflows.
What to Look for in AI Teacher Evaluation Software
If you are evaluating AI tools for teacher evaluation, here is a practical checklist based on what actually matters for valid, defensible evaluations:
- State rubric specificity: Does the tool know your state's exact rubric framework -- domains, indicators, scoring levels, and critical attributes? Or does it generate generic feedback that you have to manually map to your rubric?
- Human-in-the-loop design: Does the AI produce draft evaluations that evaluators review and approve? Or does it generate final outputs with no human oversight? Any tool that claims to fully automate evaluation should raise a red flag.
- Data privacy architecture: Where does classroom audio data go? Is it processed within a FERPA-compliant environment with proper data handling agreements? Or is it sent to a consumer AI service?
- Evidence chain: Can you trace every score back to specific transcript evidence aligned to specific rubric indicators? When a teacher asks "why did I get Developing on Domain 3?", can the system show the receipts?
- Coaching support: Does the tool support the full continuum from informal coaching visits to formal evaluations? Or does it only handle one mode?
- Built by practitioners: Was the tool designed by people who have actually conducted teacher evaluations? Or by a technology team that treats education as just another vertical?
See how Upraiser scores against your state's rubric
Watch a live evaluation using your specific state framework. Our AI knows every domain, indicator, and scoring level -- because it was built by educators who use them every day.
Request a Demo




