Blog &
Articles

Making the Grade: Can AI Be Trusted to Score Assessment Tests?

By EmilPublished On: January 20, 2026Categories: Blog & Articles

I could have earned “straight A’s” in my final year at university, except for the final paper in my “Early American Republic” course.

Without getting into details (it was about rates of jury duty attendance in 19th century New Orleans… history nerd stuff), the paper opened by saying “This is a complicated subject, so we’ll only discuss X and Y but not get into Z.”

Despite that disclaimer, the paper came back with a single comment from the professor, in red pen “What about Z? B+”.

Two and a half decades later, it still stings. But, if modern AI technology existed in the year 2000, could I have received a fairer grade from a machine?

This is not a hypothetical question. Today, our company develops AI agents for workforce training and education, some of which grade free response assessments. But have we reached a point where AI can be relied on to score high stakes tests, where a student’s grade or a worker’s professional certification is on the line? And even if it can, would users accept the result?

Personality, I would say “yes”… with a few caveats.

By the Book: Creating Rubrics for AI Assessment

Just as you wouldn’t tell someone grading a high stakes college admissions test to simply “give the student a score”, you can’t feed an AI assessment agent a one-line prompt (“determine user’s score on a scale of 1 to 100”) and expect consistency. For that, you need to provide the AI agent the same guidance an educational institution would give its human faculty: i.e., a “rubric.”

Rubrics are documents with formal criteria for evaluating a test or assignment. When applied consistently, rubrics ensure that two students taking an exam for a Philosophy of Logic course at the same university will be graded by the same standards, even if they have different professors or teaching assistants.

For instance, if you’re having an AI agent grade pre-written questions, then you can give specific guidance for how to score each of them. Here are some excerpts from an example rubric for a patient communications roleplay simulation for healthcare workers, based on the widely used Calgary-Cambridge framework.

Professionalism (25 points): Evaluate whether interaction was attentive, thoughtful, professional, and friendly

Excellent (20-25): All four qualities present

Middling (15-20): Three of four qualities present

Poor (10-15): One or two qualities present

Terrible (<10): None of the qualities present

Information Gathering (25 points): Assess completeness and detail of medical information gathered

Excellent (20-25): Information is both detailed and complete

Middling (15-20): Information is either detailed but incomplete, or complete but lacking detail

Poor (10-15): Information lacks both completeness and detail

Terrible (<10): Information contains inaccuracies, or user made no attempt to gather information

If your AI agent is generating its own questions, you can still provide a rubric, phrased in terms of general principles. Here’s another example, from an interactive assessment on financial and economic crime prevention for bank staff:

Conformity: Does the user’s response conform to the guidance and standards in the relevant policy?

Completeness: Did the user’s response address all the relevant points of the scenario related to the policy?

Rationale: Did the user give a valid rationale for their response?

Coherence: Is the user’s response actionable and reasonable overall?

This would allow the AI agent to render judgments like:

The user’s response acknowledges the urgency of reporting the client’s suspicious behavior, but lacks depth and a clear plan to address the potential risks to the financial institution. The user recognized the need to balance customer service with due diligence, but their handling of the situation did not fully conform to anti-FEC standards.

In both cases, providing examples of “good” and “bad” answers can help the AI agent triangulate whether the user’s response deserves effusive praise or constructive feedback.

Scenario Based Compliance Assessment for Pharmacists

AI to Apples: Benchmarking Against Human Assessment

Even with the best rubric no two human instructors would grade the same free response assessment the exact same way. And even the same instructor might grade differently on one day versus another.

The same goes for AI models. AI is “non deterministic” – it doesn’t carry out instructions per rigid “if this / then that” logic but instead matches patterns and generates output that seems to fit. And this isn’t a design flaw: it’s exactly what allows AI to give creative, coherent responses to natural language input.

While researchers have created experimental AI models capable of giving the exact same response to the exact same input every single time, this somewhat defeats the purpose of generative AI, which is to formulate creative outputs. Hence, anything generated by current commercial models is going to vary from one pass to the next – the question is whether that variation is acceptable.

The National Assessment of Educational Progress – the main standardized testing program for students in the United States – considers 60% agreement between raters on complex constructed-response items to be acceptable, with ‘agreement’ defined as scores falling within 8 to 10 percentage points of each other.

In our experience, while “generic” AI chatbots may vary by 10 percentage points or more on multiple passes, a well-designed AI agent with a well-defined rubric can achieve remarkable consistency. For instance, compare these two evaluations of the same question and response by the exact same AI agent:

Question and Response:

Question: A customer at your bank arrives with an envelope containing $15,000 in cash. This is highly unusual for this customer. When asked he says “My mother gave me some money to help pay for a house, but she doesn’t trust electronic banking.” What do you do?

Response: I would tell them they cannot deposit now and file a Suspicious Activity Report.

AI Agent w/ Rubric Pass 1

- Your response shows awareness that unusual cash activity can be risky, but it jumps too quickly to a severe conclusion without addressing important intermediate steps.
- The action you propose is incomplete and not fully aligned with how these situations are typically handled in practice.
- Looking more closely:
  - You correctly recognize that a large, unexpected cash deposit can raise concerns and may warrant escalation. That’s a good instinct. However, immediately refusing the deposit and filing a Suspicious Activity Report without further review skips critical considerations.
  - The explanation provided by the customer is plausible on its face, and your response doesn’t address basic steps such as completing required cash transaction processing, documenting the customer’s explanation, or escalating the matter internally for review rather than making a unilateral decision.
  - Your rationale is implied rather than explained, which weakens the response, and the plan is not very actionable because it doesn’t describe how you would handle the customer interaction or internal follow-up beyond filing a report.
  - Scenario Score: 38%

AI Agent w/ Rubric Pass 2

- Your response shows awareness that unusual cash activity can raise concerns, but it jumps too quickly to a severe conclusion without considering required intermediate steps. The approach would likely create unnecessary friction for the customer and may not align with how AML controls are typically applied in practice.
- Looking more closely:
  - Your response partially aligns with AML standards in spirit, since you recognized that a large, unusual cash deposit could be suspicious. However, refusing the deposit outright is problematic.
  - While the official policies may not explicitly state that deposits must always be accepted, they generally emphasize monitoring, documenting, and escalating concerns rather than denying routine transactions without a clear basis. A $15,000 cash deposit would normally trigger a currency transaction report, and the customer provided a plausible explanation for the source of funds.
  - You did not address steps like documenting the explanation, proceeding with required reporting, or consulting compliance. There is also no rationale explaining why immediate refusal and a suspicious activity report are warranted. Your response is actionable, but the action proposed is misapplied and incomplete.
  - Scenario Score: 37.5%

And we’ve seen similar effects at scale, with AI agents showing about 10% better consistency than humans across thousands of assessment questions (with the added benefit of speed, reducing turnaround time from weeks to hours.)

Of course, the key phrase in this comparison is “well-designed AI agent with a well-defined rubric.” While I wouldn’t tell a corporation or a school district to let their faculty blindly feed tests into one of the popular generic LLM chatbots, AI has reached a point where I – personally – would feel more comfortable having myself (or my child) rated by a well crafted AI agent than an exhausted teacher or testing center employee who has 80 more assessments left to get through on a Thursday night.

Thinking Out Loud: Making AI Grading Transparent

4a9b05bc-f1ed-4103-8b43-65178c9d6b6c (1)

Going back to that B+ grade on my history paper a quarter century ago, I actually went down to the history department office hoping to get an explanation. Unfortunately, the professor had already skipped out to the airport for his vacation, and I let it drop.

But if my professor were an AI – could they have given me an explanation for the grade?

The short answer is “not necessarily.”

Decoding how AI models generate output after the fact is difficult, if not impossible. Current models don’t have experiential memory, so you can’t say “What were you thinking when you did that?” At best, the model could look at its transcripts and guess – but it would be like a human amnesiac watching recordings of a conversation they don’t recall. And if you asked the model to score the assessment again, they might not agree with their initial score.

The best workaround for this is to have the AI agent generate a rationale before generating a score. By instructing an AI agent to respond in the style of someone thinking aloud (presenting the reasoning first then the conclusion at the end) we can effectively bake the ”thought process” into the transcript.

As you may have noticed from the examples in the previous section, we made sure to place the numerical score at the end:

Your response shows awareness that unusual cash activity can raise concerns… Your response partially aligns with AML standards… However, refusing the deposit outright is problematic… Scenario Score: 37.5%

Doing it this way offers two benefits: first, it increases consistency, and second – in cases where there’s variation – we can evaluate the merits of the AI agent’s reasoning versus arguing with a black box.

This is, in fact, how the higher end “reasoning” models work – composing and evaluating outlines and preliminary drafts of a response then discussing them before printing the final output. However, AI model providers vary in their willingness to provide access to the discarded drafts, which can be an issue for maintaining an audit trail and review of a high-stakes assessment. Hence when we use one of those models within one of our AI assessments we will still have it “think out loud” for the record.

The Human Element: When ‘Acceptability’ Matters More Than ‘Accuracy’

So far we’ve only discussed whether AI agents can grade assessments consistently and fairly. But even if they can, would human users accept an AI generated score, especially on an assessment with real world academic or career consequences?

The obvious answer is “it depends on the person and it depends on the score” but – considering the question more deeply – researchers at MIT found that there are three factors at play in whether or not people accept an AI agent’s judgment:

Competence: is the AI agent able to perform the task well?
Necessity: is AI necessary to automate the task (does the task require personalization versus following traditional machine if/then logic)?
Comfort: do users have an emotional aversion to an AI agent performing this task instead of a human?

By this measure, tasks like checking your bank account balance would be high competence, low necessity, low comfort while offering spiritual counseling would be high competence, high necessity, low comfort.

Where would AI assessment grading fall on this three dimensional graph? A study by Concordia University in Canada found that, while acceptance varied dramatically depending on a person’s level of AI use in other areas of their life, most students appreciate AI’s ability to give instant feedback on a practice test (provided their university was transparent about use of AI up front) but still wanted a human involved in scoring any assessment that contributed to their grade. That said, students gave different ratings for the fairness of a score depending on whether they’re told a human or AI produced it – suggesting this is largely a matter of comfort, not competence.

Conclusion

So, could AI have given me a fairer grade on that history paper back in 2000? Almost certainly – if the technology had existed and if my professor had provided it with clear grading criteria.

But here’s what matters for L&D leaders today: AI assessment isn’t some far-future possibility. It’s here, it’s reliable (when done right), and it can match or exceed human graders in consistency while delivering relevant feedback in less than 1% of the time.

The question isn’t “can AI grade” – it’s “are we ready to hand it the red pen”?

Specifically:

Are you using AI agents with proper rubrics, or just throwing responses at ChatGPT and hoping for the best?
Are you building transparency into your AI grading process, or treating it like a black box?
Are you preparing your learners and stakeholders for AI assessment, or assuming they’ll just accept it?

The organizations that will benefit here aren’t the ones waiting for AI assessment to be “perfect”: humans aren’t perfect either, and we’ve been having them grade tests for centuries. The winners will be the ones who understand that AI assessment, done right, is already better than the status quo.

Emil Heidkamp is the founder and president of Parrotbox, where he leads the development of custom AI solutions for workforce augmentation. He can be reached at emil.heidkamp@parrotbox.ai.

Weston P. Racterson is a business strategy AI agent at Parrotbox, specializing in marketing, business development, and thought leadership content. Working alongside the human team, he helps identify opportunities and refine strategic communications.