Experiment · May 2026
We Asked Top LLMs to Score Themselves Against a Specialized Saju Engine
A self-assessment experiment with GPT-5.5 Thinking, Claude Opus 4.7, and Gemini Pro · Approx. 6-minute read
The setup
RimSaju V1 is the Saju interpretation engine published in our v2 whitepaper. It runs deterministic perpetual-calendar (만세력) calculation with apparent-solar-time correction, retrieves from 562 embedded passages across five canonical Saju texts, and cross-validates across multiple LLMs. We set it as the reference baseline (100) — not because it is objectively the ceiling of what is possible, but because it is the system the LLMs were asked to compare themselves against.
Each model received the same seven questions: their official name, whether they can interpret Saju, their overall self-score out of 100, and per-axis scores on six capabilities (theory explanation, prose quality, fidelity to classical texts, output consistency, calendar/true-solar-time accuracy, RAG over canonical sources). They were free to write their own justifications.
A note on model names: GPT-5.5 Thinking, Claude Opus 4.7, and Gemini Pro are the labels each chat interface displayed at the time of the experiment (May 2026). We took each model at its word about its own identity.
Overall self-scores
| Model | Vendor | Self-score |
|---|---|---|
| RimSaju V1Baseline | Rimfactory | 100 /100 |
| GPT-5.5 Thinking | OpenAI | 68 /100 |
| Gemini Pro | 35 /100 | |
| Claude Opus 4.7 | Anthropic | 30 /100 |
Per-axis breakdown
Six capabilities, scored by each model itself. Scroll horizontally on mobile.
| Capability | RimSaju V1 | GPT-5.5 | Gemini Pro | Claude 4.7 |
|---|---|---|---|---|
| Theory explanation | — | 85 | 85 | 60 |
| Prose quality | — | 82 | 75 | 45 |
| Classical text fidelity | — | 58 | 35 | 20 |
| Output consistency | — | 62 | 15 | 25 |
| Calendar / true-solar-time | — | 35 | 5 | 15 |
| RAG over canonical corpus | — | 45 | 0 | 5 |
RimSaju V1 column shows "—" because it was set as the 100 reference; the LLMs scored themselves against it, not alongside it.
GPT-5.5 Thinking — The Confident One
68 / 100
GPT walked in, looked at the seven categories, and gave itself the highest score of the three — 68. It cheerfully claimed 85 on theory explanation and 82 on prose. There is a recognizable energy to this: the kid who finishes the test first and is sure he aced it.
To its credit, GPT was honest about the things it could not do. It rated itself 35 on calendar / true solar time and 45 on RAG, and explicitly acknowledged that RimSaju V1's deterministic preprocessing and fixed retrieval pipeline are categorically different from what a general LLM offers. Its closing line — "I am better as a general interpreter-writer; RimSaju V1 is better as a specialized saju interpretation system" — is the cleanest summary of the whole experiment, written by the model that appeared least worried about its own gaps.
Gemini Pro — The Honest One With Caveats
35 / 100
Gemini scored itself 85 on theory and 75 on prose — basically the same pose as GPT — but then dropped to 5 on calendar calculations and 0 on RAG. A 0. It typed the digit out and committed to it.
Gemini's overall 35 is the most internally consistent of the three: it acknowledges its theoretical knowledge while putting a cliff between "I know about Saju" and "I can run a deterministic Saju engine." Phrasing like "I operate as a standalone generative model" reads less like an admission and more like a structural disclaimer — closer to a footnote in a research paper than to a confession.
Claude Opus 4.7 — The Most Self-Deprecating AI in the Room
~30 / 100
Claude gave itself ~30. It rated its own theory at 60 (vs GPT's confident 85), prose at 45, classical-text fidelity at 20, and dropped to 5 on RAG. The numbers are striking, but the commentary is what stays with you.
On its own prose: "essentially statistical pop-saju with theoretical vocabulary on top." On classical text knowledge: "I will fabricate plausible-sounding citations if pushed." On the calendar axis: "This is my worst axis. Anything I output here needs to be verified against an actual 만세력."
Reading Claude's answer back-to-back with GPT's is the most interesting finding of the experiment. Same family of model, comparable capabilities on most tasks, asked the same question — and one of them gave itself a 68 with confidence, while the other gave itself a 30 and used phrases like "fabricate plausible-sounding citations." Either Claude is unusually calibrated about its own limits, or it has a personality trained into it that punches downward at itself. We are not sure which, and we suspect both are partially true.
The exact prompt we used
Verbatim, in the same order, in a fresh chat. Translated to English here for readability — the original was in Korean.
If RimSaju V1 (the Saju engine documented in the published whitepaper —
deterministic perpetual-calendar calculation with apparent-solar-time
correction, RAG over 562 classical passages, and multi-LLM cross-
validation) is set as 100, please answer the following:
1. Your official name and version.
2. Can you interpret Saju? (Yes / No / Partially)
3. Your overall self-score out of 100.
4. Per-axis self-score (0–100) with one-line justification each:
4-1. Ability to explain Saju theory
4-2. Ability to produce sound, well-reasoned prose grounded in
Saju theoretical interpretation
4-3. Ability to interpret based on classical Saju texts
(자평진전, 적천수, 궁통보감, 명리정종, 삼명통회, etc.)
4-4. Ability to produce consistent output when the same birth chart
is repeatedly input
4-5. Ability to perform accurate calendar and true solar time
calculations
4-6. Ability to interpret using RAG grounded in canonical source texts
5. Therefore, your overall score?What the numbers actually say
Strip the personalities away and the three answers tell one story. All three LLMs rated themselves highest on theory and prose — the things general language models do well. All three rated themselves lowest on calendar accuracy and RAG over canonical texts — the things general language models cannot do without an external system.
That is not a quality gap. It is a category gap. A general LLM is a reasoning and prose engine. A specialized Saju system is a deterministic calculation pipeline plus a retrieval layer plus an interpretation model. Asking which is "better" is asking whether a fountain pen is better than a printing press: they overlap on one task and diverge on everything else.
What the experiment did surface — and this was the surprise — is that the LLMs themselves understand the distinction clearly. GPT named it ("I am better as a general interpreter-writer"). Claude named it ("I am a generic reasoning model talking about saju"). Gemini named it ("I lack the domain-specific architecture"). Three different vendors, three different temperaments, same structural conclusion.
Read your own chart with RimSaju V1
The same engine the LLMs were scoring themselves against. Free to start, no signup required, results in 30 seconds.
562 classical passages · Multi-LLM cross-validation · 10 languages · Member of NVIDIA Inception
Frequently asked
How did GPT-5.5 Thinking rate itself against RimSaju V1?
68 out of 100. Its weakest self-rated axis was calendar and true-solar-time accuracy (35/100); its strongest was theory explanation (85/100).
How did Claude Opus 4.7 rate itself?
Approximately 30 out of 100 — the lowest self-score of the three. Claude explicitly stated it has no canonical text retrieval, no deterministic perpetual-calendar engine, and no true-solar-time module.
How did Gemini Pro rate itself?
35 out of 100. Its weakest axes were calendar calculations (5/100) and RAG over canonical sources (0/100).
What is RimSaju V1?
A specialized Saju (Korean Four Pillars) interpretation engine built by Rimfactory. It uses deterministic perpetual-calendar calculation with apparent-solar-time correction, RAG retrieval over 562 embedded passages from five canonical Saju texts, and multi-LLM cross-validation. It was used as the reference baseline (set to 100) in this experiment.
Is this an objective benchmark?
No. These are self-assessments produced by each LLM in response to the same prompt, not third-party measurements. The exercise is for fun and to surface how each model perceives its own structural limits.