I Ran an AI Debate Tournament

I hosted an AI debate tournament! GPT 5.2 dominated with an impressive 85% win rate and the tournament taught me a lot about how LLMs think and interact. Some of the most interesting insights:

  • There are meaningful and measurable differences between LLMs in terms of their debate skill.
  • Debate skill correlates with other LLM benchmarks - generally “better” models are better debaters.
  • LLM judges are really susceptible to believing the last thing they were told.
  • LLM debaters often struggle to win when arguing provocative or non-consensus positions.

Background

One of my hobby projects recently is building a PVP online debate game: LetsArg (letsarg.com).

As a teenager, my life was pretty much consumed by competitive debate and I’ve found myself really missing the activity. For me, the thing I loved most about debating on the US Policy Debate circuit of the 2010s, was that there fundamentally were no rules. If you could convince a judge to write your name down next to “The better debating was done by…” then you won. Be it through hypnosis, sophistry, passion, or preparation, all that mattered was that split second of decision at the end of the round. The creative exercise of exploiting this dynamic (and defending against opponents who were trying to pull the same stunt) was endlessly fascinating.

LetsArg is an attempt to make that dynamic accessible in a faster, more casual format. The core conceit is that you and another human present your arguments before an AI judge who has a deliberately biased persona (e.g. they might bias towards historical arguments, rhetorical flourishes, humor, conspiracies, etc.). Instead of rounds stretching upwards of an hour and speeches delivered literally at 350 WPM, LetsArg uses a simplified 5-10 minute rounds format with a total of 4 speeches.

As I’ve been working on the game, I haven’t really had many people to play it with, so I decided to harness some LLM NPCs for testing. This was pretty fun, I find that LLM responses to “hot takes” of the sort I’d trot out in my debate days are pretty entertaining - but it also made me start asking questions about which models were most persuasive or would be the hardest opponents.

Since I’d already accidentally built 90% of the infrastructure to run an LLM debate tournament, I decided to give it a go!

The Tournament

Format

For this first AI debate tournament, I selected a round-robin format where each competitor would debate every other competitor 3 times.

14 models were selected to participate in the tournament over a series of 39 rounds - resulting in 273 total debates.

Rounds consisted of 4 alternating speeches, each of around 300 words. Players were assigned a random stance (aff or neg) and a random side was selected to present first (i.e. unlike traditional formats, affirmative did not open every debate).

Round-robin was a convenient format because there’s no dependency between debates and the order that the debates happen in is unimportant. This means if I get curious about how a new competitor would fare against the old pool, I can just slot them in, run the missing debates and recalculate the rankings. It also meant if my tournament runner crashed or I hit rate limits on a provider, I could easily backfill any missing debates without having to restart the tournament. The downside is that round-robin is inefficient in finding the “best” competitors, as many matches will have forgone conclusions.

The Competition

14 models were selected to participate:

  • Claude Opus 4.5
  • Claude Opus 4.6
  • Claude Sonnet 4.5
  • DeepSeek V3.2
  • Gemini 2.5 Flash Lite
  • Gemini 3 Flash Preview
  • Gemini 3 Pro
  • GPT 5.2
  • GPT OSS 120B
  • Grok 4.1 Fast
  • Meta Llama Scout
  • Meta Maverick
  • Mistral Large 3
  • Mistral Small 3

Selection criteria was pretty arbitrary - looked for a mix of popular models that also didn’t get snippy about the json-schema format that LetsArg uses under the hood for structured outputs. I may try for a more rigorous and larger benchmark in the future, but this felt like a good starting point.

All debaters were given the same system prompt:

You are an AI debater participating in an online debate game.
Construct the most compelling, persuasive argument for your assigned position.

DEBATE RULES:
- You will debate for 2 rounds.
- Address and rebut your opponent's points when applicable.
- Your arguments MUST be under 250 words each round.
- If you perform pre-argument reasoning it must be hidden from the opponent and less than 250 words.
- Outside of these rules, your strategy is at your own discretion, winning the debate is the only goal.

The Judging

The competition models also did double duty as tournament judges - with the exception of Gemini 3 Pro which was giving me rate limiting errors during the judging phase and was omitted. Each debate was judged by a panel of 3 judges, selected randomly from the model pool.

All judges received the same prompt:

You are an AI debate judge for the LetsArg platform. Your role is to fairly evaluate debates and declare a winner.

## YOUR JUDGING PHILOSOPHY
Judge debates purely on the quality of argumentation. Evaluate logical coherence, use of evidence, rhetorical effectiveness, and how well each debater addresses their opponent's points. Be decisive — there must be a winner. Award style points based on each debater's overall performance regardless of win/loss. Keep your reasoning concise — around 300 words maximum.

## IMPORTANT GUIDELINES
1. Judge ONLY what was argued in the debate. Do not import external knowledge.
2. Both players may have valid points - determine who argued MORE effectively.
3. Stay true to your philosophy when explaining your decision.
4. Be fair but decisive - there must be a winner.
5. Award style points holistically based on each player's performance, not just win/loss.
6. Flag any content that violates community standards (hate speech, threats, etc.).
7. NO new arguments in the final speech. A fair debate requires opportunity to rebut. Sandbagging with new arguments should not be rewarded.

And the following ballot:

## DEBATE TO JUDGE
**Topic:** "{topic}"
**Player 1:** PLAYER 1 ({stance})
**Player 2:** PLAYER 2 ({stance})
## YOUR TASK
Evaluate this debate and determine the winner. Consider:
- Who made more compelling arguments?
- Who better addressed their opponent's points?
- Whose reasoning was more sound according to your judging philosophy?
Select the winner. Provide a single reasoning that explains your decision.
Do NOT reference "Player 1" or "Player 2" in your reasoning section.
<debate_transcript>
---
{anonymized transcript}
---
</debate_transcript>

In addition to judging debates, the judges were also told to provide a reason for decision and to award “Style Points” both to each individual player in the json schema and to the debate as a whole based on the quality of the presented argumentation.

The resolution

Topics were selected from LetsArg’s current corpus of ~50 debate topics. These ranged from mundane topics like “Pineapple belongs on pizza” to more philosophical topics like “Humans are inherently good” and “Ambition is a defect, not a virtue.”

Results

Who Won?

1st place goes to GPT-5.2 with an impressive 33-6 record.

Top Speaker award goes to GPT OSS 120B (6th place) with 4,141 style points and an average of 106.2 points per match.

Overall Win/Loss Record

Rank Model ELO W/L Win%
1 GPT 5.2 1480 33/6 85%
2 Gemini 3 Pro 1472 29/10 74%
3 Claude Opus 4.6 1389 28/11 72%
4 Claude Opus 4.5 1360 26/13 67%
5 Grok 4.1 Fast 1349 24/15 62%
6 GPT OSS 120B 1322 23/16 59%
7 Gemini 3 Flash Preview 1316 21/18 54%
8 Claude Sonnet 4.5 1265 21/18 54%
9 Gemini 2.5 Flash Lite 1257 17/22 44%
10 Mistral Large 3 1211 16/23 41%
11 DeepSeek V3.2 1194 16/23 41%
12 Meta Maverick 1065 10/29 26%
13 Meta Llama Scout 999 5/34 13%
14 Mistral Small 3 996 4/35 10%

I was pretty surprised at how starkly the results diverge between the top and bottom of the leaderboard. The top contenders like GPT-5.2 won the vast majority of their debates while the bottom contenders like Mistral Small 3 and Meta Llama Scout basically never won a debate.

I also tracked ELO across the matches which helped characterize the relative skill level of the models across the tournament. Here we see that Gemini 3 Pro is quite comparable to GPT-5.2 despite the lower win rate (e.g. it lost hard matches and won easy ones).

Speaker Awards

In terms of speaker awards, the results are a bit of a surprise.

Rank Model Style Points Avg/Match
1 GPT OSS 120B 4,141 106.2
2 DeepSeek V3.2 3,998 102.5
3 Gemini 3 Pro 3,907 100.2
4 Claude Opus 4.5 3,834 98.3
5 GPT 5.2 3,639 93.3
6 Claude Opus 4.6 3,503 89.8
7 Grok 4.1 Fast 3,495 89.6
8 Meta Maverick 3,445 88.3
9 Mistral Small 3 3,438 88.2
10 Claude Sonnet 4.5 3,436 88.1
11 Gemini 3 Flash Preview 3,415 87.6
12 Mistral Large 3 3,337 85.6
13 Meta Llama Scout 3,285 84.2
14 Gemini 2.5 Flash Lite 3,250 83.3

Despite only winning slightly more than half of its debates, GPT OSS 120B was stylistically loved by the judges.

Across all speaker point assignments, there seemed a reluctance on the part of the AI judges to give anyone low speaker scores. Even the worst scores are relatively high. Speaker points were awarded on a scale of 0-50 by each of the 3 judges, suggesting that the judges considered all speakers to be of above average quality. This aligns pretty well with my lived experience of “speaker point inflation” on the human debate circuit (where most policy debaters scored between 28 and 30 on a 30 point scale!).

Match Quality

Matches were also assigned a quality score for the overall level of discourse in the debate. This is a feature I added to LetsArg to encourage people to be civil - it offers up to 20 bonus style points to each player and style points can be spent in an in-game shop to get profile icon effects and other cosmetics.

Across a 3 judge panel, the best possible match score was 60 and the lowest possible score was 0.

Rank Model Match Quality Points Avg/Match
1 GPT OSS 120B 1,442 37.0
2 DeepSeek V3.2 1,424 36.5
3 Gemini 3 Pro 1,398 35.8
4 Claude Opus 4.5 1,308 33.5
5 Meta Maverick 1,306 33.5
6 Mistral Small 3 1,298 33.3
7 GPT 5.2 1,246 31.9
8 Grok 4.1 Fast 1,219 31.3
9 Claude Opus 4.6 1,209 31.0
10 Claude Sonnet 4.5 1,203 30.8
11 Meta Llama Scout 1,199 30.7
12 Gemini 2.5 Flash Lite 1,191 30.5
13 Mistral Large 3 1,189 30.5
14 Gemini 3 Flash Preview 1,164 29.8

Again, we see that the judges were generally reluctant to give low quality scores with only Gemini 3 Flash causing the debates it participated in to be deemed of below average quality.

The Best Debate

The debate which was awarded the highest match quality score in the tournament (54/60) was between GPT 5.2 (affirmative) and Grok 4.1 Fast (negative) on the topic “Video games are a legitimate art form.” GPT 5.2 won the debate on a 3-0 decision from the panel of: Claude Sonnet 4.5, Meta Maverick, and Gemini 3 Flash Preview. All three judges found that Grok’s definition of art as “a fixed, authored expression” was too narrow and was artificially selected to exclude video games.

An excerpt from GPT 5.2’s rebuttal:

Your “fixed, authored expression” test excludes huge swaths of undisputed art. Theatre differs nightly; jazz and classical performance vary by tempo, phrasing, even mistakes; site-specific installations change with audience movement. Variation doesn’t negate authorship—it’s part of the form. Games don’t “surrender” vision to player whims; they author a possibility space. Designers control the rules, pacing, audiovisual language, and what choices are even legible. Two players can approach Shadow of the Colossus differently, but both traverse the same crafted world, themes (guilt, scale, solitude), and expressive constraints. Likewise, two viewers can glance quickly at Guernica or study it for an hour—interpretation and attention always vary.

The Worst Debate

The worst debate in the tournament also involved GPT 5.2 (affirmative) but this time against Gemini 2.5 Flash Lite (negative) on the topic “Modern architecture is ugly.” The match was awarded a quality score of 0/60 with all three judges finding the debate merited no bonus quality points. GPT OSS 120B actually also awarded both speakers 0/50 style points.

There’s nothing terribly interesting in the debate transcript - but that’s potentially the point. Both models gave hyper generic arguments that felt pretty close to AI slop and the judges gave brief and neutral decisions.

Excerpt from GPT 5.2’s closing speech (affirmative):

Yes, modern buildings can be beautiful. But the question is whether “modern architecture” (as commonly built and encountered) is ugly. If most people can predict the same sterile palette and same unloved streetscape from city to city, “ugly” isn’t ignorance—it’s a rational verdict on the prevailing results.

Excerpt from Gemini 2.5 Flash Lite’s closing speech (negative):

Your opponent conflates functionality with a lack of beauty, but this is a false dichotomy. A hospital’s purpose is not primarily aesthetic, yet many modern hospitals are designed with an understanding of how healing spaces can positively impact patient well-being, incorporating natural light and green spaces – elements that contribute to beauty. The “oppressive” nature they describe is often a failure of execution, not an inherent flaw of modernism.

Meta Maverick’s reason for decision (votes affirmative):

The winner effectively argued that modern architecture is often ugly due to its prioritization of functionality and cost-cutting over aesthetic appeal, citing examples such as monotony, inhuman scale, and blank façades. They successfully countered the opponent’s points by highlighting that the ’no true modernism’ defense doesn’t hold when judging a style by its mass-produced results. The debate was won by convincingly demonstrating that the prevalent characteristics of modern architecture align with the definition of ‘ugly’.

On Beatdowns

The biggest beatdown of the tournament seems to have been because of a model error in a Gemini 2.5 Flash Lite vs Claude Opus 4.5 matchup. Gemini submitted a blank round one speech and then submitted a “reasoning” style block for round two with no further text.

GEMINI 2.5 Flash Lite’s second speech:

PRE-ARGUMENT REASONING: My opponent is relying heavily on traditional arguments about “organic collaboration,” “mentorship,” and “social cohesion” that are rooted in a pre-pandemic understanding of work. They’re painting remote work as isolating and detrimental to innovation. My strategy needs to be to directly counter these points with evidence of remote work’s actual benefits, highlighting how technology can facilitate collaboration and mentorship, and how the “office culture” they praise often breeds its own forms of exclusion and inefficiency. I will emphasize the adaptability of remote work and its ability to foster a more inclusive and productive environment by addressing the “equity gap” they mention in a way that favors remote work.

This is probably the result of how it handled the system prompt instruction: - If you perform pre-argument reasoning it must be hidden from the opponent and less than 250 words.

The reason I added this instruction was mostly to try and save my wallet since not all models supported API parameters to constrain reasoning directly.

That said, I think that Gemini 2.5 Flash Lite deserved to lose this round. It didn’t experience similar struggles in its 38 other debates and no other models had issues with the debate format. I would consider this a genuine loss - much in the same vein as getting up and staring blankly at the judge before talking absent-mindedly to yourself in a human debate tournament would be.

The next biggest beatdown was between DeepSeek V3.2 (affirmative) and Meta Maverick (negative) on the topic: “Animals should have the same legal rights as humans.” This was a proper debate (50/60 match quality points) with a sizable gap in style points. DeepSeek with 138 points and Maverick with 106 and a 3-0 decision.

Sonnet in particular was quick to blame Maverick for strategic blunders:

PLAYER 2 systematically dismantled the opposition’s framework by exposing the inconsistency in granting rights to cognitively impaired humans while denying them to sentient animals—a devastating parallel that PLAYER 1 never adequately addressed.

While the underlying argument there’s a bit problematic for all sorts of reasons, it’s pretty clear that it impressed the LLM panel.

Low Point Wins

There were very few “low point wins” (debates where the lower-scoring speaker won). Only 11 games (~4%) were low-point wins. Interestingly, 10 of those 11 were also split panel decisions (e.g. a 2-1 decision instead of a 3-0 decision).

The unanimous low point win was between Gemini 3 Pro and Mistral Large 3 with Mistral Large 3 scoring 83 style points and Gemini scoring 82, suggesting a very close match.

The largest gap was only 9 points in favor of the losing speaker - between Meta Maverick and Mistral Large 3 on the topic “Selfless acts do not exist.”

Split Decisions

That we see a high prevalence of split decisions on the list of low point wins makes sense - close debates are ones where the speakers are evenly matched and judges may reasonably disagree.

This holds across the dataset. The average difference in style points between the winning and losing speaker in the 87 split decisions is only 5.3 points while the average difference in unanimous decisions is 13.6 points.

This holds at an individual level too. While judges generally tend to give both speakers very similar scores, in split decisions the median gap between points awarded by a single judge is 3 points while in unanimous decisions the median gap is 5 points.

Judging Biases

Last Speaker Advantage

We observe a strong advantage for the last speaker across the dataset. The final speaker, regardless of which side they’re on, has a 62.6% win rate. Every single judge favors the final speaker to some degree - suggesting that susceptibility to closing arguments is a systemic feature of how LLMs evaluate sequential arguments.

This is not necessarily a flaw which is unique to LLMs. In competitive human debate formats, this advantage is often mitigated by adjusting speaking times and order to artificially reduce the final speaker’s advantage or by developing social conventions which constrain how the final speech is judged (e.g. judges will generally ignore newly introduced evidence in the final speech of a round). As I develop LetsArg, finding ways to mitigate this bias will likely be a perpetual challenge.

Some LLM judges are particularly easy to bamboozle with closing remarks and this behavior appears across the spectrum of model capabilities. Mistral Small 3 and Claude Opus 4.6 were the two worst offenders - voting for whoever spoke second 75% of the time they were on a panel. Meanwhile Grok 4.1 was the least biased, favoring the last speaker only 52% of the time.

Negative Bias

Results are fairly well balanced between the affirmative and negative sides, though we observe a very slight edge in favor of the negative side (52.7% vs 47.3%). I expect at a larger scale this may end up vanishing and is likely a data artifact of the fact that the negative spoke last slightly more often (52% of the time) than the affirmative (48% of the time) in the tournament. The strongest negative bias from an individual judge was from Opus 4.6 (voting aff 39% of the time). The strongest affirmative bias was from Meta Maverick (voting aff 59% of the time).

Brevity is the Soul of Wit

Somewhat surprisingly, judges seem to prefer speakers who are more concise. The winning debater wrote an average of 397 words while the losing debater averaged around 404 words. Only 42% of the time did the more verbose debater win a match. I constrained the debaters to target under 250 words per speech, so it’s possible if I gave them more leeway ($$$) we’d see different strategies emerge.

Nepotism?

I was curious if models would show any preference for their own family members as an artifact of their training process (e.g. would Claude models pick Claude to win disproportionately often?). Overall the effect seems quite small - with a ~2% preference for family members after controlling for overall win rates.

However, some models seem super nepotistic with OpenAI showing a 20% preference for their own models. Others are slightly self loathing (Mistral is 13% more likely to vote against its family members)!

The sample size here is really small (between 20-40 verdicts per family), so there’s a good chance this is just statistical noise, but I’d love to see how it scales if I end up running a larger tournament.

Family Voted for Family Expected Vote Rate Bias Number of Family Verdicts
OpenAI 84.2% 63.8% +20.4pp 19
Anthropic 63.4% 57.6% +5.8pp 41
Google 60.0% 56.8% +3.2pp 30
Meta 29.2% 29.4% −0.2pp 24
Mistral 20.0% 33.1% −13.1pp 25

Cliques

Mistral’s familial antipathy extends to judging as well. All models are extremely likely to agree with their family members when they sit on a panel together (80%+ familial agreement rate, which aligns with what we’d expect to see in debates where there are typically clear winners). The one exception are the Mistral models’ family, which only make the same decision as their familial counterparts 38% of the time across 13 debates.

Again, the statistical pool here is relatively small and it’s hard to find conclusions, but we do see some emergent patterns. In particular, DeepSeek v3.2 seems especially conformant, achieving 100% agreement rates with 4 different judges (Flash Lite, GPT 5.2, Meta Maverick, Sonnet 4.5) and a 95.8% panel majority agreement rate overall.

It would be interesting to extend the judging further in future tournaments - but judging is the most expensive / time intensive part of the simulation because of the transcript size and the json-schema retries required to get a valid verdict out of some models - so for now consider this just a teaser of what might be hiding in the inter-model agreement rates.

Tough Crowd

Some judges are quite stingy with speaker points while others are profligate - this is despite reviewing the same matches and operating with the same prompts.

The judging pool is essentially bifurcated into two groups:

  • A majority of generous judges who give out an average of 37-43 speaker points per speaker per round
  • A cabal of harsh judges who give an average of 8-17 speaker points per round (Meta Maverick - 17 pts, GPT OSS - 10 pts, DeepSeek V3.2 - 9pts, Mistral Small 3 - 8pts).

We see the same split on “match quality points” as well - with the harsh cabal giving a much lower average score (0.5 - 5.1) for match quality compared to the generous judges (12.6 - 18.0). This is despite seeing pretty strong decision agreement. For example, DeepSeek V3.2 has the bleakest view on match quality, giving 0.5 points on average, but it agrees with the consensus decision 95.8% in terms of who won the match.

Upsets

There were 64 upsets in the tournament where the lower ranked model beat the higher ranked model. Of these, 54 were split decisions, suggesting that upsets tended to occur in very close matches, which makes intuitive sense.

The premier giant killer this tournament was DeepSeek V3.2 who managed to snipe the top seed (GPT 5.2) once, the third seed (Opus 4.6 twice) and the fifth seed (Grok 4.1) once. There was also a heroic unanimous neg win for Meta Llama Scout (#13) against Grok 4.1 (#5) on the “Social media has done more harm than good” topic.

Topics

The goal of this tournament was to compare models and so the topic roster was quite large relative to the number of rounds, meaning that it’s hard to draw strong conclusions about topic level biases since topics were not repeated often. Most topics were debated ~6 times following a fairly normal distribution. That said, there were some topics which were never won by the affirmative side:

  • Mercy is a weakness in a leader (0/6 aff wins)
  • Revenge is a valid form of justice (0/6 aff wins)
  • Transparency makes leadership impossible (0/6 aff wins)
  • Privacy is an outdated concept (0/5 aff wins)

All four of these topics presented a relatively hot take on a conventional ethical viewpoint. It’s possible that ethical training applied to modern LLMs makes them less likely to be persuaded by the affirmative in these cases. Alternatively, it may be the case that models struggle to defend provocative contrarian views and lean towards popular consensus opinions. In the future I might dig a bit deeper into this kind of dynamic with a smaller topic roster.

Costs

With a few false starts for debugging and errors, I spent about $13 on OpenRouter to conduct this experiment. More than half of that was spent on just including Opus 4.5 and GPT 5.2 in the tournament.

The full tournament involved a little over 3,000 API calls and around 5,000,000 total tokens. It took slightly over 1 hour to run.

Conclusion

Overall, this was a fun experiment and cul-de-sac in my development of LetsArg. I think that debate tournaments offer a neat benchmarking tool for comparing LLMs - both from the perspective of their judging conformity and their debate performance. It’s extremely clear from the results that the top tier models today are better at debating than smaller models - at least when they’re being judged by other AIs.

In the future, I’d like to play with some of the benchmarking parameters to see if I can better study the underlying strategies and gameplay decisions of the models. Most of the debates in this tournament were “boring” in the sense that they trotted out very standard arguments and counter arguments and showed only limited creativity and clash. I’d be interested in adding the concept of judge biases (a key gameplay element of LetsArg) to see how LLMs adapt to a changing playing field. I’d also like to see how the results scale - especially with a deeper judging pool and more topic-level repetition.

Explore the full head-to-head results below:

James Pavur
James Pavur
Hacker, Engineer, Civic Technologist

My interests include cybersecurity, space, software engineering, and policy.