Meta’s Maverick Ranks High—But Is It the Same Model?

Meta’s newly released AI model, Maverick, is already generating buzz—but not for the reasons the company might have hoped. Over the weekend, the tech giant proudly announced Maverick’s performance on LM Arena, a benchmark where human reviewers rank AI model outputs by preference. According to Meta, Maverick ranked second. However, what developers are getting access to may not be the same version that earned that top-tier ranking.

Multiple AI researchers pointed out on X (formerly Twitter) that the Maverick model tested on LM Arena isn’t identical to the version released to the public. In fact, Meta quietly acknowledged this in its announcement. The version on LM Arena is an “experimental chat version,” specifically optimized for conversations. A chart on Meta’s official Llama website confirms this, stating that the benchmark results were based on “Llama 4 Maverick optimized for conversationality.”

This raises serious questions about transparency. While LM Arena has never been the gold standard for evaluating AI models, it has remained a consistent reference point. Most companies avoid tweaking their models just to boost benchmark scores—at least publicly. Meta’s approach, however, blurs that line.

By tailoring a model to excel on a benchmark and then offering a “vanilla” version to the public, Meta risks misleading developers. The two variants may behave quite differently in real-world use, making it harder for developers to gauge how well the model will perform in actual applications. Benchmarks, flawed as they may be, are meant to reflect a model’s overall strengths and weaknesses—not just its optimized best-case scenario.

This discrepancy isn’t just theoretical. Several users on X have posted side-by-side comparisons showing noticeable differences between the LM Arena version of Maverick and the downloadable model. The benchmarked version frequently uses emojis and tends to deliver overly lengthy responses—traits not seen as strongly in the public release.

One researcher commented, “Okay Llama 4 is def a little cooked lol, what is this yap city,” posting a screenshot showing Maverick’s long-winded style. Another noted that on platforms like Together.ai, the model behaves much more reasonably and professionally.

The inconsistency is fueling criticism that Meta may be gaming the benchmarking system while failing to give developers a clear picture of what to expect. Transparency in AI model performance isn’t just a nice-to-have—it’s essential for fair comparisons and responsible deployment.

As of now, both Meta and Chatbot Arena—the team behind LM Arena—have yet to respond to questions or issue any clarification. Developers and researchers are still waiting for clear answers about which version of Maverick they’re working with—and whether benchmark scores like these can still be trusted.

Share with others