Meta’s benchmarks for its new AI models are a bit misleading


One of the new flagship AI models Meta released on Saturday, Maverick, ranks second on LM Arena, a test that has human raters compare the outputs of models and choose which they prefer. But it seems the version of Maverick that Meta deployed to LM Arena differs from the version that’s widely available to developers.

As several AI researchers pointed out on X, Meta noted in its announcement that the Maverick on LM Arena is an “experimental chat version.” A chart on the official Llama website, meanwhile, discloses that Meta’s LM Arena testing was conducted using “Llama 4 Maverick optimized for conversationality.”

As we’ve written about before, for various reasons, LM Arena has never been the most reliable measure of an AI model’s performance. But AI companies generally haven’t customized or otherwise fine-tuned their models to score better on LM Arena — or haven’t admitted to doing so, at least.

The problem with tailoring a model to a benchmark, withholding it, and then releasing a “vanilla” variant of that same model is that it makes it challenging for developers to predict exactly how well the model will perform in particular contexts. It’s also misleading. Ideally, benchmarks — woefully inadequate as they are — provide a snapshot of a single model’s strengths and weaknesses across a range of tasks.

Indeed, researchers on X have observed stark differences in the behavior of the publicly downloadable Maverick compared with the model hosted on LM Arena. The LM Arena version seems to use a lot of emojis, and give incredibly long-winded answers.

We’ve reached out to Meta and Chatbot Arena, the organization that maintains LM Arena, for comment.



Meta introduces Llama 4 with two new AI models available now, and two more on the way


Meta has released the first two models from its multimodal Llama 4 suite: LLama 4 Scout and Llama 4 Maverick. Maverick is “the workhorse” of the two and excels at image and text understanding for “general assistant and chat use cases,” the company said in a blog post, while the smaller model Scout could tackle things like “multi-document summarization, parsing extensive user activity for personalized tasks, and reasoning over vast codebases.” The company also introduced Llama 4 Behemoth, an upcoming model it says is “among the world’s smartest LLMs” — and CEO Mark Zuckerberg said we’ll be hearing about a fourth model, LLama 4 Reasoning, “in the next month.”

Both Maverick and Scout are available to download now from the LLama website and Hugging Face, and they’ve been added to Meta AI, including for WhatsApp, Messenger and Instagram DMs.

A text slide describing three models from the Llama 4 family: Llama 4 Behemoth, Llama 4 Maverick and Llama 4 ScoutA text slide describing three models from the Llama 4 family: Llama 4 Behemoth, Llama 4 Maverick and Llama 4 Scout

Meta

Scout has 17 billion active parameters with 16 experts, Meta says. According to Zuckerberg, “It’s extremely fast, natively multimodal, and has an industry leading, nearly infinite 10 million token context length, and it is designed to run on a single GPU.” Maverick on the other hand has 17 billion active parameters with 128 experts. The company says it beats competitors like GPT-4o and Gemini 2.0 on coding, reasoning, multilingual, long-context and image benchmarks, and stacks up against DeepSeek v3.1 on reasoning and coding.

Zuckerberg is already calling the upcoming Behemoth model, which is still training, “the highest performing base model in the world,” with 288 billion active parameters, according to the company. It may not be here yet, but it’s likely we’ll be hearing a lot more about that and the Reasoning model soon; Meta’s big AI developer conference, LlamaCon, is just a few weeks away.



Meta’s next Llama models may have upgraded voice features


Meta’s next major “open” AI model may have a voice focus, per a report in Financial Times.

According to the piece, Meta is planning to introduce improved voice features with Llama 4, the next flagship in its Llama model family, which is expected to arrive in “weeks.” Reportedly, Meta has been particularly focused on allowing users to interrupt the model mid-speech, similar to OpenAI’s Voice Mode for ChatGPT and Google’s Gemini Live experience.

In comments this week at a Morgan Stanley conference, Meta chief product officer Chris Cox said that Llama 4 will be an “omni” model, capable of natively interpreting and outputting speech as well as text and other types of data.

The success of open models from the Chinese AI lab DeepSeek, which perform on par or better than Meta’s Llama models, has kicked Llama development into overdrive. Meta is said to have scrambled to set up war rooms to decipher how DeepSeek lowered the cost of running and deploying models.