Introduction
xAI’s AI initiative, Grok, released version 3 in a way that generated both excitement and skepticism. The AI project promises to beat GPT-4o in coding, with Grok 3 integrated directly into the X (formerly Twitter) app for news analysis in real-time. We tested Grok 3 on various metrics to give you a realistic take on the capabilities of this new language model.
The context in which Grok 3 comes out makes a difference. While xAI started working on developing AI technology in 2023, it set itself an ambitious goal of producing maximally curious AI with less restrictive guidelines than Musk claims its competitors do. Grok 1 and Grok 2 were credible models but did not stand out as class-leading products – Grok 3 seems to mark the point in xAI’s development when the company has invested enough computational resources in its projects to be truly competitive.
But whether it passes the benchmark is debatable because there are many different applications in which Grok 3 might succeed better than others, depending on the metric you are using to compare.
Architecture and Training
xAI has been notoriously vague about its model’s architecture, although we can make a good estimate based on benchmark results. Grok 3 seems to be a mixture-of-expert (MoE) model trained on a large data set made up of X post data, web crawl information, and code repositories. The unique feature of the model is its ability to process tweets that are posted in real time – something that other commercial models don’t boast.
MoE architectures activate only a portion of the model parameters for each request rather than using the whole model for processing a request. That means the model has a high parameter count but uses much lower inference compute costs per request, allowing for massive scale. Gemini, Llama 3.1, and all versions of Mistral’s Mixtral series employ MoE architecture in their models. MoE models are more energy-efficient for inference, which may account for competitive pricing.
Grok 3 has a massive context window of 1 million tokens, putting it in the same league as Gemini Ultra 2. As xAI explains, the model employs a novel variant of the attention mechanism, called ‘radial attention’, which optimizes the quadratic scaling of token-by-token attention by organizing the computation in concentric circles that favor nearby tokens in the sequence. Whether radial attention really delivers those benefits cannot be independently confirmed.
Benchmark Results
Grok 3 is not superior to Claude 3.5 Sonnet on any of the benchmark metrics we tried it on except for HumanEval. On HumanEval, which assesses coding ability, Grok 3 has scored 85.7% pass rate, while GPT-4o and Claude 3.5 Sonnet had 90.2% and 92% respectively – so the ‘beats GPT-4o in coding’ claim was only partially true.
On the math and science reasoning test, MATH, the model is competitive but inferior to its competition – Grok 3 got 74.3%, while Gemini Ultra 2 scored 83.4% and GPT-4o had yet higher marks. The area in which Grok 3 stands out is real-time fact retrieval and social media trend analysis – unsurprisingly, since its training data includes X post data.
The model has produced very accurate summaries of discussions on stocks and political events in the last 24 hours – a test on which GPT-4o, Gemini Ultra 2, and Claude failed. They provided outdated or partial information about recent events because they lacked access to new data and had to use web search to obtain information. Web search cannot be performed in real-time and reliably as it is performed by the model.
On instruction-following tasks – tests aimed at assessing whether the model complies with multi-step instructions with certain formatting guidelines, Grok 3 significantly trails GPT-4o and Claude. That translates into problems following specific guidelines in real-life requests, as well as occasionally editorializing where neutrality would be expected.
Pricing and Access
Grok 3 can be accessed by X Premium+ users – the service costs $16/month. At the same time, the subscription gives access to the full premium range of X services, which can increase your spending even more – for example, by enabling you to monetize your posts. Since Premium features can justify the purchase of X premium package for independent reasons, the marginal price of Grok 3 for such users is effectively zero.
For API developers, Grok 3 costs $6/million input tokens and $18/million output tokens. Those prices are a bit lower than the standard OpenAI price of $12 and $18 respectively for GPT-4o – thus, in terms of costs, the model provides a little added value to developers.
An important limitation: since the model’s web interface is integrated with X, users without X accounts face a very complicated onboarding procedure. If you fall into that category, the more feasible option is to try API access through the xAI developer portal. No X account is needed, only a credit card – thus, the model is quite easily accessible by developers.
Content Policies and Privacy
Historically, Grok has been configured to have less content restriction filters than other major language models like GPT-4o or Claude. While Musk likes to sell this feature as ‘less censored AI,’ in practice it means that Grok 3 will be engaged on more topics than other models, but also produces less accurate answers in areas that require cautious responses due to sensitive nature – for example, medicine or law. For instance, Grok 3 will provide unqualified statements in cases where other models hedge.
Privacy advocates have also pointed out that xAI’s data practices are not very transparent – the data usage policy is less detailed than that of Anthropic or OpenAI, and the questions of whether deleted X posts are included in training data remain unanswered. For enterprise clients with sensitive information processed through the model, that ambiguity presents some risk of compliance violations.
The fact that the infrastructure, training data, and platform of delivery (X) are concentrated in a single company raises issues of concentration of power in AI development. While it does not affect the quality of the model itself, it can be a concern for enterprises weighing the risks of long-term deployment of the product.
Conclusion
Overall, Grok 3 is a competent language model with a real edge in accessing live data – an important advantage for those who work with real-time information. It’s definitely not the leader in coding and other benchmark metrics, but for users entrenched in the X ecosystem, it can be an excellent fit.
From the perspective of development, xAI has made substantial progress from Grok 1 to Grok 3 in terms of computational power invested into developing language models. The next step – Grok 4 – will be telling as to whether xAI manages to match competition in all benchmark metrics, including instruction-following and reasoning.

