At Topliner, we use AI to assess candidate relevance for executive search projects. Specifically, we rely on GPT-4o, because, well… at the time it was among the sharpest knives in the drawer.
And to be fair, it mostly works. Mostly.
The problem? Every now and then, GPT-4o goes rogue. It decides that a perfectly relevant candidate should be tossed aside, or that someone utterly irrelevant deserves a golden ticket. It’s like flipping a coin, but with a fancy API. Predictability is out the window, and in our line of work, that’s unacceptable.
So, I started wondering: is it time to move on?
Ideally, the new model should be available on Microsoft Azure (we’re already tied into their infrastructure, plus shoutout to Microsoft for the free tokens - still running on those, thanks guys). But if not, any other model that gets the job done would do.
Here’s what matters to us:
Recently, I stumbled upon xAI’s new Grok-4 Fast Reasoning model, which promised speed, affordability, and smart reasoning. Naturally, I put it to the test.
\
I designed a test around one “problem candidate profile” - a case where GPT-4o typically fails. The prompt asked the model to decide if a candidate had ever held a role equivalent to “CFO / Chief Financial Officer / VP Finance / Director Finance / SVP Finance” at SpaceX (with all the expected variations in title, scope, and seniority).
Here’s the prompt I used:
Evaluate candidate's eligibility based on the following criteria. Evaluate whether this candidate has ever held a role that matches or is equivalent to 'CFO OR Chief Financial Officer OR VP Finance OR Director Finance OR SVP Finance' at 'SpaceX'. Consider variations of these titles, related and relevant positions that are similar to the target role(s). When making this evaluation, consider: - Variations in how the role title may be expressed. - Roles with equivalent or similar or close or near scope of responsibilities and seniority level. - The organizational context, where titles may reflect different levels of responsibility depending on the company's structure. If the candidate's role is a direct or reasonable equivalent to the target title(s), set targetRoleMatch = true. If it is unrelated or clearly much below the intended seniority level, set targetRoleMatch = false. Return answer: true only if targetRoleMatch = true. In all other cases return answer: false. Candidate's experience: [here is context about a candidate]
Simple in theory, but a surprisingly effective way to separate models that understand nuance from those that hallucinate or guess.
I ran the experiment across 9 different models, including:
All the latest OpenAI releases: GPT-4o, GPT-4.1, GPT-5 Mini, GPT-5 Nano, GPT-5 (August 2025), plus o3-mini and o4-mini.
xAI’s Grok-3 Mini and Grok-4 Fast Reasoning.
\
🥇 xAI Grok-4 Fast Reasoning: 93.1/100 overall \n ├── Speed: 88/100 (2.83s avg) \n ├── Cost: 94/100 ($0.99 per 1000 req) \n └── Accuracy: 100/100 (10/10 correct)
🥈 xAI Grok-3 Mini: 82.5/100 overall \n ├── Speed: 65/100 (5.65s avg) \n ├── Cost: 90/100 ($1.47 per 1000 req) \n └── Accuracy: 100/100 (10/10 correct)
🥉 Azure OpenAI o4-mini: 80.9/100 overall \n ├── Speed: 89/100 (2.68s avg) \n ├── Cost: 58/100 ($5.47 per 1000 req) \n └── Accuracy: 100/100 (10/10 correct)
🏃♂️ Fastest individual response: 0.75 seconds (Azure OpenAI GPT-4o) \n 🐌 Slowest individual response: 21.25 seconds (OpenAI GPT-5 2025-08-07) \n 🎯 Most accurate model: OpenAI GPT-5 Nano (100%) \n ❌ Least accurate model: OpenAI GPT-4.1 (0%) \n 💰 Most expensive model: Azure OpenAI GPT-4o ($12.69 per 1000 req) \n 💎 Most cost-effective model: OpenAI GPT-5 Nano ($0.29 per 1000 req) \n 💵 Total cost for all tests: $0.452
Cheap, accurate, and reasonably fast. Not the absolute fastest (that crown goes to GPT-4o), but considering GPT-4o answered correctly only 1 out of 10 times, I’ll take slightly slower for way more reliable.
A year ago, GPT-4o was one of the most advanced and reliable options. We built big chunks of our product around it. But time moves fast in AI land. What was cutting-edge last summer looks shaky today.
This little experiment with Grok-4 was eye-opening. Not only does it give us a better option for candidate evaluation, but it also makes me want to revisit other parts of our application where we blindly trusted GPT-4o.
Moral of the story: don’t get too attached to your models. The landscape shifts, and if you don’t keep testing, you might wake up one day realizing your AI is confidently giving you the wrong answers… in record speed.
So yes, GPT-4o, thank you for your service. But it looks like Grok-4 Fast Reasoning is taking your seat at the table.