Industry

The 10 Frontier AI Companies That Need What We're Building

Ajay Jetty10 min readApril 30, 2025

Anthropic, OpenAI, Cursor, Bolt.new, Perplexity, Cohere, Mistral, Sierra, Cognition, and Google DeepMind all share one problem: they can't evaluate their own models at the scale and depth they need. Here's exactly what Jetty Train solves for each of them.

Every frontier AI company has the same hidden problem.

They build extraordinary models. They run automated benchmarks. They have internal red teams. And then they ship — and discover that their model fails in ways their benchmarks never caught. In production. At scale. In languages their team doesn't speak.

This is not a small problem. It is the central unsolved problem of the AI industry right now.

Jetty Train exists to solve it. And the companies that need this most are exactly the ones building the AI systems that will define the next decade.

Here are the ten companies we're building for — and precisely what we solve for each of them.

---

The 10 Companies. The 10 Problems.

| Company | What They Build | The Evaluation Gap | What Jetty Train Solves |

|---|---|---|---|

| Anthropic | Claude — frontier reasoning & safety | Constitutional AI boundary testing at scale; Indic language alignment gaps | Red-team evaluation data, safety boundary testing, multilingual alignment |

| OpenAI | GPT-4o, o1, o3 — reasoning & multimodal | Chain-of-thought hallucination in complex domains; instruction-following edge cases | Structured hallucination audits, adversarial instruction testing, RLHF preference data |

| Cursor | AI code editor — autocomplete & refactoring | Code suggestion correctness in real codebases; does the AI actually understand context? | Code correctness evaluation, logic error detection, real-world debugging trace review |

| Bolt.new | AI app generator — full-stack from prompt | Does the generated app actually run? Does it match intent? Multi-file coherence | App output evaluation, prompt-to-product fidelity scoring, UI correctness review |

| Perplexity AI | AI search engine — real-time answers | Factual accuracy of cited answers; hallucination in niche domains | Answer accuracy evaluation, citation quality review, domain-specific fact-checking |

| Cohere | Enterprise LLMs — RAG & embeddings | Retrieval accuracy in enterprise workflows; does the model use context correctly? | RAG output evaluation, context utilisation scoring, enterprise workflow task review |

| Mistral AI | Open-weight frontier models | Multilingual performance gaps; safety alignment in non-English contexts | Multilingual evaluation (Indic, European), safety boundary testing, reasoning quality |

| Sierra AI | AI customer experience agents | Does the agent resolve the issue? Does it stay on-brand? Escalation accuracy | Agent task completion evaluation, brand alignment scoring, escalation decision review |

| Cognition (Devin) | Autonomous software engineering agent | Does the agent complete multi-step engineering tasks correctly end-to-end? | Agentic workflow evaluation, multi-step code generation review, PR quality assessment |

| Google DeepMind | Gemini — multimodal frontier model | Indic language performance; multimodal reasoning accuracy; safety in diverse contexts | Indic language evaluation, multimodal output review, safety red-teaming |

---

Why Each Company Has This Problem

### Anthropic

Anthropic is the most safety-focused frontier lab in the world. Their Constitutional AI framework is designed to make Claude helpful, harmless, and honest. But testing the boundaries of that framework requires humans who can probe it systematically — not just automated red-teaming scripts.

The gap is especially acute in Indic languages. Claude's performance in Hindi, Tamil, and Telugu lags significantly behind its English performance. Anthropic knows this. They don't have a cost-effective solution for it at scale.

Jetty Train's evaluators — trained, tiered, and scored — can run structured boundary tests and multilingual alignment evaluations that Anthropic's internal team cannot run at the volume they need.

### OpenAI

OpenAI's models are the most widely deployed in the world. GPT-4o handles everything from customer service to legal research to medical triage. The failure modes that matter most are the ones that happen in production: confident hallucinations in complex domains, instruction-following failures on ambiguous prompts, reasoning chains that look correct but aren't.

Jetty Train produces structured evaluation data on exactly these failure modes — with failure category classification that tells OpenAI not just that the model failed, but *how* and *why*.

### Cursor

Cursor is the fastest-growing AI coding tool in the world, with a reported valuation of $9.9 billion. Their core product is AI-assisted code editing — autocomplete, refactoring, and chat-based debugging.

The evaluation problem for Cursor is different from frontier model labs. They don't need safety red-teaming. They need to know: does the AI suggestion actually work in a real codebase? Does it introduce bugs? Does it understand the context of the file it's editing?

Jetty Train's coding evaluators — engineers who can read, run, and debug code — can produce exactly this kind of evaluation data at scale.

### Bolt.new

Bolt.new (by StackBlitz) lets users generate entire web applications from a text prompt. It's an extraordinary product. And it has an extraordinary evaluation challenge: how do you know if the generated app is actually good?

The answer requires humans who can: run the app, check if it works, evaluate whether it matches the prompt intent, review the UI for usability, and assess whether the multi-file codebase is coherent. That's not something you can automate.

Jetty Train's evaluators can do this systematically — producing app output evaluation data that Bolt.new's team cannot generate internally at the volume they need.

### Perplexity AI

Perplexity is building the AI-native search engine. Their core promise is accurate, cited, real-time answers. The failure mode that would destroy that promise is confident hallucination — an answer that sounds authoritative but is wrong.

Evaluating factual accuracy at scale, across diverse domains, with citation quality review, requires human evaluators with domain knowledge. Jetty Train's tiered evaluator pool — with specialists in coding, math, science, and reasoning — can provide exactly this.

### Cohere

Cohere builds enterprise LLMs focused on retrieval-augmented generation (RAG) and embeddings. Their customers use Cohere to build internal knowledge bases, document search systems, and enterprise chatbots.

The evaluation gap for Cohere is about context utilisation: does the model actually use the retrieved context correctly? Does it answer from the document or hallucinate? Does it handle long-context correctly?

Jetty Train can produce structured RAG evaluation data — testing whether model outputs are grounded in the provided context — at a scale that Cohere's internal team cannot match.

### Mistral AI

Mistral is building open-weight frontier models that are competitive with the best closed models. Their strength is multilingual performance and efficiency. Their gap — like every frontier lab — is evaluation coverage in non-English languages.

Jetty Train's Indic language evaluators are a direct solution to Mistral's multilingual evaluation gap. And as Mistral expands into enterprise markets, structured safety evaluation becomes increasingly important.

### Sierra AI

Sierra (founded by former Salesforce and Google executives) builds AI agents for customer experience — replacing or augmenting human customer service teams for major brands. Their agents handle returns, complaints, account changes, and escalations.

The evaluation problem for Sierra is fundamentally different from model labs: it's about task completion and brand alignment. Did the agent actually resolve the customer's issue? Did it stay on-brand? Did it escalate correctly when it should have?

Jetty Train's evaluators can simulate customer interactions and evaluate agent responses against these criteria — producing structured task completion and brand alignment scores.

### Cognition (Devin)

Cognition's Devin is the world's first autonomous software engineering agent. It takes a task — "build a web scraper", "fix this bug", "add this feature" — and completes it end-to-end, including writing code, running tests, and submitting pull requests.

Evaluating Devin requires engineers who can review the entire output: does the code work? Is it well-structured? Does it actually solve the stated problem? Are there edge cases it missed?

Jetty Train's expert-tier coding evaluators are exactly the right people to produce this evaluation data.

### Google DeepMind

Google DeepMind's Gemini is a multimodal frontier model competing directly with GPT-4o and Claude. Its evaluation challenges span multiple dimensions: Indic language performance (critical for Google's India market), multimodal reasoning accuracy, and safety alignment across diverse cultural contexts.

Jetty Train's evaluator pool — with depth in Indic languages, coding, math, and reasoning — can address multiple evaluation gaps simultaneously for DeepMind.

---

What This Means for Jetty Train

These ten companies represent the core of the frontier-native AI industry. They are all US-based. They all have significant evaluation gaps. And they all have the budget to pay for structured, high-quality evaluation data.

Jetty Train is the only platform that can provide:

1. Scale — hundreds of trained, tiered evaluators producing structured data

2. Indic language depth — the only evaluation platform with native Hindi, Tamil, and Telugu evaluators

3. Structured output — not just pass/fail, but failure category classification that tells companies *why* their model failed

4. A track record — every evaluator's quality score is documented and verifiable

We're not building a labeling marketplace. We're building evaluation infrastructure for the AI industry.

The train is moving. These are the companies that need to be on it.

If you work at any of these companies — or if you're building frontier-native AI — reach out. We want to work with you.

→ Request a pilot: https://jettytrain.com/#request-pilot

Ready to get in the AI economy?

Evaluate AI, get paid up to ₹50,000/month, build your track record.