Building an AI Agent Platform: The Three Questions That Shaped Every Decision
I started with one question: which model do I use? Then I hit two more I didn't expect. Which model survives an agentic loop, and how do I actually deploy this? Here is what I learned answering each one in production, and how it shaped the platform I ended up building.
When I started building what eventually became ModelFitAI, I thought I had one question to answer:
"Which LLM should I use for this product?"
A few months later, I realised I'd been answering the wrong question — or rather, I'd been answering a first question, and there were two more behind it that I hadn't even seen yet. Each one reshaped the architecture in ways I didn't expect.
This post is about those three questions, in the order they hit me, and the trade-offs that came with each. If you're building anything agentic in production right now, you're probably standing at one of them.
Question 1: Which model fits my use case?
This is where everyone starts. It's also the question that gets the most ink and the least clarity, because the honest answer is "it depends, and you have to test."
My use case at the start was narrow: an AI engine that could read a user's natural-language description and recommend the right LLM and stack from a database of 50+ options. Structured output, decent reasoning, conversational handling, free-tier-viable cost — those were the requirements.
I prototyped against three Claude tiers — Haiku, Sonnet, and Opus — because I wanted to compare across a single provider's quality curve before I added cross-provider variance.
What I learned (the longer write-up is here):
- Haiku was cheap and fast, but its JSON output failed validation ~20% of the time. In production, that meant either an aggressive retry layer or a degraded UX, neither of which was free.
- Opus was excellent at the reasoning, but at 3× the cost of Sonnet, it killed the unit economics of any meaningful free tier.
- Sonnet sat in the middle on price and at the top on structured-output reliability for my workload. 95%+ valid JSON, decent latency, sustainable margin at the price I wanted to charge.
Sonnet won for that use case. But the lesson wasn't "use Sonnet." The lesson was that model selection is a function of three things: task complexity, output reliability requirements, and unit economics — and you can only resolve them by running the same workload across candidates and looking at the failure modes, not the marketing pages.
This is also where I started thinking the rest of the system was going to be straightforward. I had a model. I had a use case. I had structured output that worked. What else could there be?
Question 2: Which model survives an agentic use case?
Then I started building actual agents — agents that loop, call tools, monitor a Reddit feed, post to X, score leads, send Telegram messages, retry on failure, and manage their own state across hours and days.
Almost immediately, the answer to Question 1 stopped being load-bearing.
Here's what I didn't appreciate at the start: agentic workloads are not the same shape as one-shot LLM calls, and "best for chat" or "best for structured output" is only loosely correlated with "best for agents."
A few things that look obvious in hindsight:
Tool-calling reliability matters more than raw reasoning. A reasoning model that picks the wrong tool 5% of the time, or calls a tool with hallucinated arguments, will silently corrupt agent state in ways that are very hard to debug after the fact. The first time you see an agent confidently insert a malformed row into your database at 3am, you stop caring about leaderboard scores.
Latency compounds across an agentic loop. A single LLM call at 1.5 seconds is fine. Six chained calls is nine seconds, plus tool execution. Users feel that. Agents that drive interactive flows (chat, Telegram replies) need a noticeably faster model than agents that run on a cron and don't care if they take 30 seconds.
Cost-per-task replaces cost-per-call. A "$0.015 per query" math from Question 1 is meaningless when one user-facing task expands into eight LLM calls, four tool invocations, and a retry. The unit you actually have to budget against is the task, not the call. Otherwise your unit economics quietly invert.
Context management is its own subsystem. Agentic loops accumulate state — prior tool outputs, partial results, retrieved documents. You either pay for ever-growing input contexts (linear cost growth, increasing latency, higher hallucination risk near the context limit), or you build a summarisation / pruning layer. There is no third option.
Failure recovery is the agent. A non-agentic call has two outcomes: it works or it errors. An agentic call has dozens — wrong tool, missing argument, partial completion, infinite loop, timeout, rate-limit. The failure-handling code becomes the largest and most important part of the system. The model you pick should be the one that fails predictably, not necessarily the one that succeeds the most often.
For my agentic workloads — Reddit lead-gen, X autopilot, HN monitoring, Product Hunt response — I ended up with a small routing layer:
- Sonnet handles tool-calling, planning, and any task where structured output reliability matters more than speed.
- Haiku handles classification, intent detection, and anything cheap enough to run inside a tight loop where latency dominates the user experience.
- Cross-provider fallback kicks in when the primary returns malformed tool calls twice in a row — same task, different model, different blind spots.
The cost discipline came from one principle: the LLM is the most expensive thing in the loop, so spend on it last. Cache aggressively. Use deterministic code where you can. Only escalate to the model when the problem actually requires a model.
This is also when I noticed that the agent frameworks I was looking at — LangChain, CrewAI, AutoGen — were optimised for letting you build agents quickly, not for letting you operate them reliably. They didn't have strong opinions on retry budgets, cost ceilings, observability, or per-tenant isolation. So I started building those layers myself, on top of a thin agent runtime that I eventually open-sourced as OpenClaw once it stabilised.
By this point I had something that worked — agents I could run locally, deploy to a VPS, and trust to handle real-world inputs. I figured the hard part was over.
I was wrong again.
Question 3: How do I actually deploy this?
The third question is the one nobody warns you about. It's also the one that ended up taking the most engineering hours, by a wide margin.
Once an agent works on your laptop, you have to answer:
- Where does it run?
- How do users control it without learning a new tool?
- How do new users deploy their agent without setting up a server?
I tried, in this order, the three obvious paths.
Path 1: Vercel functions / serverless. I started here because the rest of the app already lived there. It collapsed almost immediately. Agentic loops are long-running by nature — minutes, sometimes hours. Serverless function timeouts (10s on hobby, 60s with config gymnastics) make this a bad fit. Worse, agents that maintain in-memory state across iterations don't survive cold starts. You can paper over both with queues and external state, but at that point you've reinvented a worker, badly.
Path 2: Self-hosted Kubernetes. I considered this for about an afternoon and dropped it. The complexity surface for a multi-tenant agent platform on K8s is absurd for a solo founder. Cluster management, node autoscaling, container security, network policies, image registry, RBAC. None of that work makes the product better; all of it makes the on-call worse.
Path 3: Long-running Docker containers on a managed VPS. This is where I landed and stayed. Each agent runs in its own isolated Docker container, on a Hetzner VPS, with constrained CPU/memory and a strict per-container resource budget. Hetzner because the per-core cost on long-running compute is roughly an order of magnitude lower than the comparable AWS instance, and for workloads that don't need elastic burst, that math is decisive. Docker because container isolation is good enough for multi-tenant agents if you're disciplined about secrets, networking, and the supply chain — which is its own subsystem to build, but a finite one.
Path 3 also forced a UX decision I hadn't expected. If agents are long-running and live on a server somewhere, the customer needs a control plane. The default answer is a web dashboard. I built one — and almost no one used it for the day-to-day. People wanted to interact with their agents from where they already were: their phone, their messaging app, on the train, between meetings.
So I made the control plane a Telegram bot. The customer creates an agent in the dashboard once. After that, they talk to the agent from Telegram. The agent reports leads, asks for clarifications, and accepts commands the same way. The dashboard becomes provisioning and billing, which is what dashboards are actually good at.
That single decision — "chat as control plane, dashboard as configuration" — changed retention more than any model upgrade did.
The fourth thing I had to build, which I didn't anticipate at all, was a standardised skill format so I could ship and update agents without rewriting the runtime each time. Each agent template is a small package — SKILL.md + an entry script + a manifest of credentials and AI keys it needs. The runtime knows how to load any skill and run it. Adding a new agent type became a config change, not a code change. Eventually I open-sourced the skill format and started a small marketplace for them. That decision turned out to be load-bearing for everything that came after.
What this taught me about LLM architecture in general
Looking back across all three questions, the meta-lesson isn't about models or frameworks. It's that every layer of an LLM system is interdependent, and decisions made at one layer set hard constraints on the layers above and below.
- Pick the wrong model for an agentic loop, and your retry layer has to do double duty.
- Pick the wrong runtime, and your model choice gets re-litigated every time a serverless cold start kills an in-flight agent.
- Pick the wrong control-plane UX, and customers churn before they ever see how good the agent is.
- Skip the skill standardisation, and every new agent type becomes a re-implementation of the previous one.
The teams I see succeeding with LLM products in 2026 aren't the ones that picked the smartest model. They're the ones that took these three questions seriously, in this order, and resisted the urge to rush past any of them.
What I'd do differently if I started over today
A few specifics, in case this is useful as a checklist:
- Test models on the actual workload, not the marketing. Run your real prompts through three candidates with your real validation logic. The cheapest model that doesn't break your validation budget wins.
- Budget against tasks, not calls. Track total LLM cost per user-facing outcome. That number, not per-call cost, decides whether your business model works.
- Pick the runtime before you pick the framework. Long-running agents on serverless is a forced error. Decide where this thing actually lives before you write more agent code.
- Treat the control plane as a separate product decision. Web dashboards are great for configuration. They are mediocre for ongoing agent interaction. Most users live in chat apps; meet them there.
- Standardise the agent contract early. A small, boring "skill" abstraction with a clear manifest format pays for itself the third time you add a new agent type.
If you're working through any of these questions right now, I'd genuinely love to hear which one is hurting most. The platform I ended up with — ModelFitAI — is essentially the codified answer to all three, but the questions themselves are the part I think is worth sharing.
I write about LLM architecture, AI agent systems, and the practical economics of shipping AI products. More at modelfitai.com/blog or on LinkedIn.