From Bash to Bliss: Scaling Vespa Operations with Temporal
Growing platform - Growing maintenance
Keeping the Lights On (KTLO) is an essential, yet often taxing, part of a platform engineer’s role. It represents the routine operational work required to keep the business running and the platform stable. For our team, this primarily involves maintenance on our search engine, Vespa - ranging from version upgrades and service restarts to draining traffic from nodes for hardware replacements. In the O’Reilly book Platform Engineering, the authors recommend that “KTLO work should account for no more than 40% of your team’s workload. Any more than that and you risk burning out your team”. I couldn’t agree more. While necessary, KTLO tasks are often labor-intensive and repetitive rather than intellectually challenging.
As Vinted grows, our infrastructure must follow. We have transitioned from managing a hundred nodes to over a thousand, and without intervention, the KTLO “tax” accrues exponentially. We faced a binary choice: scale the team linearly by hiring or scale our efficiency by reducing the manual burden.
The Scaling Wall: When Scripts Aren’t Enough
Our maintenance wasn’t fully manual; we relied on Bash scripts and Knife commands. This was sufficient for a few dozen nodes, but as Vespa search engine became our default solution for search problems, our node count exploded. We reached a tipping point: we were no longer managing a single deployment, but dozens of unique deployments with varying maintenance needs. Our existing tooling simply couldn’t keep up with this complexity.
As we hit this limit, the flaws in script-based automation became clear:
- Fragility: bash scripts are “fire and forget.” A network blip in the middle of an upgrade leaves an engineer to manually reconcile the cluster state.
- Operational Toil: without native state management, scripts require “babysitting” to ensure completion.
- Lack of Guardrails: scripts are often “blind.” We needed a system capable of checking node health and readiness before proceeding to the next node.
To support Vinted’s growth, we pivoted from scripts to durable orchestration with three goals:
- Zero-Impact: transparent operations with automated health checks.
- Autonomy: scheduled, hands-off upgrades.
- Self-Service: guardrails that allow product teams to safely manage their own restarts.
Temporal
To understand Temporal, you have to stop thinking about “running a script” and start thinking about “durable execution.” In a traditional environment, when you run a script to restart a Vespa search engine node, the state lives in the memory of the process running that script. If your laptop closes, the CI/CD runner times out, or the network blips, that state is lost. You’re left wondering: Did the node upgrade? Temporal changes this by acting as a fault-tolerant state machine. It records every successful step of your code in a backend database. If the execution is interrupted, Temporal simply spins it back up on a different worker and resumes from the last successful “event,” with all its variables and local state intact.
Why Temporal?
As our Vespa footprint grew to a thousand nodes, we could consider building an event-driven system - where one service would emit a “Node Down” event and another service would react. However, we realized that events are often the wrong abstraction for complex maintenance.
Here is why we chose Temporal’s orchestration over traditional events or scripts:
- Orchestration over Choreography: in an event-driven “choreography,” it’s nearly impossible to see the “big picture” of an upgrade. With Temporal, the entire workflow - draining traffic, upgrading, and health-checking - is defined in a single block of code. We have a clear “manager” for the process rather than a dozen disconnected services “reacting” to each other.
- The Code is the State: usually, to automate an upgrade, you’d need a database to track which nodes are PENDING, UPGRADING, or FAILED. Temporal removes this “toil.” The state is simply the current line of code being executed.
- Built-in Reliability: in our old bash scripts, we didn’t have error handling or durability for that matter. Temporal provides these as primitives. If a Vespa API call fails, we don’t write a loop; we tell Temporal to “retry with exponential backoff,” and it handles the rest. We chose the Go SDK because it allows us to treat infrastructure-as-code in the truest sense.
- Workflows (The Brain): we wrote a VespaUpgradeWorkflow in Go. It’s deterministic logic that orchestrates different operations like locking the chef client, bumping the version, restarting nodes and ensuring we never take down too many nodes at once.
- Activities (The Muscles): these are the individual Go functions that talk to different services and execute the steps for the procedures. Because activities are decoupled from the workflow, we can fail and retry an activity (like a slow node restart) without ever failing the overall upgrade process. By moving our KTLO work into Temporal, we transformed “babysitting scripts” into a self-healing platform operation.
A platform within a platform
By leveraging Temporal, we automated far more than just routine maintenance. We extended our orchestration to include new cluster provisioning and other recurring operational tasks.Today, upgrades are scheduled automatically twice a month during weekdays. The system intelligently accounts for public holidays and traffic surges, ensuring we are online to respond if issues arise. We’ve integrated Slack for real-time progress reporting and the Temporal UI for deep-visibility monitoring, backed by a robust alerting suite for stalled or failed workflows. This automation has transformed our daily operations:
- Self-Service: feature teams now use a Slack bot to trigger restarts independently, removing our team as a bottleneck.
- Provisioning at Scale: as the demand for new nodes increased, we automated the entire provisioning lifecycle.
- Reduced Toil: while hardware failures still occasionally require manual intervention, these are now outliers. What used to be the “brunt” of our on-duty backlog has effectively disappeared. We have essentially built a platform within our search-platform. This shift has not only lowered our KTLO “tax” but has allowed us to focus on higher-value engineering rather than the logistics of scale.