<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://vinted.engineering//atom.xml" rel="self" type="application/atom+xml" /><link href="https://vinted.engineering//" rel="alternate" type="text/html" /><updated>2026-05-18T10:24:57+00:00</updated><id>https://vinted.engineering//atom.xml</id><title type="html">Vinted Engineering</title><subtitle>The Engineering Blog from Vinted. These are the voyages of code tailors that help create Vinted.</subtitle><entry><title type="html">How Vinted Serves Personalised Search Autocomplete</title><link href="https://vinted.engineering//2026/04/22/personalized-search-autocomplete/" rel="alternate" type="text/html" title="How Vinted Serves Personalised Search Autocomplete" /><published>2026-04-22T00:00:00+00:00</published><updated>2026-04-22T00:00:00+00:00</updated><id>https://vinted.engineering//2026/04/22/personalized-search-autocomplete</id><content type="html" xml:base="https://vinted.engineering//2026/04/22/personalized-search-autocomplete/"><![CDATA[<style>
/* Scoped to this post: allow diagram-heavy figures to bleed beyond the 720px
   column on wide viewports while collapsing back on narrow viewports. */
.post__entry figure.wide { max-width: 100%; }
@media (min-width: 1150px) {
  .post__entry figure.wide {
    max-width: 1100px;
    width: 1100px;
    margin-left: calc((720px - 1100px) / 2);
    margin-right: calc((720px - 1100px) / 2);
  }
}
/* References: left-align and allow long URLs to break cleanly instead of
   forcing wide word-spacing on justified lines. */
.post__entry .references { text-align: left; }
.post__entry .references a { word-break: break-all; }
/* Tables: left-align cells so narrow columns don't stretch words apart. */
.post__entry table td,
.post__entry table th { text-align: left; }
</style>

<p>At Vinted, more than 20% of all search sessions now start with a click on an autocomplete suggestion. A few years ago, that number was below 8%. Autocomplete not only saves typing effort - it helps people discover listings they didn’t know existed, and guides them toward successful searches.</p>

<p>Today, across 24 languages and 50+ country-language combinations, we have a pool of 125 million different queries ready to suggest to users. Our service, <code class="language-plaintext highlighter-rouge">svc-suggestions</code>, runs on Vespa and matches and ranks 4,700 queries per second at 31 ms P99.</p>

<!--truncate-->

<figure style="text-align: center;">
  <img src="/static/2026/04/autocomplete_demo_short.gif" alt="Vinted autocomplete in action" style="max-width: 500px; width: 100%; border: 1px solid #e5e5e5; border-radius: 4px;" />
</figure>

<p>Autocomplete systems work in two phases:</p>

<ul>
  <li><strong>Offline</strong>, we generate a pool of candidate queries - the things we might suggest - from product data, search logs, or both.</li>
  <li><strong>Online</strong>, every time a user types a character, we match those candidates against the input (tolerating typos), rank the matches by what the user is most likely to click or buy, and return the top few. All of this has to happen in milliseconds - the bar is set by Google, Amazon, and every other product people use daily.</li>
</ul>

<p>This post walks through each step - candidate generation from product metadata and search logs (and why 2% of candidates drive half the clicks), edge-ngram indexing for performance, fuzzy matching for typo tolerance, personalisation via a Learning-to-Rank model, and what 35+ A/B experiments taught us over two years.</p>

<h2 id="generating-and-scoring-125-million-suggestions">Generating and scoring 125 million suggestions</h2>

<p>The foundation of any autocomplete system is candidate generation - producing the pool of suggestions that will later be matched and ranked at query time.</p>

<p>At Vinted, our Self-Learning Suggestions (SLS) pipeline draws candidates from two sources:</p>

<ul>
  <li><strong>Product metadata</strong> - item features such as category, colour, brand, and attributes are combined to generate all possible entity combinations (e.g., “Nike shoes”, “red dress Zara”).</li>
  <li><strong>Search logs</strong> - popular user queries in each country-language market, capturing real demand signals and seasonal trends. These include queries that metadata alone could never produce, like book titles (“harry potter and the prisoner of azkaban”), holiday-related content (“world book day costume girls”), or fashion trends (“y2k baggy jeans”).</li>
</ul>

<p>This dual approach ensures broad coverage: metadata-based suggestions surface inventory that users may not yet know to search for, while query-based suggestions reflect proven demand.</p>

<p>Below in the graph you can see UK’s most clicked suggestions in April 2026, by their source type:</p>

<figure class="wide" style="text-align: center;">
  <img src="/static/2026/04/uk-top-clicks-by-source.png" alt="UK top 30 most clicked suggestions by source (April 2026)" />
</figure>

<p><br /></p>

<p>In addition to this, we can see how product metadata and query logs shift with trends and seasons, using our members’ data. Let’s compare UK’s top clicked search suggestions during December 2025 and April 2026:</p>

<figure class="wide" style="text-align: center;">
  <img src="/static/2026/04/seasonal-shift-uk.png" alt="UK seasonal shift: Christmas to Spring suggestion clicks" />
</figure>

<p><br /></p>

<p>Seems like “unwanted christmas gifts” climbs into one of the most popular search suggestions in the UK during December. Nothing says “Happy Holidays!” quite like millions of users collectively speed-typing their way to rehome that <em>lovely reindeer jumper</em> or <em>luxury mini spa set</em> before New Year’s 😄.</p>

<figure class="wide" style="text-align: center;">
  <img src="/static/2026/04/unwanted-gifts-vinted.png" alt="A Champneys mini spa set listed on Vinted, labelled 'Unwanted gift from Christmas'" />
</figure>

<p>Multiple r/vinted threads show that reselling unwanted Christmas gifts is a recurring topic Vinted members discuss each year:</p>

<figure style="text-align: center;">
  <img src="/static/2026/04/unwanted-gifts-reddit-only.png" alt="r/vinted threads discussing selling unwanted Christmas gifts" />
</figure>

<p><span id="candidate-scoring"></span></p>
<h3 id="candidate-scoring">Candidate scoring</h3>

<p>Once candidates are generated, they need to be scored. The SLS model computes a final ranking score for each suggestion using a multi-objective heuristic approach. It builds on the <a href="https://dl.acm.org/doi/10.1145/1963405.1963424">Most Popular Completion (MPC) framework</a> <a href="#ref-1">[1]</a>, extended with our own metrics, normalisation, and weighting logic.</p>

<p>Several performance metrics are calculated per suggestion, including:</p>

<ul>
  <li><strong>Item STR (sell-through rate)</strong> - how often items matching this suggestion actually sell</li>
  <li><strong>Number of sold items</strong> - absolute transaction volume behind the suggestion</li>
  <li><strong>Suggestion usage</strong> - the share of search sessions where the user clicks on a suggestion rather than submitting their own typed query</li>
  <li><strong>Suggestions CTR (click-through rate)</strong> - the ratio of suggestion clicks to number of suggestion lists shown</li>
</ul>

<p>The model scores candidates using aggregated metrics (such as 7-day item STR and suggestions CTR) at the country-language level. This means the ranking reflects the preferences of the “average” Vinted user in a given market - typically a 25-35 year-old female living in a city, looking for affordable fashion. Great as a baseline. Blind to everyone else.</p>

<p>These raw metrics vary significantly in scale, so we normalise them in two steps: first, extreme values are capped using a sigma rule to prevent outliers from dominating; then, capped values are min-max normalised to a [0, 100] range. Normalisation is done per country, language, and first letter of the suggestion - so “Nike” competes with other “N” suggestions in the same market, not with globally popular suggestions starting with different letters.</p>

<p>Each normalised metric is then multiplied by a hand-tuned weight that controls its relative importance. Weights are only applied when a metric has sufficient data - for instance, the CTR weight kicks in only if the suggestion has been shown enough times; below that threshold, the metric contributes with reduced influence. Additional bonuses are added based on structural properties: the entity combination type (e.g., brand-only vs. brand+category), whether the suggestion is trending, and so on. The final score is a linear sum of all weighted metrics plus these structural adjustments, normalised once more to produce the total score that each suggestion carries into Vespa.</p>

<p>Finally, a diversity filter balances popular and low-viewed suggestions so that high-CTR items don’t dominate every market. Without it, newer or niche suggestions would never get the exposure needed to prove themselves.</p>

<p>The result is over 125 million scored suggestions across all 50+ country-language combinations, generated twice a week and ready to be indexed into Vespa, the search and ranking engine that also stores, matches, and ranks suggestions at serving time.</p>

<figure class="wide" style="text-align: center;">
  <img src="/static/2026/04/sls-pipeline.png" alt="SLS generation and scoring pipeline" />
</figure>

<h2 id="indexing-suggestions-into-vespa">Indexing suggestions into Vespa</h2>

<p>Suggestions are indexed through Vinted’s existing <a href="https://vinted.engineering/2023/09/25/search-indexing-pipeline/">Search Indexing Pipeline</a> - BigQuery exports to a Kafka topic, and Apache Flink streams updates into a dedicated Vespa cluster. The pipeline is fully streaming, so even though we only regenerate suggestions twice a week, we could switch to real-time updates without touching the infrastructure.</p>

<h3 id="why-vespa">Why Vespa?</h3>

<p>Vespa is the primary search engine at Vinted - we <a href="https://vinted.engineering/2024/09/05/goodbye-elasticsearch-hello-vespa/">migrated from Elasticsearch in 2023</a> and have written about the decision on this blog. For autocomplete specifically, the deciding factor was ranking: Vespa provides native support for ranking expressions and ML inference in the serving path, which means we can run a LightGBM model per keystroke without leaving the search engine.</p>

<p>The tradeoff: Vespa is weaker than Elasticsearch/OpenSearch in lexical analysis - it lacks built-in edge-ngram tokenisers and the rich analyser chains we needed. We close that gap in the matching section below.</p>

<h2 id="matching-user-input-in-milliseconds">Matching user input in milliseconds</h2>

<p>Once a user starts typing, we need to match their input against 125 million suggestions as fast as possible.</p>

<figure style="text-align: center;">
  <img src="/static/2026/04/matching-diagram.png" alt="Matching user query 'ap' against 125M suggestions" />
</figure>

<p>We first implemented autocomplete using Vespa’s native <a href="https://docs.vespa.ai/en/text-matching.html#prefix-match">prefix query</a> - the approach Vespa recommends in their official <a href="https://github.com/vespa-engine/sample-apps/tree/master/incremental-search/search-suggestions">Search Suggestions sample app</a>. It worked, but at Vinted scale our load tests revealed P99 latency around ~220 ms. Not good enough for autocomplete, where every millisecond of delay is felt and CPU is burnt.</p>

<p>So we moved the matching cost from query time to indexing time. Borrowing the idea of an edge-ngram tokeniser from Elasticsearch, we split each suggestion into all its prefixes at index time:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"apple" → ["a", "ap", "app", "appl", "apple"]
</code></pre></div></div>

<p>At query time, matching becomes a simple <code class="language-plaintext highlighter-rouge">contains</code> lookup:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">select</span> <span class="o">*</span> <span class="k">from</span> <span class="n">suggestions</span> <span class="k">where</span> <span class="n">title_edge_ngrams</span> <span class="k">contains</span> <span class="nv">"ap"</span>
</code></pre></div></div>

<p>The suggestion text field is indexed in memory using Vespa’s <code class="language-plaintext highlighter-rouge">attribute: fast-search</code>. We brought the edge-ngram tokeniser into Vespa through <a href="https://blog.vespa.ai/lucene-linguistics/">Lucene Linguistics</a> - our contribution to the Vespa project that allows Lucene analysers to run inside Vespa’s indexing pipeline.</p>

<p>The tradeoff is higher memory usage from storing more terms. For us it was worth it: P99 Vespa matching latency dropped from ~220 ms to ~25 ms, CPU usage decreased, and autocomplete felt faster for users.</p>

<h3 id="accent-tolerance-without-losing-intent">Accent tolerance without losing intent</h3>

<p>Vinted operates in many European countries, and a lot of languages use accented characters like š, ė, ą, ž, ł. In practice, users rarely type accents - instead of ž they simply type z - but we still want suggestions to match.</p>

<p>The naive approach: apply Lucene’s ASCIIFolding at both indexing and query time. ASCIIFolding is a token filter that maps accented Unicode characters to their closest ASCII equivalents (Ž → Z, ė → e, ł → l). This works for matching - but it throws away information. Typing an accent is a strong intent signal. If a user types Ž, they usually mean Žalgiris, not Zara.</p>

<p>To support this, we use a Multiplexer token filter, which stores two versions of every token:</p>

<ul>
  <li>original accented token</li>
  <li>ASCII-folded token</li>
</ul>

<p>So “žalgiris” is indexed as both “žalgiris” and “zalgiris” (each then further split into edge-ngrams: <code class="language-plaintext highlighter-rouge">["ž", "ža", ..., "žalgiris", "z", "za", ..., "zalgiris"]</code>).</p>

<p>Typing Z matches the ASCII-folded token → finds both Zara and Žalgiris:</p>

<figure style="text-align: center;">
  <img src="/static/2026/04/accent-tolerance-z.png" alt="User query Z matches both Zara and Žalgiris via the ASCII-folded token" />
</figure>

<p>Typing Ž only matches the real accented token → finds Žalgiris only:</p>

<figure style="text-align: center;">
  <img src="/static/2026/04/accent-tolerance-z-diacritic.png" alt="User query Ž only matches the real accented token for Žalgiris" />
</figure>

<p>This gives us accent tolerance when users don’t type accents, while preserving intent when they do.</p>

<h3 id="handling-typos-with-fuzzy-matching">Handling typos with fuzzy matching</h3>

<p>Handling misspellings is a must-have for autocomplete. Vespa provides fuzzy matching based on Levenshtein edit distance, allowing suggestions to match even when the user mistypes part of a word.</p>

<p>Two parameters control the tradeoff between fuzziness and performance:</p>

<ul>
  <li><a href="https://docs.vespa.ai/en/reference/query-language-reference.html#maxeditdistance"><code class="language-plaintext highlighter-rouge">maxEditDistance</code></a> - how many total character edits are allowed</li>
  <li><a href="https://docs.vespa.ai/en/reference/query-language-reference.html#prefixlength"><code class="language-plaintext highlighter-rouge">prefixLength</code></a> - how many prefix characters must match exactly (no edits allowed)</li>
</ul>

<p>In our case:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">prefixLength = 1</code> - improves latency by reducing how many edit combinations Vespa needs to consider, and improves relevance by avoiding edits on the first character (changing the first letter often turns the query into a completely different word).</li>
  <li><code class="language-plaintext highlighter-rouge">maxEditDistance = 1 or 2</code> depending on the fallback level. Edit distance 1 catches most common typos; we escalate to 2 only when the first pass returns too few results.</li>
</ul>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">select</span> <span class="o">*</span> <span class="k">from</span> <span class="n">suggestions</span>
<span class="k">where</span> <span class="n">title_edge_ngrams</span> <span class="k">contains</span> <span class="p">({</span><span class="n">prefixLength</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="n">maxEditDistance</span><span class="p">:</span> <span class="mi">2</span><span class="p">}</span> <span class="n">fuzzy</span><span class="p">(</span><span class="nv">"ap"</span><span class="p">))</span>
</code></pre></div></div>

<p>When we started, Vespa didn’t support fuzzy prefix search - unlike Elasticsearch, where it’s built in. This was critical for us because users make typos before they finish a word, not only in finished ones. We opened <a href="https://github.com/vespa-engine/vespa/issues/30720">a feature request</a>, and the Vespa team shipped fuzzy on <code class="language-plaintext highlighter-rouge">prefix:true</code> shortly after - one of several fast turnarounds we’ve had with them on this project. By the time it landed, our edge-ngram design already covered the same ground: once prefixes are materialised as indexed tokens, Vespa never needs <code class="language-plaintext highlighter-rouge">prefix:true</code> mode for either exact or fuzzy queries. We stuck with edge-ngrams.</p>

<h3 id="cascading-from-precise-to-permissive-queries">Cascading from precise to permissive queries</h3>

<p>A single match pass isn’t always enough. For common queries, exact prefix returns plenty of results and the work ends there. For typos or unusual prefixes, we fall back through progressively more permissive matchers. Each additional tier is another client-to-Vespa round-trip on the hot path of every keystroke, so we order the tiers from most precise to most permissive and stop the moment we have 10 unique, deduplicated suggestions:</p>

<ol>
  <li>exact prefix</li>
  <li>fuzzy (edit distance 1)</li>
  <li>fuzzy (edit distance 2)</li>
</ol>

<p>Most popular queries produce 10 results in the first tier and never pay for the rest. Queries with a typo or an unusual prefix fall through to the next tier, stopping as soon as we have enough results.</p>

<p>This ordering also gives us ranking signal for free. Vespa doesn’t expose how many edits a fuzzy match used, so within one query we can’t tell an exact hit from a heavily corrected one. By splitting tiers we know the “strength” of a match from the phase that produced it, which lets us rank exact matches ahead of fuzzy ones when we merge results.</p>

<p>One deliberate design choice: we don’t relax aggressively. Returning nothing is sometimes better UX than something irrelevant. If we can’t suggest anything good, we let the user finish typing. If their query eventually succeeds, it enters the search logs and over time becomes a suggestion for everyone.</p>

<p>Most keystrokes never leave the first tier. Of all Vespa requests issued by <code class="language-plaintext highlighter-rouge">svc-suggestions</code>, ~62% are exact-prefix; the rest are split between the two fuzzy tiers.</p>

<figure style="text-align: center;">
  <img src="/static/2026/04/request_type_ratio.png" alt="Pie chart: 62% of Vespa requests are prefix_exact_lightgbm, 21% prefix_fuzzy1_lightgbm, 17% prefix_fuzzy2_lightgbm" />
</figure>

<h2 id="personalising-ranking-with-learning-to-rank">Personalising ranking with Learning-to-Rank</h2>

<p>The matching layer produces a ranked list of candidate suggestions using the SLS heuristic score. But heuristic ranking treats all users the same. To make suggestions personal, we added a second-phase re-ranking layer using a Learning-to-Rank (LTR) model that runs natively inside Vespa.</p>

<h3 id="model-specifics">Model specifics</h3>

<p>We chose a tree-based LTR approach using LightGBM with a LambdaRank objective, optimising directly for NDCG@1. LightGBM is fast at inference, memory-efficient, supports categorical features natively, and remains interpretable - all important properties for Vinted’s autocomplete system that must re-rank 20 suggestions per keystroke within milliseconds.</p>

<p>The model is trained on user-query-suggestion interaction data: for each query prefix, we observe which suggestions were shown and which ones users clicked. This produces labelled query-suggestion pairs that the model learns to rank.</p>

<h3 id="features">Features</h3>

<p>Our model uses 63 features organised into four groups:</p>

<table style="border-collapse: collapse; width: 100%; font-size: 0.9em;">
  <thead>
    <tr>
      <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Feature group</th>
      <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Examples</th>
      <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Count</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">Query &amp; Suggestion</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Input length, suggestion length, entity type combinations</td>
      <td style="border: 1px solid #ddd; padding: 8px;">~10</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">Popularity</td>
      <td style="border: 1px solid #ddd; padding: 8px;">CTR, click count, total score, frequency ratios</td>
      <td style="border: 1px solid #ddd; padding: 8px;">~15</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">User Behaviour</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Click history, purchase patterns, category preferences (e.g., fashion vs. electronics vs. high-value fashion), suggestion interaction patterns</td>
      <td style="border: 1px solid #ddd; padding: 8px;">~25</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">Contextual</td>
      <td style="border: 1px solid #ddd; padding: 8px;">Country, language, platform (iOS / Android / Web), month</td>
      <td style="border: 1px solid #ddd; padding: 8px;">~13</td>
    </tr>
  </tbody>
</table>

<p><br /></p>

<p>Below you can see what the model actually learns (feature importance) and which features make the biggest impact:</p>

<figure class="wide" style="text-align: center;">
  <img src="/static/2026/04/ltr-feature-importance.png" alt="Top 10 most important features in the LTR model, by split count and by gain" />
</figure>

<p>The single biggest signal is how much the user has typed - top feature by split count and top 3 by gain. Close behind are popularity signals: prefix-level click frequency, suggestions CTR, and the SLS heuristic score (4th by split - validating that the LTR model builds on the baseline rather than replacing it).</p>

<p>Equally important is the gap between current input length and when users typically click a given suggestion (top 2 by gain). The model isn’t just learning what to show - it’s learning <em>when</em> to show it.</p>

<p>User preference features - favourite categories, brand interests - carry less individual weight but work collectively. They appear in the gain top 10 but not in the split top 10: the model doesn’t check them often, but when it does, they meaningfully change the ranking - especially on short, ambiguous prefixes where generic popularity can’t distinguish intent.</p>

<h3 id="serving-architecture">Serving architecture</h3>

<p>Vespa runs two-phase ranking on every keystroke:</p>

<ol>
  <li><strong>First-phase - SLS score.</strong> Each content node matches up to 1,000 candidates and orders them by the offline SLS score (described in <a href="#candidate-scoring">Candidate scoring</a>). This baseline is the same for every user in a given country-language market.</li>
  <li><strong>Second-phase - LightGBM re-ranking.</strong> The top 20 per node are re-ranked by the LightGBM model inside Vespa, combining indexed suggestion-side features with user-side features fetched in real time from Vinted’s Feature Store (VFS).</li>
</ol>

<h3 id="personalisation-in-practice">Personalisation in practice</h3>

<p>The model learns that different users want different suggestions for the same prefix. For example, when a user in the UK types “sh”:</p>

<ul>
  <li>The baseline SLS ranking (same for all users) might show: “shoes”, “shirt”, “shein”, “shorts”…</li>
  <li>A user who predominantly clicks on men’s items sees: “shoes men”, “shirt”, “shacket”…</li>
  <li>A user who mostly browses women’s items sees: “shein”, “shoes women”, “shoulder bag”…</li>
</ul>

<p>This is especially impactful for short prefixes (1-3 characters), where the candidate space is large and generic ranking struggles to surface what any individual user actually wants.</p>

<h3 id="making-it-work-in-production">Making it work in production</h3>

<p>Building the LTR model was not a straight path - over 20 model variations, 5 experiments, and results ranging from clear regressions on early iterations to meaningful lifts in the version that eventually scaled.</p>

<p>Not all problems were about the model. Early on, our LightGBM scores in Vespa didn’t match the offline ones - same inputs, different outputs. We suspected a bug in Vespa’s LightGBM lambdarank integration around categorical features, put together a minimal reproduction, and reported it. The Vespa team shipped fixes very quickly (<a href="https://github.com/vespa-engine/vespa/pull/34084">vespa#34084</a>, <a href="https://github.com/vespa-engine/vespa/pull/34094">vespa#34094</a>).</p>

<p>On the data side, the biggest win was deceptively simple: cleaning up noisy training labels. When users type short prefixes (1-4 characters), they’re still typing, not choosing. But our click attribution marked suggestions as “clicked” at these lengths even when the final click on a suggestion happened at later tokens. Stripping those noisy positives immediately improved ranking quality.</p>

<p>We also learned where not to re-rank. Initially, the LTR model scored all returned suggestions - including fuzzy and fallback matches. But fuzzy matches are already lower-confidence by nature, and re-ranking them alongside exact prefix matches muddied the results. Restricting LTR re-ranking to exact prefix matches only gave a clear boost to relevance metrics.</p>

<h2 id="high-level-architecture">High-level architecture</h2>

<p>Offline, BigQuery generates and scores the 125M suggestions, which Kafka and Flink stream into Vespa. Online, <code class="language-plaintext highlighter-rouge">svc-suggestions</code> fetches user features from the Feature Store, queries Vespa with progressive relaxation, and returns the top 10.</p>

<figure class="wide" style="text-align: center;">
  <img src="/static/2026/04/architecture.png" alt="Complete high-level architecture of Vinted autocomplete" />
</figure>

<h2 id="vespa-hardware">Vespa hardware</h2>

<p>The Vespa clusters run in three European data centres - with US expansion coming - each with 6 content nodes organised as 2 groups of 3. Indexes are split per country, so a query touches only the country’s shard group. Each content node is an AMD EPYC 7713P (64 cores, 128 threads) with 512 GB of RAM.</p>

<p>Search CPU averages ~2% and peaks at ~4.5% during evening traffic (Western Europe evenings), even at 4,700 QPS. The write-CPU spikes are the twice-weekly index rebuilds, which push total CPU to ~7% on those days. The cluster has a lot of headroom: we could grow the user base, expand the suggestion pool, or run heavier ranking models without needing more metal.</p>

<figure class="wide" style="text-align: center;">
  <img src="/static/2026/04/vespa-cpu-usage.png" alt="Vespa search-node CPU usage over a week: Total ~2% mean / ~7% max, Search ~2% mean / ~4.5% max, Write ~0.1% mean / ~3% max" />
</figure>

<h2 id="lessons-from-35-ab-tests">Lessons from 35+ A/B tests</h2>

<p>Over two years we ran 35+ A/B experiments (30+ on SLS, 5 on LTR), with a ~30% scale ratio. Our two primary engagement metrics were suggestion usage (share of search sessions that click a suggestion) and suggestions CTR.</p>

<h3 id="ab-tests-and-interleaving">A/B tests and interleaving</h3>

<p>An A/B test at Vinted splits users into two groups - one sees the current autocomplete (“OFF”), one sees the change we want to evaluate (“ON”) - and we compare their behaviour: which group uses autocomplete more (suggestion clicks, session-level usage) and which group buys more on Vinted overall (transactions, GMV). A typical test needs a week or more of traffic before the numbers stabilise.</p>

<p>During periods of rapid iteration, we had more variants to test than A/B could handle in reasonable time. So we used team-draft interleaving: one list per user with suggestions drawn alternately from both variants, and we measured which variant’s suggestions users clicked more. We knew which variant was better in about a day instead of a week, which let us test ideas in batches.</p>

<figure style="text-align: center;">
  <img src="/static/2026/04/interleaving.png" alt="Team-draft interleaving: a single merged suggestion list composed of items from variants A and B, with per-side click counts driving the decision" />
</figure>

<h3 id="why-evaluate-engagement-not-sales">Why evaluate engagement, not sales</h3>

<p>Autocomplete sits at the very top of the search funnel. A user accepts a suggestion, gets a better query, finds a relevant item, and eventually may purchase. This path has many confounding variables (item and shipping price, seller responsiveness, recommended items, promoted items), and the signal diminishes at each step.</p>

<p>Industry literature confirms the pattern: even Amazon reports modest ~0.13% revenue lifts from QAC improvements <a href="#ref-2">[2]</a>, and most published work by eBay, Walmart, and Spotify <a href="#ref-3">[3]</a><a href="#ref-4">[4]</a><a href="#ref-5">[5]</a> focuses on engagement metrics (MRR, acceptance rate, keystroke savings) rather than direct sales attribution. Sustained improvements in suggestion relevance and top-of-funnel engagement compound into better search sessions, discovery, and eventually conversions - but the causal chain is long.</p>

<h3 id="a-few-notable-tests">A few notable tests</h3>

<p>Out of 30+ SLS experiments, a few are worth calling out.</p>

<h4 id="sls-massively-increasing-autocomplete-usage">SLS: massively increasing autocomplete usage</h4>

<p>Before SLS, Vinted’s autocomplete was a static list of brand and category names - no data-driven ranking and no real user queries in the pool. The first SLS version shipped in early 2023 in the UK; over two years and 30+ experiments, suggestion usage climbed from under 8% of search sessions to over 17%.</p>

<figure style="text-align: center;">
  <img src="/static/2026/04/sls-usage-over-time.png" alt="Search Suggestion Usage over the period of SLS experiments" />
</figure>

<p>The first significant increase came in April 2023, when we scaled SLS in the UK, with a +0.9% lift in search GMV per active user and +0.7% in search transactions per active user. The plateau from May to September 2023 (~10%) covered multiple iterations to get SLS performing well across all countries and languages. The first big jump across the whole marketplace came when SLS was scaled to all countries (September 2023). The step-change in early 2024, from ~13% to 17%+, came from adding query-data suggestions - one of the biggest levers in the entire SLS journey.</p>

<p>SLS originally drew candidates only from product metadata - combinations of brand, category, colour, and attribute. This was a clear constraint: users search for things metadata can’t describe (e.g. “atomic habits”, “sony wh-1000xm5”, “charizard pokemon card”, “mob wife coat”), and without those queries in the pool, autocomplete felt out of touch with how people actually shop.</p>

<p>Adding user search queries to the candidate pool changed that. Query-based candidates make up only ~2% of the total pool but account for roughly half of all suggestion clicks - users gravitate toward suggestions that match how real people search.</p>

<p>The caveat: query data is only as good as the query volume behind it. It works best in markets with enough users to generate signal and enough inventory for those queries to return real results. In smaller markets, query data adds more noise than value.</p>

<p>Measured in short-term (&lt;= 2 weeks) experiments, the cumulative impact of the full SLS system was:</p>

<table style="border-collapse: collapse; width: 100%; font-size: 0.9em;">
  <thead>
    <tr>
      <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Metric</th>
      <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Result</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">Suggestions CTR</td>
      <td style="border: 1px solid #ddd; padding: 8px;">~+49%</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">Suggestion usage</td>
      <td style="border: 1px solid #ddd; padding: 8px;">~+42%</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">Search channel transactions</td>
      <td style="border: 1px solid #ddd; padding: 8px;">~+0.8% (p &lt; 0.05)</td>
    </tr>
  </tbody>
</table>

<p><br /></p>

<h4 id="debounce-latency-is-a-product-feature">Debounce: latency is a product feature</h4>

<p>The app used to wait 350 ms after each keystroke before asking for suggestions - every new character would cancel the pending timer and restart it. We dropped that to 100 ms, below the threshold at which a system feels instant. In a four-variant test, 100 ms won cleanly: suggestions usage up ~12%, iOS session-level usage up ~15%, and manual typing (users giving up and typing the full query themselves) down ~4%.</p>

<p>Shorter debounce also means more keystrokes actually fire a request instead of being cancelled by the next character - meaningfully increasing QPS to Vespa.</p>

<h4 id="capitalisation-a-deliberate-tradeoff">Capitalisation: a deliberate tradeoff</h4>

<p>A multi-variant test compared mixed-case suggestions (“Nike Shoes”) against all-lowercase (“nike shoes”). We believed that the casing should not affect user engagement metrics - but we were wrong - mixed-case had slightly better suggestions CTR and usage.</p>

<p>However, we had to make a deliberate tradeoff. Mixed-case broke visually when we started mixing metadata suggestions (which have predictable capitalisation) with query-data suggestions (which come in whatever case users type). Lowercase was less polished on individual suggestions but more consistent across the whole list, and could unblock future work. So we shipped lowercase anyway. Not every engagement win is worth scaling; sometimes the cleaner design is worth a point of CTR.</p>

<h4 id="scoped-suggestions-richer-ui-weaker-conversion">Scoped suggestions: richer UI, weaker conversion</h4>

<p>We tested suggestions that carried a category scope - clicking one didn’t just run a query, it applied the category as a hard filter on the results page. Visually richer, two jobs at once (search and scope).</p>

<figure style="text-align: center;">
  <img src="/static/2026/04/scoped-suggestions-mockup.png" alt="Scoped suggestions mockup showing 'vintage' query with category scopes" />
</figure>

<p>CTR was up ~2.4% and usage up ~2.7% against the no-scopes baseline. But the downstream numbers told a different story: users who clicked a scoped suggestion were ~1.3% less likely to buy in that session (statistically significant), and transactions per active user trended slightly negative. The scope was restricting the result set in ways that hurt conversion. We did not roll it out.</p>

<p>Looking at major e-commerce and search players today, almost all stick with a plain lowercase query list. Showing more per suggestion (categories, filters, thumbnails) is rare even at the largest scale.</p>

<h3 id="ltr-personalisation-on-top-of-sls">LTR: personalisation on top of SLS</h3>

<p>With SLS producing strong baseline suggestions, the remaining problem was that every user saw the same ranking. A user who predominantly browses women’s clothing and one who shops for men’s luxury brands got the same list for the prefix “dr” (you guessed it - “dresses” often being the number one suggestion).</p>

<figure style="text-align: center;">
  <img src="/static/2026/04/personalisation-example.png" alt="Three different users get different personalised suggestions for the same prefix" />
</figure>

<p>Five experiments over several months iterated on features, training data, and when to apply the re-ranking. The final model scaled to 100% of users.</p>

<p>Key results:</p>

<table style="border-collapse: collapse; width: 100%; font-size: 0.9em;">
  <thead>
    <tr>
      <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Metric</th>
      <th style="border: 1px solid #ddd; padding: 8px; text-align: left;">Result</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">Suggestions CTR</td>
      <td style="border: 1px solid #ddd; padding: 8px;">~+8%</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">Suggestion usage</td>
      <td style="border: 1px solid #ddd; padding: 8px;">~+4%</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">CTR on longer queries</td>
      <td style="border: 1px solid #ddd; padding: 8px;">up to +16%</td>
    </tr>
    <tr>
      <td style="border: 1px solid #ddd; padding: 8px;">Avg. value of viewed suggestions (EUR)</td>
      <td style="border: 1px solid #ddd; padding: 8px;">~+5%</td>
    </tr>
  </tbody>
</table>

<p><br /></p>

<p>More interesting than the aggregates was multivertical visibility. The SLS baseline was implicitly biased toward clothing, the dominant category on Vinted. Personalisation surfaced non-clothing verticals - electronics, sports, high-value fashion, luxury, and home - for the users who actually wanted them. The sports vertical showed particularly clear downstream impact: transactions per active user up ~0.91% (p &lt; 0.05), buyer GMV per active user up ~1.5%.</p>

<p>We also saw users deepening their relationship with autocomplete: clicks per user up ~3.8%, share of multi-day suggestion users up ~1.3%. Personalisation doesn’t just change a single session - it changes how much users lean on the feature over time. The effect on the suggestion-usage metric is obvious - across all countries we saw significant shifts, with countries newly onboarded to Vinted seeing the biggest increases.</p>

<h2 id="learnings-along-the-way">Learnings along the way</h2>

<p>Getting here meant building the SLS pipeline from scratch, migrating from Elasticsearch to Vespa, implementing edge-ngram indexing and progressive query relaxation, and layering on a LightGBM Learning-to-Rank model for personalisation. A few things during this journey stood out:</p>

<ul>
  <li><strong>Get the retrieval foundations right first.</strong> The Vespa migration and SLS generated suggestions doubled suggestion usage before we even added ML re-ranking. A solid baseline makes ML improvements additive, not a rescue operation.</li>
  <li><strong>Don’t underestimate heuristics.</strong> The SLS heuristic baseline carried most of the usage lift before we added any ML - simple, well-tuned heuristic approaches go a long way.</li>
  <li><strong>Real queries often beat the generated ones.</strong> Query data is a strong win when there are users and inventory to back it. Real search queries outperform machine-generated metadata combinations - but only when enough users are typing and enough items exist to make those queries useful.</li>
  <li><strong>Personalisation pays off in the long tail.</strong> Its primary value lies in the long tail - the ambiguous queries where individual intent diverges from the average - which is not easily captured by aggregate business metrics. Patience and good experimentation infrastructure are essential.</li>
  <li><strong>Engagement metrics are the right leading indicators.</strong> Suggestions CTR, usage, and keystroke savings are the most sensitive and reliable signals for autocomplete quality. Downstream business metrics follow, but take longer to materialise.</li>
  <li><strong>Know when to show nothing.</strong> Our progressive relaxation and deliberate restraint on aggressive fuzzy fallback reflect that principle.</li>
  <li><strong>Industry defaults exist for a reason.</strong> We tried richer visuals more than once - capitalisation, category scopes - and the results rarely beat plain lowercase suggestions that run a simple text search. Most major search players do the same. Novelty in autocomplete UI consistently lost to user familiarity with the basic pattern.</li>
</ul>

<h2 id="whats-next">What’s next</h2>

<p>With the retrieval, ranking and personalisation foundations in place for search autocomplete, here’s where we’re heading:</p>

<ul>
  <li><strong>Session-aware re-ranking</strong> - using the queries a user has typed earlier in the session as context for the LTR reranker. A user who just searched “nike air max” and then types “s” is likely after “shoes”, not “skirt”.</li>
  <li><strong>Surfacing each user’s previous searches</strong> directly in autocomplete, drawn from both the current session and earlier ones. Google and eBay already do this - past searches render alongside popular suggestions, typically with a clock icon and an inline “remove” control.</li>
  <li><strong>Neural suggestion generation</strong> - LLMs open up an exciting frontier for autocomplete: generating suggestions that no user has typed before and no metadata combination could produce. Be it long-tail queries, conversational phrasings, or trend-aware suggestions that adapt faster than any data logs based pipelines. The challenge, though, is latency - autocomplete fires on every keystroke under a 100 ms budget, so generative inference doesn’t yet fit the head traffic. But with smaller and faster models, smarter caching, and better serving infrastructure, this gap is closing fast. So we see LLM generation as a natural next layer on top of foundations we’ve built.</li>
</ul>

<h2 id="references">References</h2>

<div class="references">

  <p><span id="ref-1">[1]</span> Ziv Bar-Yossef, Naama Kraus. <a href="https://dl.acm.org/doi/10.1145/1963405.1963424">Context-sensitive query auto-completion</a>. 2011.</p>

  <p><span id="ref-2">[2]</span> Sonali Singh, Sachin Farfade, Prakash Mandayam Comar. <a href="https://assets.amazon.science/65/38/ed911a6b42718e19768e804f142e/evaluating-auto-complete-ranking-for-diversity-and-relevance.pdf">Evaluating Auto-complete Ranking for Diversity and Relevance</a>. Amazon Science.</p>

  <p><span id="ref-3">[3]</span> Adithya Rajan, Weiqi Tong, Greg Sharp, et al. <a href="https://arxiv.org/pdf/2505.08182">Semantic De-boosting in e-commerce Query Autocomplete</a>. Walmart Global Tech.</p>

  <p><span id="ref-4">[4]</span> Enrico Palumbo, Gustavo Penha, Alva Liu, et al. <a href="https://earl-workshop.github.io/pdf/recsys2025-workshops_paper_7.pdf">AudioBoost: Increasing Audiobook Retrievability in Spotify Search with Synthetic Query Generation</a>. Spotify.</p>

  <p><span id="ref-5">[5]</span> Hung Nguyen, Jayanth Yetukuri, Phuong Ha Nguyen, et al. <a href="https://genai-ecommerce.github.io/assets/papers/GenAIECommerce2025/recsys2025-workshops_paper_216.pdf">Enhancing Related Searches Recommendation System by Leveraging LLM Approaches</a>. eBay.</p>

</div>]]></content><author><name>Justina Bartulevičienė</name><uri>https://github.com/justinakiud</uri></author><summary type="html"><![CDATA[At Vinted, more than 20% of all search sessions now start with a click on an autocomplete suggestion. A few years ago, that number was below 8%. Autocomplete not only saves typing effort - it helps people discover listings they didn’t know existed, and guides them toward successful searches. Today, across 24 languages and 50+ country-language combinations, we have a pool of 125 million different queries ready to suggest to users. Our service, svc-suggestions, runs on Vespa and matches and ranks 4,700 queries per second at 31 ms P99.]]></summary></entry><entry><title type="html">Test Smarter, Not Harder: Risk-Based Data Quality Without Pipeline Paralysis</title><link href="https://vinted.engineering//2026/03/11/risk-based-testing/" rel="alternate" type="text/html" title="Test Smarter, Not Harder: Risk-Based Data Quality Without Pipeline Paralysis" /><published>2026-03-11T00:00:00+00:00</published><updated>2026-03-11T00:00:00+00:00</updated><id>https://vinted.engineering//2026/03/11/risk-based-testing</id><content type="html" xml:base="https://vinted.engineering//2026/03/11/risk-based-testing/"><![CDATA[<p>Upstream schema changes were breaking our finance pipelines daily. With monthly reporting deadlines looming, we needed to balance data quality with pipeline reliability. Here’s how we solved it without compromising either.</p>

<!--truncate-->

<h2 id="we-obsessed-over-data-quality-why">We obsessed over data quality. Why?</h2>

<p>Ideally, we’d like data to be pristine. Most data practitioners accept trade-offs in data quality rather than spending copious amounts of time addressing underlying data issues. For finance reporting, the margin of error is much smaller.</p>

<p>Our data space concerns reporting data pertaining to shipments from Vinted’s network partners. We have oversight over carrier invoices for financial reporting. This is also corroborated with cost expectations based on shipments created and contractual terms. The data we receive from carriers comes in all shapes, sizes, and forms. Think CSV, JSONs, Parquet, Excel files. This requires us to be flexible while keeping a close eye on any schema or format changes which may inadvertently affect the accuracy and completeness of data.</p>

<p>Post-migration, we shifted-left in our testing approach. We implemented several tests: not null, accepted values, and expression validations close to the source. An example would be an accepted values test for cost descriptions, as they would impact whether there was a positive or negative sign applied to the invoice amount. Despite keeping only <em>essential</em> tests as errors (as opposed to tests that only raise warnings), our pipeline was being blocked by upstream errors.</p>

<p>The reality hit hard: our pipelines went from consistently high daily success to a significantly lower rate within two months. The variety of data we received changed much more frequently than we thought. Our initial attempts to stay on top of our input files were too strict.</p>

<figure style="text-align: center;">
  <img src="/static/2026/03/overzealous-testing.png" alt="Chart showing pipeline success rate declining over two months due to overly strict data quality testing" />
</figure>

<p>We went back to the drawing board and asked ourselves - what is just enough? Getting bombarded by alerts and failing pipelines in the morning was eroding stakeholder confidence. Data quality that doesn’t arrive on time defeats its own purpose. We needed our pipelines to be available for consumption at the start of the working day.</p>

<h2 id="materiality-and-informational-quality">Materiality and informational quality</h2>

<p><em>Materiality is related to the significance of information within a company’s financial statements. If a transaction or business decision is significant enough to warrant reporting to investors or other users of the financial statements, that information is “material” to the business and cannot be omitted.</em></p>

<p>Financial accounting principles are a helpful tool for determining how to deal with data quality. We’re interested in two concepts: materiality, and the qualitative characteristics of financial information. The latter can also be understood through the Informational Quality Framework.</p>

<p>We revised our processes. The most crucial dates for financial reporting were at the start of the month and in the middle of the month, when financial reporting information is downloaded and reported on. At all other times of the month, the daily pipeline serves analytical purposes. In other words, localised errors with a new incoming invoice would not be significant enough to influence someone’s decision-making process. Therefore, we could afford to be less strict on localised errors.</p>

<p>Data quality can be viewed through the lens of the Informational Quality Framework:</p>

<figure style="text-align: center;">
  <img src="/static/2026/03/informational-quality.png" alt="Informational Quality Framework diagram showing key dimensions: accuracy, completeness, timeliness, consistency, accessibility, and interpretability" />
</figure>

<p>In our context, this would mean:</p>

<ul>
  <li><strong>Accuracy</strong>: If the mart says the shipment was invoiced for 5€, was it the same at source?</li>
  <li><strong>Completeness</strong>: Are all 100 shipments invoiced reflected in data products?</li>
  <li><strong>Timeliness</strong>: Can I see the updated data by the start of the work day?</li>
</ul>

<p>The framework also highlights other qualities such as consistency, accessibility, and interpretability. The reference to the paper is included below.</p>

<p>With these quality dimensions defined, we still faced a practical challenge: how do we translate abstract concepts like “materiality” into concrete testing decisions? We needed a systematic way to determine which data quality issues truly mattered for business decisions versus those that were just “nice to have” perfect. This led us to develop a framework that combined business impact assessment with frequency patterns.</p>

<h2 id="the-risk-based-approach">The risk-based approach</h2>

<p>Setting guidance on materiality in view of accuracy and completeness grounded us in assessing the potential impact to the team. After several months of observing schema evolution with an overly strict testing regime, we had a sense of how frequent exceptions occurred.</p>

<p>This allowed us to conceptualise the risk-based testing framework, based on the issues’ impact and frequency. The framework helped us reduce daily pipeline failures while maintaining critical data quality checks.</p>

<figure style="text-align: center;">
  <img src="/static/2026/03/risk-matrix.png" alt="Risk-based testing matrix with four quadrants: high impact/high frequency (avoid), high impact/low frequency (avoid), low impact/high frequency (reduce), low impact/low frequency (accept)" />
</figure>

<p>We took a more cautious approach. High impact risks should be avoided at all costs. For low impact but high frequency exceptions, we reduce the risk by monitoring them more closely. Alerts are triggered when there are exceptions, and there are tests that run daily. In cases where the exception doesn’t happen often and is unlikely to have high impact, we accept the risk. We monitor them in weekly reviews, without the need to trigger an alert every day.</p>

<p>These are some examples:</p>

<figure style="text-align: center;">
  <img src="/static/2026/03/examples.png" alt="Examples of data quality issues categorized by risk level: missing invoice amounts (high impact), duplicate tracking IDs (high impact), new cost types (low impact), and date parsing errors (low impact)" />
</figure>

<p>We only kept high impact tests in the main run. <code class="language-plaintext highlighter-rouge">dbt build</code> by default runs all tests, whereas we opted to exclude a substantial number of tests which look out for low impact silent failures. This helped us make the main run leaner while preserving checks.</p>

<figure style="text-align: center;">
  <img src="/static/2026/03/dag-timeline.png" alt="DAG timeline comparison showing improved pipeline performance after implementing lean test runs with excluded low-impact tests" />
</figure>

<h3 id="translating-this-to-code">Translating this to code</h3>

<p>We tag tests and use exclusion flags to build models and run tests.</p>

<p>On a test-level, in the model’s configuration:</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">data_tests</span><span class="pi">:</span>
  <span class="c1"># Missing amounts break financial reconciliation</span>
  <span class="pi">-</span> <span class="na">not_null</span><span class="pi">:</span>
      <span class="na">column_name</span><span class="pi">:</span> <span class="s">invoice_amount_eur</span>
      <span class="na">config</span><span class="pi">:</span>
        <span class="na">tags</span><span class="pi">:</span> <span class="s">highimpact_highfrequency</span>
  <span class="c1"># Duplicate tracking IDs corrupt cost allocation</span>
  <span class="pi">-</span> <span class="na">unique</span><span class="pi">:</span>
      <span class="na">column_name</span><span class="pi">:</span> <span class="s">shipment_tracking_id</span>  
      <span class="na">config</span><span class="pi">:</span>
        <span class="na">tags</span><span class="pi">:</span> <span class="s">highimpact_lowfrequency</span>
  <span class="c1"># New cost types appear occasionally, can be mapped later</span>
  <span class="pi">-</span> <span class="na">accepted_values</span><span class="pi">:</span>
      <span class="na">column_name</span><span class="pi">:</span> <span class="s">cost_description</span>
      <span class="na">values</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">delivery</span><span class="pi">,</span> <span class="nv">return</span><span class="pi">,</span> <span class="nv">surcharge</span><span class="pi">,</span> <span class="nv">fuel_surcharge</span><span class="pi">]</span>
      <span class="na">config</span><span class="pi">:</span>
        <span class="na">tags</span><span class="pi">:</span> <span class="s">lowimpact_highfrequency</span>
  <span class="c1"># Occasional date parsing errors, rarely material</span>
  <span class="pi">-</span> <span class="na">expression</span><span class="pi">:</span>
      <span class="na">expression</span><span class="pi">:</span> <span class="s2">"</span><span class="s">invoice_date</span><span class="nv"> </span><span class="s">&gt;=</span><span class="nv"> </span><span class="s">'2020-01-01'"</span> 
      <span class="na">config</span><span class="pi">:</span>
        <span class="na">tags</span><span class="pi">:</span> <span class="s">lowimpact_lowfrequency</span>
</code></pre></div></div>

<p>In <code class="language-plaintext highlighter-rouge">dbt_project.yml</code>:</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">models</span><span class="pi">:</span>
  <span class="na">vgo_finance</span><span class="pi">:</span>
    <span class="na">+meta</span><span class="pi">:</span>
      <span class="na">excluded_tests</span><span class="pi">:</span>
        <span class="c1"># This way, only high impact tests run</span>
        <span class="pi">-</span> <span class="s">tag:lowimpact_highfrequency</span>
        <span class="pi">-</span> <span class="s">tag:lowimpact_lowfrequency</span>

<span class="na">sources</span><span class="pi">:</span>
  <span class="na">vgo_finance</span><span class="pi">:</span>
    <span class="na">+meta</span><span class="pi">:</span>
      <span class="na">excluded_tests</span><span class="pi">:</span>
        <span class="c1"># This way, only high impact tests run</span>
        <span class="pi">-</span> <span class="s">tag:lowimpact_highfrequency</span>
        <span class="pi">-</span> <span class="s">tag:lowimpact_lowfrequency</span>
</code></pre></div></div>

<p>Our dbt project is split up into different tasks in Airflow. We have internal orchestration that reads the <code class="language-plaintext highlighter-rouge">+meta.excluded_tests</code> configuration and turns it into <code class="language-plaintext highlighter-rouge">--exclude</code> flags when calling the dbt CLI. See this article on our Airflow set-up: <a href="/2025/12/29/orchestrating-success/">Orchestrating Success</a>.</p>

<p>Stock dbt does not interpret <code class="language-plaintext highlighter-rouge">meta</code> in this way, so if you do not have similar tooling you should pass <code class="language-plaintext highlighter-rouge">--exclude</code> directly to <code class="language-plaintext highlighter-rouge">dbt test</code> / <code class="language-plaintext highlighter-rouge">dbt build</code>, as in:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Main pipeline run - only critical tests</span>
dbt build <span class="nt">--exclude</span> tag:lowimpact_highfrequency tag:lowimpact_lowfrequency

<span class="c"># Daily monitoring run - low-impact high-frequency tests for alerting</span>
dbt <span class="nb">test</span> <span class="nt">--select</span> tag:lowimpact_highfrequency
</code></pre></div></div>

<h2 id="managing-a-process-change">Managing a process change</h2>

<figure style="text-align: center;">
  <img src="/static/2026/03/change-model.png" alt="Change management process flow: recognizing problems, gathering requirements, writing RFC, taking action, with emphasis on stakeholder engagement" />
</figure>

<p>This was an exercise of change management. In order to obtain a mandate to prioritise these tasks, we needed to first have a consensus that there was a problem. Thereafter, while gathering requirements, we wrote an RFC and took action.</p>

<p>The key is in recognising that when data quality issues arise, they’re very much related to process deficiencies. People who interact with the process need to be engaged as an essential part to ensuring the process succeeds.</p>

<h2 id="designing-for-continuity-first-principles-and-inversion-as-a-mental-model">Designing for continuity: first principles and inversion as a mental model</h2>

<p><em>First principles thinking is a problem-solving method that breaks complex issues down into their most fundamental, foundational truths, rather than reasoning by analogy or convention.</em></p>

<p><em>Inversion thinking is a problem-solving technique that flips challenges upside down by focusing on how to avoid failure rather than solely on how to achieve success.</em></p>

<p>On data quality issues, we zoomed in on testing at source to ensure completeness. We focused on key pieces of information like the invoice amount, the date, and the cost description. We tried to imagine how, and asked, our stakeholders would check for these issues. What would they want to know about the information, and what could help them resolve it as quickly as possible?</p>

<figure style="text-align: center;">
  <img src="/static/2026/03/questions.png" alt="Key stakeholder questions for data quality issues: What happened? How much data is affected? When did it occur? What action is needed?" />
</figure>

<p>This is an iterative journey. Expectations are similarly built with time. When we started with alerting on source issues, the situation we wanted to avoid was not catching them at all. This resulted in an alert, but it wasn’t obvious what someone needed to do with it until they looked at a Google Sheet. This expectation evolved to us thinking - any solution cannot be unactionable; in other words, looking at the alert should allow someone to immediately know what to do with it.</p>

<p>The key insight: <strong>actionable alerts build trust, while noisy alerts erode it</strong>. We learned this the hard way when our Slack channel went from essential updates to a source of alert fatigue.</p>

<p>We will keep revising, and learning. For now, we have something that balances alerts (preventing fatigue) while ensuring quality and building trust.</p>

<h2 id="appendix">Appendix</h2>

<p>We’ve also shared our work at the <a href="https://www.youtube.com/watch?v=tNZMm4KTjTc">Forward Data Conference</a>. Do check it out!</p>

<p>Some prior research referenced is:</p>
<ul>
  <li>Prochaska, J. O., Norcross, J. C., &amp; DiClemente, C. C. (2013). Applying the stages of change. Psychotherapy in Australia, 19(2), 10-15.</li>
  <li>Eppler, M. J., &amp; Wittig, D. (2000). Conceptualizing Information Quality: A Review of Information Quality Frameworks from the Last Ten Years. IQ, 20(0), 0.</li>
  <li>IFRS Foundation (2018). Definition of Material, Amendments to IAS 1 and IAS 8.</li>
</ul>]]></content><author><name>Jeremy Chia</name><uri>https://github.com/jeremychia</uri></author><summary type="html"><![CDATA[Upstream schema changes were breaking our finance pipelines daily. With monthly reporting deadlines looming, we needed to balance data quality with pipeline reliability. Here’s how we solved it without compromising either.]]></summary></entry><entry><title type="html">From Dagger to Metro</title><link href="https://vinted.engineering//2026/02/12/from-dagger-to-metro/" rel="alternate" type="text/html" title="From Dagger to Metro" /><published>2026-02-12T00:00:00+00:00</published><updated>2026-02-12T00:00:00+00:00</updated><id>https://vinted.engineering//2026/02/12/from-dagger-to-metro</id><content type="html" xml:base="https://vinted.engineering//2026/02/12/from-dagger-to-metro/"><![CDATA[<p><a href="https://zacsweers.github.io/metro/latest/">Metro</a> - modern and kotlin injection framework <a href="https://www.zacsweers.dev/introducing-metro/">created by Zac Sweers</a>.
And we, Android developers at Vinted, officially and fully migrated to it! It was quite a bumpy ride for our huge codebase.</p>

<p>Our story begins…</p>

<!--truncate-->

<h2 id="era-before-metro">Era before Metro</h2>

<p>We have a huge codebase of a few hundreds Gradle modules, collected some good code and some legacy code during the past 14 years.
We adopted the dependency injection idea from the beginning, first it was the Dagger version released by Square, then the second fully-static version was released by Google.
A couple of years later, we adopted dagger.android, and the idea of having subcomponents per fragment looked fantastic back then (spoiler alert, it is not).
Later, a simpler yet more powerful DI idea arrived as the Hilt framework, but it was too late to redo all fragments.</p>

<p>After modularization took momentum and module count grew rapidly, we began to envy Hilt’s way of installing dependencies instead of providing via large dagger modules.
But it was hard to justify the time spent rewriting the code for the business.</p>

<p>Until one day we found the Anvil - Kotlin compiler plugin, which brings Hilt idea to contribute dependencies via annotation.
And it was faster due to its dagger factory generation.
We eagerly began adopting and even migrating Fragments from Android Injector to construction injection.</p>

<p>But technologies are moving fast, Kotlin released K2, and since Anvil was Kotlin compiler plugin, it required huge effort to adopt K2, and later Anvil moved to Maintenance mode.</p>

<p>So, even by upgrading Kotlin to 2.x, we were still stuck on K1 and 1.9 language features, without incremental compilation.
K1 support nearing its end also added pressure. We had many options: Hilt, Kotlin-inject, and … Metro.</p>

<h2 id="why-metro">Why Metro</h2>

<p>Metro was built using lessons learned from other DI frameworks, bringing together many solid ideas.
It supports Kotlin idioms well and is fast, consistent, and easy to learn.
However, it was difficult to justify switching at the time—at first glance, it seemed too risky to rely on a brand-new framework.</p>

<p>Metro has a major migration advantage that other frameworks don’t: robust, feature-rich interoperability with popular DI solutions like Dagger, Anvil, and kotlin-inject.
That level of compatibility is something its competitors lack.
In fact, Metro was the quickest path for us to adopt K2, since other frameworks would have required migrating more business code—adding not only time, but also risk.</p>

<p>We were evaluating all the options, but Metro was growing fast and the direction of its growth aligned with our needs closely.
This has further solidified our choice.</p>

<h2 id="bumpy-migration">Bumpy migration</h2>

<p>Not gonna lie, the ride was not easy.
We decided to migrate everything at once, without using any of the interoperability options, keeping scope and graph structure.
First obvious thing to do was just mass-replace imports and annotation names.</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">- import javax.inject.Inject
</span><span class="gi">+ import dev.zacsweers.metro.Inject
</span></code></pre></div></div>

<p>Funny thing about javax.inject.Inject - a lot of libraries are “leaking” it!
So the IDE will always try to suggest it in autocomplete.
At some point, we’ve had to set up a separate validating KSP processor just to fail the build when @Inject annotation from the wrong library was encountered in source code, since it can lead to subtle and hard-to-catch bugs.
Later though, we were able to remove it from the compile classpath completely, which solved the autocomplete problem.</p>

<p>Also removing <code class="language-plaintext highlighter-rouge">@JvmSuppressWildcards</code> as they are not needed anymore.
Harder thing was to replace <code class="language-plaintext highlighter-rouge">@ContributesMultibinding</code> since Metro has two annotations: <code class="language-plaintext highlighter-rouge">@ContributesIntoSet</code> and <code class="language-plaintext highlighter-rouge">@ContributesIntoMap</code>.
But no worries, Metro will let you know if you make a mistake!</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gd">- @ContributesMultibinding(FragmentComponent::class)
</span><span class="gi">+ @ContributesIntoMap(FragmentScope::class)
</span>  @ViewModelKey(AddressPluginViewModel::class)
  class AddressPluginViewModel @Inject constructor(): ViewModel
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">boundType</code> (and many other cases) can be fixed by regexp magic.
Pro tip: write the script instead of manually doing mass replace, it will help later doing upstream merges.</p>

<pre><code class="language-regexp">boundType = (.*)::class
️⬇️
binding = binding&lt;$1&gt;()
</code></pre>

<p>The other half was tricky. Do you remember Android Injectors from <code class="language-plaintext highlighter-rouge">dagger.android</code> I mentioned earlier?
We still have more than 100 fragments left…
But there is nothing code generation would not solve!
We made a crude implementation to generate graph extensions from the similar annotations (we made some shortcuts here).
From this code:</p>

<div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Container only for Android Injector contributions</span>
<span class="nd">@InjectorModule</span><span class="p">(</span><span class="nc">ActivityScope</span><span class="o">::</span><span class="k">class</span><span class="p">)</span>
<span class="k">abstract</span> <span class="kd">class</span> <span class="nc">LegacyFragmentsModule</span> <span class="p">{</span>
    <span class="nd">@FragmentScope</span>
    <span class="nd">@ContributesAndroidInjector</span><span class="p">(</span><span class="n">modules</span> <span class="p">=</span> <span class="p">[</span><span class="nc">LegacyModule</span><span class="o">::</span><span class="k">class</span><span class="p">])</span>
    <span class="k">abstract</span> <span class="k">fun</span> <span class="nf">contributesLegacyFragment</span><span class="p">():</span> <span class="nc">LegacyFragment</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We generated this:</p>

<div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@FragmentScope</span>
<span class="nd">@GraphExtension</span><span class="p">(</span>
    <span class="nc">FragmentScope</span><span class="o">::</span><span class="k">class</span><span class="p">,</span>
    <span class="n">bindingContainers</span> <span class="p">=</span> <span class="p">[</span><span class="nc">LegacyModule</span><span class="o">::</span><span class="k">class</span><span class="p">]</span>
<span class="p">)</span>
<span class="k">public</span> <span class="kd">interface</span> <span class="nc">LegacyFragmentInjectorGraph</span> <span class="p">{</span>
    <span class="c1">// Still using member injection</span>
    <span class="k">public</span> <span class="k">fun</span> <span class="nf">inject</span><span class="p">(</span><span class="n">instance</span><span class="p">:</span> <span class="nc">LegacyFragment</span><span class="p">)</span>

    <span class="nd">@ContributesTo</span><span class="p">(</span><span class="nc">ActivityScope</span><span class="o">::</span><span class="k">class</span><span class="p">)</span>
    <span class="nd">@GraphExtension</span><span class="p">.</span><span class="nc">Factory</span>
    <span class="k">public</span> <span class="kd">interface</span> <span class="nc">Factory</span> <span class="p">{</span>
        <span class="k">fun</span> <span class="nf">create</span><span class="p">(</span>
            <span class="nd">@Provides</span> <span class="n">instance</span><span class="p">:</span> <span class="nc">LegacyFragment</span>
        <span class="p">):</span> <span class="nc">LegacyFragmentInjectorGraph</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="nd">@Inject</span>
<span class="nd">@ContributesIntoMap</span><span class="p">(</span>
    <span class="nc">ActivityScope</span><span class="o">::</span><span class="k">class</span><span class="p">,</span>
    <span class="n">binding</span> <span class="p">=</span> <span class="n">binding</span><span class="p">&lt;</span><span class="nc">InstanceInjector</span><span class="p">&lt;</span><span class="nc">Fragment</span><span class="p">&gt;&gt;()</span>
<span class="p">)</span>
<span class="nd">@ClassKey</span><span class="p">(</span><span class="nc">LegacyFragment</span><span class="o">::</span><span class="k">class</span><span class="p">)</span>
<span class="k">public</span> <span class="kd">class</span> <span class="nc">ShippingFragmentInjector</span><span class="p">(</span>
    <span class="k">private</span> <span class="kd">val</span> <span class="py">graphFactory</span><span class="p">:</span> <span class="nc">ShippingFragmentsInjectorGraph</span><span class="p">.</span><span class="nc">Factory</span><span class="p">,</span>
<span class="p">)</span> <span class="p">:</span> <span class="nc">InstanceInjector</span><span class="p">&lt;</span><span class="nc">Fragment</span><span class="p">&gt;</span> <span class="p">{</span>
    <span class="k">override</span> <span class="k">fun</span> <span class="nf">inject</span><span class="p">(</span><span class="n">instance</span><span class="p">:</span> <span class="nc">Fragment</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">graphFactory</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span><span class="n">instance</span> <span class="k">as</span> <span class="nc">LegacyFragment</span><span class="p">).</span><span class="nf">inject</span><span class="p">(</span><span class="n">instance</span><span class="p">)</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>… and voilà, and another big chunk of code was done!
The rest was easier.
We took advantage of our existing ksp-powered code generation.
We had custom codegen for a lot of things, since we needed to generate boilerplate for Anvil (yes, we are generating boilerplate for everything).
Changing the codegen was not hard, mostly imports and annotation names, and boom, another couple hundred cases were fixed!</p>

<div class="language-kotlin highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@ContributesFragment</span>
<span class="kd">class</span> <span class="nc">InfoFragment</span> <span class="nd">@Inject</span> <span class="k">constructor</span><span class="p">()</span> <span class="p">:</span> <span class="nc">Fragment</span>

<span class="nd">@ContributesViewModel</span>
<span class="kd">class</span> <span class="nc">InfoViewModel</span> <span class="nd">@Inject</span> <span class="k">constructor</span><span class="p">()</span> <span class="p">:</span> <span class="nc">ViewModel</span>
</code></pre></div></div>

<p>Not everything was so smooth.
We learned in a hard way what it means to adopt the 0.x tool.
We found quite a few cases when the compiler was crashing due <code class="language-plaintext highlighter-rouge">StackOverflowException</code>, or the generated code was too slow.
We began with the 0.7.x version, and finished with 0.9.2 (0.9.3 was broken for us).
Most of the problems arise due to MemberInjector, which we don’t recommend to use.</p>

<p>Moreover, being a compiler plugin, Metro does not output much in build directories, like Anvil and Dagger used to do.
At first, it makes debugging a bit harder, but once we’ve got accustomed to <a href="https://zacsweers.github.io/metro/latest/debugging/">rich diagnostic reports</a> which are hidden by a Metro Gradle plugin property, debugging has become much easier.</p>

<p>Another big problem was constant upstream changes.
A few dozen developers produced a lot of changes daily, which made a 30min conflict solving ceremony each day.
We strategically chose to do migration around the holiday season.</p>

<p>After summing everything, we have zero regrets.
Of course, it was a bumpy ride, but a worthy one.
We learned so much about compilers, how to do mass migrations, how to make good codegen.
Big thanks to Zac Sweers, who put a lot of effort into fixing problems in a timely manner, and we hope that our small contributions to Metro will help others to have a smoother migration.</p>

<h2 id="the-results">The results</h2>

<p>Two months later after migrating and solving all the issues we were able to enable this juicy K2 and a bit later Incremental compilation (which is a topic that deserves another article all for itself).
The results are looking quite good for us.
Apart from getting rid of existential dread which was caused by the fact that K1 support will be dropped sooner or later, we’ve got some solid CI build times improvements!
For our large codebase, they look as follows:</p>

<table class="table table-bordered">
  <thead>
    <tr>
      <th>Build Scenario</th>
      <th>Metro</th>
      <th>Dagger/Anvil</th>
      <th>Reduction</th>
      <th>Savings</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Best Case build; most tasks are cached</td>
      <td>3m 23s</td>
      <td>4m 33s</td>
      <td>25.64%</td>
      <td>1m 10s</td>
    </tr>
    <tr>
      <td>Worst Case build; no tasks are cached</td>
      <td>24m 12s</td>
      <td>27m 05s</td>
      <td>10.65%</td>
      <td>2m 53s</td>
    </tr>
    <tr>
      <td>Worst Case Release build; no tasks are cached</td>
      <td>37m 43s</td>
      <td>40m 09s</td>
      <td>6.06%</td>
      <td>2m 26s</td>
    </tr>
    <tr>
      <td>ABI change in a core module that all feature modules depend on</td>
      <td>15m 46s</td>
      <td>17m 22s</td>
      <td>9.21%</td>
      <td>1m 36s</td>
    </tr>
  </tbody>
</table>

<p><br />
These stats were recorded with Metro 0.9.x.
Metro continues to grow and improve, also improving the code it generates, and therefore, build times, so if we were measuring them with the latest version, the results would have certainly been even better!</p>

<p>Our local build times also improved greatly, incremental compilation is no joke! However for the sake of brevity, we will not include them here.</p>

<h2 id="conclusion">Conclusion</h2>

<p>To sum it all up, Metro consolidated all the best practices from other popular frameworks, while leaving out the not-so-best practices on the side, allowed us to enable K2 and immediately experience significant build time improvements, while also unlocking incremental compilation, which means that the builds will be getting even faster.</p>

<p>The migration process, even in a big codebase with lots of legacy remainders lingering, even without using any interoperability capabilities, was interesting and as challenging as it should have been in such circumstances.</p>

<p>The developer satisfaction and confidence in the context of dependency injection has also increased with the arrival of Metro.
It’s easier to reason about one DI framework, rather than two, especially when this framework is kotlin-first and kotlin-centric.</p>]]></content><author><name>Andrius Semionovas</name><uri>https://github.com/neworld</uri></author><summary type="html"><![CDATA[Metro - modern and kotlin injection framework created by Zac Sweers. And we, Android developers at Vinted, officially and fully migrated to it! It was quite a bumpy ride for our huge codebase. Our story begins…]]></summary></entry><entry><title type="html">From Bash to Bliss: Scaling Vespa Operations with Temporal</title><link href="https://vinted.engineering//2026/01/21/from-bash-to-bliss-scaling-vespa-operations-with-temporal/" rel="alternate" type="text/html" title="From Bash to Bliss: Scaling Vespa Operations with Temporal" /><published>2026-01-21T00:00:00+00:00</published><updated>2026-01-21T00:00:00+00:00</updated><id>https://vinted.engineering//2026/01/21/from-bash-to-bliss-scaling-vespa-operations-with-temporal</id><content type="html" xml:base="https://vinted.engineering//2026/01/21/from-bash-to-bliss-scaling-vespa-operations-with-temporal/"><![CDATA[<h2 id="growing-platform---growing-maintenance">Growing platform - Growing maintenance</h2>

<p><strong>Keeping the Lights On (KTLO)</strong> is an essential, yet often taxing, part of a platform engineer’s role. It represents the routine operational work required to keep the business running and the platform stable. For our team, this primarily involves maintenance on our search engine, Vespa - ranging from version upgrades and service restarts to draining traffic from nodes for hardware replacements.
<!--truncate-->
In the O’Reilly book Platform Engineering, the authors recommend that <em>“KTLO work should account for no more than 40% of your team’s workload. Any more than that and you risk burning out your team”.</em> I couldn’t agree more. While necessary, KTLO tasks are often labor-intensive and repetitive rather than intellectually challenging.</p>

<p>As Vinted grows, our infrastructure must follow. We have transitioned from managing a hundred nodes to over a thousand, and without intervention, the KTLO “tax” accrues exponentially. We faced a binary choice: scale the team linearly by hiring or scale our efficiency by reducing the manual burden.</p>

<h2 id="the-scaling-wall-when-scripts-arent-enough">The Scaling Wall: When Scripts Aren’t Enough</h2>

<p>Our maintenance wasn’t fully manual; we relied on Bash scripts and <a href="https://www.chef.io/">Knife</a> commands. This was sufficient for a few dozen nodes, but as Vespa search engine became our default solution for search problems, our node count exploded. We reached a tipping point: we were no longer managing a single deployment, but dozens of unique deployments with varying maintenance needs. Our existing tooling simply couldn’t keep up with this complexity.</p>

<p>As we hit this limit, the flaws in script-based automation became clear:</p>
<ul>
  <li><strong>Fragility</strong>: bash scripts are “fire and forget.” A network blip in the middle of an upgrade leaves an engineer to manually reconcile the cluster state.</li>
  <li><strong>Operational Toil</strong>: without native state management, scripts require “babysitting” to ensure completion.</li>
  <li><strong>Lack of Guardrails</strong>: scripts are often “blind.” We needed a system capable of checking node health and readiness before proceeding to the next node.
To support Vinted’s growth, we pivoted from <strong>scripts</strong> to <strong>durable orchestration</strong> with three goals:
    <ol>
      <li><strong>Zero-Impact</strong>: transparent operations with automated health checks.</li>
      <li><strong>Autonomy</strong>: scheduled, hands-off upgrades.</li>
      <li><strong>Self-Service</strong>: guardrails that allow product teams to safely manage their own restarts.</li>
    </ol>
  </li>
</ul>

<h2 id="temporal">Temporal</h2>

<p>To understand Temporal, you have to stop thinking about “running a script” and start thinking about <strong>“durable execution.”</strong>
In a traditional environment, when you run a script to restart a Vespa search engine node, the state lives in the memory of the process running that script. If your laptop closes, the CI/CD runner times out, or the network blips, that state is lost. You’re left wondering: Did the node upgrade?
Temporal changes this by acting as a <strong>fault-tolerant state machine</strong>. It records every successful step of your code in a backend database. If the execution is interrupted, Temporal simply spins it back up on a different worker and resumes from the last successful “event,” with all its variables and local state intact.</p>

<h3 id="why-temporal">Why Temporal?</h3>

<p>As our Vespa footprint grew to a thousand nodes, we could consider building an event-driven system - where one service would emit a “Node Down” event and another service would react. However, we realized that <strong>events are often the wrong abstraction</strong> for complex maintenance.</p>

<p>Here is why we chose Temporal’s orchestration over traditional events or scripts:</p>
<ul>
  <li><strong>Orchestration over Choreography</strong>: in an event-driven “choreography,” it’s nearly impossible to see the “big picture” of an upgrade. With Temporal, the entire workflow - draining traffic, upgrading, and health-checking - is defined in a single block of code. We have a clear “manager” for the process rather than a dozen disconnected services “reacting” to each other.</li>
  <li><strong>The Code is the State</strong>: usually, to automate an upgrade, you’d need a database to track which nodes are <em>PENDING</em>, <em>UPGRADING</em>, or <em>FAILED</em>. Temporal removes this “toil.” The state is simply the current line of code being executed.</li>
  <li><strong>Built-in Reliability</strong>: in our old bash scripts, we didn’t have error handling or durability for that matter. Temporal provides these as primitives. If a Vespa API call fails, we don’t write a loop; we tell Temporal to “retry with exponential backoff,” and it handles the rest.
We chose the <strong>Go SDK</strong> because it allows us to treat infrastructure-as-code in the truest sense.</li>
  <li><strong>Workflows (The Brain)</strong>: we wrote a <em>VespaUpgradeWorkflow</em> in Go. It’s deterministic logic that orchestrates different operations like locking the chef client, bumping the version, restarting nodes and ensuring we never take down too many nodes at once.</li>
  <li><strong>Activities (The Muscles)</strong>: these are the individual Go functions that talk to different services and execute the steps for the procedures. Because activities are decoupled from the workflow, we can fail and retry an activity (like a slow node restart) without ever failing the overall upgrade process.
By moving our KTLO work into Temporal, we transformed “babysitting scripts” into a <strong>self-healing platform operation</strong>.</li>
</ul>

<h2 id="a-platform-within-a-platform">A platform within a platform</h2>

<p>By leveraging Temporal, we automated far more than just routine maintenance. We extended our orchestration to include new cluster provisioning and other recurring operational tasks.Today, upgrades are scheduled automatically twice a month during weekdays. The system intelligently accounts for public holidays and traffic surges, ensuring we are online to respond if issues arise. We’ve integrated Slack for real-time progress reporting and the Temporal UI for deep-visibility monitoring, backed by a robust alerting suite for stalled or failed workflows.
This automation has transformed our daily operations:</p>
<ul>
  <li><strong>Self-Service</strong>: feature teams now use a Slack bot to trigger restarts independently, removing our team as a bottleneck.</li>
  <li><strong>Provisioning</strong> at Scale: as the demand for new nodes increased, we automated the entire provisioning lifecycle. </li>
  <li><strong>Reduced Toil</strong>: while hardware failures still occasionally require manual intervention, these are now outliers.
What used to be the “brunt” of our on-duty backlog has effectively disappeared. We have essentially built a platform within our search-platform. This shift has not only lowered our KTLO “tax” but has allowed us to focus on higher-value engineering rather than the logistics of scale.</li>
</ul>

<h4 id="resources">Resources</h4>

<ul>
  <li><a href="https://www.oreilly.com/library/view/platform-engineering/9781098153632/">O’Reilly Platform Engineering</a></li>
  <li><a href="https://temporal.io/">Temporal</a></li>
  <li><a href="https://www.chef.io/">Chef</a></li>
  <li><a href="http://vespa.ai">Vespa search engine</a></li>
  <li><a href="https://temporal.io/blog/events-are-the-wrong-abstraction-rethinking-distributed-systems">Events are the wrong abstraction: Rethinking distributed systems</a></li>
</ul>]]></content><author><name>Martynas Jakimčikas</name><uri>https://github.com/jakimcikas</uri></author><summary type="html"><![CDATA[Growing platform - Growing maintenance Keeping the Lights On (KTLO) is an essential, yet often taxing, part of a platform engineer’s role. It represents the routine operational work required to keep the business running and the platform stable. For our team, this primarily involves maintenance on our search engine, Vespa - ranging from version upgrades and service restarts to draining traffic from nodes for hardware replacements.]]></summary></entry><entry><title type="html">Building a Global, Event-Driven Platform: Our Ongoing Journey, Part 1</title><link href="https://vinted.engineering//2026/01/09/building-global-event-driven-platform-part-1/" rel="alternate" type="text/html" title="Building a Global, Event-Driven Platform: Our Ongoing Journey, Part 1" /><published>2026-01-09T00:00:00+00:00</published><updated>2026-01-09T00:00:00+00:00</updated><id>https://vinted.engineering//2026/01/09/building-global-event-driven-platform-part-1</id><content type="html" xml:base="https://vinted.engineering//2026/01/09/building-global-event-driven-platform-part-1/"><![CDATA[<p>A few years ago, our platform reached a point where the way we’d always built software simply wasn’t enough anymore. The monolith that powered our early success had served us well, but as the business expanded across the continent, it started showing real limits. Global growth forced us to confront problems we couldn’t ignore: latency across regions, unpredictable load patterns, and an architecture that didn’t match the scale of the company. We needed to rethink how the entire system worked, from the shape of our data to the boundaries between teams.</p>

<!--truncate-->

<figure style="text-align: center;">
  <img src="/static/2026/01/blog_post_ongoing_journey_part_1.png" alt="ongoing-journey" />
</figure>

<p>What follows is the story of that shift - and why the most interesting engineering challenges are still ahead of us.</p>

<h2 id="the-moment-growth-outpaced-the-monolith">The Moment Growth Outpaced the Monolith</h2>

<p>Many of us grew up professionally inside the monolith. Everything lived in one place. Data was easy to reach. Consistency was immediate. Debugging meant reading a single flow and knowing exactly where things went wrong.</p>

<p>In the mid 2020, as our traffic increased to 150k requests per second in peak hours, that simplicity turned into a constraint. Some endpoints triggered hundreds of database queries. Others spanned dozens of logical databases. Latency between regions created unpredictable behavior. The entire platform lived inside one large failure domain, which meant any issue could cascade much further than it should.</p>

<p>The monolith didn’t just slow us down - it held us back from being truly global.</p>

<h2 id="discovering-the-shape-of-the-system">Discovering the Shape of the System</h2>

<p>Our first step wasn’t to break the monolith apart. It was to understand it. A few engineers introduced Domain-Driven Design as a way to map responsibilities and expose natural boundaries inside the application.
This quickly became more than a technical exercise. It gave teams clarity about what they owned. It highlighted places where responsibilities were tangled together. It made development faster simply because people weren’t stepping on each other’s toes anymore. Eventually, it guided how we restructured teams and how we planned the future of the architecture.</p>

<p>DDD didn’t give us all the answers, but it gave us the vocabulary to find them.</p>

<h2 id="moving-from-synchronous-calls-to-events-and-sagas">Moving from Synchronous Calls to Events and Sagas</h2>

<figure style="text-align: center;">
  <img src="/static/2026/01/blog_post_ongoing_journey_part_1_1.png" alt="ongoing-journey" />
</figure>

<p>It took us at least two years to understand the domains, as we identified almost 300 of them. But once we understood the domains, another problem became obvious: the entire system relied on synchronous communication. And while that worked fine in a single region, it didn’t survive real-world distributed conditions.</p>

<p>Every synchronous call added latency. Every tight integration increased fragility. And any workflow that needed data across regions suffered from unpredictable delays.</p>

<p>Shifting to business events changed that. Instead of expecting a remote service to respond in real time, services could publish state changes and let other domains react whenever they were ready.</p>

<p>For multi-step workflows, we are introducing Saga-style orchestration. Instead of trying to fake distributed transactions, we are embracing compensations, retries, and eventual completion. These ideas required new habits, new coding patterns, and a new mindset - but they let us operate reliably across geographic boundaries.</p>

<p>This was the moment the platform started to behave more like a distributed system and less like a stretched monolith.</p>

<p><em>In the next part, we will move from why the architecture had to change to how it operates at a global scale. We will look at the concrete decisions behind our multi-region model, why we centralized writes while distributing reads, and how events and projections make that possible. This is where the platform stops being just distributed in theory and starts delivering predictable performance worldwide.</em></p>]]></content><author><name>Dejan Menges</name><uri>https://github.com/dejo1307</uri></author><summary type="html"><![CDATA[A few years ago, our platform reached a point where the way we’d always built software simply wasn’t enough anymore. The monolith that powered our early success had served us well, but as the business expanded across the continent, it started showing real limits. Global growth forced us to confront problems we couldn’t ignore: latency across regions, unpredictable load patterns, and an architecture that didn’t match the scale of the company. We needed to rethink how the entire system worked, from the shape of our data to the boundaries between teams.]]></summary></entry><entry><title type="html">Building a Global, Event-Driven Platform: Our Ongoing Journey, Part 2</title><link href="https://vinted.engineering//2026/01/09/building-global-event-driven-platform-part-2/" rel="alternate" type="text/html" title="Building a Global, Event-Driven Platform: Our Ongoing Journey, Part 2" /><published>2026-01-09T00:00:00+00:00</published><updated>2026-01-09T00:00:00+00:00</updated><id>https://vinted.engineering//2026/01/09/building-global-event-driven-platform-part-2</id><content type="html" xml:base="https://vinted.engineering//2026/01/09/building-global-event-driven-platform-part-2/"><![CDATA[<p><em>In the first part, we described how growth pushed us beyond the limits of the monolith and forced us to rethink our architecture from the ground up. We explored how Domain Driven Design helped us uncover clear boundaries, and how shifting from synchronous calls to events and sagas changed the way the system behaves under real distributed conditions. With that foundation in place, we can now focus on what it takes to run this platform reliably across continents.</em></p>

<!--truncate-->

<h2 id="designing-for-a-global-footprint">Designing for a Global Footprint</h2>

<p>Operating across continents forces you to rethink even the assumptions that once felt foundational. One of the biggest decisions we made was choosing not to shard our primary data models across regions. Instead, we embraced a model where all writes happen in the primary site and read-only projections are replicated around the world. This gives us a single source of truth for writes while still providing fast, local reads to users regardless of geography.</p>

<p>It wasn’t an easy choice. It meant accepting that global consistency would always lag by at least a few moments and that network behavior would play a real role in how fresh data appears in different regions. But it also avoided the complexity and operational cost of a fully sharded, multi-writer system - complexity that becomes hard to justify unless the business absolutely demands it.</p>

<p>What this approach gave us, though, was predictable behavior under load. We started building our projections to tolerate replication delays, survive partial failures, and recover automatically when regions fall behind. Most importantly, we are learning to evaluate every domain and feature through a new lens: how does its read path behave globally, how sensitive is it to freshness, and what happens when the network isn’t cooperating?</p>

<p>Today, this model allows us to serve hundreds of thousands of requests per second worldwide while keeping write logic centralized and robust. Eventual consistency is being built into how the system works, not an edge case we try to hide. And that clarity is making our platform both more resilient and easier to evolve as we plan to expand to more regions.</p>

<h2 id="faster-reads-through-data-projections">Faster Reads Through Data Projections</h2>
<p>One of the biggest improvements came from separating how we write data from how we read it. Features like feeds, search, and listing pages need fast, region-local access. Depending on remote services simply isn’t an option once you operate across continents.</p>

<p>The answer was to build read-optimized data projections generated directly from our event streams. Each team could decide what their projection looked like, how it should be optimized, and where it should live geographically. This reduced cross-team dependencies and made performance far more predictable.</p>

<figure style="text-align: center;">
  <img src="/static/2026/01/blog_post_ongoing_journey_part_2.png" alt="ongoing-journey" />
</figure>

<p>Through the mid of 2026, the projections should power many of our most visible features. They’re a key part of our strategy for low latency and global scale.</p>

<h2 id="the-cultural-shift-behind-the-architecture">The Cultural Shift Behind the Architecture</h2>

<p>None of this would have been possible if we hadn’t shifted how teams think about building software. The monolith encouraged a mindset where consistency was free, and data lived everywhere. Distributed systems demand the opposite. And we are talking about ~50 teams.</p>

<p>Teams learned to design for retries, compensations, idempotency, and partial failure. They had to build experiences that hold together even when some events arrive late or in a different order than expected. And perhaps most importantly, they had to take real ownership of domain behavior from end to end and not just code paths.</p>

<p>Conversations changed. Instead of debating individual endpoints, teams talk about flows, boundaries, event lifecycles, data freshness, and recovery. This shift has made our architecture stronger, and it has made our engineering culture stronger too.</p>

<h2 id="where-we-are-and-whats-next">Where We Are, and What’s Next</h2>

<p>We are now ready to run a hybrid architecture built around services, events, and globally replicated projections. The foundations are in place, but we’re very much in the middle of the journey. A significant part of our work today is focused on strengthening the platform itself: improving our async tooling, defining clear standards for how projections and consumers should be built, and making sure our infrastructure can sustain the traffic patterns we’re seeing — and the ones we know are coming.</p>

<p>We’re still refining the rules for how events flow through the system, how projections handle late or conflicting updates, and how consumers recover after interruptions. A lot of energy is going into making the development experience smoother: better local tooling, more predictable event schemas, cleaner testing patterns, and clearer guidelines for how domains should emit and react to events. At the same time, the infrastructure side is evolving to support larger volumes, faster replication, and better observability across regions.</p>

<p>There’s plenty left to do. Some domains still need to be extracted. Some projections need to be redesigned for scale. Some need to be designed from scratch. Our event propagation paths can get faster, and our recovery mechanisms can become more automated. The long-term goal is to reach a point where operating a distributed, event-driven system feels no more complicated to an engineer than working inside the monolith once did, but with all the resilience, clarity, and global performance benefits of the new world.</p>

<p>We’ve built the basic shape of the platform we want. Now we’re tuning it, scaling it, and making it something teams can rely on with full confidence as the company keeps growing.</p>

<h2 id="why-this-work-matters-and-why-you-might-want-to-join">Why This Work Matters, and Why You Might Want to Join</h2>

<p>If you’ve spent enough years in engineering, you can tell when a team is solving real problems versus rearranging abstractions. The work we’re doing sits firmly in the first category. We’re building systems that have to hold together across continents, under real traffic (we have already reached 300k requests per second, and it is growing steadily), in environments where eventual consistency, replication delays, and partial failures aren’t theoretical edge cases - they’re everyday constraints we have to design for.</p>

<p>You’d be joining a group of people who care deeply about getting the fundamentals right. The problems are complex in a way that rewards good engineering instincts: modeling domains cleanly, designing robust asynchronous flows, understanding how events propagate through a large system, and building projections that remain fast and correct under load. There’s room here for engineers who enjoy thinking holistically, who appreciate clarity in domain boundaries, and who like improving the systems that everyone else will depend on for years.</p>

<p>You’d have influence, not in the “we have a committee for that” sense, but in the way where well-reasoned ideas actually shape how the platform evolves. If you see a gap in our tooling, you can fix it. If you find a better pattern for consumers, you can drive its adoption. If you notice a weakness in our replication or event flows, you can help redesign them. This is the kind of environment where senior engineers don’t just write code - they leave fingerprints on the architecture.</p>

<p>And perhaps most importantly: we’re not done. The foundations are in place, but many of the hardest challenges are still ahead. We’re scaling across new markets, pushing more traffic through the system, and tightening the guarantees we provide while keeping the developer experience simple. If you’re the kind of engineer who enjoys working on systems that matter, who wants real ownership, and who’s motivated by building the kind of platform that other teams can stand on with confidence, we’d love to talk to you.</p>

<p>You could make a meaningful impact here - not someday, but immediately.</p>]]></content><author><name>Dejan Menges</name><uri>https://github.com/dejo1307</uri></author><summary type="html"><![CDATA[In the first part, we described how growth pushed us beyond the limits of the monolith and forced us to rethink our architecture from the ground up. We explored how Domain Driven Design helped us uncover clear boundaries, and how shifting from synchronous calls to events and sagas changed the way the system behaves under real distributed conditions. With that foundation in place, we can now focus on what it takes to run this platform reliably across continents.]]></summary></entry><entry><title type="html">Orchestrating Success</title><link href="https://vinted.engineering//2025/12/29/orchestrating-success/" rel="alternate" type="text/html" title="Orchestrating Success" /><published>2025-12-29T00:00:00+00:00</published><updated>2025-12-29T00:00:00+00:00</updated><id>https://vinted.engineering//2025/12/29/orchestrating-success</id><content type="html" xml:base="https://vinted.engineering//2025/12/29/orchestrating-success/"><![CDATA[<p><strong>TL;DR</strong>: How Vinted standardizes large-scale decentralized data pipelines.</p>

<!--truncate-->

<p>When we started migrating Vinted’s data infrastructure to the cloud, we set out to create a decentralized way of working. The idea was simple: teams know their data best, so they should be fully empowered to build, own, and operate their pipelines without a central platform team getting in the way.</p>

<p>In that early phase, this worked reasonably well. Teams were moving fast, experimenting, and shipping value. They orchestrated their pipelines independently, inside their own domain. But as the platform grew, reality caught up with us: handling dependencies between decentralized teams requires a sophisticated solution.</p>

<p>In practice, teams were constantly using each other’s data assets. A marketing model would rely on product events; a finance report depended on operational data; a machine learning feature set pulled from three different domains. The business logic was naturally cross‑cutting, but our orchestration model pretended that domains were islands.
This led to a subtle but very real problem: <b>coordination moved from code into endless meetings</b>.</p>

<p>We were left with a pretty hefty task to solve: how do we make sure that these domains naturally fit together, as to complete Vinted’s puzzle of data pipeline orchestration?</p>

<figure style="text-align: center;">
  <img src="/static/2025/12/orchestration-puzzle.png" alt="orchestration-puzzle" />
</figure>

<h2 id="the-dark-side-of-decentralization">The Dark Side of Decentralization</h2>
<p>The goal of our decentralized setup was to let teams work autonomously, without constantly leaning on a central data platform. They got their own infrastructure in GCP, their own <a href="https://github.com/dbt-labs/dbt-core">dbt</a> project, and were expected to run their own pipelines on an <a href="https://github.com/apache/airflow">Airflow</a> instance provided by the data platform team.</p>

<p>At the same time, we didn’t have Airflow experts scattered across the organization, and we didn’t want to create that requirement. Asking every team to hand‑craft DAGs and become fluent in Airflow would distract them from doing what they do best: creating impactful data models that positively influence Vinted.</p>

<p>In fact, the “classic” way to run dbt with Airflow leans into that idea: keep Airflow simple and let dbt handle the complexity. You schedule a small number of tasks, often just a <code class="language-plaintext highlighter-rouge">dbt run</code> executed as a Bash command, and dbt resolves the full dependency graph internally. Airflow doesn’t try to mirror dbt’s model-level lineage; it just triggers the job and reports whether it succeeded.</p>

<p>We followed the same philosophy, but adapted it to our scale and constraints. A single end-to-end execution was simple, but it didn’t fit our cost profile: if something failed late in the run, the easiest recovery was often to rerun the whole job, which meant recomputing (and paying for) a lot of already-finished work, including some very large tables. To keep retries cheaper and failures more contained, we split execution into layers: one Airflow task per dbt “layer”. Airflow would call dbt with “run the staging layer,” “run the fact layer,” “run the mart layer,” and dbt would take care of the rest. Inside the job, dbt figured out the dependency graph within that layer and executed the models in the right order</p>

<p>This kept things approachable, but it had sharp edges. If an unrelated staging model broke, the entire staging layer task would fail and everything downstream would be blocked. The figure shows an example:</p>

<figure style="text-align: center;">
  <img src="/static/2025/12/task-execution.png" alt="task-execution" />
</figure>

<p>The dbt lineage shows that <code class="language-plaintext highlighter-rouge">mrt_items</code> should complete without problems in the case that <code class="language-plaintext highlighter-rouge">int_orders</code> fails. However, due to the fact that Airflow doesn’t have this granularity, it never even starts the jobs downstream from the <code class="language-plaintext highlighter-rouge">intermediate</code> layer. Data that didn’t actually depend on the broken model still arrived late. This was only the beginning of our troubles.</p>

<p>This “dbt handles the graph, Airflow runs the job” approach works extremely well at small scale. However, once you’re dealing with thousands of models spread across ~20 teams, the lack of model-level transparency becomes a real operational problem, especially when dependencies cross team boundaries. When something breaks, it’s no longer obvious what’s actually blocking what, which missing dependency caused the failure, or who owns the upstream piece that needs fixing. In a decentralized setup, that ambiguity is expensive: tracing issues turns into detective work, and responsibility becomes harder to pinpoint. The same lack of visibility forces teams to wait for an entire upstream pipeline to finish, because they can’t reliably tell when the specific piece they depend on is actually done. You’d hear questions like:</p>

<ul>
  <li>“At what time can I assume your daily job is finished?”</li>
  <li>“If your pipeline fails, how will I know?”</li>
  <li>“Can we align our schedules so my pipeline doesn’t start too late?”</li>
</ul>

<figure style="text-align: center;">
  <img src="/static/2025/12/isolated-domain-dependencies.png" alt="isolated-domain-dependencies" />
</figure>

<p>We had successfully decentralized ownership, but we had accidentally introduced fragility in the hand‑offs between teams.</p>

<h2 id="the-rise-of-our-dag-generator">The Rise of our DAG Generator</h2>
<p>We believe decentralized teams are what we need to scale, as Vinted grows. So we needed a way to remove the cognitive load of “orchestration trivia” from domain teams, especially around cross‑domain dependencies.</p>

<p>The key design goal was this: <i>Let teams think in terms of data models and lineage, not in terms of pipeline scheduling and cross‑pipeline wiring</i>.</p>

<p>To get there, we focused on two things:</p>

<ul>
  <li>Abstracting away pipeline creation, so engineers didn’t need to hand‑craft DAGs and dependency chains</li>
  <li>Standardizing the way dependencies interact, so relationships between data assets were expressed and enforced consistently</li>
</ul>

<p>We already had the perfect source of truth for dependencies: the dbt manifest. It knows which model depends on which, how data flows through the domain, and where the boundaries between sources and transforms lie.</p>

<p>So we built a DAG generator that:</p>

<ol>
  <li>Reads the dbt manifest</li>
  <li>Understands the full lineage within a domain</li>
  <li>Unfolds that lineage into a task‑per‑model setup in Airflow</li>
</ol>

<p>By unfolding the lineage into a task‑per‑model structure, we gained granularity and flexibility. Suddenly, we weren’t just running “the staging layer”, we were running concrete, addressable units of work that mapped directly to dbt models. That opened the door to do something much more powerful across domains.</p>

<h2 id="decentralized-domains-centralized-dependencies">Decentralized Domains, Centralized Dependencies</h2>
<p>With task‑per‑model pipelines in place, the next step was to actually wire domains together. Practically, that meant setting up sensors that could wait on upstream work in other domains and only move forward when the right data was ready. Conceptually, the problem is simple (“don’t start this until that has finished”), but at platform scale the implementation details matter: who do you wait on, how do you express that, and how do you keep those relationships from turning into spaghetti?</p>

<figure style="text-align: center;">
  <img src="/static/2025/12/connected-domain-dependencies.png" alt="connected-domain-dependencies" />
</figure>

<p>Airflow already has an opinionated way to model cross-DAG dependencies: <a href="https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/asset-scheduling.html">Airflow Assets</a>. They’re event-driven, first-class citizens, and looked like the perfect fit for connecting domains without tight scheduling coordination.</p>

<p>Unfortunately, we found ourselves running into a hard limitation quite early on: Airflow Assets operate at the DAG level. An Airflow Asset update can trigger an entire downstream DAG, but we needed something more precise. Our pipelines are owned end-to-end by domains, and we wanted to keep that ownership boundary intact: upstream domains shouldn’t be “starting” other teams’ pipelines, and downstream domains shouldn’t have to understand (or care) how upstream work is split across DAGs or tasks. What we needed was task-level unblocking inside larger pipelines: resume this specific unit of work as soon as that specific upstream unit is ready.</p>

<p>We found a more fitting candidate in the <code class="language-plaintext highlighter-rouge">ExternalTaskSensor</code>. It lets a task in one pipeline wait for the completion of a specific task in another DAG, exactly the fine-grained dependency we were after. However, this came with two obvious downsides. First, if teams wired sensors by hand, we’d end up with a fragile web of hard-coded references that’s difficult to validate, painful to refactor, and easy to break silently. Second, the mechanism is polling-based and timeout-driven, and in real life upstream tasks sometimes finish after a downstream sensor has already timed out, turning “just rerun it” into an operational habit.</p>

<p>So we set out to enrich this candidate to ensure we solve both downsides. To achieve this, we built an Asset Registry: a central catalog of all tasks and their relationships. It knows which domain, pipeline, and dbt model each task belongs to, and how tasks depend on each other across domains. We use it in CI/CD to validate that upstream references are valid and to attach metadata like “when should my data be available?” and “which task should I poll for completion?”. This metadata is collected automatically, as it is already available in the dbt manifests.</p>

<figure style="text-align: center;">
  <img src="/static/2025/12/asset-registry.png" alt="asset-registry" />
</figure>

<p>For engineers, this means they don’t wire pipelines together directly. They simply say “this model depends on that model,” and the combination of DAG generator and Asset Registry turns that into concrete task‑level dependencies, distributed amongst decentralized data pipelines, using <code class="language-plaintext highlighter-rouge">ExternalTaskSensor</code> behind the scenes. This effectively solved the wiring problem. One down, one to go.</p>

<figure style="text-align: center;">
  <img src="/static/2025/12/sensor-dag-dependency.png" alt="sensor-dag-dependency" />
</figure>

<p>To solve the timeout problem, we use the registry too. When an upstream task completes, even if it’s late, we look up all downstream sensors that depend on it (including the ones that have already timed out) and mark them as satisfied via a completion callback. Downstream pipelines then continue automatically, without manual restarts.</p>

<p>From an engineer’s perspective, this complexity is invisible. They don’t restart stuck runs, chase timing mismatches between teams, or track who depends on what in their heads. The platform reconciles dependencies as tasks complete and makes the behavior transparent and deterministic: you can always ask the registry who depends on a task and why something did or didn’t run.</p>

<h2 id="turning-decentralized-data-modelling-into-a-breeze">Turning Decentralized Data Modelling into a Breeze</h2>
<p>Not only does our approach solve the dependency issues, it also sheds light on a complex and decentralized data landscape. In an intrinsic web of domain dependencies, it can become tricky to understand who depends on which data assets you own. This created a risky environment for our engineers to make changes to assets they own. Upon introducing breaking changes, like altering the schema, they were tasked with finding out which domains were using this asset. Often this resulted in many back-and-forths and meetings that could otherwise have been avoided.</p>

<p>Our Asset registry unlocks the ability in CI/CD to understand which model is going to be changed, and which teams depend on said model. We can simply collect these scenarios, and post them in the body of the PR the engineer is working on. By adding the Slack channel, we provide a simple and effective way to understand who to reach out to.</p>

<figure style="text-align: center;">
  <img src="/static/2025/12/scheduler-report.png" alt="scheduler-report" />
</figure>

<h2 id="why-everyone-needs-a-dag-generator">Why Everyone Needs a DAG Generator</h2>
<p>A standardized DAG generator has become one of the most valuable pieces of our platform. Because every pipeline is created through this generator, we effectively hide DAG authoring from users and constrain them to a small, curated set of building blocks. Under the hood, those map to a limited number of Airflow operators and patterns we control, which means we only need to test and maintain a narrow surface area instead of a zoo of custom DAGs.</p>

<p>The trade-off is that Airflow has a huge ecosystem of operators and built-in features, and our generator only exposes a small subset of them. Sometimes that means we can’t use a capability straight out of the box, or we have to reimplement parts of it inside the generator to keep the interface consistent. Still, the leverage we get from standardization is worth it.</p>

<p>The interface for users stays stable: they describe models and dependencies in the same way, regardless of what’s happening underneath. This gives us freedom to change the generator’s output when we need to. If we want to tweak retries, swap an operator, or adopt a new Airflow feature, we update the generator and regenerate the DAGs. Teams don’t have to manually configure anything in their pipelines.</p>

<p>This approach really paid off when we upgraded to Airflow 3. We adapted the generated DAG structure and operators, rolled out the new generator, and were done. For engineers, the migration was almost invisible; for us, it was a controlled platform change instead of a manual cleanup of dozens of hand‑written DAGs.</p>

<p>And for most of our engineers, that’s exactly how it should be: they think in terms of data, while the platform quietly does its job in the background.</p>

<h2 id="appendix">Appendix</h2>
<p>We had the privilege to present this solution in more detail at the Airflow Summit in Seattle, PyData Amsterdam and Astronomer’s The Data Flowcast. Please find the links here if the above piqued your interests!</p>

<ul>
  <li><a href="https://www.youtube.com/watch?v=9YAVD3kwU58">PyData Amsterdam</a></li>
  <li><a href="https://www.youtube.com/watch?v=YU-4My_dneM">Airflow Summit</a></li>
  <li><a href="https://www.youtube.com/watch?v=325Rjx3pWr0">Astronomer’s The Data Flowcast</a></li>
</ul>]]></content><author><name>Oscar Ligthart</name><uri>https://github.com/OscarLigthart</uri></author><summary type="html"><![CDATA[TL;DR: How Vinted standardizes large-scale decentralized data pipelines.]]></summary></entry><entry><title type="html">Dense Retrieval</title><link href="https://vinted.engineering//2025/11/18/dense-retrieval/" rel="alternate" type="text/html" title="Dense Retrieval" /><published>2025-11-18T00:00:00+00:00</published><updated>2025-11-18T00:00:00+00:00</updated><id>https://vinted.engineering//2025/11/18/dense-retrieval</id><content type="html" xml:base="https://vinted.engineering//2025/11/18/dense-retrieval/"><![CDATA[<p><strong>TL;DR</strong>: integrating embedding-based retrieval into the e-commerce search application is a significant undertaking.</p>

<!--truncate-->

<h2 id="introduction">Introduction</h2>

<p>The low-recall aspect of the keyword-based search is challenging when dealing with the content which is both very visual and multilingual. Returning “no results” even when there are relevant items is a missed business opportunity. To address the challenge, the initial experimentation with the dense embeddings-based retrieval can be tracked back to internal hackathons as early as 2022.</p>

<p>To prove the business value and to maximise learning, we implemented filling of low-recall search sessions with some dense retrieval matches. A little increase in recall turned into improved search metrics which gave more confidence in the technique.</p>

<p>Attempts to include dense retrieval matches in all search sessions started in the spring of 2024. When a technical approach and business value was proven in one market with one dominant language, it was time to work on scaling the solution.</p>

<p>Some 50 AB tests later, the dense retrieval is fully enabled. It required numerous model improvements and a bunch of engineering wizardry which are covered in the remainder of this post.</p>

<h2 id="ml-model">ML Model</h2>

<p>At its core, our system is a <strong>Two-Tower Model</strong>:</p>

<ul>
  <li><strong>The Query Tower:</strong> Takes a user’s search query (e.g. “red summer dress”) and encodes it into a 256-dimension vector (a.k.a. “query embedding”).</li>
  <li><strong>The Item Tower:</strong> Takes an item from our catalog and encodes all its features into a vector (a.k.a. “item embedding”) of the same 256 dimensions.</li>
</ul>

<h3 id="the-query-tower">The Query Tower</h3>

<p>The frozen pre-trained <strong>multilingual <a href="https://huggingface.co/docs/transformers/en/model_doc/clip">CLIP</a> model is</strong> our base for further fine-tuning. We train a “projection head” with <strong>GELU</strong> activations and <strong>LayerNorm.</strong> That not only complements CLIP’s general-purpose knowledge with our search domain specifics, but also makes training fast and efficient.</p>

<h3 id="the-item-tower">The Item Tower</h3>

<p>An item’s representation is a fusion of signals, including its image and metadata (brand, color, category, price, etc.):</p>

<ul>
  <li><strong>Categorical Features:</strong> Embeddings from data like brand ids, category ids, etc., each projected to a 256-dimensional space</li>
  <li><strong>Visual Features:</strong> Pre-calculated CLIP embeddings from our primary product images (512 dimensions input projected to 256 dimensions).</li>
</ul>

<p>These vectors are concatenated and passed through a final fusion layer. This allows the model to learn the complex interactions between all features (e.g., how a specific brand relates to a specific image style).</p>

<p>Including the textual item information into the model turned out to be challenging but remains a promising direction for future improvements, perhaps a much larger training dataset or different incorporation techniques are worth a try.</p>

<h3 id="the-training-recipe">The Training Recipe</h3>

<p>The two towers are trained together using contrastive learning. The model learns to pull positive (relevant) query-item pairs together while pushing negative (irrelevant) pairs apart. For every (query, positive_item) pair, we force the model to distinguish it from 7,000–10,000 random negative items.</p>

<p>To get this training to converge, we use a full recipe of such practices:</p>

<ul>
  <li>A <strong>learnable temperature</strong> parameter that self-tunes the “difficulty” of the loss.</li>
  <li><strong>Mixed-precision (FP16) training</strong> for a speed-up.</li>
  <li>The <strong>AdamW optimizer</strong> with separate weight decay.</li>
  <li>A <strong>Cosine Annealing</strong> learning rate scheduler for stable convergence.</li>
</ul>

<p>With the same training recipes we saw model training improvements by scaling dataset by 10x to more than 100 million positive pairs.</p>

<p>For deployment, converting the item model to <a href="https://onnx.ai/">ONNX</a> introduced complexities in handling categorical feature preprocessing. Our solution is to integrate the preprocessing logic directly into the ONNX graph by using a masking technique to filter and manage out-of-vocabulary (OOV) inputs.</p>

<p>Removing irrelevant nearest neighbors requires picking an arbitrary similarity value. Doing this not only requires manually-tuning combined further adjustments with AB testing, but also mandates adjustment after every model retraining. Furthermore, it could be adjusted per query based on some estimated recall value during query time.</p>

<h2 id="system-architecture">System architecture</h2>

<p>The entire implementation is within a Vespa application package.</p>
<figure style="text-align: center;">
  <img src="/static/2025/11/deployment-diagram.png" alt="deployment-diagram" />
</figure>

<p><a href="https://github.com/dainiusjocas/notes/issues/9#issuecomment-3531634112">In the high-level architecture diagram</a> above we see that within the Vespa stateless layer query and feeding containers are physically separated. The separation allows adding resources individually and on demand. Query cluster nodes run the query model configured as a standard <a href="https://docs.vespa.ai/en/embedding.html#huggingface-embedder">Vespa embedder</a>. The feed cluster nodes contain the item tower that is invoked in a custom <a href="https://docs.vespa.ai/en/document-processing.html">Vespa document processor</a> that invokes a <a href="https://docs.vespa.ai/en/stateless-model-evaluation.html#model-inference-using-java">model evaluator</a> whose result is added to the document update operation. Both query and feed nodes communicate with the same content cluster. The content cluster nodes contain <a href="https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world">HNSW</a> index on the portion of the dataset. All models files and configuration is managed within a single Vespa <a href="https://docs.vespa.ai/en/applications.html">application package</a>.</p>

<p>It is also worth mentioning that to initially calculate image embeddings of only the primary photo for all the items took about 1 month. The image embeddings weigh around ~ 1 TB of memory (10^9 items * 512 dimensions * 2 bytes per dimension). The item embeddings weight half as much.</p>

<h2 id="performance">Performance</h2>

<p>Combining the fast HNSW search with filtering is a major performance challenge. This combination requires extensive optimization, even in a high-performance system like Vespa, to stay within a tight latency budget. Below we list such optimizations roughly sorted by complexity.</p>

<h3 id="add-nodes-to-content-groups">Add nodes to content groups</h3>

<p>The worst-case scenario for ANN search has high tail latencies because a lot of vector comparisons are needed. One simple approach to lower the latency is to make HNSW smaller by splitting the data into more content nodes and executing more searches in parallel. Currently, we have 30 nodes per content group.</p>

<h3 id="split-the-index">Split the index</h3>

<p>Vinted has some markets connected, e.g. members in Spain can see French items. Markets that are transitively connected form a logical bloc. We leveraged that to create smaller HNSW indices per bloc. This was achieved by creating 3 separate Vespa indices that are deployed to the same content cluster.</p>

<p>This trick not only made HNSW searches more manageable, but as a side effect, it also sped up all search requests. However, the indexing and querying now need to be routed to the correct indices.</p>

<h3 id="tune-thresholds">Tune thresholds</h3>

<p>Vespa dynamically decides which <a href="https://blog.vespa.ai/tweaking-ann-parameters/">nearest neighbor searches strategy</a> to apply based on the estimated hit ratio. Thresholds can be tuned to balance latency vs. resource utilization. During the benchmarking we’ve discovered that the sweet spot was to set ranking.matching.approximateThreshold to a values that translates to ~1M documents per index per content node with 8 <a href="https://docs.vespa.ai/en/reference/query-api-reference.html#ranking.matching.numthreadspersearch">threads per search request</a>.</p>

<figure style="text-align: center;">
  <img src="/static/2025/11/exact-vs-approximate-nearest-neighbor.png" alt="exact-vs-approximate" />
</figure>

<p>Note that there is no perfect threshold.</p>

<h3 id="retry-strategy">Retry strategy</h3>

<p>Search requests have a tight latency budget of 500 ms. Even with the tuned thresholds, we’d still get some timeouts due to how the approximate nearest neighbor search is executed. The worst part is that such failures end in 0 results sessions. Knowing that the exact nearest neighbor search doesn’t have such a failure mode we implemented a retry strategy:</p>

<ul>
  <li>Divide latency budget into 2 parts: e.g. 350 ms and 150 ms</li>
  <li>First, prefer executing the approximate nearest neighbor search with a timeout of 350 ms</li>
  <li>If the first request fails, then force the exact nearest neighbor search by  <code class="language-plaintext highlighter-rouge">ranking.matching.approximateThreshold=1</code> with a timeout of 150 ms.</li>
</ul>

<p>The sequence diagram below explains the flow.</p>

<figure style="text-align: center;">
  <img src="/static/2025/11/retry-strategy.png" alt="retry-strategy" />
</figure>

<p>Typically, a Vespa timeout means that the <a href="https://docs.vespa.ai/en/document-summaries.html">summary fetching</a> was skipped. Such responses are mostly unusable as the document data is not in the response. To “recover” such requests, we leveraged the <code class="language-plaintext highlighter-rouge">match-features</code> to return a <a href="https://vinted.engineering/2025/11/06/vespa-match-features/">document ID as a tensor</a>. The strategy eliminated most of the timeouts.</p>

<h2 id="consistency">Consistency</h2>

<p>The key requirement was to <strong>integrate</strong> the dense retrieval into the existing setup. This means:</p>

<ul>
  <li>Support a hard limit on the number of dense retrieval matches added to the Top-K results as simply flooding results with approximate matches is a bad user experience.</li>
  <li>Adding a filter should not increase the number of search results.</li>
  <li>Avoid surprise when changing the sorting of search results.</li>
</ul>

<p>It might be hard to believe, but all of the above happened for real user queries.</p>

<h3 id="example">Example</h3>

<p>A query <code class="language-plaintext highlighter-rouge">zx750</code> with a brand filter has ~300 lexical matches and with the tuned distance threshold the nearest neighbor search matches another ~7000 items. The problem with those 7000 items is that they are mostly random things, and the lexical matches are spread across those ~7300 hits. The poor matches are due to <a href="https://huggingface.co/spaces/Xenova/the-tokenizer-playground">BERT-based tokenizer</a> producing tokens <code class="language-plaintext highlighter-rouge">[CLS]z##x##75##0[SEP]</code> that tend to give high similarity with random items. If the brand filter is removed, we get the expected ~450 (~300 lexical and ~150 nearest neighbor) matches.</p>

<h3 id="nearest-neighbor-target-hits">Nearest neighbor target hits</h3>

<p>It turns out that the dynamic switching between exact and approximate the nearest neighbor search strategies introduces these failure modes. It boils down to how the <code class="language-plaintext highlighter-rouge">targetHits</code> param is handled in each strategy. Each <code class="language-plaintext highlighter-rouge">nearestNeighbor</code> query clause has the <code class="language-plaintext highlighter-rouge">targetHits</code> param. However, it is only a target, not a limit! Meaning that Vespa is free to add nearest neighbor matches. And it happens when the execution strategy is the exact nearest neighbor. For example, when a filter is added to the search request, surprisingly, there can be more matches than without a filter. Also, <code class="language-plaintext highlighter-rouge">targetHits</code> is per content node. So, the total limit is <code class="language-plaintext highlighter-rouge">targetHits * number_of_content_nodes</code>. Not ideal. Note that the approximate nearest neighbor execution strategy returns exactly <code class="language-plaintext highlighter-rouge">targetHits</code> Per content node.</p>

<p>This behaviour originates in the matching phase. So, later in the ranking we can already implement a workaround.</p>

<h3 id="approach">Approach</h3>

<p>Having consistency requirements and the technical limitation in mind, there were only a few paths to go:</p>

<ul>
  <li>Ignore the problem</li>
  <li>Two requests to Vespa and combine the results in the application layer</li>
  <li>Push the complexity down to Vespa</li>
</ul>

<p>Taking the <a href="https://en.wikipedia.org/wiki/Red_pill_and_blue_pill">blue pill</a> and ignoring the problem (1) was not really an option as it would mean either killing a promising project or leaving the expensive feature as a filler for low results queries. (2) Two requests felt unattractive as it would introduce plenty of complexity (e.g. faceting) to make the codebase toxic to the extent that nobody would dare to touch it.</p>

<p>We decided to proceed with taking the red pill and creatively pushing the complexity down to Vespa (3). We believe that reading Vespa documentation and following a ranking profile is easier than reading a hacky implementation code.</p>

<h3 id="reciprocal-rank-fusion">Reciprocal rank fusion</h3>

<p>A.k.a. <a href="https://docs.vespa.ai/en/phased-ranking.html#cross-hit-normalization-including-reciprocal-rank-fusion">RRF</a> is a Vespa ranking feature available in the <code class="language-plaintext highlighter-rouge">global-phase</code>. It is defined as <code class="language-plaintext highlighter-rouge">rrf_score = 1.0 / (k + rank)</code>. The score depends on an arbitrary parameter <code class="language-plaintext highlighter-rouge">k</code> and the rank (position) within a virtual list of search results ranked by some ranking feature, e.g. nearest neighbor similarity.</p>

<p>If you squint, when <code class="language-plaintext highlighter-rouge">k=0</code> then <code class="language-plaintext highlighter-rouge">rrf_score</code> looks like a position. We can find a number that represents N-th position <code class="language-plaintext highlighter-rouge">rrf_score@N=1/N</code>, e.g. <code class="language-plaintext highlighter-rouge">rrf_score@160=1/160=0.00625</code>.</p>

<h3 id="identifying-the-nearest-neighbor-matches">Identifying the nearest neighbor matches</h3>

<p>All documents matched by the <code class="language-plaintext highlighter-rouge">nearestNeighbor</code> query clause has a non 0 <code class="language-plaintext highlighter-rouge">itemRawScore</code> <a href="https://docs.vespa.ai/en/reference/rank-features.html#itemrawscore\(label\)">rank feature</a>. However, to answer if the document is matched <strong>only</strong> by the <code class="language-plaintext highlighter-rouge">nearestNeighbor</code> the we need to know if the document was not matched by other query clauses.</p>

<p>The YQL conceptually looks like this:</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">SELECT</span> <span class="o">*</span>
<span class="k">FROM</span> <span class="n">documents</span>
<span class="k">WHERE</span> <span class="n">filters</span> <span class="k">AND</span> <span class="p">(</span><span class="n">lexical_match</span> <span class="k">OR</span> <span class="n">interpretations</span> <span class="k">OR</span> <span class="n">nearestNeighbor</span><span class="p">)</span> 
</code></pre></div></div>

<p>If a document is a lexical match then <code class="language-plaintext highlighter-rouge">textSimilarity(text_field).queryCoverage &gt; 0</code> must be true.</p>

<p>By <code class="language-plaintext highlighter-rouge">interpretations</code> we mean a query rewrites into filters on metadata, e.g. a query `red dress` becomes a filter <code class="language-plaintext highlighter-rouge">color_id=10 AND category_id=1010</code>. The complication is that there might be multiple interpretations, and we need to know if the document is a full match of at least one <code class="language-plaintext highlighter-rouge">interpretation</code>. There is no easy way to calculate such condition. A workaround is to introduce a boolean attribute field where the value is always <code class="language-plaintext highlighter-rouge">false</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>field bool_const type bool {
  indexing: attribute
  attribute: fast-search
  rank: filter
}
</code></pre></div></div>

<p>Then rewrite each interpretation AND’ing with the always TRUE <code class="language-plaintext highlighter-rouge">bool_const=FALSE</code> condition, e.g.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>color_id=10 AND category_id=1010 AND bool_const=FALSE
</code></pre></div></div>

<p>Now, if the rank feature <code class="language-plaintext highlighter-rouge">attributeMatch(bool_const) &gt; 0</code> then the document is fully matched by at least one interpretation. To distinguish between interpretations a field per interpretation would be needed.</p>

<p>Finally, we can identify if the document is only a nearest neighbor match:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function match_on_nearest_neighbor_only() {
  expression {
    if(itemRawScore(dense_retrieval) &gt; 0 
          &amp;&amp; textSimilarity(text_field).queryCoverage == 0
          &amp;&amp; attributeMatch(bool_const) == 0,
      1,
      0
    )
  }
}
</code></pre></div></div>

<h3 id="using-the-global-ranking-phase-as-a-filter">Using the global ranking phase as a filter</h3>

<p>Matches from all content nodes are available in the container node in the global ranking phase. It supports a <code class="language-plaintext highlighter-rouge">rank-score-drop-limit</code> parameter that can be used to remove matched documents whose score is lower than some constant numeric value. This feature was <a href="https://github.com/vespa-engine/vespa/pull/33298">contributed</a> to Vespa.</p>

<p>The trick to filter out documents is to change the score of some matches to a value lower than the <code class="language-plaintext highlighter-rouge">rank-score-drop-limit.</code></p>

<p>A nice thing is that this parameter can be passed as an HTTP request parameter to have a control per request or for AB testing.</p>

<h3 id="putting-all-together">Putting all together</h3>

<p>Here is the implementation that limits the nearest neighbor matches count using the tricks described above:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global-phase {
  expression {
    if(reciprocal_rank(itemRawScore(dense_retrieval), 0) &gt;= 1.0 / query(nn_hits_limit)
          || match_on_nearest_neighbor_only == 0,
    relevanceScore,
    -100000.0)
  }
  rerank-count: 20000
  rank-score-drop-limit: -100000.0
}
</code></pre></div></div>

<p>It works like this: when the document is either a nearest neighbor match up to <code class="language-plaintext highlighter-rouge">query(nn_hits_limit)</code> position or it is not only a nearest neighbor match then the document gets the same score as calculated in the previous ranking phases (i.e., no change). Otherwise, the score is set to <code class="language-plaintext highlighter-rouge">-100000.0</code> (way outside the range of the normal scores). We take 20,000 documents (meaning “all”) to rerank. After reranking the documents with score &lt;-100000 are filtered out.</p>

<h3 id="sorted-queries">Sorted queries</h3>

<p>Yet another complication is that every result ordering has to deal with the same issue of limiting the number of nearest neighbor matches. The required sorting has to be implemented with ranking profiles because <code class="language-plaintext highlighter-rouge">global-phase</code> prevents using the <code class="language-plaintext highlighter-rouge">order by</code> <a href="https://docs.vespa.ai/en/reference/query-language-reference.html#order-by">clause</a>. Here is an example rank profile for the “newest first” sorting:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rank-profile dense_retrieval_newest inherits dense_retrieval_global_phase {
    match-phase {
        attribute: first_visible_at
        order: descending
    }
    first-phase {
        expression: attribute(first_visible_at)
    }
}
</code></pre></div></div>

<p>Another way to see this complication is that it creates an opportunity for secondary sorting to use some clever scoring that might include personalization or something. While secondary sorting with <code class="language-plaintext highlighter-rouge">order by</code> are limited to an indexed attribute value.</p>

<h3 id="custom-jvm">Custom JVM</h3>

<p>Using the <code class="language-plaintext highlighter-rouge">global-phase</code> reranking as a filter added quite a bit of work to the query container nodes which increased latencies. To amortise for that, we’ve experimented with a newer JVM as Vespa ships with a relatively old OpenJDK 17.</p>

<p>In the past, we’ve had great results with <a href="https://www.graalvm.org/">GraalVM</a> when running Elasticsearch. Why not try it with Vespa? And it turns out that it works pretty well! The tail latencies dropped by double-digit percentages. Later we’ve also <a href="https://github.com/vespa-engine/vespa/pull/34439">configured</a> <a href="https://docs.oracle.com/en/java/javase/21/gctuning/z-garbage-collector.html">ZGC</a> so that JVM garbage collection pauses would cause fewer timeouts.</p>

<p>Using the combination of GraalVM and ZGC not only helped with this project but also proved to help with <a href="https://vinted.engineering/2025/11/06/vespa-match-features/">other</a> ultra low-latency use cases.</p>

<h3 id="peak-of-complexity">Peak of Complexity</h3>

<p>Even though the current implementation gets the job done, we are not happy. The logic is not particularly complicated, but it has many components, and therefore it is easy to get lost. We’ve added plenty of integration tests that prevent introducing bugs when changing something remotely related.</p>

<p>The worst part of it is that additional requirements can be implemented only by adding even more complexity. To get rid of this complexity, either the consistency requirements need to change, or performance of ANN should be significantly improved, or the model is improved so that all items that pass distance thresholds are relevant.</p>

<p>This reminds of the <a href="https://www.youtube.com/watch?v=u08hjp6PF-Q">Peak of Complexity</a> model introduced by the Java architect Brian Goetz. We’ve just passed the complexity peak, and we’re in the virtuous collapse phase where simplification leads to even more simplification. When looking into parts of the solution, it is not uncommon to hear questions like what took you so long?</p>

<h2 id="summary">Summary</h2>

<p>The dense retrieval is a significant improvement to the item search. The benefits were proved in lots of AB tests. However, as it is typical for search, good work leads to even more work: the model can get better, the integration into the ranking can be improved, nuance can be introduced when and to what extent the dense retrieval is applied, resource utilization can be optimized, etc.</p>

<p>Even though a lot of work is ahead of us, we are proud of what has been achieved. The overall architecture is relatively simple and contained, which allows for the team autonomy. Due to multiple performance optimizations, we’ve reached a &lt;0.02% error rate. The techniques mastered for this feature have laid the foundations for other advancements such as image search or advanced personalization.</p>

<p>We hope that this long blog post was interesting and sheds some light on what it takes to work out a significant search feature at the <a href="https://vinted.engineering/2025/01/10/1-billion-items-in-search/">billion-scale e-commerce dataset</a>.</p>]]></content><author><name>Laurynas Jasiukėnas</name><uri>https://github.com/laurynasjs</uri></author><summary type="html"><![CDATA[TL;DR: integrating embedding-based retrieval into the e-commerce search application is a significant undertaking.]]></summary></entry><entry><title type="html">Teaching the Old Dog a New Trick</title><link href="https://vinted.engineering//2025/11/06/vespa-match-features/" rel="alternate" type="text/html" title="Teaching the Old Dog a New Trick" /><published>2025-11-06T00:00:00+00:00</published><updated>2025-11-06T00:00:00+00:00</updated><id>https://vinted.engineering//2025/11/06/vespa-match-features</id><content type="html" xml:base="https://vinted.engineering//2025/11/06/vespa-match-features/"><![CDATA[<p><strong>TL:DR</strong>: When required data can be encoded with  <code class="language-plaintext highlighter-rouge">match-features,</code> Vespa can apply a new optimisation, which can be a lifesaver when data is frequently redistributed.</p>

<!--truncate-->

<h2 id="use-case">Use-case</h2>

<p>When ranking, <a href="https://vespa.ai/">Vespa</a> requires additional information that cannot be easily stored within the documents being ranked (e.g. the statistical cross features between the user and the document such as a counter of how many interactions the user has made with the documents of this category can’t be stored within an item itself). Then, you need to pass them as parameters via <a href="https://docs.vespa.ai/en/reference/query-api-reference.html#ranking.features">ranking features</a>. An important question is: where do you store and fetch that information from?</p>

<p>When Redis became a bottleneck for this task, we decided to try Vespa itself. Why?</p>

<ul>
  <li>The data format is already suitable because it is going to be passed into the ranking profile.</li>
  <li>It becomes possible to eliminate some network round-trips.</li>
  <li>Vespa is scalable to any dataset size and can store data both in memory and on disk.</li>
  <li>Vespa allows for fetching multiple filtered documents.</li>
  <li>Vespa allows for collocating calculations with the data.</li>
</ul>

<p>And of course, such tasks require single-digit millisecond latency.</p>

<h2 id="problem">Problem</h2>

<p>Initially, the use e of Vespa for  use cases worked well, and looked like a great success. However, the proverbial honeymoon ended when the number of schemas (50+) and the update rate (500M+ per hour) skyrocketed. We noticed that sometimes tail (p99+) latencies bumped to 100ms+, seemingly out of nowhere, but the bump sometimes correlated with the high feeding bursts (meaning right after the feeding burst). Such high latencies are unacceptable when the latency budget is 50 ms.</p>

<h2 id="investigation">Investigation</h2>

<p>We noticed that the spike in tail latencies always happened after a burst of feeding requests. E.g. features that are recalculated hourly/daily for all Vinted users (i.e. 100+ million records) create such bursts. When latencies started spiking more and more frequently, it was a signal to have a closer look to see what the cause was.</p>

<figure style="text-align: center;">
  <img src="/static/2025/11/p99-latency-spike-after-feeding.png" alt="latency" />
</figure>

<p>The diagram above shows that a spike in p99 latency occurred after a feeding burst.</p>

<p>After inspecting the logs during such a latency spike, there were multiple records such as the following:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Docsum fetch failed for 36 hits (64 ok hits), no retry
</code></pre></div></div>

<p>The log says that  some document summaries have failed.</p>

<p>After a <a href="https://en.wikipedia.org/wiki/Guru_Meditation">guru meditation</a> session at the <a href="https://www.mruni.eu/news/apsilankymas-pirmajame-lietuvos-vienaragyje-vinted-praplete-studiju-horizontus/">Vilnius office sauna</a> (which is intended for such type of work), we concluded that the data is moving around the cluster and it causes problems for document summary fetching.</p>

<p>This theory was quickly confirmed by checking the dashboard on data redistribution.</p>
<figure style="text-align: center;">
  <img src="/static/2025/11/p99-latency-vs-data-redistribution.png" alt="latency" />
</figure>
<p>With this evidence, the problem to solve was clear but first, we need to familiarise ourselves with how Vespa executes queries.</p>

<h3 id="query-execution">Query Execution</h3>

<p>This is the typical query execution flow:</p>
<figure style="text-align: center;">
  <img src="/static/2025/11/vespa-default-query-execution.png" alt="latency" />
</figure>
<p>The diagram above shows that a request first comes to the Vespa container node. Then, it is scattered to all content nodes (typically over the network) of an  available content group. Responses with local Top-K hits are then gathered in the container node. The Top-K global matching documents a <code class="language-plaintext highlighter-rouge">.fill()</code> request is once again sent to the relevant content nodes to fetch the document summary (i.e. document data or calculated values).</p>

<p>When a data redistribution is ongoing during the query handling, it might happen that between the query execution and <code class="language-plaintext highlighter-rouge">.fill()</code> (that typically takes a couple of milliseconds), the documents is moved from one content node to another (or the content node is down, or some other unexpected situation that happens in distributed systems). To handle such a situation, Vespa queries all known content nodes for the summary data, <a href="https://github.com/vespa-engine/vespa/blob/63c770c26f24c77357aef9e78d3a03bebc45c5f3/container-search/src/main/java/com/yahoo/search/dispatch/rpc/RpcProtobufFillInvoker.java#L304-L313">potentially doing multiple retries</a>.</p>

<p>Typically, summary fetching takes ~1 ms, but we’ve seen <code class="language-plaintext highlighter-rouge">summary</code> fetching taking ~100 ms.</p>

<p>A small nuance to note about the query execution flow is that, with the first response from content nodes, the <code class="language-plaintext highlighter-rouge">matchfeatures</code> can be returned.</p>

<p>Match features are rank features, added to <a href="https://docs.vespa.ai/en/reference/schema-reference.html#match-features">each hit</a> into the <code class="language-plaintext highlighter-rouge">matchfeatures</code> field. The feature was added to Vespa in <a href="https://github.com/vespa-engine/vespa/issues/19645#issuecomment-970802029">2021</a>. The values can be either floating point numbers or tensors (but not strings, booleans, etc.). Typically, they are useful to record the feature values used in scoring for further <a href="https://docs.vespa.ai/en/tutorials/rag-blueprint.html">ranking</a> optimisation.</p>

<p>A clever trick to encode non-numeric data, e.g. a string label, is to convert it into a <a href="https://docs.vespa.ai/en/reference/tensor.html">mapped tensor</a>. If you squint a little, the mapped tensor looks like a regular JSON object.</p>

<h2 id="solution">Solution</h2>

<p>By knowing the problem and being familiar with  <code class="language-plaintext highlighter-rouge">matchfeatures,</code> we can draft a workaround for summary fetching.</p>

<p>Luckily, we’ve already thought about such an <a href="https://github.com/vespa-engine/vespa/issues/33979">optimisation</a>! What if everything we needed could be fetched with the <code class="language-plaintext highlighter-rouge">select match-features from …</code>? Summary fetching would then not be required, and <code class="language-plaintext highlighter-rouge">.fill()</code> <a href="https://github.com/vespa-engine/vespa/pull/34029">could be eliminated</a>.</p>

<p>The enthusiasm led to a quick proof of concept; however, the benchmarks surprisingly showed no improvement at all. This led to an inspection of the <a href="https://docs.vespa.ai/en/reference/query-api-reference.html#tracing">query trace</a>, in which we found that the summary was being fetched! This was confirmed with the metrics on the Vespa side, on <code class="language-plaintext highlighter-rouge">docsum</code> <a href="https://docs.vespa.ai/en/reference/vespa-set-metrics-reference.html">operations</a>.</p>

<p>It was high time to roll up our sleeves and do some <a href="https://github.com/vespa-engine/vespa/pull/35011">open source work</a>. The feature was released with <a href="https://factory.vespa.ai/changes/8.596.7">Vespa 8.596.7</a>.</p>

<p>Open source contributions take time (review, accept, release, adopt cycle can take weeks), but we needed the solution quickly Vespa is extremely flexible, and we could alter the platform ourselves with aplugin by adding <a href="https://github.com/dainiusjocas/notes/tree/main/examples/ignore-fill-bundle">your bundle JAR</a> file into the <code class="language-plaintext highlighter-rouge">components/</code> <a href="https://docs.vespa.ai/en/jdisc/container-components.html#adding-component-to-application-package">directory</a>, and configuring the <a href="https://docs.vespa.ai/en/searcher-development.html#deploying-a-searcher">search chain</a>.</p>

<p>Let’s explore the Vespa application setup. First, we need to create a rank profile that encodes data into tensors.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>schema doc {
    document doc {
        field my_feature type string {
            indexing: attribute
        }
    }
    rank-profile fields inherits unranked {
        function my_feature() {
            expression: tensorFromLabels(attribute(my_feature))
        }
        match-features {
            my_feature
        }
    }
}
</code></pre></div></div>

<p>Then, we need to specify the <code class="language-plaintext highlighter-rouge">fields</code> rank profile when querying. As a bonus, we can disable the <a href="https://docs.vespa.ai/en/reference/query-api-reference.html#ranking.querycache">query cache</a>, because  it helps during the summary fetching and asks for the <a href="https://docs.vespa.ai/en/reference/query-api-reference.html#presentation.format.tensors">short version</a> of tensors:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"yql"</span><span class="p">:</span><span class="w"> </span><span class="s2">"select matchfeatures from doc where true"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"ranking"</span><span class="p">:</span><span class="w"> </span><span class="s2">"fields"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"ranking.queryCache"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="p">,</span><span class="w">
  </span><span class="nl">"presentation.format.tensors"</span><span class="p">:</span><span class="w"> </span><span class="s2">"short-value"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>The response looks like:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="err">'root':</span><span class="w"> </span><span class="p">{</span><span class="w">
  </span><span class="err">'id':</span><span class="w"> </span><span class="err">'toplevel'</span><span class="p">,</span><span class="w">
  </span><span class="err">'relevance':</span><span class="w"> </span><span class="mf">1.0</span><span class="p">,</span><span class="w">
  </span><span class="err">'fields':</span><span class="w"> </span><span class="p">{</span><span class="err">'totalCount':</span><span class="w"> </span><span class="mi">1</span><span class="p">},</span><span class="w">
  </span><span class="err">'coverage':</span><span class="w"> </span><span class="p">{</span><span class="w">
   </span><span class="err">'coverage':</span><span class="w"> </span><span class="mi">100</span><span class="p">,</span><span class="w">
   </span><span class="err">'documents':</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
   </span><span class="err">'full':</span><span class="w"> </span><span class="err">True</span><span class="p">,</span><span class="w">
   </span><span class="err">'nodes':</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
   </span><span class="err">'results':</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
   </span><span class="err">'resultsFull':</span><span class="w"> </span><span class="mi">1</span><span class="p">},</span><span class="w">
  </span><span class="err">'children':</span><span class="w"> </span><span class="p">[{</span><span class="w">
    </span><span class="err">'id':</span><span class="w"> </span><span class="err">'index:content/</span><span class="mi">0</span><span class="err">/c</span><span class="mi">4</span><span class="err">ca</span><span class="mi">42388</span><span class="err">ce</span><span class="mi">70</span><span class="err">a</span><span class="mi">10</span><span class="err">b</span><span class="mi">392</span><span class="err">b</span><span class="mi">401</span><span class="err">'</span><span class="p">,</span><span class="w">
    </span><span class="err">'relevance':</span><span class="w"> </span><span class="mf">0.0</span><span class="p">,</span><span class="w">
    </span><span class="err">'source':</span><span class="w"> </span><span class="err">'doc</span><span class="p">,</span><span class="w">
    </span><span class="err">'fields':</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="err">'matchfeatures':</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="err">'my_feature':</span><span class="w"> </span><span class="p">{</span><span class="w">
          </span><span class="err">'MY_LABEL_VALUE':</span><span class="w"> </span><span class="mf">1.0</span><span class="w">
        </span><span class="p">}</span><span class="w">
      </span><span class="p">}</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">}]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Third, the <code class="language-plaintext highlighter-rouge">match-features</code> need to be converted into a usable form. That is, either in a custom <a href="https://docs.vespa.ai/en/glossary.html#searcher">searcher</a>, or in your application.</p>

<p>Without summary fetching, the query execution is much simpler.</p>
<figure style="text-align: center;">
  <img src="/static/2025/11/vespa-matchfeatures-query-execution.png" alt="mean-latency" />
</figure>
<p>In the diagram above, one network round-trip is eliminated when compared to the typical query execution. Also, this eliminates all the potential summary fetching problems because documents are findable even during data redistributions.</p>

<h2 id="results">Results</h2>

<p>When the solution was deployed, we immediately noticed a drop in tail latencies. But the most important thing was that there were no more latency spikes during data redistribution!</p>

<figure style="text-align: center;">
  <img src="/static/2025/11/p99-latency-drop.png" alt="p99-latency" />
</figure>
<p>When the change was deployed, the p99 latencies dropped from ~9 ms to about 3 ms. And the latency spikes are gone.</p>

<figure style="text-align: center;">
  <img src="/static/2025/11/mean-latency.png" alt="mean-latency" />
</figure>
<p>Currently, the mean query latency with ~7.5k RPS per container node is around 430 microseconds.</p>
<figure style="text-align: center;">
  <img src="/static/2025/11/max-query-latency.png" alt="max-latency" />
</figure>
<p>The max latencies (pro tip: always monitor max latencies) are typically ~20 ms. Those ~200 ms spikes are due to packet loss in the network layer (not Vespa specifics).</p>

<h3 id="discussion">Discussion</h3>

<p>Even though the optimisation is nice, the journey is not yet finished. There are other ways to get even more out of Vespa. Here are several ideas:</p>

<ul>
  <li>The current implementation creates tensors from attributes at query time. They could be precalculated during indexing.</li>
  <li>The current implementation is usable when querying one schema. For multi-schema support, you either have to encode the datatype name in the document attributes or ask Vespa to add <a href="https://docs.vespa.ai/en/reference/default-result-format.html#sddocname"><code class="language-plaintext highlighter-rouge">sddocname</code></a> to the response. However, having <code class="language-plaintext highlighter-rouge">sddocname</code> is filled only on <a href="https://github.com/vespa-engine/vespa/blob/63c770c26f24c77357aef9e78d3a03bebc45c5f3/container-search/src/main/java/com/yahoo/search/dispatch/rpc/RpcProtobufFillInvoker.java#L282">receiving the summary</a>.</li>
  <li>A <a href="https://docs.vespa.ai/en/result-rendering.html">custom renderer</a> could be implemented that serialises data based on the schema into a binary format, avoiding JSON serialisation.</li>
</ul>

<h2 id="summary">Summary</h2>

<p>This new trick  of selecting only the <code class="language-plaintext highlighter-rouge">matchfeatures</code> in <a href="https://factory.vespa.ai/changes/8.596.7">Vespa 8.596.7</a>, helps eliminate not only a network round-trip, but also problems and latencies associated with summary fetching. The overhead of converting attributes into tensors and transmitting slightly more data over the network in our setup was negligible. Of course, this optimisation is not a silver-bullet for all use cases, but when summary fetching is problematic, it really helps.</p>

<p>Kudos to the team for this great work! And thanks to everyone who helped!</p>

<h2 id="ps">P.S.</h2>

<p>A fun fact is that the initial hypothesis for latency spikes was the pauses of the JVM garbage collector. However, after setting up the <a href="https://wiki.openjdk.org/display/zgc/Main">generational ZGC</a> the latency spikes were still there. Garbage collector is almost never a root cause.</p>]]></content><author><name>Dainius Jocas</name><uri>https://github.com/dainiusjocas</uri></author><summary type="html"><![CDATA[TL:DR: When required data can be encoded with match-features, Vespa can apply a new optimisation, which can be a lifesaver when data is frequently redistributed.]]></summary></entry><entry><title type="html">Investigation: Identical Servers, Different Performance</title><link href="https://vinted.engineering//2025/07/15/clocksource-performance/" rel="alternate" type="text/html" title="Investigation: Identical Servers, Different Performance" /><published>2025-07-15T00:00:00+00:00</published><updated>2025-07-15T00:00:00+00:00</updated><id>https://vinted.engineering//2025/07/15/clocksource-performance</id><content type="html" xml:base="https://vinted.engineering//2025/07/15/clocksource-performance/"><![CDATA[<p>Inconsistent Redis performance was observed across a fleet of otherwise identical servers. After investigation, we discovered that differences in Linux system clocksource settings - specifically, servers running the slower HPET clocksource instead of the default TSC - led to significant increases in Redis latency and CPU usage.</p>

<!--truncate-->

<p>This post summarizes our findings, shows how to spot and fix the issue, and gives tips to prevent performance drops from unintended clocksource changes. Fixing this simple config can lead to immediate, measurable improvements in throughput and system efficiency.</p>

<h2 id="tldr">TL;DR</h2>

<ul>
  <li>System performance drops, especially for high-throughput workloads, if the server switches from the default TSC clocksource to HPET.</li>
  <li>The kernel might fall back to HPET if TSC sync fails, often after the server is powered off for a long period.</li>
  <li>You can check and switch your available clocksource easily (see below).</li>
</ul>

<h2 id="problem">Problem</h2>

<p>Some “identical” servers were running slower, with increased latency and CPU usage. We needed to understand why.</p>

<h2 id="whats-a-clocksource">What’s a Clocksource?</h2>

<p>A clocksource is how the Linux kernel keeps track of time (“read the clock!”).</p>

<ul>
  <li><strong>TSC</strong>: Crazy fast, lives on modern CPUs.</li>
  <li><strong>HPET</strong>: Accurate, but slow for frequent reads.</li>
</ul>

<p>More technical details:</p>
<ul>
  <li><a href="https://www.kernel.org/doc/Documentation/timers/timekeeping.txt">Linux kernel timekeeping documentation</a></li>
  <li><a href="https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_for_real_time/7/html/reference_guide/chap-timestamping#Reading_hardware_clock_sources">Red Hat: Reading hardware clock sources</a></li>
</ul>

<h2 id="investigation-whats-going-on">Investigation: What’s Going On?</h2>

<p>🔎 We dug in and noticed a pattern:</p>

<ul>
  <li>Slower servers were all using the HPET clocksource.</li>
  <li>Fast servers stuck with the default, TSC.</li>
  <li>By default, kernels prefer TSC, but will abandon it if synchronization issues are detected.</li>
</ul>

<p><strong>Example log snippet:</strong></p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Apr 15 18:22:57 srv kernel: TSC synchronization <span class="o">[</span>CPU#0 -&gt; CPU#8]:
Apr 15 18:22:57 srv kernel: Measured 120 cycles TSC warp between CPUs, turning off TSC clock.
Apr 15 18:22:57 srv kernel: tsc: Marking TSC unstable due to check_tsc_sync_source failed
</code></pre></div></div>

<h3 id="why-does-this-happen">Why does this happen?</h3>

<p>We’re not 100% sure. It could be hardware or firmware quirks, or the server being powered off for long periods.</p>

<h2 id="visual-evidence">Visual Evidence</h2>

<h3 id="system-cpu-usage-tsc-vs-hpet">System CPU Usage (TSC vs. HPET)</h3>

<figure style="text-align: center;">
  <img src="/static/2025/06/visualCPU1.png" />
  <br />
  <img src="/static/2025/06/visualCPU2.png" alt="TSC" />
</figure>

<ul>
  <li>CPU usage is significantly higher on HPET than TSC for the same workload.</li>
</ul>

<h3 id="redis-cpuperformancerate">Redis CPU/performance/rate</h3>

<figure style="text-align: center;">
  <img src="/static/2025/06/visualRedis1.png" />
  <br />
  <img src="/static/2025/06/visualRedis2.png" alt="TSC" />
</figure>

<ul>
  <li>Application latency and CPU usage spike when HPET is used.</li>
</ul>

<h2 id="benchmark-the-pain-is-real">Benchmark: The Pain is Real</h2>

<p><strong>Objective:</strong> How much does the clocksource really matter for high-throughput workloads? (Benchmarked via Envoy to Redis proxying.)</p>

<h3 id="test-setup">Test Setup</h3>

<ul>
  <li>Deployed Envoy proxies on two identical servers.</li>
  <li>Both routed requests to the same Redis cluster.</li>
  <li>Tests run in three phases to single out clocksource effects.</li>
</ul>

<figure style="text-align: center;">
  <img src="/static/2025/06/diagram.png" />
</figure>

<h3 id="how-the-test-was-performed">How the test was performed</h3>

<p>On a dedicated server, we ran separate instances of a custom Go benchmark app: one aimed at each Envoy. These apps continuously sent SET and GET commands to Redis at a constant rate, while steadily increasing the number of goroutines at regular intervals, resulting in a steadily growing Redis command RPS over time.</p>

<p>Envoy’s Redis metrics were collected every 10 seconds using a standalone Prometheus server.</p>

<h3 id="benchmark-phases">Benchmark Phases</h3>

<ul>
  <li><strong>Baseline:</strong> Both servers on TSC.</li>
  <li><strong>HPET on Server 1:</strong> Server 1 switches to HPET, Server 2 stays on TSC.</li>
  <li><strong>HPET on Server 2:</strong> Swap: Server 2 on HPET, Server 1 on TSC.</li>
</ul>

<h3 id="results">Results</h3>

<ul>
  <li>Switching to HPET = instant slow-down.</li>
  <li>Increased CPU usage and application latency were very evident.</li>
</ul>

<figure style="text-align: center;">
  <img src="/static/2025/06/benchmark.png" />
</figure>

<h3 id="conclusion">Conclusion</h3>

<p><strong>HPET = Performance Killer for High-Throughput Workloads.</strong> Stick with TSC whenever possible - otherwise, expect increased latency and higher CPU usage.</p>

<h2 id="how-to-reproduce-or-fix-this">How to Reproduce (Or Fix) This</h2>

<p><strong>When does it happen?</strong><br />
The following screenshots show a timeline of the selected clocksource. Periods with no color in a server’s timeline indicate that the server was offline.</p>

<ul>
  <li>Servers left OFF for a long time sometimes boot with unstable TSC, so the kernel falls back to HPET:</li>
</ul>
<figure style="text-align: center;">
  <img src="/static/2025/06/when1.png" />
</figure>

<ul>
  <li>A simple reboot might fix it, but not always:</li>
</ul>
<figure style="text-align: center;">
  <img src="/static/2025/06/when2.png" />
</figure>

<h3 id="how-to-check--change-your-clocksource">How to Check / Change Your Clocksource</h3>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># See available clocksources</span>
<span class="nb">cat</span> /sys/devices/system/clocksource/clocksource0/available_clocksource

<span class="c"># See current clocksource</span>
<span class="nb">cat</span> /sys/devices/system/clocksource/clocksource0/current_clocksource

<span class="c"># Change current clocksource</span>
<span class="nb">echo</span> <span class="s2">"tsc"</span> | <span class="nb">sudo tee</span> /sys/devices/system/clocksource/clocksource0/current_clocksource
</code></pre></div></div>

<h2 id="key-takeaways">Key Takeaways</h2>

<ul>
  <li>Always check your clocksource if you notice unexplained performance drops.</li>
  <li>Prefer TSC for high-throughput or latency-sensitive workloads.</li>
  <li>A simple reboot or manual switch can restore performance.</li>
</ul>]]></content><author><name>Simonas Rupšys</name><uri>https://github.com/simonasr</uri></author><summary type="html"><![CDATA[Inconsistent Redis performance was observed across a fleet of otherwise identical servers. After investigation, we discovered that differences in Linux system clocksource settings - specifically, servers running the slower HPET clocksource instead of the default TSC - led to significant increases in Redis latency and CPU usage.]]></summary></entry></feed>