Vinted Engineering

Adopting the Vespa search engine for serving personalized second-hand fashion recommendations at Vinted

2023-10-09T00:00:00+00:00

In today’s digital landscape, recommender systems have become ubiquitous, curating user experiences across a wide array of online platforms, including Vinted - Europe’s largest online second-hand fashion marketplace. In this blog post, we outline our journey of adopting the Vespa search engine to serve personalized homepage listing recommendations, helping our members find deals they will enjoy. We are excited to share our story as we have found Vespa to be a great solution combining the now trendy vector search with more traditional sparse search techniques, as well as offering a great engineering experience.

At Vinted, we’ve implemented a 3-stage recommender system that leverages both explicit and implicit user preferences to offer users a curated list of items presented on the homepage. Explicit preferences are inputted by users on the app, allowing them to specify details such as the clothing sizes they are interested in. Meanwhile, implicit preferences are extracted from historical user interactions on the platform, including clicks and purchases, via the use of machine learning models. This system distills a tailored selection from millions of available listings, presenting users with options most aligned with their tastes and behaviors.

Figure 1. 3-stage recommender system

The goal of the first stage of the system is to quickly ( < 100 ms ) recall the most relevant content based on historical user behavior. This is done by utilizing the approximate nearest neighbor (ANN) search with embeddings obtained from an in-house two-tower recommendation model. The listing “tower” of this model is responsible for generating vector representations of listings based on various metadata such as brand, price, size as well as other unstructured data such as photos. The second “tower” is responsible for generating embeddings of user’s preferences characterized by a sequence of past interactions (clicks, favorites & purchases) with listings on the platform. The model is trained in such a way that the distance between a user’s and listing’s embedding represents the affinity or relevance for the given user-item pair. This score can then be used to rank listings based on relevance for a given user and select a smaller list of candidates for the next stage.

Figure 2. Two-tower recommendation retrieval model

When implementing the first iteration of this system we have chosen to use the Facebook AI Similarity Search (Faiss) library for performing ANN searches. While Faiss served us well in the first iterations of this system to prove value, it is not a complete database solution, and we found the following drawbacks:

We used Faiss as a read-only index in a stateless Kubernetes service that would have to be periodically rebuilt and redeployed to include newly uploaded items and remove sold or deleted content.
Faiss has no capability for approximate nearest neighbor searches with pre-filtering based on metadata. You can only retrieve the top-k scoring items from this index, and any filtering would have to be performed as a post-processing step on the fixed-length list of retrieved items. This was especially problematic for us, as our product allows users to specify custom filters. Therefore, if the top-scoring items retrieved did not pass these filters, our users would see no recommendations at all.

So we set out in search of a database system that would take care of managing the data and indices, as well as allow us to filter items based on metadata such as brand, size, and so on such that we could always retrieve recommendations for our users, no matter what filters they have set.

In search for a vector search database

As alternative technologies that could satisfy the constraints mentioned above, in the summer of 2022, we’ve evaluated Vespa and Elasticsearch. More systems that support ANN with prefiltering were researched but eventually rejected either because of licensing concerns ( Vinted prefers truly open-source licensed software ) or due to overall lack of maturity of the project.

Vespa

Vespa is an application platform for low-latency computations over large datasets. It is used to solve problems such as search, vector search, real-time recommendation, personalization, ad targeting, etc. The platform is open source under the Apache 2.0 license. One particular aspect that drew us to Vespa was its first-class support for machine learning based search and ranking. On top of that, the real-time data update capability is appealing. The main complicating factor for adoption was that Vinted had no experience with Vespa.

Elasticsearch

Elasticsearch is a mature and popular system for search and analytics use cases. Elasticsearch is built on top of the Lucene library. The seemingly endless list of features makes it a trusty and future-proof technology. Elasticsearch supports ANN with prefiltering from version 8.0.

Even though the license is not open-source, Elasticsearch was a strong contender because Vinted was already using it for search and had solid engineering competencies to operate it at scale.

Benchmarking

To understand how these technologies would perform for our use case, we implemented benchmarks using real data. The goal of these benchmarks was to measure peak document indexing throughput as well as query throughput and latency.

Setup

Benchmarks were performed on a single Google Cloud Platform n1-standard-64 VM instance (64 vCPUs, 236 GB). The dataset consisted of ~1M documents, each document contained 12 fields and a 256 dimension float32 embedding. Both Elasticsearch (8.2.2) and Vespa (8.17.19) were deployed as Docker containers, and we’ve made sure to keep the ANN index (HNSW) hyperparameters consistent across both platforms for a fair comparison.

Results

In our benchmarks, we found that Vespa had a 3.8x higher document indexing throughput. Furthermore, querying benchmarks have shown that Vespa was able to handle 8x more RPS before saturating the CPU, and at this throughput had a P99 latency of 26ms. Elasticsearch, even at just 250 RPS had a P99 latency of 110ms (4.23 times higher).

Of course, if the benchmarks were run today with up-to-date versions then the numbers would be different.

Given these results, we have decided to move forward with setting up Vespa for an AB test.

System setup

Having the numbers from the load testing, we’ve estimated that to achieve high-availability (HA), 3 servers with 56 CPU cores each were needed to handle the expected load for the AB test. Deploying Vespa was as easy as setting an environment variable

VESPA_CONFIGSERVERS=server1.vinted.com,server2.vinted.com,server3.vinted.com

and then running a Docker container with Vespa on each server.

The application package was mostly the same as the one used for the load testing. The only change was that we’ve set up the content cluster with 3 groups. That made each server store a complete copy of the dataset and having more groups helped to scale the query throughput.

Operations

We’ve found that Vespa is generally easy to operate. One reason is that after the initial setup there is no need to touch the running servers: all the configuration is controlled by deploying the application package. On top of that, Vespa out-of-the-box exposes an extensive set of metrics in Prometheus format that makes creating detailed Grafana dashboards an easy task.

We consider the performance to be good enough: the P99 latency of first stage retrieval handled by Vespa is around ~50 ms. However, there was a small portion of problematic queries that took much longer to execute than the set query timeout of 150ms. Vespa has an excellent tool for debugging problematic queries: tracing. With the hints from the traces, we’ve sought help in the Vespa Slack which led to a GitHub issue. The Vespa team was quick to respond and fixed the root cause of the issue in subsequent Vespa releases. So far so good.

Approximate Search vs Exact Search

As mentioned previously, the first-stage of our recommendation system utilizes an approximate nearest neighbor search algorithm to balance the trade-off between accuracy and speed. When dealing with large datasets, finding exact nearest neighbors can be computationally expensive, as it requires a linear scan across the entire corpus. Approximate search algorithms such as HNSW aim to find neighbors that are “close enough”, which makes the search faster at the cost of accuracy. Additionally, ANN search algorithms often allow for fine tuning of the accuracy vs speed trade-off via parameters such as the “max-links-per-node”.

We were curious to quantify exactly how much accuracy was traded off by our choice of the HNSW parameters we’ve set in our Vespa deployment. Initially, we started by measuring recall - the proportion of matching documents retrieved between approximate and exact searches. We’ve found that with our choice of parameters the recall was around 60-70%. However, visually the retrieved results and scores were very similar, and we were wondering if our users could perceive this difference and if that difference would affect their engagement and satisfaction. To test this hypothesis, we performed an AB test where half of our users received recommendations retrieved using approximate search, and the other half received exact search results.

To accommodate such an experiment we needed some spare hardware resources. Luckily, we’ve recently set up a bigger Vespa deployment and until other features were deployed the resources were readily available. When it comes to Vespa, it is easy to switch from ANN to exact search just by changing a query parameter, i.e. approximate:true was changed to approximate:false, e.g.

select * from doc where {targetHits: 100, approximate:true}nearestNeighbor(user_embeddings)
// to
select * from doc where {targetHits: 100, approximate:false}nearestNeighbor(user_embeddings)

The change in algorithm caused the latency at P99 to jump from a stable ~50ms to a more bumpy ~70ms (+40%).

Figure 3. Vespa P99 search latency after starting the exact search experiment

The CPU load on Vespa search nodes increased slightly, however, we found that the user satisfaction with the exact search had not increased enough to justify the higher resource usage and query latency.

Member testimonies

The implementation of our recommender system on Vespa was a pleasant experience from an engineering point of view. While we were able to measure increased member satisfaction via a sequence of AB tests along the way, we were pleasantly surprised to hear member feedback about improvements that we were able to deliver by utilizing the new capabilities provided by Vespa:

I don’t know why I hadn’t looked at this or used this before as much as I do now.

Actually, Vinted is I think the only app that I use to just browse the main page because the stuff that comes up there is personalized to the user and based probably on my recent searches and recent buys and finds.

I’ve recently found that I do find myself overnight time scrolling through. Actually, the matches are pretty good, you know, often where I put quite a lot of stuff in my favorites by just looking at that.

A cherry on top is when we hear anecdotal feedback from random people mentioning that they only use the recommendations feature on Vinted because for them it seems that Vinted has a better understanding of their taste now.

Summary and future work

By leveraging ANN with prefiltering we’ve significantly improved the relevance of recommendations on our homepage. Also, the broader adoption of Vespa for item recommendation use cases enables numerous other product improvements and paves the way to simplify our system architecture.

Our team is excited about what we’ve achieved so far, and we can’t wait until we release new features for Vinted members that leverage the blend of dense and sparse retrieval techniques. Stay tuned!

The Downsides of Excessive Mocks (framework) and Stubs in Unit Testing

2023-10-02T00:00:00+00:00

A tiny disclaimer: When I say mock, I will be referring to a Mock Test Double. The mocking framework used to create this Test Double can be known as dynamic mock libraries, as defined by Mark Seemann. All code I will show here will be using Kotlin programming language.

A robust and reliable test suite helps us be more confident during refactors, preventing a code breach from going unnoticed; it also could improve the speed of code reviews, where the tests can act like documentation about the happy path, error handling, and edge cases. Lastly, it could highlight design problems if the code is difficult to test. Using mocks and stubs has become standard practice to isolate components and ensure reliable tests (London School). However, it’s imperative to recognize the fragility and potential pitfalls associated with excessive dynamic mock libraries and stub usage.

Clarifying Mocks and Stubs:

Before delving into the drawbacks, let’s clarify the terms mock and stub. A mock is typically created with a mocking framework (e.g., Mockito) and helps emulate and examine interactions between the System Under Test (SUT) and its dependencies. The classic example is to verify that a mock was invoked. On the other hand, stubs assist with interactions between the SUT and its dependencies to provide specific data. Those are functions that return something that we used to assert some condition.

The Fragility of Excessive Dynamic Mocking:

Mocking libraries are used in virtually every unit test in many workplaces, leading to repetitive setup code. This repetitive process of creating mocks and stubs can hinder scalability and create maintenance overhead. Either because it is necessary to create the same configuration for a given dependency in multiple places or because of the maintenance cost of updating all these mocks if something in the API changes.

Moreover, the ease of constructing the SUT’s dependencies using dynamic mocking libraries tempts developers to overlook the importance of a good design. We can imagine having an interface that contains many methods, on which you are interested in using only one - Interface Segregation Principle violation. It is very troublesome to create a working implementation of this interface and pass it in as an argument in the SUT constructor. On the other hand, it’s a breeze to do that with a dynamic mocking library.

Without paying attention, we can end up in a situation where we spend a couple of hundred lines just configuring the dependencies of the SUT, setting up things like:

val someDependency: SomeDependency = mock()
whenever(someDependency.callSomething(any())).doReturn(SomethingElse())

This extensive setup code makes the review process painful, making it difficult for reviewers to grasp the code’s intention. The back-and-forth between test and production files can be time-consuming and inefficient. Unit tests should be as declarative as possible.

Fragile Interaction Testing:

As we discovered, mock helps emulate and examine interactions of a particular dependency and the SUT being exercised. Interaction testing verifies whether specific dependencies were invoked. The fact that the SUT has called a method of its dependency is an implementation detail and, in most cases, should not leak to the test suite. This leakage results in brittle tests that require frequent updates whenever implementation details change, undermining the value of automation. We can write this kind of test using some function from the Mockito library:

verify(someDependency).callSomething(any())

Abusing the type of test - only checking calls without asserting behavior, can lead to a false feeling of completeness, having high test coverage but with low quality since by just verifying the calls, we can’t be sure that the expected result is happening. It is a mere assumption.

If a set of tests needs to be manually tweaked by engineers for each change, calling it an “automated test suite” is a bit of a stretch! (Software Engineering at Google, p.223)

Risks of Stubbing External Functions:

Stubbing functions from external sources that are not owned by us or fully understood can lead to a mismatch between the stubbed behavior and the actual implementation. This practice poses a risk of breaking present or future preconditions, invariants, or postconditions in the external function.

// kotlin
// Example of Stubbing a Function - MyClassTest
class Calculator {
    fun sum(a: Int, b: Int): Int {
        return abs(a + b) // always returns a positive integer
    }
}

//Test file
class MyClassTest {
    private val calc: Calculator = mock()
    private val sut: MyClass = MyClass(calc)

    @Test
    fun test_add() {
        whenever(calc.sum(1, -3)).doReturn(2) // Stubbing the sum function

        val result = sut.getNewValue(1, -3)

        assertEquals(result, 2)
    }
}

The test passes. But by stubbing the function sum, we are forced to duplicate the details of the contract, and there is no way to guarantee that it has or will have fidelity to the actual implementation. Just by reading the signature of the sum method, there is no guarantee that this function always returns positive integers. See more about depending on implicit interface behavior.

Times went by, and the owner of the Calc#sum method decided to change the postcondition of always returning positive integers to now also return negative values. The owner updates their test suite and runs the entire test suite of the project (assuming that all code belongs to the same repository). The worst happened, MyClassTest#test_add still passes! giving a false feeling of safety. If a particular behavior is always expected but not explicitly promised by the contract, you should write a test for it (The Beyoncé Rule).

Conclusion:

Excessive use of mocks and stubs in unit testing can introduce fragility, hinder maintainability, and lead to incomplete test coverage. Awareness of these downsides is crucial for fostering a robust and reliable testing strategy.

At Vinted, we still rely heavily on dynamic mocking libraries to write tests. However, recognizing their fragility is the first step to start thinking and treating tests as first-class citizens.

If the testing culture is an afterthought, the test suite’s quality can be, and most certainly will be, put at risk, providing a false sensation of safety where everything is virtually verified.

A very good quote from Mockito repo: If everything is mocked, are we really testing the production code?

This blog post was inspired by following resources:

Software Engineering at Google curated by Titus Winters, Tom Manshreck, and Hyrum Wright
Effective Software Testing by Maurício Aniche
Unit Testing (Principles, Practices, and Patterns) by Vladimir Khorikov

Vinted Search Scaling Chapter 7: Rebuilding search indexing pipeline

2023-09-25T00:00:00+00:00

Building an effective and efficient data ingestion pipeline is a challenging task. Let’s cover the migration from scheduled Sidekiq background jobs to real-time indexing built using Apache Flink.

Until recently, all the searchable data was fed to Elasticsearch using delta jobs. The model was simple, whenever a searchable entity changes, Rails hooks would bump its updated_at value and the delta indexing job would pick it up. We could control the job frequency to have a fine balance between system load and event lateness. We would update our items every 7 minutes, meaning that whenever members upload something to our catalog, they would have to wait up to 7 minutes for their changes to be visible in the app, which was not ideal.

When a new important field would be introduced to searchable entities, we would have to schedule a separate job to backfill the data. Backfilling our catalog would take a week and hog a large portion of the Sidekiq resources. This was a major bottleneck because it would discourage experimentation and new feature development, reindexing would simply take too much time and would be hard to iterate on.

Luckily, our Data platform team had completed the change data capture (CDC) implementation, which we had anticipated for a long time. With streaming data, we could react to changes in near real-time without putting any extra pressure on the database. This also means that we have to completely rearchitect the way our indexing pipeline works.

We chose Apache Flink to do that for a very simple reason, our data platform team has already successfully adopted that and is really happy with it. To name a few things that Flink does exceptionally well:

Built for streaming applications
Rich operator suite
Easy to scale
Fault-tolerant
An easy-to-use web interface
Provides extensive metrics

Phase one

We started slowly by implementing streaming applications that are not complex and have very small or no state. This was a good exercise to onboard ourselves to application development with Flink and programming with Scala. We also could easily experiment with different Flink settings to see how much throughput we could get.

To measure things, we set up a Grafana dashboard with Flink metrics, we were interested in the Kafka throughput and application resource usage. Without knowing the exact numbers, it would be impossible to measure our work. On top of that, we extensively used flame graphs provided by Flink to look into job hot spots. This has not taken long due to the fact that the jobs we were migrating were relatively simple and had low throughput.

Phase two

After becoming more comfortable with Flink, we rolled our sleeves up and started working on item search migration. We started with housekeeping, there were lots of deprecated and no longer used features, so we started by cleaning them up instead of blindly migrating them and potentially adding extra complexity that no one asked for.

Now this is a bit counterintuitive, but we decided not to use the Flink state for a couple of reasons:

We expected the application state to become large, meaning that whenever we would want to redeploy the application, we would need to load it back to the job, this could easily take even half an hour, we could not allow ourselves such a downtime
To construct an item document, we need to join a lot of streams, which would result in a lot of Flink expensive network shuffles.

Instead, we opted for Redis as our state. Using Redis hashes, we were able to model data in such a way that it was enough to make 3 Redis calls to fetch everything needed to construct an item. One for the item, one for members, and one for the enrichments such as catalog, brand, and color details. We could do all of that in a single asynchronous operator and avoid network shuffle. Besides that, such a model was more familiar to other developers being onboarded to Flink and Scala, meaning they would be able to implement features themselves with very little support.

To implement that, we’ve deployed separate Flink jobs that would populate Redis state for us and emit a change event whenever it would update the state. We then could consume these events in other jobs and trigger actions whenever an event arrives. Each and every event would be a JSON-encoded record representing what was set or deleted from Redis, for instance a event representing an item bump could look like this:

{
  "__deleted": false,
  "id": 1,
  "item_id": 2,
  "bumped_until": "2023-10-25T08:15:27Z"
}

Figure 1. Flink jobs building read model streams

Phase three

The last most important jigsaw piece was data reindexing. We could not stop the existing application and redeploy it from scratch, this would result in items not being seen in the catalog for some time. We needed something better. After lots of workshops and procrastination, we had a light bulb moment. “Why not produce stream events representing a reindexing event and make it part of our organic stream?” We connected the organic item stream and the reindexing stream, and whenever we would receive an event to reindex an item, we would look its details up in Redis and forward it to the chain, items that would not exist in the Redis store would be skipped. With such a model, data reindexing becomes only a matter of publishing messages to a Kafka topic. The best thing was that with such a model we could plug any stream related to the organic one and trigger reindexing.

Figure 2. Flink reindexing flow

The reindexing events look as follows:

{
  "offset_id": 0,
  "batch_size": 1000
}

We would consume such events in one of the stream operators, expand that we want to reindex items with ids ranging from 0 to 1000.

To update items whenever there’s an item change event, we would pick only the item_id field from the change event and plug that into the same stream as with reindexing.

The hard parts

Having minimal experience with Redis, it was a challenge to model data efficiently. We started with regular key-value pairs and expected to fetch them using the MGET operation. The design was simple but would not scale well because it would have to query lots of Redis shards to get the results. Remodeling them as hashes with common keys allowed us to reduce the amount of network calls we made, which resulted in lower CPU and network usage.

Storing data as JSON is convenient but expensive. For large volumes of data, we switched to MessagePack, it was not as flexible as JSON and required implementing serialization and deserialization manually, but it resulted in much more compact storage and faster storage times.

Flink loves memory and gives all the knobs to control the way to use it. Our Redis client of choice (Lettuce) uses native transports, which means we had to allocate lots of off-heap memory. Otherwise, task managers would run out of memory and constantly restart. Flink can recover from these restarts easily, but this is costly.

Due to Flink’s distributed nature, everything Flink transfers between operators must be serializable. Sockets, connections, and others cannot be sent over the wire and must be reconstructed by every operator. The easiest way to do that is by carrying around serializable components required to do the initialization, such as connection strings. It is recommended to use rich operators and their open methods to do that, open method is called once before constructing the operator. This resulted in lots of open Redis connections even when they are thread-safe and it is recommended to use a single connection across the application.

We did use Flink’s state for smaller applications and changing the underlying schema can be tedious. You have to ensure that the underlying class is a POJO and one needs to use savepoints to be able to evolve the schema, regular checkpoints will break your application.

Flink upgrades are hard. The application and cluster versions must match. Whenever we upgrade the system, we have to temporarily stop all the streams, take their snapshots, and then redeploy them one by one. If the application state is large, this can take a lot of time.

Outcomes

The migration was difficult and required many design changes. We’ve remodeled data numerous times but the outcomes were totally worth that.

We can now process changes in near real-time and members can see their newly published items in our catalog within a minute.

Intrusion detection for containers

2023-08-31T00:00:00+00:00

As containerisation continues to revolutionize the way software applications are developed and deployed, ensuring the security of container environments has become an utmost priority. Security in a containerised environment requires a different approach than traditional security mechanisms. It involves continuously monitoring container activities, identifying security threats, and ensuring compliance.

In this blog post, we want to share our experience with Falco, an open-source tool designed specifically for securing containerised environments.

Empowering Kubernetes runtime security

Falco, originally developed by Sysdig, is an open-source runtime security tool purpose-built for Kubernetes. Leveraging kernel instrumentation, Falco monitors system calls and events within containers and provides real-time insights into container activity.

The main factors why we chose Falco to enhance Vinted Kubernetes security are:

Container native approach
Real-time threat detection through rule-based monitoring
Robust community support
Extensibility

Falco deployment in Vinted

Within Vinted’s Kubernetes clusters, Falco is deployed as a DaemonSet, utilizing the official Falco Helm Charts. This ensures Falco operates within containers across every Kubernetes node. Additionally, to enhance operational efficiency, ensure a contemporary security posture, and mitigate dependencies on external connections, we have implemented a mechanism that retrieves Falco Helm charts to our internal Harbor Helm charts repository. Deployments management is executed through ArgoCD, adhering to GitOps best practices.

Instead of utilising the conventional approach of building, compiling, and maintaining Falco probe drivers, we opted for Falco’s modern BPF (Berkeley Packet Filter) probe. This decision has streamlined the instrumentation of our Kubernetes clusters for enhanced runtime security monitoring. It is imperative, however, to acknowledge that harnessing BPF’s capabilities demands specific requirements. To maximise the potential of this advanced probe, the custom Linux kernel must be eBPF-compatible and compiled with BTF (BPF Type Format) support.

Rules update mechanism

Our approach to distributing our custom Falco security rules is complex and highly effective. There’s a lot of moving parts. At its core, we have an index file placed in a private AWS S3 bucket, which acts as a guide for Falco, directing it to the specific location where it can access our security rules. These rules are securely stored within the internal Harbor OCI (Open Container Initiative) repository.

The effectiveness of this system unfolds when updates are required. Whenever rules, maintained in a private Git repository, undergo modifications, our continuous integration (CI) pipeline takes charge. First of all, it mounts rules and runs validation tests:

rules-validity-check:
    image: falcosecurity/falco-no-driver:0.35.1
    volumes:
    - ./rules:/etc/falco/rules
    command: /usr/bin/falco -o load_plugins[0]=k8saudit -o load_plugins[1]=json --validate /etc/falco/rules/vinted_rules.yaml

After successful validation, CI infrastructure encapsulates the custom rules in the OCI format and pushes them to the internal Harbor OCI repository:

rules-registry-push-oci:
image: falcosecurity/rules-registry:v0.1.0-1-876e81b
environment:
    - REGISTRY_USER
    - REGISTRY_TOKEN
    - GITHUB_REPO_URL
    - OCI_REPO_PREFIX
volumes:
    - .:/etc/falco
    - ./rules:/app/rules
command: /app/rules-registry push-to-oci /etc/falco/registry.yaml $OCI_TAG

To accommodate Falco security rules in the OCI package, we have developed a custom container image with Falco’s rules-registry CLI.

Furthermore, while Falco scans for rule updates hourly, within this interval, all Falco pods autonomously retrieve the revised rules. This ensures that our security policies remain consistently up-to-date.

Integration with security incident management process

The integration of Falco with Vinted’s security incident management process involved leveraging Falco’s runtime security capabilities to monitor, detect, and alert on anomalous behaviour within Vinted Kubernetes infrastructure. By integrating Falco, we can automatically detect and respond to security threats in real-time, ensuring the safety and integrity of our systems.

It is not just about detecting threats using Falco but also about promptly notifying the Security Engineering team when something suspicious occurs. When Falco detects a breach of its rules, it generates an alert, which Falcosidekick delivers to the security incident management platform. This integration creates a dynamic and efficient feedback loop, ensuring the security team is aware of potential threats and equipped to respond swiftly and effectively.

Using Vitess in our CI/CD pipeline

2023-05-19T00:00:00+00:00

Here at Vinted, we adhere to the continuous deployment principle, meaning that each merge to the main branch of the code repository initiates an automatic deployment process. As a result, the merged code goes live in a short period of time. Merge small and merge often are two practices that are instrumental to our day-to-day engineering work.

This approach mitigates friction when working on a shared code base, shortens the feedback loop, and reduces the anxiety of breaking the production environment with a gigantic change. But for it to actually work, one must have very good code test coverage, deployment automation tools with solid safeguards in place, a reliable observability stack to catch problems in the deployed products, battle-tested rollback, recovery procedures, and much more.

In this post I’ll touch on the testing phase. Specifically, I’m going to explain how we make use of the Vitess database to run backend tests of our core application.

I assume that you’re already somewhat familiar with Ruby on Rails, RSpec, Vitess, and Kubernetes. If not, then I would suggest spending some time getting acquainted with them first, as I’ll jump right into some details about them below.

Tests - then and now

Vinted Marketplace is one of the biggest and most important products created in Vinted. Under the hood of this online platform there’s a Ruby on Rails application called core. It was the very first Vinted application that was migrated from ProxySQL to Vitess. Here’s a series of posts by my colleague Vilius on our journey to Vitess that you should definitely check out: Vinted Vitess Voyage: Chapter 1 - Autumn is coming.

At the time of writing this post, core has more than 30 keyspaces (logical databases), with critical ones horizontally sharded or in the process of sharding. VTGates serve 1.3 mln. queries per second during peak hours, and data size is around 15 TB.

As described in the introduction, core is deployed continuously, with around 150 pull request builds and 50 main branch builds (that eventually end up with code release to production) per average working day. During each build, backend tests are executed by running 30,000 RSpec examples. It’s worth mentioning that the database is usually not mocked out, therefore tests execute actual SQL queries on a running MySQL server.

Our CI/CD pipeline is driven by the good old Jenkins with a fleet of agent nodes. For a very long time, specs used a MySQL server running directly on the agent server. This made setup super easy, and prevented additional latency and possible transient failures due to the network layer. The sheer number of examples that needed to be run meant that we started using the parallel_tests Gem a long time ago. With around 10 parallel runners, we were able to keep the duration of the backend tests stage under 15 minutes, which we considered acceptable.

Once we started migrating core from ProxySQL to Vitess, we had to decide what to do with the tests. Essentially, the question was if we trusted Vitess’ claim of MySQL compatibility to keep running tests on plain MySQL. In the end, we did. Our experience confirmed that it was the right choice. Granted, we had some nasty Vitess-related surprises, but these were usually caused by various configuration issues, and not an unexpected (untested) database query behaviour.

Then came the moment when we started preparing for the first horizontal sharding. Now this was going to be a bit more challenging. Vitess horizontal sharding has no direct analogue in plain MySQL, and it introduces a number of new limitations. For example, such an innocent looking call as User.first would fail if the underlying users table was sharded. Keep in mind that this was way before the query_constraints feature was introduced in Rails.

We decided that we needed to be able to run tests on Vitess. It was helpful that we were only going to horizontally shard a single table. The potential impact of releasing an unsupported or incorrectly working query to production was quite limited. That’s why we picked the middle road - to only run a limited subset of specs on Vitess in parallel to running all of them on plain MySQL as before. We were well aware that this was risky, error prone, and would not scale in the future. But kicking the can down the road is a legitimate tactical decision in certain cases. It allowed us to concentrate on more important tasks at the time, while reducing the risk to an acceptable level.

The final design was a simple one. We added a parallel Jenkins Pipeline step that started a Vitess cluster in a Docker container on the agent, loaded DB schema, and ran some predefined specs. RSpec tagging functionality was really handy here:

RSpec.describe <TestedThing>, vitess: true do
  ...
end

and

bundle exec rspec --tag vitess

This setup worked well for us for around six months. But then another “Vinted Autumn” came and we were once again struggling with an ever increasing load. It was more than clear that we needed to horizontally shard many more tables, and we needed to do that as soon as possible. This also meant that running only a subset of specs on Vitess was no longer a viable option, so we needed to up our game.

At that point we toyed with two main ideas: run all backend tests on Vitess or run tests on plain MySQL and analyse (explain) the produced queries somewhere on the side. We were leaning towards the first option, but had some obstacles to overcome. Simply put, we needed tests on Vitess to run comparably fast to tests on agent-local MySQL servers. At that point in time, we were quite far from this goal.

Luckily, in the end, we managed to find solutions to all major challenges. We’re currently running all core backend tests on Vitess with each pull request build. Vitess used for tests reflects the production setup as close as is needed in this scope. Most importantly, code that runs on horizontally sharded keyspaces in production, runs on horizontally sharded keyspaces in tests too. This does not guarantee that unsupported query won’t sneak into production as there’s a chance that some code paths are not fully tested. But the chances of this happening are greatly reduced.

Below is the general overview of our tests’ setup. And in the following chapters I’ll give you more details on specific elements.

Vitess adapter

Before diving into the test setup details, I have to briefly introduce our Vitess adapter - an internally developed Ruby Gem that does a lot of heavy lifting for applications working with Vitess. Some of the most important features and capabilities added by the adapter are:

Support for a mixed setup of multiple ProxySQL clusters and multiple Vitess clusters in a single application. This includes the necessary glue to make our internal online migrations Gem (that uses gh-ost under the hood) work with any such configuration.

Instrumentation for sharding annotations together with automatic primary VIndex columns injection into certain queries (functionally quite similar to query_constraints of Rails)

class ShardModel < ApplicationRecord
  self.abstract_class = true
  vitess_functional_shard :name, horizontally_sharded: true
end

class ConcreteModel < ShardModel
  vitess_primary_vindex %i(col1 col2)
end

Transaction patches to enforce transaction mode and to decorate queries in our query log:

ActiveRecord::Base.transaction(mode: :single, tag: ‘description’) do
  ...
end

Throttler for background jobs to protect Vitess shards from overly aggressive writes that may drive replica lag unacceptably high. It makes use of Vitess tablet throttler interface.

throttler = Vinted::VitessAdapter::Throttler
  .new(‘name’, [User, Item, Message])

<Foo>.in_batches(of: BATCH_SIZE) do |batch|
  throttler.wrap do
    <operations with batch>
  end
end

In the context of testing, a special mention is reserved for the patches of database tasks like db:schema:load to make them work in the case of Vitess. You will hear about this a bit later.

And now we can get back to test setup.

Vitess in a Docker

The very first thing that we needed to do was to make Vitess a thing that we could throw around. We decided to take inspiration from a Vitess local setup example and pack all the important pieces into an “all-in-one” Docker container.

Inside the container, we run etcd, vtctld, vtgate, multiple mysql, mysqlctld and vttablet instances. MySQL instances are semi-shared by VTTablets to limit resource usage.

The Vitess cluster is configured dynamically during container startup. Configuration is driven by an (almost) VSchema JSON that is passed as an argument. Take, for example, the following JSON:

{
  "vitess_sequences": {
    "tables": {
      "table_id_seq": {
        "type": "sequence"
      }
    }
  },
  "horizontal": {
    "sharded": true,
    "vindexes": {
      "hash": {
        "type": "hash"
      }
    },
    "tables": {
      "table": {
        "autoIncrement": {
          "column": "id",
          "sequence": "vitess_sequences.table_id_seq"
        },
        "columnVindexes": [
          {
            "column": "ref_id",
            "name": "hash"
          }
        ]
      }
    }
  },
  "main": {
    "tables": {}
  }
}

It describes a cluster consisting of 3 keyspaces (sharded horizontal and two unsharded ones - vitess_sequences and main) with their appropriate VSchemas from the JSON. All the underlying resources (MySQL instances, VTTablets, etc.) are created automatically.

This pseudo VSchema is auto-generated from correctly annotated ActiveRecord models and is tracked in the same code application repository. The aforementioned Vitess adapter provides a vitess:schema:dump rake task for just that. This task and its companion task vitess:schema:verify also verify that the model annotations are valid: all horizontally sharded models have primary VIndex annotation and they do not contradict each other if multiple models point to the same underlying table.

As already mentioned in the Vitess adapter section, the standard db:schema:load task is patched so that it works in the following manner:

The database schema from db/structure.sql (or other files depending on database connection configuration) is split into chunks per keyspace, by consulting their "tables" hashes from VSchema JSON.
Keyspace with empty "tables" (in our case - main) receives all tables that do not explicitly fall into any other keyspace. In other words, it is a “catch all” keyspace.
The constructed schema load command is executed directly on the primary MySQL instance of each shard of all keyspaces. This shaves 3-5 minutes (for close to 400 tables) as compared to loading schema via VTGate which became excruciatingly slow at some point in time, taking close to one second per table.
Finally, VSchemas of all unsharded keyspaces are updated by comparing VSchema to the actual tables in MySQL database, and executing appropriate ALTER VSCHEMA … SQL statements

With these three components (generic Vitess Docker image, VSchema for configuration, and the patched Rails database tasks) we can spin-up and prepare a Vitess cluster for local development or testing of any Vitess-enabled Rails application.

Enter the scale

Having dockerised Vitess running locally on the Jenkins agent machine was sufficient when we were only dealing with a small subset of all specs. But when we realised that we may need to run ALL of the backend tests on Vitess, we had a brand new challenge on our hands.

As you may remember, we were executing around 30,000 RSpec examples, and were relying on the parallel_tests Gem for parallelisation, as otherwise tests would have taken about two hours. So, in a nutshell, we needed to parallelise Vitess tests to a similar degree without exploding our Jenkins agent machines.

Luckily, by that time Vinted was fully into Kubernetes - the major applications, including core itself, were already running on it. So it should come as no surprise that we decided to deploy our Vitess clusters in Kubernetes.

The Kubernetes side of things was rather standard. We used Helm charts to describe our Kubernetes resources, Harbor to store artifacts such as Docker images and Helm charts, and Argo CD to drive delivery pipelines. We define Vitess cluster as Argo CD application and it can be deployed by installing a chart:

helm install \
  --values  \
  --name-template ${instanceName} \
  --set fullnameOverride=${instanceName} \
  ...

As a separate step, we wait for the deployment to finish before starting to use the cluster.

kubectl \
  --namespace vitess-docker-ci \
  rollout status deployment ${instanceName} --watch

One Vitess cluster on demand is nice, but not nearly enough for our need to execute 30,000 RSpec examples. So we spin up many more (15 at the moment) in parallel. For distributing specs over these clusters, we reuse the same parallel_tests Gem.

When using Rake tasks from this Gem to perform parallel actions, it provides env variable TEST_ENV_NUMBER with a unique batch identifier for each process that can then be used to build a unique database name, for example:

# config/database.yml
test:
  database: test_db<%= ENV['TEST_ENV_NUMBER'] %>

In reality this is a tiny bit more complicated, as we add more components to the name (like Jenkins build ID) to make the database name unique across all build jobs running on the same agent.

In Vitess’ case we don’t change the database name, but instead host to point to a different Vitess cluster:

# config/database.yml
test:
  host: vitess-<%= ENV['TEST_ENV_NUMBER'] %>.example.com

Once again, it’s more complicated, but the idea is the same: we know how to construct this name from having the base host name and parallel process ID from the TEST_ENV_NUMBER env variable.

Now you may remember that we used the instanceName variable in our cluster deployment helm and kubectl commands. It’s the same host name as the one that we construct in the config/database.yml file. This way, we spin up clusters in advance by naming them precisely, and then they are automagically used for parallel spec batches:

bundle exec rake parallel:load_schema[$parallelism]
bundle exec parallel_test spec -n $parallelism -t rspec

The only missing component is the VSchema that’s needed to correctly configure a new cluster. As mentioned above, this schema file is generated from the ActiveRecord annotations and it’s tracked in the code repository. To make it available to the Vitess cluster that’s being spinned-up, Jenkins agent automatically uploads it to the Cloudsmith repository using its package upload API, and provides its public URL to the cluster as one of the environment variables in the helm install command.

With all these building blocks in place, running all backend tests on Vitess is a fully automatic process. Vitess clusters’ setup is driven by developers appropriately marking ActiveRecord models. This is very convenient when preparing for the next horizontal sharding. Developers mark a functional shard root model as horizontally sharded, add primary VIndex annotations to the descendant models, generate new schema by running rake vitess:schema:dump and push committed changes to a new branch on GitHub. As a result they get all the unsupported queries or changes in behaviour in the build output and can safely work on fixing them in their branch.

Final words

In the engineering world everything is permanently in progress until final decommissioning is completed. Naturally, this holds true for our testing setup. It reflects the progression of our needs and obstacles that we faced on the way, and not the configuration that is the best or the most desirable for us. There are also some potential directions of improvement.

One of the more nagging issues is a rather slow cluster spin-up. Container initialisation and database schema load add up to a few minutes. While not a deal breaker, all of that could go away if we had a pool of pre-spawned clusters. Then a Jenkins build run would only need to checkout a collection of suitable clusters and bring them up to date by executing the latest database migrations.

To conclude, while we had some challenges on the way, this task proved not to be overly complicated. We mostly made use of the technology stack that we already had, which is always a plus in our book. Right now running all backend tests on Vitess is an important component of our CI/CD pipeline that majorly contributes to ensuring the quality of the product that we’re releasing to our members.

Vinted Vitess Voyage: Chapter 1 - Autumn is coming

2023-04-27T00:00:00+00:00

“Winter is coming” - Ned Stark, Game of Thrones

Au contraire, autumn is the season to warn us of upcoming challenges - the most extreme growth and workload period of the whole year. In fact, we coined the term - ‘Vinted Autumn’. Our main MySQL databases were taking a massive beating every autumn despite sharding them vertically multiple times. While managing a dozen physical servers is an ok task, managing 42 servers manually is quite cumbersome. Additionally, the process of vertical sharding itself was increasingly hard to orchestrate and there was no way of turning back. Hacking around a horizontal sharding solution was not an option either. Naturally, we wanted to bring in tools to efficiently manage MySQL.

Vitess became the eighth CNCF project to graduate in November 2019. It did promise all the nice bells and whistles to improve painful vertical sharding and the possibility to shard horizontally. After all, Vitess was created to solve the MySQL scalability challenges that the team at YouTube faced. So, 2019 marked the start of our Vitess Voyage and this is the first in a series of chapters sharing the story.

The Beginning

Different countries had their own Vinted portal with separate deployment and a separate set of resources. We’ve already been using functional sharding in our monolithic Rails application for some of our largest portals for years. We used a relatively complex ProxySQL cluster setup of core and satellite (aka replicas) nodes with a couple of routing rules to send queries to appropriate primary or replicas of target functional shards. All satellites were running on application servers to minimise network hops. Our analytics applications were using dedicated functional shard replicas (Fig. 1).

Vertical sharding is a process where some tables from a single functional shard are moved to another functional shard.

Horizontal sharding is a process where rows of some tables are spread over multiple shards on the same functional shard.

Functional shard is a group of tables that are closely related and expected to reside in a single database (and potentially share the same connection objects), while tables from different functional shards may be located in different databases.

Figure 1: ProxySQL functional shard

The vertical sharding process was quite laborious and prone to errors (Fig. 2). The following steps summarise the overall process:

Preparation
- Monitor and fix queries that cross functional shard boundaries
  - Remove joins
  - Refactor transactions
- Create new MySQL db2 cluster with primary replicating from db1 cluster
- Configure application with new functional shard db2 which reuses db1 connection
Separate connections
- Assign VIP (IP alias) to db1 and switch functional shard db2 to use it
Final switch
- Perform IP alias switcheroo from db1 primary to db2 primary
Cleanup
- Stop replication from db1 to db2
- Delayed drop of tableB on db1
- Delayed drop of tableA on db2
- Configure new ProxySQL functional shard and switch applications to use it instead of VIP

Figure 2: ProxySQL vertical sharding

Despite the relative simplicity of the approach, it had several drawbacks:

Human error - misspelled, malformed or out of order executed commands could lead to downtime;
Cluster db1 had to be able to hold at least double the number of connections due to VIP;
Both primary and replica reads were forwarded to VIP, thus creating even more pressure on db1;
Connections to VIP were created much slower compared to ProxySQL;
After Step 3 there was no turning back without downtime.

Eventually, we reached the point where vertical sharding was only possible by turning off asynchronous jobs and reducing running application instances during the lowest load hours. To rub more salt into the wound of issues, any small bug or even super popular seller could cause connection storms on some MySQL primaries. There were just too many applications and ProxySQL instances. Some tables were terabytes in size and no longer migratable. Lastly, shards were already running high-end hardware.

Luckily, our Platform and SRE engineers have been preparing for such a ‘Vinted Autumn’.

The Voyage of 2019

Proof of concept

It was the beginning of 2019 and Vitess had already moved past the v3 version. Our SRE brewed a Vitess chef cookbook and made it available for public. With the help of the platform team, we tested a single portal with a single functional shard. After several Cups of Joe and code fixes it ran without a hiccup. However, the proof of concept only tested a subset of portal features and the load was too small.

At the end of the summer, I joined Vinted with an enthralling challenge to help slingshot the Vitess project. ‘Vinted Autumn’ was already foretold, so I had my hands on the old school vertical sharding early on. Shortly after, I dug into Vitess.

“We’ve made a cunning plan on how to test the main portal load on Vitess. Firstly, we’re going to collect all MySQL requests to a log cluster and then use these same requests on Vitess to see if there are any bottlenecks” - Vinted SRE

Our newly formed SRE Databases team started preparing the tooling for a Vitess performance test. In essence, we were going to collect all of the MySQL queries and try them out on a test cluster. The opportunity to test Vitess with a real load would give us more confidence to move forward with the project. Also, we wanted to be able to check different negative scenarios, such as server issues, master failover, and faulty migrations.

All chapters in this series

Vinted Vitess Voyage: Chapter 2 - The Cunning Plan

2023-04-27T00:00:00+00:00

This is the second in a series of chapters sharing our Vitess Voyage story. After a busy ‘Vinted Autumn’, the real work took hold.

The Voyage of 2020

By the start of 2020, we had our biggest portal consisting of 11 functional shards with 8TB of data on test cluster. Additionally, we developed a testing environment, query capture and replay tooling. One feature switch away, applications would send all generated and annotated queries to a Kafka topic partitioned by request_id, app_username. Then, queries would be batched by the same partitioning fields and sent to another Kafka topic for storage and replay tests.

Thus, the SRE was tested there!

The basic workflow was to restore the cluster, replay load, adjust and repeat. This is the part where things got more interesting. We were load replaying as quickly as possible which required quite some tuning for Vitess and the replay tooling itself. The Vitess team support was invaluable whenever we hit both unknowns and bugs (Fig. 1). Anyhow, whole replay testing deserves its own blog post.

Figure 1: Vitess support by Sougou

During the year outside of Vinted Autumn’, we upgraded/tested up to v8 with some backports, and the following results were true for us:

1) Additional 1.5 - 2ms overhead on query time

From the test results we saw that the mean time of a query will degrade up to 2 times with Vitess. Peak time average of our non-Vitess queries were in the 1.5-2ms range. Longer running queries were impacted less than faster running ones, which meant that the performance penalty was constant. We saw that our test cluster could handle the load that we had at the time during peak times (190K QPS vs 220K QPS). Vitess components (vtgate, vttablet) actually do have a lot of timing metrics which gave us a lot of hints where the overhead in some parts came from.

2) Additional resource overhead

We could not slim down any shard to a recommended 250GB size and we were far from ready to shard horizontally. It was essential for vttablet, a Vitess component, to be run alongside MySQL without overloading existing hardware. The Vitess documentation and Slack workspace history provided accurate estimates of resource consumption overheads. We observed that vttablet consumed the same amount of CPU as MySQL, which suggested that further vertical sharding would be necessary in the future. Moreover, if combined vttablet and MySQL CPU usage would exceed 50% of the host’s capacity, performance would degrade. Double the CPU usage of vttablet compared to MySQL would indicate that there was an issue with either the query load or vttablet itself. Since Vitess is written in Go, pprof turned out to be an invaluable tool in determining some of those issues. Notably, Go programs would set GOMAXPROCS to match the number of CPU cores on the machine by default, causing the Go scheduler to go mad by stealing all the CPU resources on our high spec machines (64, 96 or 128 CPU cores) (Fig. 2).

Figure 2: Vitess GOMAXPROCS

So our rules of thumb:

Per vtgate instance
- 4 CPU cores
- GOMAXPROCS=4
Per combination of vttablet and mysqld instance
- All CPU cores available to mysqld process
- Half of CPU cores available to vttablet process and GOMAXPROCS set to that number
- All connection pools (transaction, stream, query) configured to 2K for vttablet process (if possible use much smaller)
- All connection pools prefilling configured to 400 for vttablet process (if possible use much smaller)

3) Still manual primary switch

Vitess provided orchestrator integration for automatic failover support. However, after some chaos monkeying we ended with split brain situations and decided not to use it. This was OK for us. Practically, we had just one emergency reparent over 3 years. Anything else was manual due to maintenance. Still, Vitess itself provided tools to manually switch primary much easier than our masterfully crafted old school Bash script.

4) If possible, session variables should not be changed

A user might also want to change one of the many different system variables that MySQL exposes. Vitess handles system variables in different ways. Dynamically changing these values might return to bite us in quite unexpected ways: unexpected transaction timeouts, SET queries routed to unavailable replicas… It happened repeatedly in the past for us. So just make sure that the global MySQL variables are set to the same values the application would require and/or use Vitess-aware variables.

5) A big NO to advisory locks

Vitess supports the advisory locking functions and we use them heavily. However, we hold locks for quite a long time while Vitess applies the same transaction timeout for any transaction - 30 seconds by default. Additionally, the reserved connection pool (an extra special transaction pool) was filling up in the vttablet almost instantly. Also, only one vttablet was picked from the alphabetically first keyspace - certainly not scalable. Instead, we reserved a dedicated MySQL cluster for just that. This deserves its own blog post too :) .

6) Tons of query fixing

The obvious elephant in the room. There are compatibility issues where Vitess differs from MySQL. Most of the effort was spent on these:

Lots of 10k queries to refactor
Prevent usage of non-Vitess aware session variables due to special behaviour
Refactor transactions to prevent exceeding 30s timeout
Refactor queries to fit 30s timeout
Fix most cross-shard transactions to at least self heal
Remove cross-shard joins

7) Other work

In order to really migrate to Vitess, we had other prerequisites:

Dockerised Vitess for development and CI
Database adapter for Rails applications
- Schema migration mechanism support for Vitess and multiple keyspaces
- Cross-shard join and transaction monitoring
- Primary and replica query handling
- Database and table introspection
HAProxy balancer in front of vtgates

Autumn 2020

In the beginning of the second wave of the Covid-19 pandemic, we were getting our “bones” broken and performing the last old school vertical sharding.

We had to compromise on some tasks and push forward since we had many portals with hundreds of tables - each being bigger than the other.

Migration workflow designed

After surviving yet another ‘Vinted Autumn’ and more extensive testing, we muscled up our Vitess skills and layed down the grand migration plan. It was based on documented Movetables functionality. Each functional shard was migrated this way with the help of migration plan generation scripts. The following 5-step example briefly illustrates how we migrate a single table tableB to a new vertical shard.

Unmanaged vttablet is a vttablet process which connects to an already existing MySQL server setup.

Managedvttablet is a vttablet process which connects to Vitess managed MySQL server setup. Usually it is Vitess mysqlctld process which manages mysqld.

1) Deploy & Copy data

Prepare all applications for a new vertical shard just like with old school vertical sharding.
Lock table migrations for functional shards.
Provision unsharded keyspace db2 in Vitess.
Start unmanaged vttablet processes for each source MySQL server as keyspace db1
Provision additional source MySQL server with RDONLY vttablet type. This is going to speed up data copying during migration and prevent any disruptions.
Start Movetables workflow to migrate tables from keyspace db1 to db2.
Perform VDiff to ensure there are no missing/extra/unmatched rows between source and target keyspaces.

2) Canary release

Switching to Vitess might reveal hidden issues, so we make sure to minimise the blast radius. Canary release gives us an option to revert and have time to fix our applications.

Switch analytics apps fully to Vitess for functional shard. All queries would be served by db1 unmanaged RDONLY vttablet.
Switch only some instances of asynchronous apps to Vitess.
Switch only some instances of main apps to Vitess.

3) Full release

Fully switch all applications to Vitess for functional shard
Run at least 24h to make sure nothing breaks.
Switch RDONLY traffic from db1 to db2 keyspace.
Switch REPLICA traffic from db1 to db2 keyspace.

4) Final cut-over

The major difference from our old school vertical sharding final switch - it is reversible.

Switch PRIMARY traffic from db1 to db2 keyspace.
Ensure reverse VReplication was created after the traffic switch.
Change applications to use db2 keyspace for functional shard.

5) Complete

Perform Movetables completion. We always kept the renamed source table for some time before dropping it.
Celebrate.

All chapters in this series

Vinted Vitess Voyage: Chapter 3 - The Great Migration

2023-04-27T00:00:00+00:00

This is the third in a series of chapters sharing our Vitess Voyage story. With the plan ready and wounds healed, we made the move.

The Voyage of 2021

All our CIs were still running on bare MySQL.

“Ain’t nobody got time for that” - Kimberly “Sweet Brown” Wilkins

‘Vinted Autumn’ was coming…

Q1 sandboxes

Sandboxes - the only other environment than production. This was relatively simple since there were only a couple of GB of data. Yet, it took an unacceptably long time due to almost 360 tables and ~500MB of data per portal, because v8 Movetables could only copy or perform VDiff one table at a time. Apparently, one could split start multiple Movetables in parallel and then later merge them into one in order to perform a clean cut-over (Fig. 1). It did require a very careful composition of gargantuan commands, but it did the trick. Everything would break if Errant GTID were to creep in unnoticed.

Figure 1: Vitess Movetables split

Q2 first portals

Here came some already vertically sharded portals and first steps on rakes. This is when we had to add dedicated source MySQL replicas to prevent disruption. In addition to this, target primary could no longer keep up with the amount of row changes from multiple Movetables, so we had to limit them too. The throttling feature was not yet available in v8.

Several more interesting issues were detected during the VDiff process. Due to mismatched max_allowed_packet and grpc_max_message_size we received false positive differences. In addition to this, for large enough values WEIGHT_STRING will return NULL from source and destination targets and VDiff would compare NULL vs NULL instead of comparing binary value representations. Vitess support quickly reacted and released a fix.

Even later on, some more issues appeared during other migrations due to feature differences between portals. We had to backport fixes and build our own binaries since we already settled with the v8 version for stability. It was kind of the initiation of the first Vitess code.

Q3 Boss level

With a little bit more confidence built, the only thing left to migrate was the main portal.

Figure 2: Vitess Movetables too slow

Despite all previous divide and conquer optimisations, it was not enough (Fig. 2). Even after devising a number of workarounds, rake hits were sustained. The following list contains most of the tricks employed to pass the Boss level.

1) MySQL 5.7 downgrade

We forgot to pin the MySQL version for Vitess deployments. The bug fix PS-3951: Fix decreasing status counters introduced additional locking which significantly affected transaction performance. Thus, we had to downgrade to 5.7.21-20.

2) Replica optimisations

This was another missing configuration from our old school deployment. We do have different global system variables on primaries and replicas. However, Vitess does not support such functionality. Luckily, vttablet has hooks API. A workaround of a simple cron job triggering a hook on each vttablet every 5 minutes worked like a charm. Based on vttablet type, the hook would set appropriate global variables.

PRIMARY
- SET GLOBAL sync_binlog=1
- SET GLOBAL innodb_flush_log_at_trx_commit=1
- SET GLOBAL innodb_flush_log_at_timeout=1'
REPLICA/RDONLY
- SET GLOBAL sync_binlog=0
- SET GLOBAL innodb_flush_log_at_trx_commit=2
- SET GLOBAL innodb_flush_log_at_timeout=300

3) Secondary index drop & create

This trick is probably one of the best there is. After initial Movetables command execution, we would stop it, drop all secondary indexes and restart workflow. After the data copy phase finished, only then would we recreate dropped indexes. It brought us at least a ~2.5x improvement of copy phase for heaviest tables having at least 2-3 or more secondary indexes.

Example: table of 170GB and 3.2B rows and 3 indexes:

Normal Movetables
- Copy rate degraded over time due index rebalancing (rows/s every 4h: 38K, 30K, 25K, 23K, 21K, 21K, 18K, 18K, 18K, 17K, 18K)
- Copy time 42h
- Total time 42h
Secondary indexes dropped
- Copy rate 70K rows/s is constant
- Copy time 13h
- Index add time 4h
- Total time 17h

4) More vertical sharding, turn off or refactor

Some vertical shards were already getting too much write load or the use case was bad for Vitess. For example, analytics imports were rewriting whole tables each day. Most jobs like this have moved to other solutions by now. At the time, turning them off temporarily was the only solution.

5) Beware of large GTID sets and source table lists

Movetables workflow itself generated quite a lot of row updates on target primary internal table _vt.vreplication in order to update its position. Row updates contained a GTID set and an encoded source table list. This became especially painful when source shards had very long GTID sets in Executed_Gtid_Set due to a lot of reparents to new hardware. About MySQL56/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx:y-yyyyyyy GTID times 30 would approximately weigh 1000 bytes in size. In addition to this, if 350 tables would be picked for migration, an encoded source table list could weigh about 30K bytes. Such big workflow position updates caused not only high lag on target replicas, but also disks filling up with binary logs. Of course it grew linearly with the count of parallel running Movetables - yet another limitation. Plain manual binary log purges and limiting the workflow count helped. The issue was addressed later by Vitess.

6) Treading dangerous ground

“He who is not courageous enough to take risks will accomplish nothing in life” - Muhammad Ali

Set these at your own calculated risk on target keyspace:

innodb_doublewrite=OFF
sync_binlog=0
sync_relay_log=0
sync_relay_log_info=0
innodb_flush_log_at_timeout=1800

Autumn 2021

In the aftermath of the migration, 23 new vertical shards were running our biggest portal. 2x growth was evident both in query traffic and even more in data size.

Ready or not, ‘Vinted Autumn’ still caused a bit of a stir, but afraid we were not. With roads already paved, a couple of more vertical shards were added quickly under our wide belt. Some peculiar things still befell.

1) October 2021 Facebook outage

That caused an almost 15% increase on top of 2x yearly growth, but this time we held it! I’m sure that not every other platform got this lucky.

2) Cache bust

Everyone suffers from this once in a while (Fig. 3).

Figure 3: Cache bust rate

A sudden extreme increase of read load hit some of our primaries multiple times. The good news, vttablet has a query consolidation feature that is designed to protect the underlying database server. When a vttablet receives a query, if an identical query is already in the process of being executed, the query will then wait. As soon as the first query returns from the underlying database, the result is sent to all callers that have been waiting. A high consolidation rate means that there were a lot of simultaneous identical queries, as the cache was partially bypassed. Instead of a thundering herd beating MySQL, vttablet processes got the beating and still did not go down. Go routine counts and query times increased tenfold, causing connection pool fillings too. In a way, Go scheduler had too much work yet again. Still, we kept this feature on to protect MySQL instances and proceeded to redirect load to replicas. Similar experiences were reported too (Fig. 4).

Figure 4: Query consolidation

3) Distributed deadlocks

Random short bursts of query timeouts and connection pool fill ups did slap some transaction heavy shards. The issues resolved themselves on their own. All the symptoms looked similar to the cache bust with high consolidation rates. Little did we know that distributed deadlocks were the culprit too. Luckily, the Vitess community shares their horror stories - Square did encounter the same issue. At least that was clear six months later.

“We still think the tradeoff was worth it: the deadlocks were a small price to pay to shard months earlier and avoid much bigger outages.” - Mike Gershunovsky, Square Inc

4) Space & Lag

One table in particular exhausted all vertical scaling options. Almost 3TB in size and alone it ruled a dedicated shard. Table migrations were impossible long ago, but now even write load was above the threshold for replicas to keep up. It was prime time for horizontal sharding.

All chapters in this series

Vinted Vitess Voyage: Chapter 4 - Autumn Strikes Back

2023-04-27T00:00:00+00:00

This is the fourth in a series of chapters sharing our Vitess Voyage story. After a rough ‘Vinted Autumn’, this time we came to the conclusion that vertical sharding was no longer an option.

The Voyage of 2022

The first horizontally-sharded keyspace

The Vitess documentation and community already provides a significant amount of information about horizontal sharding. I’ll just share the most interesting part of our own sharding.

As luck would have it, the aforementioned table in the previous chapter served a simple use case and not a lot of different queries were issued to it. Picked user_id column as a sharding key after basic manual query column usage inspection. Some parts of queries were fixed by just adding an additional predicate with user_id. Around 8% of queries were using primary key id and 2% other columns. Due to their inconsequential rate and importance, we decided to let them scatter anyway.

As the saying goes, “Rome wasn’t built in a day”. We had to put in even more elbow grease:

Rerun all queries through vtexplain + EXPLAIN FORMAT=VITESS to warrant correct query routing
Train ourselves: replay, break and recover resharding process
Modify migrations to support horizontal shards and test/break/recover (our own tasks using gh-ost)
Custom auto-injection of primary VIndex column predicate into appropriate queries. Just recently, we noticed a feature got into a future Rails version.
Teach developers to work with horizontal Vitess shards.

First contribution

Anyhow, manual query parsing was certainly not going to cut it for use. vtgate instances logged queries with shard, vttablet type and used table tags, but after a second look the bulk of them had incorrect table tags. Additionally, they did not have any client connection identification whereas the MySQL general log would contain such information. Besides, the Vitess version we used had just fallen out of support. Implementing such features might take months, or even longer. Yet, ‘Vinted Autumn’ was coming and there were other more complicated candidates for horizontal sharding.

With the help of Andres Taylor from PlanetScale, we added client session UUID and some missing table extraction to the query logs. Additionally, we flagged queries if they were in transaction, which greatly improved query analysis. Then, of course the matter of backporting was left.

Vitess later improved table and keyspace extraction quite a bit in v15

Vitess upgrade

With more time available, upgrade testing from v8 to v11 with additional backport fixes was started on our same Vitess test cluster. It contained a great deal of improvements. Most notably:

Golang upgrade from 1.13 to 1.16 resulted in 20-50% garbage collection time improvement for different components.
More performant Cache Implementation for query plans using LFU eviction algorithm.
ProtoBuf APIv2 and custom Protocol Buffers compiler. Interesting Vitess blog post about how they got there.
Movetables v2 - less manual cleanups, progress status, less overhead.
Throttle API.
Comparing v8 vs v11 with the same workload test results were outright great:
- Query latency median (p50) was 0.3% lower for main application queries and 3.4% for job queries
- Latency p95 was 19.6% lower for main application and 19.3% for jobs.
- vtgate garbage collection time reduced by ~20% and vttablet by ~50%
- Though, a small resource usage increase (used up to ~2% more of CPU and up to ~4% more of memory)

Semi-sync + replicas by the dozen

With an unexpected geopolitical situation escalating in February 2022 and a danger lurking around Lithuania, it was time to deploy a second region with replicas. Initially, all was fine, but some shards had more than a dozen replicas. Since we use semi-sync replication, a perplexing issue creeped in beside inter-region latency. It really deserves a blog post on its own. Based on documentation and our configuration, the primary writes the data on binlog and waits rpl_semi_sync_master_timeout=2147483646 milliseconds (~4 minutes) to receive an acknowledgment from rpl_semi_sync_master_wait_for_slave_count=1 semi-sync enabled replicas about them having received the data (Fig. 1). We could not control the order in which replicas were asked for confirmation.

Figure 1: Semi sync problem

Latency of multiple remote round trips increased the duration of transactions a lot and filled up vttablet transaction pools quickly causing short outages. In the end, we decided to leave up to 2 replicas with semi-sync enabled in the primary region. All other replicas had semi-sync disabled.

There are numerous posts about semi-sync shortcomings, which I recommend you read:

Autumn 2022

The new patched v11 was running everything fine and dandy. We did occasional vertical sharding, switched some read traffic to replicas. App periodic jobs write enthusiasm was curbed by throttle API. We felt ready as ever. But one does not simply dodge ‘Vinted Autumn’. vtgate metrics showed a total of 1M+ QPS. Most critical keyspaces had reached vertical scaling limits. The déjà vu of partial weekly downtimes at peak times kept us awake for several weeks. To say that the problems to solve for SRE and Platform teams alone were complex was in all respects an understatement. We had to get immediate help from product teams.

“Performance Task Force, Assemble!”

One of our Vinted cultural features is picking peculiar and contemporary names for teams. Performance Task Force has been assembled from “S-Class” engineers. After the initial one-day conference, the necessary confidence to shard horizontally was there.

Oh boy, I can’t wait to tell you the story of how it winds-up! :)

“Engage.” - Jean-Luc Picard, Star Trek: The Next Generation

All chapters in this series

Engineering in the heart of Lithuania

2023-04-24T00:00:00+00:00

Vinted is taking a significant step to strengthen our presence in Lithuania - we’re officially expanding to Kaunas this year. We have a number of software engineering and engineering management roles that were previously only located in Vilnius which are now open for applicants living in or near Kaunas as well.

In this blog post I will explain our reasons for this expansion, paint a picture of Vinted today and share my personal reasons for joining Vinted more than seven years ago. Let’s find out if my reasons aged well.

Why expand to Kaunas now?

As Vinted keeps growing significantly every year, we need to keep up on the engineering side of the business to continue thriving. We’ve had our eye on Kaunas as a potential office and expansion location for a few years.

We were drawn to the city because it’s home to Kaunas University of Technology - one of the best tech schools in Lithuania. With other smaller excellent education institutions as well, we have access to a sizable and capable talent pool.

The familiar operational, legal and cultural environment of our home turf means that we can move quickly.

Kaunas is very close to our headquarters in Vilnius, so we’re treating this as an extension, rather than a completely new location. Among other things, this means that Kaunas and Vilnius compensation and benefits packages for the same roles will be the same (or comparable, where we can’t provide an identical benefit).

We’ve been hiring people from around Kaunas for a while now - around 70 current Vinted employees are from the area. While working remotely in recent years has changed the flexibility standard people expect, it’s becoming clear from our own office attendance that the office is a great place to collaborate. We recognise that people thrive in different environments. For some, no office is necessary, for others working in an office is a perk, but for others it’s a necessity.

Vinted today

For years Vinted operated one business - Europe’s biggest second-hand fashion marketplace - in many different countries. A marketplace generates a lot of parcels, most of which are sent via Pick-Up Drop-Off (lockers or paštomatai) points, or PUDO points for short. Last year we started another business - a PUDO point network called Vinted Go which aims to deliver a more seamless shipping and delivery experience across Europe. Vinted Group Functions - Finance & Legal, People, Strategy, Data Science & Analytics and Engineering - serve these businesses by providing infrastructure and various services to both.

I get frequently asked - “Is Vinted an enterprise yet, or is it still a startup?”, to which I answer that no, we’re not yet an enterprise - hopefully one day. A better term would be a ‘scaleup’. A scaleup has a mature, established and profitable product that the company is working to scale, a startup is a company trying to build one. “A scaleup or a startup then?” I’d say we’re a bit of both, and something in between. Let me explain.

Vinted Marketplace - the scaleup. Powered by hundreds of engineers, this is the largest, most established and most structured organisation in Vinted. Expanding into new markets and segments every year at an enviable pace, Vinted Marketplace faces the challenges of creating and moving to a new architecture, modularising a monolithic codebase, scaling a successful product that’s handling more than 145K RPS on peak time, and diving deep to solve complex, specific problems for more than 80 million Vinted members. Donatas Kulvičius leads Marketplace Engineering, with Adam Jay as the CEO.

Vinted Go - more of a startup. The brainchild of one of Vinted’s founding team members - Mantas Mikuckas - Vinted Go is furiously expanding. We’re aiming to provide an affordable and climate-friendly parcel shipping via an extensive PUDO point network, including a new PUDO point network in France, to be used not just by Vinted members, but other shipping customers as well. Aistė Miškūnienė leads Vinted Go Engineering, with Vytautas Atkočaitis as the VP.

Vinted Group Engineering - the something in between. Home to our data infrastructure, site reliability engineering, IT, engineering experience, and security and privacy teams. We’re aiming to provide common infrastructure and joint services with other functions to the businesses and the group. This includes a redundant data centre setup in Europe, various solutions on top of public cloud offerings, a Kubernetes based compute platform, a Vitess based database platform, an observability platform, an in-house data warehouse that’s being migrated to an upcoming public cloud based dataverse solution, risk management frameworks, privacy solutions, software and hardware asset management, a CI & CD pipeline that deploys developers code to production in less than 30 minutes… and much more. Vinted Group Engineering is here to help our businesses thrive. Mindaugas Mozūras, our VP of Engineering leads Vinted Group Engineering.

All three of these organisations - Vinted Marketplace, Vinted Go and Vinted Group Functions together make up Vinted Group, a company aiming to make second-hand the first choice worldwide, led by Thomas Platenga as Group CEO.

The various environments at Vinted provide opportunities for everyone. From people who are just starting their career, who are looking for mentorship or to take big risks, through to people who have some experience and want to apply it in a new context, and people with a lot of experience to share and apply, who already have big commitments in life and want more stability - there’s a place for everyone here.

An opportunity

I joined Vinted more than seven years ago. Back then, I was a young professional with a few years of software development and management experience, eager to work hard and “make it”. I didn’t have any big commitments or dependants. I was very confident in my ability, therefore I was willing and able to take on a big risk, like joining a cool startup.

I’d been working remotely for years, and I was fed up with not being able to connect on a personal level with most of my coworkers. So an office in Vilnius (even in Žirmūnai, where the cool startup was located back then) was a huge perk for me. My professional experience matched what Vinted was looking for - I knew how to develop software with Ruby on Rails. A lucky coincidence.

Today, I suppose my younger self would consider that I “made it”. I acknowledge that I was lucky and privileged. I’m grateful to have an opportunity to work on trying to make the world a better place by helping reduce the climate impact of retail fashion. I’m happy that more people now are able to join us on this fulfilling journey.

In recent years we’ve made an effort to make opportunities at Vinted more accessible. We welcome much more technological expertise than Ruby on Rails - recently we adopted Go as a general-purpose backend programming language. We also embrace people who don’t have experience but are willing to learn through our Academy and Internship programs. We don’t insist on folks coming into the office anymore - allowing employees to work in ways that best suit their needs. To those who can’t afford to take a risk - we have a more stable business - the scaling marketplace - that grows at a predictable pace and doesn’t reorganise itself every year anymore. If you want and can afford to take on more risk, join the Vinted Go startup and get ready for a wild ride.

We recognise that Vinted is seen as a more of a Vilnius company in Lithuania. By opening an office in Kaunas, valuing both specific and broad technological expertise and welcoming people who prefer more stability, we work to change this perception.

Mieli Kauniečiai ir kiti - prisijunkite prie mūsų. If you’re interested in exploring the opportunities at Vinted - see our jobs page.