Boldly refactoring complex code

At Vinted, new code is deployed to production machines all the time with zero downtime for members. Deployment changesets are small. So we can move fast and fearlessly.

To keep the speed of development from decreasing we have an extensive test suite for easy refactoring.

But in cases where it’s difficult to reproduce a complete dataset from a production environment there are other approaches to testing changes safely.

Here’s what we do:

  • we run both old and new code at the same time and discard the result of new code,
  • we log old and new code output diff,
  • by analyzing the logs we tell if new code performs as expected.

Even if it doesn’t, it won’t disrupt anything in production.

This approach has some caveats. Both code paths must be idempotent. For example, imagine code being tested increments a counter in a database. If we run both old and new code, the counter will be incremented twice. Such cases must be considered in advance.

Query code on the other hand is a good candidate for such testing.

We used to do this by rolling an ad hoc solution for each case. It was cumbersome and had limited logging capabilities. There ought to have been a better way, and there was. :)

Scientist by GitHub provides a simple framework for conducting such testing. It’s a gem for carefully refactoring critical paths.

We use an adapter affectionately called VintedExperiment. It logs discrepancies to our centralized logging facility (Graylog) and counts matches and mismatches in statsd. For data visualization and dashboards we use Grafana with Graphite.

The combination of all of these tools allow us to easily test critical paths.

Let’s say we want to improve a method called find_items. We extract the old implementation into old_find_items temporarily and put the new one into new_find_items. And then we set up the test like this:

def find_items(options)
  user_id = options[:user_id]'find_items') do |e|
    # optional: will appear in logs, eases debugging
    e.context(arguments: options)

    e.use { old_find_items(options) }
    e.try { new_find_items(options) }

    # optional do |control, candidate| ==

    # optional
    e.run_if do
      user = User.find_by_id(user_id)
      Vinted.catalog_refactor_enabled? ||
        (user && Ab::Tests.new_from_user(user).crf.on?)

def new_find_items(options)
  # ...

def old_find_items(options)
  # ...

VintedExperiment is our wrapper over the Experiment class of scientist:

class VintedExperiment
  include Scientist::Experiment

  self.raise_on_mismatches = Rails.env.test?

    experiment = new(name)
    yield experiment

  attr_reader :name

  def initialize(name)
    @name = name

  def publish(result)
    Stats.timing "science.#{name}.control", result.control.duration
    Stats.timing "science.#{name}.candidate",

    if result.matched?
      Stats.increment "science.#{name}.matched"
    elsif result.ignored?
      Stats.increment "science.#{name}.ignored"
      Stats.increment "science.#{name}.mismatched"
      control_value = result.control.value
      candidate_value = result.candidates.first.value
        "VintedExperiment #{name} mismatched, " +
        "control: #{control_value.inspect}, " +
        "candidate: #{candidate_value.inspect}, context: #{context}"

  # overrides everything, if `false` only code in `use` block
  # will be run we prefer to control this through
  # the `run_if` block instead
  def enabled?

Scientist will always run the code block provided to use, and it will run the try block only if the run_if block returns true or it is not specified.

If the try block is run, the results from use (control) and try (candidate) will be compared. If a simple == comparison is not enough, then the compare block can be specified.

Results can be inspected in a Grafana dashboard. In the provided example the test was disabled after running it for a while. Multiple lines correspond to countries where our app is deployed.

Grafana example

If there are discrepancies, log messages can be viewed in Graylog. They include details of what exactly did not match.

Graylog example

While recording matches or mismatches as well as the runtime duration the log enables all sorts of live benchmarking possibilities. You can inspect mean, median or any other statistic of runtime durations.

Once the result is satisfying - no discrepancies and increased runtime duration - the test can be removed in favor of the new code.

VintedExperiment can be extended to support several candidate logging, scientist supports that, but we haven’t had the use for it.

Vinted is a peer-to-peer marketplace. Searching for items is a core part of our product. The code inside find_items is one of the most critical code paths in our app. Moreover, it’s one of our oldest code. We wrote it nearly 6 years ago. To make it more complicated, we run many A/B tests inside this code. All of these factors add up to a lot of complexity in one place.

Scientist proved to be a great tool for doing a lot of refactoring on this bit of our codebase. Best part being that no user suffered the consequences of refactoring gone bad.