Reminders

  • Feedback given on final project proposals, ask me about any questions you may have
    • Notes if the project goal needs to be increased, lessened, or looks good as is
  • HW4 due tonight (reminder of your late days)
  • Joy Liu guest lecture next week, attendance will be taken

Scaling a Brokerage

Early 2019

  • Robinhood gaining traction/virality
  • Covid 19 hits, huge traffic increase across the entire tech industry
  • RH signed up ~3M users in just 4 months
  • Daily traffic increased 10x within just a month or so
  • No other brokerages were handling scale comparable to RH at the time (perhaps still today)

Why is This Different?

  • Instagram feed fails to load? Annoying, but no real harm
  • A brokerage has strict correctness requirements
    • Money cannot appear or disappear due to a bug
    • Regulations require accurate accounting at all times
  • Most web services optimize for availability — a brokerage must also optimize for consistency

How Most Services Scale

  • Add read replicas — spread reads across copies of the database
  • Cache aggressively — serve most requests without hitting the DB at all
  • Eventual consistency — tolerate slightly stale data for higher throughput
  • Parallelize everything — process requests independently, merge later
  • These work great for Instagram, Twitter, Netflix, etc.
  • Why don't they work for a brokerage?

The Serial Constraint

  • Most services can process requests in parallel — order doesn't matter
    • Two users liking the same photo? Process both at once, no problem
  • Brokerage operations on a single user must be serial
    • Check balance → lock funds → place order → confirm fill → update balance
    • If two orders race, you could spend money you don't have
  • This means: per-user operations are inherently sequential

Real-Time + Serial = Hard to Scale

  • Markets move in milliseconds — stale data means wrong prices, missed fills
  • But you can't just process faster by adding more workers per user
    • Each user's operations form a queue, not a pool
  • Traditional scaling (more replicas, eventual consistency) breaks correctness
  • You need strong locks per user while still keeping latency low

The Scaling Paradox

  • More users → more load on the database
  • Can't use caching tricks (data must be real-time and consistent)
  • Can't use eventual consistency (money must always add up)
  • Can't parallelize per-user work (operations must be ordered)
  • So what can you do?

System Overview

Brokerage Service

  • Handles order state transitions & accounting
  • Strongest guarantees are needed, e.g. full locks on balances when creating order
  • Unlike many other larger companies, correctness is paramount
    • If your Instagram feed fails to load, it's not that bad. If money disappears in a brokerage due to a bug, it's really bad.
  • Traditionally was a monolith, so still handles a ton of APIs (stocks, options, admin, balances, etc)

Problem: Market Open

  • This is when the database of the brokerage service experiences peak load
  • We cannot send orders to our venues until the market opens
  • Overnight, tons of orders are queued up to be submitted
  • At market open, orders are all sent to venue and updates need to be made, e.g. filled, cancelled, pending, etc

Problem: Market Open

  • The brokerage service DB begins to experience 100% load at market open
  • What do you do when you can't throw money at the problem anymore?

Sharding

  • Idea: let's shard the brokerage service
  • Multiple ways to shard, important to consider your shard key. We'll use user sharding
  • Split users across multiple databases, each user lives completely within a single database
  • Now we'll have brokerage-service-1, brokerage-service-2, etc.

Sharding Problems

  • How do we decide where a user goes?
  • Traffic needs to be routed accordingly, internally and externally
  • Kafka streams need to be routed accordingly
  • Internal processes that read from replicas of the brokerage service need to be updated
  • Jobs that run on the brokerage service need to be OK to be replicated
  • Users need to be migrated off of current shard (scary!)

User Sharding

  • We can rely on the authentication service to help us route users properly
  • On signup, the user is assigned to a shard
  • Then, when authenticated traffic arrives, nginx will tag it with the shard number
  • Add a new nginx in front of all brokerage services to respect this shard number
  • We can take a similar approach for Kafka streams

Battle Test: GameStop

gamestop

GameStop

  • One of the highest traffic market opens
  • In a purely technical sense, our solution worked!
  • Funnily, account creation was the service that fell over instead