In Part 1 we covered the three fault lines that take down trading platforms at scale: the reconnection storm, the bandwidth trap, and the last-mile reality. The firms that avoid 2AM incidents have solved these at the infrastructure layer, not the application layer. Read Part 1 →
In Part 2 we understand the solution to address the last-mile data streaming problem. We will walk through the architecture that Tier-1 trading firms run in production- the stack, the decisions, the trade-offs, and the specific Lightstreamer capabilities that make each layer work at scale.
We'll also touch upon the question that every engineering leader asks at some point: could we just build this ourselves or it makes more sense to buy?
The Stack, Layer by Layer
Most Tier-1 trading platforms share a common architectural pattern that has emerged over years of production experience. It's not the pattern they started with, but they arrived at it, after enough 2AM incidents to understand what goes wrong and how best to handle it.
The architecture has four layers, each with a specific task. The key insight is that Kafka and Lightstreamer are not competitors. They solve different problems and sould be evaluated correctly to complement each other.
Kafka handles the upstream problem: high-throughput, fault-tolerant message ingestion from market feeds, risk engines, and order management systems. It does this exceptionally well. What Kafka was not designed to do is deliver individual ticks to 80,000 concurrent browser and mobile clients with varying network conditions, adaptive bandwidth throttling, and transparent firewall traversal.
That's the last-mile problem. And that's precisely what Lightstreamer is built for.
"Kafka gets the data to the building. Lightstreamer gets it to the desk - on every floor, through every wall, at whatever speed each connection can sustain."
The Kafka–Lightstreamer Integration
The Lightstreamer Kafka Connector is the bridge between these two layers. It subscribes to Kafka topics and streams updates to clients in real time - handling all of the last-mile complexity that Kafka deliberately leaves out of scope.
How it works in production
The connector subscribes to Kafka topics on behalf of all connected clients. When a new message arrives on a topic - say, a price update for AAPL - the connector determines which clients are subscribed to that instrument and delivers the update to each of them, applying conflation and delta delivery along the way.
This distinction matters at scale. At adesso's production numbers - 6 million instruments, 500,000–600,000 updates per second - each update needs to reach only the clients subscribed to that specific instrument. Without intelligent fan-out and conflation, we would be pushing the full update volume to every client. The network will be overwhelmed..
Conflation in practice
Consider a volatile session. AAPL is updating 50 times per second. A client on a mobile connection can realistically consume 10 updates per second before their experience degrades. Without conflation, 80% of the updates that we're sending are too late and also wastefully consuming the client's valuable battery and data plan.
With Lightstreamer's conflation, the server tracks the latest value for each subscribed item and delivers at the rate the client can sustain. The client always sees the most current price. They never see a stale value. They never receive redundant intermediate ticks. The network sees a fraction of the raw update volume.
The Case Studies, Revisited
In Part 1 we mentioned the production numbers. Here we want to put them in architectural context - because the numbers only make sense when we understand the architectural decisions that led to the outcome.
Could we Just Build This Ourselves?
It's a choice we have to make. We must estimate not only what it costs to build, but also the ongoing maintenance. There will be unknowns that will surface as time goes by. So include the realistic estimates, 18 months down the line and onwards, before committing. There are a number of lessons that are already learned and fixed by Lightstreamer, do we really want to re-learn that ourselves? at what cost?
| Capability | DIY WebSocket | Lightstreamer |
|---|---|---|
| Reconnection handling | Custom - we own every bug | Built in - intelligent backoff, queuing |
| Firewall / proxy traversal | Manual fallback code required | Automatic protocol negotiation |
| Data conflation | Build it ourselves or skip it | Native - only latest value delivered |
| Delta delivery | Full records on every tick | Changed fields only |
| Mobile SDK | Generic WebSocket - we adapt | Native iOS, Android SDKs |
| Kafka integration | Custom consumer + fan-out logic | Kafka Connector - drop-in |
| Bandwidth throttling | All clients get same rate | Per-client adaptive delivery |
| On-call incidents | 2AM is ours to own | Lightstreamer handles the edge cases |
The table above isn't a knock on DIY. It's an objective perspective. Every item in the middle column is solvable. Smart teams have solved all of them. The question is not can we build it - it's should our best engineers spend the next 18 months building and maintaining it forever, and how does that impact our product feature development velocity which actually generates revenue?
eToro's 60% infrastructure cost reduction wasn't just about server bills. It was about the engineering hours that stopped going into infrastructure maintenance and started going into product. That's the real ROI of the decision.
"The DIY might turn out to be more expensive than it appears in the begining, and compounds every quarter."
What "Production-Grade" Actually Requires
There's a gap between "WebSocket infrastructure that works" and "WebSocket infrastructure that works at 2AM during a market dislocation." The gap is filled with edge cases. These edge cases follow distinct patterns, which have been thoroughly analyzed and embedded into Lightstreamer.
Network heterogeneity. All of our users are not on fiber. A meaningful percentage are on mobile networks, corporate proxies, hotel WiFi, and VPNs. Each of these environments breaks standard WebSocket connections in different ways. Handling them gracefully requires protocol fallback logic that works silently and correctly every time - not 99% of the time.
Correlated failures. The worst streaming failures are correlated and triggered by market events that cause every user to be active simultaneously. The reconnection storm from Part 1 is a correlated failure. A platform that handles 10,000 concurrent normal connections may fall over under 10,000 simultaneous reconnections, because the load shape is entirely different. Production-grade infrastructure accounts for the worst-case load shape, not the average.
Operational visibility. When something goes wrong at 2AM, we need to know what went wrong, why, and what to do about it - in minutes, not hours. DIY streaming infrastructure typically has minimal built-in observability. Lightstreamer ships with monitoring dashboards and diagnostic tooling built into the deployment. That's not a feature we want to build from scratch.
We've seen how the stack is assembled. We've seen the capability comparison. We have a sense of what the DIY path actually costs over time. But every engineering team's situation is different - different team size, different update frequency, different concurrency targets, different compliance constraints.
In Part 3, we will build a scoring framework that can be run against your own current architecture and capabilities. We will explore three scenarios - startup, growth-stage, and enterprise with objective, measurable criteria to help decide which path makes sense at each stage.The answer may not be always Lightstreamer. But our framework will guide you in making an informed decision.
DP5 offers a free architecture consultation for trading platform engineering teams.
Schedule Call for Free Consultation →