Nikhil Goyal — AI Engineer & Software Developer

I want to be upfront: I am not a quant. I don't trade. I built this because I wanted to understand what "low latency" actually means when the stakes are real - when microseconds translate directly to money, and when the engineers who build these systems think about problems that most software developers never encounter.

The result was a REG-NMS compliant C++17 matching engine with ~230 microsecond order processing latency. This post is about what I built, why certain decisions matter at this scale, and the three things that surprised me most.

What a Matching Engine Actually Does

Before the architecture, the basics.

A matching engine sits at the heart of every exchange. When you place a buy order for 100 shares of AAPL at $180, that order enters the matching engine. The engine checks if there's a sell order at $180 or below. If yes, the trade executes. If no, your order joins the order book - a sorted list of all outstanding buy and sell orders - and waits.

Simple concept. The complexity is everything else:

Orders must be matched in price-time priority (best price wins; equal prices matched in order of arrival)
The system must handle tens of thousands of orders per second
Every trade must be persisted durably (regulatory requirement)
The order book state must be broadcast to market participants in real time
The system must survive crashes without losing any orders

Each of these requirements is individually manageable. Getting all of them simultaneously while hitting microsecond latency targets is where it gets interesting.

The Architecture Decision That Everything Else Flows From

The most important architectural decision I made was this: separate the hot path from everything else.

The "hot path" is the sequence of operations that must complete before I can tell a client their order was accepted or matched. Everything on the hot path adds latency. Everything off the hot path doesn't.

In a naive implementation, the hot path looks like:

Receive order → Match against book → Write to database → Broadcast to subscribers → Acknowledge

The database write is the killer. Even a fast SSD write takes 50-200 microseconds. A network write to a subscriber takes similar time. If these are on the hot path, you'll never hit sub-millisecond latency.

My hot path is:

Receive order → Match against book → Write to WAL buffer → Acknowledge

That's it. The matching logic itself runs in memory against an in-memory order book. The WAL (Write-Ahead Log) write is to a memory-mapped file - not a network call, not a disk seek, just a memcpy into an mmap'd buffer that the kernel flushes asynchronously.

Everything else - flushing the WAL to disk, broadcasting Level 2 market data to WebSocket subscribers, updating aggregate statistics - happens on separate threads that don't block the matching loop.

The Order Book Data Structure

The order book is a sorted collection of price levels, where each price level contains a queue of orders at that price.

The obvious implementation is `std::map<double, std::queue<Order>>`. This works. It won't win any performance benchmarks.

The problem with `std::map` is cache behavior. A red-black tree stores nodes scattered across the heap. Traversing it means following pointers that are probably in different cache lines. At the volume of a real exchange, this pointer chasing becomes significant.

My implementation uses a custom price-indexed structure with a flat array for "active" price levels (prices with outstanding orders) and a hash map for O(1) lookup by price. The flat array enables cache-friendly iteration when sweeping through prices during matching.

For the order queue at each price level, I used a doubly-linked list rather than `std::queue`. This allows O(1) order cancellation by keeping a map from order ID to list iterator - critical because in real markets, a significant fraction of orders are cancelled before they execute.

struct PriceLevel {
    double price;
    std::list<Order> orders;
    uint64_t total_quantity;
};

struct OrderBook {
    // Bids sorted descending (highest first)
    std::map<double, PriceLevel, std::greater<double>> bids;
    // Asks sorted ascending (lowest first)
    std::map<double, PriceLevel> asks;
    // O(1) order lookup for cancellations
    std::unordered_map<uint64_t, std::list<Order>::iterator> order_index;
};

Write-Ahead Logging and Crash Recovery

This was the part I was most nervous about getting right.

The WAL is a sequential log of every operation that affects the order book state. Before any state change is acknowledged, it's written to the WAL. This means if the process crashes, you can reconstruct the complete order book by replaying the WAL from the beginning.

My WAL uses memory-mapped I/O:

// Map 256MB of WAL file into virtual address space
void* wal_buffer = mmap(nullptr, WAL_SIZE,
                        PROT_READ | PROT_WRITE,
                        MAP_SHARED, wal_fd, 0);

Writing a WAL entry is a `memcpy` into this buffer. The OS handles the actual disk flush via `msync` on a background thread. This is safe because the WAL is append-only - even if the OS crashes before flushing, we lose at most the last few milliseconds of orders, which is acceptable in the scenarios I was designing for.

Crash recovery replays WAL entries in sequence, rebuilding the order book state:

void recover_from_wal() {
    WALEntry* entry = (WALEntry*)wal_buffer;
    while (entry->sequence_number != 0) {
        apply_entry(entry);
        entry++;
    }
}

I tested this by writing a test harness that randomly kills the process mid-operation and verifies that recovery produces a consistent order book. Getting this right took three days and two complete rewrites of the entry format.

The recovery of 100k+ log entries takes about 180ms on my test machine - fast enough to restart without significant downtime.

REG-NMS Compliance

REG-NMS (Regulation National Market System) is US securities law that governs how exchanges must handle orders. The relevant rule for a matching engine is the Order Protection Rule: you cannot execute a trade at a price inferior to a better price available on another exchange.

For a standalone matching engine, the practical implication is: when executing against the book, you must sweep through all available prices in strict priority order, and you must implement certain order types correctly.

I implemented:

Market orders: Execute immediately at best available price, sweeping through the book until filled or exhausted.

Limit orders: Execute only at specified price or better; remainder joins the book.

Stop orders: Dormant until a trigger price is reached, then convert to market order.

Trailing stops: The most complex - trigger price adjusts dynamically as the market moves in your favor, locks when market moves against you. Implementing this required a sorted index of trailing stop orders, updated on every price tick.

The trailing stop implementation took longer than everything else combined. The edge cases are numerous: What if the market gaps past the trigger? What if the order is partially filled when triggered? What if the trailing distance is larger than the current bid-ask spread?

The Three Things That Surprised Me

1. Lock-free data structures are harder than they sound.

I initially tried to use lock-free queues for the message passing between the matching loop and the WAL writer thread. After two weeks of intermittent bugs that disappeared under the debugger, I switched to a mutex-protected queue with a condition variable. The performance difference was negligible at my scale, and the correctness was immediate.

Lock-free programming requires a level of mental discipline around memory ordering that I'm not confident I had. For a hobby project, the "simple and correct" approach was the right call.

2. The matching logic is not the bottleneck.

I spent weeks optimizing the order book data structures. Profiling revealed that matching itself was about 20% of total hot-path time. The other 80% was serialization (converting orders to/from wire format) and the mmap write. The lesson: profile before you optimize.

3. Latency and throughput are in tension.

To minimize latency, you want to process each order immediately as it arrives. To maximize throughput, you want to batch operations (amortize serialization overhead, take fewer locks). A real production system has to make explicit decisions about where on this tradeoff curve to sit. I optimized for latency since that was the interesting constraint.

The Number

Median order processing latency: ~230 microseconds.

For context, a human blink takes 150,000 microseconds. In the time it takes you to blink, this system could process approximately 650 orders.

Is 230µs competitive with real HFT systems? No. Production matching engines from major exchanges run in the single-digit microseconds range, using kernel bypass networking, custom hardware, and co-location. But for software running on a commodity server without any of that infrastructure, I'm reasonably happy with it.

More than the number, I'm happy with what I understand now that I didn't before: what it actually means to design for latency, why every data structure choice matters, and why the engineers who build these systems are so obsessive about the details.

The code is on GitHub at https://github.com/ngoyal88/REG-NMS-Matching-Engine

I Built a High-Frequency Trading Matching Engine. Here's What I Learned.