Trading Venues Face Resilience Test in Covid-19 Pandemic

Software testing and monitoring keeps market infrastructure a step ahead amid market volatility.

domino effect

British author Terry Pratchett once wrote, “Million-to-one chances crop up nine times out of 10.” In 2020, venue operators learned the truth of this maxim, as unprecedented market volatility met the unprecedented occurrence of billions of people made to work from home under lockdown orders.

The market infrastructure that remained steady through this period had robust software testing practices, says Laurence Rose, chairman and CEO of Omega ATS, an alternative trading system (ATS) that facilitates connectivity to major Canadian listed markets.

Some three months ago in the North American markets, regulatory circuit breakers were triggered not once, but four times. “Imagine you have thousands of computers all interconnected, transferring many billions of dollars in value in securities all day long. And imagine saying, ‘OK, now we have to hit the pause button and then resume operations 15 minutes later and hope that everything goes back to working properly.’ And ‘hope’ really isn’t a great word to use in this context. You have to ensure that your systems are architected and tested in a way that you know it’s going to resume seamlessly,” he says.

The first level at which circuit-breakers are triggered is when the S&P falls 7%. When that happens, market practitioners must halt trading, wait 15 minutes, and then resume, all together, and in coordination with regulators. To do that four times in the span of a few weeks, having not had to do it since the last crisis in 2008, was stressful, Rose says. 

“Because it hadn’t been tested in a production environment since December 2008, any venue that says that after they resumed everything they were 100% sure that it was all going to work perfectly is not being truthful,” he says. 

Every night after close, and every weekend in March, Omega ATS staff had to keep a close eye on the venue’s systems to ensure they had the capacity to process the new levels of order flow coming in.

“This process involved ensuring that when we reviewed the order messaging and transaction activity, we determined any changes we needed to make in our system to ensure that if those levels happened the next day, we were prepared for it,” Rose says. “We were trying to stay one step ahead. So every time, for example, that we had additional order volumes in our system, we made sure that we had multiples of capacity to handle more than that level of volume for the next trading day.”

All of this took place just after Omega ATS staff had gone into lockdown, with about 80% of the workforce at home. It was a perfect storm of unusual events that could not have been imagined before the era of Covid-19. It is possible, however, to prepare for the unimaginable, by testing the load that systems can handle, and combining those scenarios with others, such as servers going down.

This is the job of Exactpro Systems, a London-based company that specializes in functional and non-functional testing of systems that process wholesale financial products, of which Omega ATS is a customer. Exactpro has also done resilience testing and quality assurance for large clients like the London Stock Exchange and interdealer broker Tradition.

Paranoid Testing

Omega ATS and Exactpro have partnered on specific product implementations and are currently performing quality assurance on a matching engine that Omega plans to launch within the next month. The engine will offer midpoint trading, an order type the ATS has never provided before.

Exactpro CEO and co-founder Iosif Itkin says that from his vantage point, the systems that held up during this latest crisis were tested to a “paranoid degree,” at levels more than double historical maximums. “In regulated markets, exchanges appeared to be adequately prepared, at least from what we observed, despite prior discussions [with clients] that such levels of testing are unrealistic and not required. That is what people think until the crisis actually happens,” he says.

Rose says in March, Omega took to testing at three times the volumes it expected to see.

“What you need to do is test for multiples of those types of volumes. So, for example, if during the height of the volume and volatility in March you were getting 10 million orders a day, you need to test for 30 million orders a day, or more. There was definitely a gap there based on historical levels, so we were prepared to handle the volumes that we were seeing, but we needed to up our testing and our capacity around testing by multiples because the baselines have changed now. The baselines for what we thought was this busiest day we could imagine actually tripled,” Rose says.

Exactpro builds software to test the software of its clients. This includes injecting simulations of heavy market volume into exchanges’ platforms to see if they can withstand the volatility. “Ideally, we put the simulators into the co-location environment so that the network topography is very close to the real market interaction because of the volume that comes into the exchanges from the co-location,” Itkin says.

Exactpro conducts testing at paranoid levels to see at what exact point the system would not be able to tolerate more volume. In other words, Exactpro tests to see when—rather than if—a system will fail. This point for the majority of the market seems to be higher than that load experienced in March and April, Itkin says.

The company’s testing software needs to be able to reproduce a real market event as accurately as possible. The injection of hundreds of thousands of transactions should not trigger the exchange’s internal protective mechanisms, such as its own circuit-breakers or market surveillance systems—unless that is intended—because that will alter how the exchange’s platform behaves and responds. In a real-life market event, volume is not evenly distributed over a long time; rather, a flash flood of orders will suddenly appear within seconds, or milliseconds. And the simulations must be deployed with as small a hardware footprint as possible, because in real life, there are thousands of servers connected to stock exchanges that split the load between them.  

“When we do load testing, resilience testing, it’s not possible to secure thousands of servers for the test, so [you] take limited hardware and then use it to simulate a huge wave of orders,” Itkin says.

It’s also important to simulate another, simultaneous event, as few outages historically have occurred due to a single factor. Let’s say, for example, that an exchange is experiencing unexpectedly heavy volumes while at the same time, software updates in a production environment turn out to be incompatible with the servers that traders were using to access the system.

“We don’t just test for different load levels; we need in parallel to kill various servers inside this system, and see that there is still no single point of failure,” Itkin says. “Whatever we kill inside the system, there is a workaround, and the system will be able to switch to this workaround, and the server will die.”

Itkin likens this approach to Netflix’s interpretation of the concept of chaos engineering, which engineers at the streaming service have promoted as an approach to resilience. In 2012, Netflix released the source code for a tool called Chaos Monkey that tests the resilience of its infrastructure by randomly terminating virtual machine instances that run in production environments, testing how computers and humans respond. It’s not possible to embrace this wholeheartedly in the highly-regulated and systemically-important market infrastructure world, but it’s nonetheless useful, Itkin says.

Exactpro’s testing software is written mainly in Java and Kotlin, apart from the software used for the load testing, which is in C++ because it is more efficient at simulating thousands of servers to which hypothetical brokers and traders are connected without the heavy hardware footprint. The hardware Exactpro uses depends on the client.

“Our clients use a variety of tools to achieve scale and resilience: InfiniBand, FPGA for low latency, and of course they use various servers and firewalls. Everything is duplicated [in testing] so there is no single point of failure,” Itkin says.

He says that when testing for resilience in market infrastructure, software problems are more urgent than hardware problems. “Hardware will inevitably fail within large server farms. So it is necessary to keep reserve servers and network devices to accommodate for this event. But if something is wrong with the software, there could be a knock-on effect. If something kills software on a single server, there is a high degree of probability that the very same problem will kill any other server.”

While Itkin remembers sleeping on the floors of exchanges in the early days—around 2010—of Exactpro, nowadays the company does this kind of testing remotely. The only time a human needs to be in the office is to simulate an event such as a cleaning staff member mistakenly pulling a power cord out of a wall, which happens occasionally.

Itkin says companies should invest not just in testing, but also in ongoing monitoring systems with automated alerts. The recent volatility seems to have leveled out, but it was no black swan, he says—there will always be crises.

Rose says the recent crisis has sharpened Omega’s focus on operational resilience going forward.

“If volume spikes to new highs, we are confident that we can handle it and deliver that resiliency for our clients. There is a new baseline for that now, but I think the crisis has made us focus on these areas and develop some new daily, weekly, and monthly procedures to ensure that we are always checking for these things,” he says.

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact info@waterstechnology.com or view our subscription options here: http://subscriptions.waterstechnology.com/subscribe

You are currently unable to copy this content. Please contact info@waterstechnology.com to find out more.

Most read articles loading...

You need to sign in to use this feature. If you don’t have a WatersTechnology account, please register for a trial.

Sign in
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here