Code Red: Trading Firms Turn to AI for System Stability

As IT systems are buckling under the pressure of modern day trading volumes and regulatory requirements, some firms are turning to AI to predict and minimize outages.

Outage-water0718

IT systems are under pressure to cope with modern-day trading and volumes of data, with glitches, downtime, and outages proving to be frequent occurrences.

As US and EU regulators weigh in on preventing disruptions and downtime, a regulatory imperative has been added to reputational concerns for such problems.

Firms are turning to AI technologies to detect patterns and predict future system outages, as systems become ever-more interrelated and complex. But AI cannot be a substitute for proper planning.

The London Stock Exchange, Deutsche Börse, Euronext, the New York Stock Exchange, Cboe Global Markets, the Miami Options Exchange, Nasdaq—they’ve all suffered technology issues in the past few years, because when it comes to operating the technology powering the world’s markets, there is one constant truth, and it’s not one that anyone likes to admit: At some point it’s going to go down.

“Failures of complex platforms will always happen,” says Wolfgang Eholzer, head of department for cash and derivatives trading IT at Deutsche Börse. 

There is a litany of reasons why system outages are becoming increasingly more difficult to overcome. Some of the challenges in recent years derive from adoption of complicated trading mechanisms and the data explosion that has consumed the industry. 

In many cases, trading platforms are under pressure to process unprecedented numbers of transactions, operate faster than ever before and handle greater volumes of traffic. Eholzer says that there are two primary areas to consider when it comes to operating complex systems: the infrastructure, and its applications. In other words, all hardware technologies should have the capacity and resilience to withstand its intended functionality. 

That doesn’t always happen, of course. As exchanges and trading venues scramble to innovate and create products according to client demand or add new infrastructure to support new activities, this is often being bolted on to decades-old technology. The problems are, quite literally, stacking up.

“When you are putting these new capabilities into existing systems that are relatively high-speed, it just introduces all sorts of risk and all sorts of complexity that is really difficult to deal with,” says Lev Lesokhin, vice president of strategy at software analytics firm Cast, which specializes in identifying misbehaving systems. “Traditional approaches to dealing with that complexity for software development shops have been testing, so trying to test to make sure you don’t have any glitches or problems. But testing has gotten really hard to do because with newer architectures you have parts of your systems that are always running in production.”

Glitches may be unavoidable, but the problem is particularly acute in finance. A trading platform going down, a data feed going dead, or an exchange’s datacenter short-circuiting can cause market turmoil—and the potential loss of significant sums of money—for both retail and institutional investors, alike. 

Therefore, given the complexity of modern trading machinery, how can firms continue to improve and analyze their systems for potential problems while keeping their markets running? For some, the answer lies in emerging technologies—in particular, artificial intelligence (AI).

The All-Seeing AI 

AI is being broadly developed—and sometimes, actually used—in multiple areas across the financial markets. In many instances, smart tools have shown to be effective at detecting behavioral patterns and fine-tuning market surveillance operations. 

More recently, AI has proven to be a valuable tool for mitigating system outages by using historical events to predict future failures or spot-check malicious activity. Up until now, firms have built resilient hardware and have commonly used simple rule models to identify patterns of abnormal behavior to create an alert. Sumit Gupta, vice president of AI, machine learning and high-performance computing (HPC) at IBM Cognitive Systems, says there has been a shift toward using more advanced technologies and using them to bolster existing maintenance controls. 

“This notion of predictive maintenance is one of the key ways that artificial intelligence can really help reduce the number of outages,” he says. “There are lots of things where you can look at history, historical failures or historical events. You could even do the same thing for cyber attacks.”

Because IT systems are usually complex and siloed, it can be challenging to monitor performance and security activity using basic technologies. Enzo Signore, chief marketing officer at FixStream, a provider of AI technology, says there are three fundamental stages to minimizing outages: correlating data across entire IT stacks, applying machine learning algorithms to detect historical patterns, and using that information to create an alert to prevent future glitches. 

“The machine correlation is learning about every single fault, alert, log or any sort of abnormality that happens across the entire stack,” he explains. “We’ll see the sequencing of those events, what starts first, what is next and what is after that. And once we can connect them one by one, we can actually see that this is a pattern and this is the level of probability that a particular event will happen, and then we can tell the operations team.”

In many ways, this is similar to how other industries have been deploying machine learning to gain a measure of predictive analysis for where and when something along the tech stack might buckle or break.

Indeed, financial services firms could take a lesson by looking beyond terra firma, and into outer space, where bodies like the European Space Agency (ESA) are tasked with monitoring the health of thousands of different systems that they can’t physically reach to repair.

One of the ways that the ESA accomplishes this is through machine-learning algorithms that monitor the health and performance of individual mechanisms within deep-space satellites. The algorithms can also use pattern analysis logic to pick up on potential anomalies far quicker than a human analyst might be able to—if they could at all.

The ESA is now partnering with a vendor, Mosaic Smart Data, in order to gauge how this technology could be applied to financial markets. While it has applications in surveillance, it may also be applicable to other areas, such as monitoring globally dispersed infrastructure and systems.

“These machine-learning models spot potential technical issues on satellites before things go seriously wrong by learning what ‘normal’ behavior is and then spotting anomalies in the data from the tens of thousands of telementary parameters,” says Matt Hodgson, CEO and founder of Mosaic Smart Data. “The difference is that satellites have tens of thousands of inputs, and catching something before it goes wrong can save millions of dollars in damages. In the markets, there are millions of data inputs, but catching something earlier could save hundreds of millions, possibly even billions.”

Using AI-powered platforms can allow firms to have a hawk-eye view of IT operations and domains. In this case, a map of the IT environment can be formed, through what FixStream’s Signore describes as the discovery of every single element, including the likes of routers, switches, devices, servers and containers. Alerts can then be allocated to incidents that have formerly led to failures, such as an overheated hard drive, a struggling fan or unauthorized entry. 

However, AI technology is just one layer of surveillance used to strengthen a multifaceted strategy for reducing the likeliness of downtime. In many cases, AI is just one tool in a box of many.

Regulatory Imperatives

The need to develop new methods of testing infrastructure comes at a time when regulators and the public are increasingly focused on outages at exchanges. When Nasdaq suffered a major outage in 2013, for instance, which was triggered by problems at both the New York Stock Exchange’s (NYSE’s) Arca venue and software code flaws in the Securities Information Processor it runs, the White House was reportedly receiving up-to-the-minute information on the problem as it unfolded. Regulators have also introduced new rules—and penalties—for firms that allow their systems to go haywire.

Under Regulation Systems Compliance and Integrity (Reg SCI) adopted by the US Securities and Exchange Commission (SEC) in 2014, trading venues and clearinghouses must reduce the occurrence of systems issues, improve resiliency, and “enhance the Commission’s oversight and enforcement of securities and market technology infrastructure.” In March, the SEC flexed its muscles by fining NYSE $14 million for what was described as “several disruptive market events,” stretching back to 2014, the first such fine under the provisions of Reg SCI.

And across the Atlantic, January 3 saw the implementation of the revised Markets in Financial Instruments Directive, through which EU regulators clamped down on trading venues performance and their ability to function without “failures, outages or errors,” a provision outlined in Regulatory Technical Standards 7.

David Howson, COO of Cboe Europe, emphasizes the importance of performance and reliable infrastructures because outages are costly on many different levels. “For us, an outage can affect many firms or even all firms at the same time, and so it’s not just about lost revenue for the day, it’s the reputational impact and the potential for future lost confidence and volumes,” he says.

In recent years the industry has seen a number of prominent system failures across tech heavyweights and financial services firms such as Bloomberg, Amazon, Nasdaq and, most recently, the London Stock Exchange. Many of the reported reasons for outages include technical glitches or software issues. But those are just a few in a long list of threats. At face value, they seem relatively simple, but in reality, the modern day data and technology challenges are much more complex. 

“The more complex and dynamic the environments, the more challenging they are to manage,” says FixStream’s Signore. 

Code Red

Not all outages are due to system components simply failing, of course. A well-known frequent cause of downtime are traffic spikes, where users overload the web platform at unanticipated times of the day or week. Other such causes relate to software issues, bugs or hardware failures including overheated central processing units, device malfunctions or damage to connecting network cables. 

Further complications can arise with the adoption of hybrid storage models where applications are run across third-party cloud services and proprietary datacenters. In situations like these, it can be difficult to pinpoint where a malfunction originated—whether it occurred in the cloud, the firm’s private datacenter or its internal system. 

“That’s a very challenging environment because you’re using different tools and you don’t know the correlation between the application running in the cloud and your own prime infrastructure,” says FixStream’s Signore. “Also, you don’t know if the application in the cloud is running on top of your routers or switches or not, at any point in time.”

Limitations also exist in the use of AI. While it can be beneficial as a predictive technology, or in a monitoring capacity, much of the grunt work and heavy lifting involved with mitigating the effects of glitches still lie in tried-and-tested methods.

This can include scenarios where enough preparation and foresight have been in place to ensure graceful failures, or those when a piece of hardware fails but all data and applications migrate to another server for the purposes of minimizing disruption. In other words, says IBM’s Gupta, have a plan B—and that B stands for backups. 

IT systems are the lifeblood of any modern firm and key to their survival. In that case, many financial services firms are dependent on mission-critical systems, where IT is built to be highly resilient, but in the event of a failure, backups are readily available. This usually involves doubling or tripling up on hardware infrastructure, including multiple servers, power supplies, and devices. 

“If hardware fails, most real-time-mission critical systems have a redundant backup waiting to take over for the primary in the event of an outage and that is certainly the case throughout our infrastructure,” says Cboe’s Howson. “When hardware failures do occur there may be some interruption to service but the resumption of service is typically very quick.”

In extreme cases, major institutions such as banks located in high-risk locations—vulnerable to natural disasters—are expected to have extremely resilient hardware. Gupta reflects on a time where he was shown an image of a Japanese bank following the aftermath of an earthquake in March 2011. The bank’s datacenter and mainframe had both collapsed during the event, but its banking operations remained up and running as the connecting wires stayed intact. 

Reliability of services is critical in situations where firms carry huge responsibilities over vast amounts of data and investor finances. 

“I think if you are dealing with critical customer information or data—for example, your customers’ money, whether it’s my stock, whether it’s my cash, whether it’s my mortgages—[complete] failure is just not an option,” explains Gupta. 

Therefore, despite all of the promise of AI, failure testing is a crucial part of maintaining and safeguarding a system’s integrity and remains one of the core methodologies targeting weaknesses. 

“On the software side, there are always bugs as code of a certain complexity cannot be error-free,” says Deutsche Börse’s Eholzer. “As no software is error-free, there has to be built-in mechanisms that deal with partial failures.” 

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact info@waterstechnology.com or view our subscription options here: http://subscriptions.waterstechnology.com/subscribe

You are currently unable to copy this content. Please contact info@waterstechnology.com to find out more.

Where have all the exchange platform providers gone?

The IMD Wrap: Running an exchange is a profitable business. The margins on market data sales alone can be staggering. And since every exchange needs a reliable and efficient exchange technology stack, Max asks why more vendors aren’t diving into this space.

Most read articles loading...

You need to sign in to use this feature. If you don’t have a WatersTechnology account, please register for a trial.

Sign in
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here