From burst to bust: What happens when cloud runs dry?

After years of initial resistance, the capital markets have come to depend heavily on the compute capacity of the public cloud. But increasing market volumes are rapidly outpacing the cloud capacity that organizations thought would be sufficient for years to come, causing firms to be more critical about how they assess cloud services and providers.

READ MOREThe best uses of the cloud so far have been as an enabler of performance and innovation. The best uses yet to come could reshape AI. So, Max warns, if you think cloud is about cutting costs, you’re thinking about it all wrong. Click here to read why.

At its technology conference in June, industry regulator Finra revealed a startling statistic—the amount of information being collected to support the Securities and Exchange Commission (SEC)-mandated Consolidated Audit Trail (Cat) of all US equities quote and trade data was straining the limits of the capacity it had provisioned in Amazon Web Services’ public cloud to run the Cat.

“The peak volumes of the Cat have exposed some scalability issues in the public cloud. Volumes keep going up. When the Cat was originally contemplated, the original plan stated that peak volumes would be 80 billion market events in a single day,” said Finra chief information officer Steve Randich at the conference.

But—partially exacerbated by the market volatility during the Covid-19 pandemic—peak volumes grew much faster than expected. By the time Finra confirmed AWS as its partner for the Cat in 2019, the system was already ingesting more than 100 billion events, and now regularly handles more than 350 billion events per day. “We’ve just seen half a trillion events, and it keeps going up. It will be 1 trillion before we know it,” Randich added.

The bad news is, we’re handling volumes that are even challenging AWS. But they are working with us to handle these challenges.
Steve Randich, Finra

Not only that, but because Finra needs to perform surveillance over timeframes ranging from days to calendar quarters, it needs to be able to access tens of trillions of market events. Being able to use the cloud to dial capacity up and down when needed for those scheduled tasks has been “huge,” Randich said, estimating this has improved Finra’s infrastructure efficiency by 40%.

That’s the good news. Also good news is that the statistics about market volumes validate Finra’s decision to use the public cloud. “Obviously, if we are pushing the scale of [the resources provisioned for the Cat by] AWS, then there is no way we could have done this in a private cloud,” he said.

steve-randich-finra
Steve Randich

“The bad news is, we’re handling volumes that are even challenging AWS. But they are working with us to handle these challenges.”

Addressing this issue has meant holding monthly meetings with Amazon executives to improve scale, as well as continuous internal reviews of whether the Cat’s technical elements are operating in the most efficient manner.

“We may need to re-think in some regards how the Cat is being operated in terms of the policies and requirements of it,” Randich said.

“For example, we run the Linker three times per day—but with the cost and scalability numbers associated with it, that may not be the best way to go about it. Volumes are going to continue to go up. So we need to find ways to make sure that elastic scale continues to be efficient, and continues to be something that we don’t need to worry about.” (Linker is a component of the CAT that links related orders to create “ecosystems” of data around a trade.)

Finra was an early adopter of cloud for its compute and storage requirements. In 2014, it began moving its market surveillance functions and Cat forerunner the Order Audit Trail System (Oats) to AWS. Once that was completed in 2016 and deemed a success, Finra decided to move more workflows to the public cloud, which it completed in December 2019.

“Currently, the only applications left in our private datacenters are ‘cats and dogs’ applications that will die on the vine. About 99.9% of our data and 99% of our technology is all in the public cloud. In 2021, Finra is a public cloud company,” said Randich, who prior to joining Finra in 2013, was a CIO at Citi.

Finra declined to comment further for this story beyond what Randich said at the conference. AWS declined to comment specifically on Finra, or on how it helps clients in general address and mitigate capacity issues.

But Michael Borts, CTO of governance and regulatory compliance technology provider ACA Group, who joined the vendor in October after a year as principal of AWS’ advisory practice, says “Amazon is very focused on helping customers save money. They go out of their way to help every customer run as efficiently as possible. They would come in and sit down with a customer and have a strategic conversation about what their governance is like, what data they have, and what data they need, what can be stored, and what is rarely accessed.”

Then, Borts adds, a company can identify which applications are subject to spikes, and can decide how to configure applications to achieve the right balance of cloud storage versus cloud compute capacity. For example, instead of persisting a certain data type, a firm could simply stream it in memory, allowing it to weight applications more heavily to compute capacity instead of memory.

Bursting restrictions

peter-nabicht-stac
Peter Nabicht

“It’s all about ‘burstability’ … and what happens when you want to expand availability,” says Peter Nabicht, president of STAC (the Securities Technology Analysis Center), which works with financial firms and technology providers to develop benchmarks for diverse technology stacks, including cloud.

“For example, if a spike in market volumes requires you to scale up from 12 to 20 servers, and the spike lasts for 20 seconds, but it takes 45 seconds to scale up, then there’s no point. But if it takes just five seconds, then people would do it all the time.”

While Finra may be in a unique position as a regulator and the body responsible for collecting data on behalf of the Cat, the challenge it faces is by no means unique: Exchanges and data vendors all capture and aggregate vast quantities of data, while trading and investment firms consume and combine huge amounts of exchange-based quote and trade data with other datasets. And during periods of peak volumes, much of the industry will be experiencing the same sudden demand at exactly the same time.

[Throttling] is something we were aware of when we started the migration project, but we didn’t fully understand the actual impact it would have on our project timeline until we started to experience the throttling for ourselves.
Trevor Hicks, Wetherby Asset Management

While some of these market events may be unpredictable, a lot of demand for resources is scheduled and predictable, and can be planned for in advance, which means that when unexpected peak volumes do occur, firms are better placed to handle those spikes, says Christin Brown, global financial services strategy and solutions lead at Google Cloud.

In addition, this also means firms can utilize the cloud’s elasticity to dial back capacity between those periods. “You don’t want to be permanently over-subscribed. You want to grab that capacity back between spikes,” says ACA’s Borts.

“Financial institutions are very good at planning ahead. By using cloud, they can manage the bursts gracefully and minimize impacts on those workloads. They know their seasonality, and we work together to achieve the best results balancing their infrastructure elasticity, capacity and costs,” says Brown, who joined Google five years ago, and had previously spent 17 years at IBM.

In addition, she says that firms respond differently in these circumstances, noting that algorithmic trading companies actually slow down trading.

Algorithms tend to not do well in a ‘black swan’ event, and when algorithms get thrown for a loop, they slow down. While the market is going haywire, another group slows down. So there is a little bit of a see-saw effect that balances out the demand on the network,” she says.

In some cases, firms not subject to those same market events still experience challenges relating to cloud capacity during high-pressure periods of demand.

For instance, San Francisco-based wealth manager Wetherby Asset Management is nearing the end of a project to migrate many of its corporate IT functions to Microsoft’s Azure cloud. Originally, the impetus for the move was because the firm felt there were applications that could be run much better in the cloud than on its in-house infrastructure, and to take advantage of being able to pay for services based on usage, rather than on a continuous basis. As fate would have it, Wetherby’s timing was fortuitous: With employees scattered over the US during the pandemic, which would have strained its previous setup, availability of services was “fantastic,” and meant the firm didn’t need to rely on its own VPN.

trevor-hicks-wetherby-asset-management
Trevor Hicks

However, Trevor Hicks, CTO at Wetherby, warns that “bursting restrictions are a real risk of leveraging cloud services,” that must be assessed, mitigated, or accepted just like other types of risk. This risk can present itself both directly and indirectly. To illustrate, while planning its Azure migration, Wetherby expected that its processing loads would not generally be subject to spikes, and that any spikes that did occur—given that it uses Azure to support “fairly predictable” internal services—would be “insignificant” compared to what Azure could handle.

“However, what we did not appropriately plan for was the throttling of services during our migration process. We are moving a significant amount of data to the Microsoft cloud, and based on what time of day and how much data we are moving, Microsoft throttles how much speed/resources are available to us. This is something we were aware of when we started the migration project, but we didn’t fully understand the actual impact it would have on our project timeline until we started to experience the throttling for ourselves,” Hicks says.

As a result, the firm had to extend its project timeline and adjust its migration strategy, extending the six-month project by around 45 days to compensate for delays in data transfer and resulting strategy adjustments, completing the migration around the end of September.

However, bursting restrictions caused by external and completely unrelated issues can also impact firms indirectly.

For example, Hicks describes an incident that occurred unexpectedly in the summer of 2020. “One of our third-party service providers (which leverages AWS) bumped into their bursting limits as a result of a misconfiguration or misuse of their platform by another (unrelated) customer. Although we had nothing to do with this other customer, the services that the third party provides to us were severely degraded and brought our business to a crawl,” he says. “Fortunately, we had considered this potential in a previous risk assessment of the third-party provider and had appropriate workarounds available to our employees so they could continue their work.”

Safety buffer

Different markets and industries can also indirectly impact firms who may think they’re prepared for any eventuality affecting their industry, only to find—similar to Hicks’ example—that the culprit lies far beyond their control.

“Different resources are needed for different uses across industries,” says STAC’s Nabicht. “Other industries have capacity demands that are completely independent of financial services and burst to meet those demands in the same cloud environments. So what happens if there’s also a big market spike that takes place at the same time?”

If you look at historical trends, the volume of market data never goes down—it’s always trending upwards.
Jim Nevotti, Sterling Trading Tech

Preparation is key, agrees Jim Nevotti, president of Chicago-based trading and risk software vendor Sterling Trading Tech, whose Sterling Risk Engine is cloud-based. Nevotti says the circumstances created by the Covid-19 outbreak have placed unforeseen stresses on all components of firms’ infrastructures, including the cloud.

jim-nevotti-sterling-trading-tech
Jim Nevotti

“Last year saw massive spikes in volumes as a result of Covid, and other factors, such as more stock listings. And if you look at historical trends, the volume of market data never goes down—it’s always trending upwards,” Nevotti says. “We’ve taken an approach over the last couple of years to make sure we could handle data spikes and not hit capacity.”

Now, Sterling incorporates monitoring of spikes and cloud capacity as part of its ongoing systems testing and stress testing, to include regular benchmarking to ensure it has a “buffer zone” of capacity.

“There were many industry outages over the past year, and we held up really well because we put in the time and investment beforehand,” Nevotti adds.

Google’s Brown also stresses the importance of careful planning to maintain a safe operating buffer.

“We do a lot of capacity planning for peaks and troughs, and we can plan ahead and reserve resources for specific customers. We do this on the retail side of the house for the holiday season, for example. For events like the current pandemic, we rely on our buffer to be able to handle it, and we can easily move compute resources around the world, even taking advantage of time differences,” she says.

These are the types of capacity planning factors that firms need to take into account when deciding how to best utilize cloud resources, and which cloud provider is right for them. But even with the capacity planning and resilience testing undertaken by cloud companies themselves, the industry needs to change the way it thinks about cloud.

Instead of treating it as an infinite resource, firms need to understand that it’s finite and must be planned for and paid for. They must also treat cloud computing suppliers in the same ways as they treat, assess, and test other suppliers for potential risks. As firms place greater reliance on the cloud, it promises to deliver significant benefits, but also becomes a greater area of potential risk for the industry as a whole.

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact info@waterstechnology.com or view our subscription options here: http://subscriptions.waterstechnology.com/subscribe

You are currently unable to copy this content. Please contact info@waterstechnology.com to find out more.

Most read articles loading...

You need to sign in to use this feature. If you don’t have a WatersTechnology account, please register for a trial.

Sign in
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here