Data Lakes: Uncharted Waters
Anthony Malakian examines why data lakes are growing in popularity, and the risks that they pose.
When marketing something new, it’s best to pit it against something established.
Data warehouses are established. They’re ubiquitous. Data lakes, on the other hand, are relatively nascent. And the tools and strategies around data lakes are basically embryotic, especially in the capital markets space.
Data lakes are repositories where firms dump huge quantities of raw, ungoverned and unformatted data. It’s a staging area, for all intents and purposes.
“Even before deploying a data lake, it is required to first understand that they are not equivalent to enterprise data warehouses,” says Ruban Phukan, co-founder and chief product and analytics officer at analytics platform provider DataRPM. “An enterprise data warehouse is about storing prepared data—cleansed, transformed and modeled for specific analytical purposes.”
But a Google search of “data lake vs. data warehouse” brings up a large number of results, most of which sing the praises of the challenger over the incumbent. The fact is that data lakes and data warehouses aren’t mutually exclusive. As the tools in and around data lakes evolve, this may change, but right now these two technologies are most likely to exist in unison within trading houses.
“The data lake approach puts high demands on rare skills, particularly in a field of information science—such as for data and metadata modeling—and sophisticated data analytics and data science. Tooling around data lakes, for analysis and reporting, data lifecycle management, modeling and security, is also new and sometimes patchy. Unless these points are addressed properly, the data lake program is at risk of delivering just a messy big data storage repository without business benefits.” Alexey Utkin, DataArt
As firms look to capitalize on the promise of big data, data lakes are likely to take on greater prominence on Wall Street. And the firms that have the best chance of capturing the true advantage of a data lake—and not having it turn into a data swamp—will find a way to incorporate these tools alongside a data warehouse.
Constructing a Lake
Data warehouses have structure built into them and the user (hopefully) has in mind what he or she will use the data for and the context built around that information. Data lakes are built in an amorphous way, according to Marshall Saffer, COO at hedge fund solutions provider MIK Fund Solutions. “You don’t know the exact need for all this data you’re pumping into the lake, but when somebody needs that data, you want to be able to go and get it,” he says.
It’s this lack of knowledge and structure inherent in data lakes that makes data warehouses still very necessary. It’s also why building scalability and flexibility into a lake is so crucial.
Apache Hadoop has proven popular when building a data lake, but it’s not the only choice. However, for the purposes of this feature, Hadoop will be referred to when talking about an open-source Java framework for distributed storage and processing, as an example.
So, for starters, a firm will need a cluster of servers and a significant amount of storage on each server. Then the Hadoop Distributed File System (HDFS)—or equivalent—storage layer is added into the cluster. Users can then drop huge datasets into the lake, where the information is scattered across all of the servers automatically by way of the batch-processing-oriented system.
If a firm has 10 terabytes of data and 10 nodes, the data lake will distribute one terabyte to each node automatically, and, when designed correctly, the system will be able to recall that information from across its various servers. This is a primitive environment used for capturing raw files, but from there firms can build semantic and governance layers on top of the lake through which the refined information can be passed to a data warehouse.
Dan Graham, director of technical marketing for analytics platform provider Teradata, has over four decades’ industry experience and has witnessed the evolution of data warehouses and the impact data lakes have had on the capital markets landscape. He says that where people most often get data lakes wrong is when they simply throw everything into the repository without much thought.
“The downside is that you have to keep track of file names or else you’ll start losing data and duplicating it because it’s not well organized yet. So you see a lot of people investing in governance to make it easy to find and use all these files,” Graham says. “Its current purposes are limited to what a programmer can do. It’s up to a programmer to do a lot of do-it-yourself work in the data lake. Once you get the information into the data warehouse, there are a lot of tools and a semantic layer. The data lake isn’t as robust of a technology, but it’s a mandatory technology for some workloads.”
[To read more on the importance of programmers when building a data lake, click here.]
Still Evolving
Greg Bujak, CTO of social analytics platform provider Social Alpha, says investing in a classification system is necessary because these tools are still evolving and firms don’t want to make an investment and dump data into them today, only to have to completely change their strategy down the road as data lakes mature.
“The problem you are solving for is extracting insight from the data you possess in new and creative ways at some point in the future,” he says. “More often than not, data lakes are used as a data dumping ground; no thought is given to the fact that you always need some form of schema or semantic layer. Consideration needs to be given to data extraction and correlation.”
After a data management layer is built that can catalog downstream processes, an analytics layer will need to be implemented in order to consume the crude information and turn it into a product or commodity, according to DataRPM’s Phukan. He adds that it’s important to identify all of the data silos that will be streamed into the lake and the pipes that need to be built in order to make that possible.
“Most organizations only think about internal data streams like POS transaction logs, inventory data, or data in customer-relationship management systems, and so on,” he says. “But the true power of a data lake comes not only from being able to break down internal data silos, but even more from being able to host relevant external data streams.”
Into the Lake
While speaking at a recent Waters-run conference, Rashmi Gupta, a data manager at MetLife, noted that her firm had developed a data lake, which serves as a data acquisition repository. On top of that, the asset manager built a data translation layer, which she called the semantics layer.
“So you have one set of information, one single version of the truth, but you don’t have all the cost associated and the work and labor involved in creating a single data warehouse,” Gupta explained. In addition to the cost, scalability has proven to be a great benefit, she added.
At the same conference, Scott Burleigh, executive director at JPMorgan Asset Management, said that data lakes, along with other tools built around those repositories, could eventually displace warehouses as we know them today.
“It boils down to the data warehouse now being replaced by a high-technology data service layer,” Burleigh said. “You just talk to the service layer, tell it what data elements you want, and it knows where they are. It serves it up to you as though it was one source.”
Moving away from how others have used and viewed data lakes, Burleigh added that JPMorgan Asset Management governs the data before it enters the lake, which some would say dilutes the purpose of having a lake in the first place.
“We’re identifying the source for the data that goes into the lake and we make changes, or the governance says we need to make changes to the data element,” Burleigh said. “We make it at the source and it is reflected in the data lake.”
Third-party providers are also deploying data lakes as part of their own business practices. Ben Cuthbert, CEO at Celer Technologies, which provides a front-to-back trading ecosystem, has also taken the data lake route; he feels that this is a trend that will continue in the capital markets.
“Most people presume they know up front what they will need from their data. In our view, that is wrong,” Cuthbert says. “When we built Celer, we thought we knew everything about the type of data we’d need. That was not the case. So we changed our model to store as much raw data as possible, and then we hydrated our analytics and trade data from that raw data.”
Swim at Your Own Risk
The term “data swamp” has been coined to describe data lake projects that end up going nowhere. If you can’t extract information easily and quickly, then the point of a data lake is lost.
Teradata’s Graham once again notes that it's easy to screw up the governance layer and the lake’s security. While governance is vital at some stage, Teradata’s Graham says that firms should also be careful about implementing too many different “refinement” tools that end up creating numerous copies of the same file.
For example, a user might load one terabyte of data into the lake, which Hadoop then duplicates. From there, programmers create several derivatives of the original file to suit their needs. Without proper governance, that one terabyte quickly turns into eight. It should also be borne in mind that raw data—complete with errors and inconsistencies being dumped into the lake—needs at some point to be cleaned for trading purposes, rather than just for “playing in a sandbox” discovery purposes.
“All of these data cleansing techniques are a foreign idea and a new area of exploration for the Hadoop data lake people,” Graham says. “This is something that has been going on in the data warehouse since the 1990s. But it’s very simple: If there’s dirt in the water, users won’t drink it.”
Not for Everyone
Nick Heudecker, research director at Gartner, who has done extensive research on data lakes and data warehouses, says that some firms are simply attaching a new term to what a data warehouse already is. The data lake, for many firms, is simply a new type of data container, and when they build governance, semantic and security layers on top of it, it’s hard to tell the difference between a lake and a warehouse.
“If I’m building a very detailed semantics layer and add security in the way that I would do it in my data warehouse, and I’m optimizing data for storage or reuse, is it really a data lake, or is it a different physical implementation of a data warehouse?” Heudecker asks. “People confuse the physical implementation of the data warehouse with its architectural considerations. And that’s fine; call it what you want. But I want to know about the architecture and from there we can figure out what characteristics you’re going to need.”
Alexey Utkin, practice leader at DataArt, notes that a data lake will not suite every company’s needs, simply because the ancillary products are not mature—nor is the human capital—and, as a result, if they can’t extract useful information, the benefit of a data lake is lost.
“The data lake approach puts high demands on rare skills, particularly in a field of information science—such as for data and metadata modeling—and sophisticated data analytics and data science,” he says. “Tooling around data lakes, for analysis and reporting, data lifecycle management, modeling and security, is also new and sometimes patchy. Unless these points are addressed properly, the data lake program is at risk of delivering just a messy big data storage repository without any business benefits.”
There are also reporting hurdles that need to be considered. Suresh Kandula, director of technology at commodities specialist Sapient Global Markets, says that buy-side firms have been quicker to embrace data lakes as opposed to tier-one investment banks. Part of the reason for this has to do with regulatory reporting concerns, where a data warehouse might be more appropriate.
“For sell-side firms and some critical regulatory regimes, executing regulatory reporting and finalizing regulatory forms out of a data lake is still maturing,” Kandula says. “Policies and procedures will need to mature to show lineage and data control before data lakes become more dominant with sell-side firms. Data lakes work best for sandbox environments, where data analysts can ‘play’ around with their own scenarios, data inspection or on-boarding, before formalizing them into an environment that is more strictly governed.”
But in some instances, regulatory challenges might actually help firms get to the point where they implement a data lake, notes Brian Sentance, CEO of Xenomorph Software. Take, for example, new risk data aggregation needs stemming from BCBS 239. When large firms have too many legacy systems to integrate, a data lake can be the answer.
“We know of one leading firm that tried to go along the route of establishing a common data model and integrating all its data in a traditional data warehouse—this approach proved too difficult and complex, and they have since adopted a data lake approach,” he says.
Early Days
Data lakes, as a concept, are enticing and they tick several boxes: They deliver tradeable research from large, unstructured datasets; they are relatively cheap and quick to deploy; they have huge storage capacity; they are scalable; and they are evolutionary. But as with any innovation, banks and asset managers will have to decide just how early an adopter they would like to be.
Right now, however, it’s important to remember that a data lake is not a solution that will automatically replace a data warehouse. Sure, firms can add pieces around the lake that will make it look more like an enterprise data warehouse, but there’s a danger in undermining the main purposes of a data lake: to consume and house enormous datasets in their raw and unstructured form, and to save that data for when it might be needed at a later date; or to run deep analysis on massive datasets without bankrupting a firm’s processing capabilities. Any other purpose and you may simply be redefining a data warehouse.
These waters are still muddy and the potential to create a swamp omnipresent, but for any firm looking to extract value from big data, the promise of a data lake has to be explored.
Salient Points
- Firms need to identify the data streams they want to flow into their lake, build those pipes, choose a storage layer, and then add data management and analytics layers.
- Hiring programmers who know how to build around, and extract data from, data lakes is vital, although they’re difficult to find.
- Security needs to be built into the data lake, and not added on top like icing on a cake.
- It’s still uncertain whether or not data lakes can help with regulatory reporting, but when poorly executed, a data lake can turn into a data swamp, and extracting data—whether for reporting or trading needs—becomes incredibly challenging.
Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.
To access these options, along with all other subscription benefits, please contact info@waterstechnology.com or view our subscription options here: http://subscriptions.waterstechnology.com/subscribe
You are currently unable to print this content. Please contact info@waterstechnology.com to find out more.
You are currently unable to copy this content. Please contact info@waterstechnology.com to find out more.
Copyright Infopro Digital Limited. All rights reserved.
As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (point 2.4), printing is limited to a single copy.
If you would like to purchase additional rights please email info@waterstechnology.com
Copyright Infopro Digital Limited. All rights reserved.
You may share this content using our article tools. As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.
If you would like to purchase additional rights please email info@waterstechnology.com
More on Emerging Technologies
This Week: Startup Skyfire launches payment network for AI agents; State Street; SteelEye and more
A summary of the latest financial technology news.
Waters Wavelength Podcast: Standard Chartered’s Brian O’Neill
Brian O’Neill from Standard Chartered joins the podcast to discuss cloud strategy, costs, and resiliency.
SS&C builds data mesh to unite acquired platforms
The vendor is using GenAI and APIs as part of the ongoing project.
Chevron’s absence leaves questions for elusive AI regulation in US
The US Supreme Court’s decision to overturn the Chevron deference presents unique considerations for potential AI rules.
Reading the bones: Citi, BNY, Morgan Stanley invest in AI, alt data, & private markets
Investment arms at large US banks are taken with emerging technologies such as generative AI, alternative and unstructured data, and private markets as they look to partner with, acquire, and invest in leading startups.
Startup helps buy-side firms retain ‘control’ over analytics
ExeQution Analytics provides a structured and flexible analytics framework based on the q programming language that can be integrated with kdb+ platforms.
The IMD Wrap: With Bloomberg’s headset app, you’ll never look at data the same way again
Max recently wrote about new developments being added to Bloomberg Pro for Vision. Today he gives a more personal perspective on the new technology.
LSEG unveils Workspace Teams, other products of Microsoft deal
The exchange revealed new developments in the ongoing Workspace/Teams collaboration as it works with Big Tech to improve trader workflows.