IHS Markit to Add About 1 Million Analyst Reports to its Data Lake
IHS Markit uses Google’s transformer-based model BERT and a combination of classification and extraction techniques to determine what the documents mean and summarize them.
IHS Markit is adding unstructured data, in the form of research articles and papers, to its proprietary Data Lake.
By the end of Q4, the data service provider aims to upload about one million documents published by internal analysts over the past 10 years. The research reports cover topics related to financial services, the automotive industry, agriculture, chemicals, economics and country risks, energy, life sciences, and more.
Yaacov Mutnikas, chief technology officer and chief data scientist at IHS Markit, says the documents will be summarized and tagged so that users can understand their gist, and search for articles and reports by topic.
“For example, you can pull up ‘Argentina GDP’ and all the results on anything that was ever published on Argentina’s GDP will come up,” he says.
IHS Markit will generate a synopsis and extract domain-specific entities for each document before it goes into the data lake.
“We are also running feature engineering through all the documents and extracting specific features such that we can label those articles to facilitate a much easier topic and article discovery,” Mutnikas says.
IHS Markit used various machine learning and natural language processing techniques for the tagging system, including the incorporation of Google’s transformer-based model BERT.
All this work will make it easier for clients to distinguish ambiguities between, for example, ‘Trump, the man’, and ‘Trump Tower, the building’, he explains. “[This is] to understand when you’re talking about an organization, or a location, or when you’re talking about a person versus a publication like a book.”
Mutnikas adds that development work on the documents started in February and was completed towards the end of July. Currently, the service is going through testing and validation. “When that is done, we will onboard all the documents, all the machinery, indexing, and curating. And we will do it in just under two months—about six weeks,” he says. “It’s a very big step for us, to manage all the unstructured content that we have in the company.”
Data Lake has been available to clients since May 18, and currently has about 1,000 proprietary datasets from the financial services, energy and resources, and transportation sectors.
There are a few ways that buy- and sell-side clients can use Data Lake. The first, Mutnikas says, appeals to those who want to monetize their own data. These users can inject their data onto the platform and use IHS Markit’s framework to distribute it to various users.
Other users might want to compare their in-house datasets to the datasets that IHS Markit has. “They might want to merge the breadth of their data and the depth of our data. They can merge two datasets to get the best outputs,” he says.
A third use case is to research opportunities or additional insights, for example in emerging markets. “If you look at macroeconomic data, for example, we’ve got north of 18 million time series,” Mutnikas says.
The cloud-based platform stores, catalogs, and governs access to structured and unstructured data. Using the catalog, clients can search and explore IHS Markit’s datasets via a standardized taxonomy. Clients can use the tools they want to work with their own as well as IHS Markit’s proprietary data in one place.
“We cannot tell people where the opportunity is, but we can help people to find those opportunities, because people look for different things,” he says. “For example, somebody will look for opportunities in Latin America, or somebody else will look for opportunities in Southeast Asia. They are both emerging markets and the opportunities there are different. So we enable people to use the tools they’re comfortable with rather than imposing tools on them. Essentially, we support any tool that people want to use in that space.”
IHS Markit has curated the 1,000 datasets it has into data packages, including access to metadata, sample data, and data dictionaries, to facilitate easier browsing.
Sorting Documents
IHS Markit worked closely with the relevant analysts to ensure the reports and articles were summarized correctly, and the appropriate topics were tagged.
Mutnikas says the analysts played a vital role in helping to develop the machine that IHS Markit used to summarize the documents.
“When we sit down and write the initial data science machinery around it, they can validate if our assumptions and how it summarizes documents across their domain are represented correctly. From there, the machine does the work,” he says.
The benefit for IHS Markit is that it has people in engineering, research, and specialists who understand the data. “You’ve got to understand what you’re looking at. … That is why having specialists curating that data and owning the data is key. Otherwise, [it’s a] no go,” he adds.
Further reading
Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.
To access these options, along with all other subscription benefits, please contact info@waterstechnology.com or view our subscription options here: http://subscriptions.waterstechnology.com/subscribe
You are currently unable to print this content. Please contact info@waterstechnology.com to find out more.
You are currently unable to copy this content. Please contact info@waterstechnology.com to find out more.
Copyright Infopro Digital Limited. All rights reserved.
As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (point 2.4), printing is limited to a single copy.
If you would like to purchase additional rights please email info@waterstechnology.com
Copyright Infopro Digital Limited. All rights reserved.
You may share this content using our article tools. As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.
If you would like to purchase additional rights please email info@waterstechnology.com
More on Data Management
New working group to create open framework for managing rising market data costs
Substantive Research is putting together a working group of market data-consuming firms with the aim of crafting quantitative metrics for market data cost avoidance.
Off-channel messaging (and regulators) still a massive headache for banks
Waters Wrap: Anthony wonders why US regulators are waging a war using fines, while European regulators have chosen a less draconian path.
Back to basics: Data management woes continue for the buy side
Data management platform Fencore helps investment managers resolve symptoms of not having a central data layer.
‘Feature, not a bug’: Bloomberg makes the case for Figi
Bloomberg created the Figi identifier, but ceded all its rights to the Object Management Group 10 years ago. Here, Bloomberg’s Richard Robinson and Steve Meizanis write to dispel what they believe to be misconceptions about Figi and the FDTA.
SS&C builds data mesh to unite acquired platforms
The vendor is using GenAI and APIs as part of the ongoing project.
Aussie asset managers struggle to meet ‘bank-like’ collateral, margin obligations
New margin and collateral requirements imposed by UMR and its regulator, Apra, are forcing buy-side firms to find tools to help.
Where have all the exchange platform providers gone?
The IMD Wrap: Running an exchange is a profitable business. The margins on market data sales alone can be staggering. And since every exchange needs a reliable and efficient exchange technology stack, Max asks why more vendors aren’t diving into this space.
Reading the bones: Citi, BNY, Morgan Stanley invest in AI, alt data, & private markets
Investment arms at large US banks are taken with emerging technologies such as generative AI, alternative and unstructured data, and private markets as they look to partner with, acquire, and invest in leading startups.