Web-scraped data for investment purposes accounts for as much as 5% of online traffic—but operational, legal and technological barriers remain for investment firms looking to fully tap this vast resource.
This figure comes from a report by Opimas, published earlier this year, which also estimates that spending on web scraping for investment purposes will exceed $1.8 billion by 2020.
Since the early days of the World Wide Web, its potential as a source of investment data has been obvious. Everything from social media trends to retail prices to job postings are available online. And this potential will only grow now as investors are increasingly looking for alternative data to help them stay ahead of the pack.
“What people don’t realize is web data is the largest publicly available dataset in the world, and it doubles in size every year,” says Vinicius Vacanti, CEO of YipitData, whose clients include hedge funds and mutual funds. When YipitData started five years ago, it had three employees. It now employs 100 people, thanks to a seemingly insatiable thirst for new and unique datasets.
Collecting this data is getting more complicated as websites set up barriers to prevent machines from crawling their data, and firms worry about ambiguity and the lack of standards around the legality of using public data for commercial purposes.
How It’s Done
Nonetheless, YipitData belongs to a small universe of specialists that have sprung up to meet the growing investor appetite for this type of information. Another is Thinknum, which was founded in 2014 and aims to collect web data and sell it in a structured format to investors.
Thinknum founders Justin Zhen and Gregory Ugwi met while studying at Princeton University. After they graduated, Zhen landed at a hedge fund and Ugwi became a strategist at Goldman Sachs. Zhen was looking for information available on social media—specifically, Twitter—while Ugwi was interested in real estate data. However, both realized they were facing the same issue: how to access, and make sense of, public web data. Many of their contacts and colleagues were having the same problems.
“We thought that we should build a company that organizes public web-data trails and makes them usable for investors,” Zhen says.
Thinknum works like a business search engine, Zhen says. While traditional search engines collect everything online, Thinknum gathers specific information related to business activity. What data is collected depends on what the bot is programmed to pick up.
There are massive compliance issues. Some people abide by the rules very clearly, and some people don’t.
Nick Jain, Citizen Asset Management
Thinknum uses crawlers—bots that scour websites looking for information—similar to what Google does. It then organizes and structures the underlying data to make it more easily digestible. Thinknum offers 30 different datasets on its platform, including job listings, car inventory, store locations, LinkedIn profiles, Twitter followers, restaurant menu pricing, and government contracts.
A user can view information such as who the firm is trying to hire every day, their job title, type of job, and the location of the job. “I can see how many jobs the company has every single day; if I actually overlay the stock price, I can see that this data is predictive,” Zhen says. “When the company hires, the stock price goes up about six weeks later. When the company stops hiring, the stock price goes down about six weeks later.”
Zhen likens the service to a Bloomberg terminal in that an investor could build their own, but wouldn’t bother when they could just subscribe to a better, ready-made service. “There is no reason why a fund would scrape 400,000 companies across 30 datasets,” he says.
DIY Scraping
However, many asset managers do have in-house web-scraping operations. Nick Jain, founder of Citizen Asset Management, has been scraping data himself, rather than using a third-party provider. The kind of web data that interests him is site traffic, browsing history, API calls, and social media analysis.
“I am actually not a technologist or programmer by background, [but] I think it took me five or six hours to go from knowing no code to being able to write code that can scrape data that I want,” he says.
Jain has an MBA from Harvard, as well as a background in mathematics and theoretical physics. While not a technologist by trade, he does have the practical foundation to expand his skillset. For a large number of asset management firms, though, building a team with the skills necessary to exact value from such a massive universe of data might not be cost-effective. Not only do firms need to hire experts, but if they want to do this on their own, web scraping on a large scale requires hundreds of gigabytes of data and a mass of servers. Even Jain has to rely on third parties to help provide the infrastructure.
“If I wanted to scrape that sort of data, I have the coding skills to do so, but I don’t have the server farms that I would need to go do that,” says Jain. “I can rent them from Amazon or [another vendor], but that is the one limiting factor.”
There is another option for funds that don’t want to outsource this capability, but also don’t have the resources to do it all in-house.
YipitData launched a product called ReadyPipe, which is delivered via a software-as-a-service model. This allows users to scrape data themselves without worrying about the infrastructure and databases required.
“We are starting to see investors try to collect their own web data by hiring an engineer or a technical data analyst to their team, which is why we developed ReadyPipe,” says Vacanti.
Others in the space, like Sequentum, collect the data, but then hand it to clients who want it raw so they can generate their own specialized reports.
“As long as it’s in a machine-readable format, then [clients] are happy,” says Sarah McKenna, the vendor’s CEO. Sequentum can perform some transformation of the data, such as changing date and time formats, or converting currency to US dollars. Occasionally, smaller clients without engineering expertise will request some sentiment analysis or text analytics.
Differing Returns
But the fact remains that no matter what help or solutions are out there, the data itself needs to be relevant to a fund’s particular investment strategy, and sometimes the effort and expense involved outweigh the benefits.
Neil Bond, head of trading at Ardevora Asset Management, says his firm did some work with web-scraped data a couple of years ago but ended up dropping the project. It took a lot of work and did not add much value to the firm’s alpha generation.
“We were looking for keywords in trading updates that were followed by unexpected outperformance or underperformance in results,” says Bond. “For example, if a CEO used the word ‘absolutely’ several times in a trading update, we could expect disappointing results. We no longer do this.”
I think eventually you will start to see a few of the very technology savvy hedge-funds identify ways to identify meaningful trading signals based on what is and isn’t said.
Josh Sutton, Agorai
Social media can be a mess of unstructured, low-quality datasets. You have to know how to make it useful, says Citizen’s Jain.
“Some quant funds, more on the high-frequency side, monitor Twitter sentiment data to trade stocks on a minute-by-minute or hour-by-hour basis,” Jain says. “They notice Twitter sentiment is positive, so they go buy Microsoft stock, or vice versa. That is a relatively well-understood space and there are lots of quant funds doing that.”
Using Twitter data for longer-term investing is more challenging and there are fewer firms able to do that—indeed, there is a lot of skepticism about whether it can be done at all, says Jain.
“I personally have figured out a few useful cases to do it, so I think it works,” he says. “But I don’t know whether it works generally.”
Thinknum’s Zhen says it is important to look beyond just one dataset to get a holistic view about a company. “Let’s say their Twitter followers are going up. But if their job listings are also going up, the product is becoming more expensive,” says Zhen. “If people are saying good things about management internally, all these things are good signs for the company. You want to paint a very complete picture about each company that you are looking at.”
Josh Sutton, CEO of artificial intelligence technology vendor Agorai, says that looking at how usage of a certain phrase increases or decreases in frequency can be interesting.
“I think eventually you will start to see a few of the very technology-savvy hedge funds find ways to identify meaningful trading signals based on what is and isn’t said,” says Sutton. “I think from a natural language understanding point of view, we are still a ways away from that. I do think there is a window that is continually moving, which is the ability to trade off web-scraped data in a quant-driven type of model.”
Regulatory and Privacy Hurdles
Another hindrance to wider adoption of these techniques is concern about ending up in court.
“There are massive compliance issues,” says Jain. “Some people abide by the rules very clearly, and some people don’t.”
Jain says he works with little web-scraped data because so much of it is not compliant with rules and regulations.
“Most third-party providers that I looked at, I just didn’t trust their compliance procedures. I think their goal was to scrape as much data and sell it, without respecting the terms and conditions of the place they were subscribing with. I just didn’t want that to ever be an issue for me,” he says.
Anyone collecting this type of data for investment purposes must read the small print on the website. Many websites have a standard clause that their information cannot be used for commercial purposes.
A few years ago, Jonathan Streeter, a lawyer at Dechert LLP, began to notice a significant increase in queries about the legality of alternative data for investment strategies. “I think activity in the space picked up considerably about three or four years ago and a lot more investment managers got interested in it at that time,” he says.
Sequentum’s McKenna says that when her firm contracts with funds, it has specific compliance concerns to make sure it is not subject to insider trading accusations. It always aims to follow Captcha tests—Turing tests that ask users to check all the boxes that contain photos of storefronts, cars or street lines in them before they are allowed to access a site.
“As long as we are getting data that is public, it is not behind fake accounts, not behind logins, it is readily available to basically anybody who is cruising the web, then that is not considered insider trading,” says McKenna.
She says the firm calculates the average daily volume of traffic on the site and limits its visits to as low as 1%.
“When the analysts say, ‘I want all the data, every hour,’ then we explain to them the goal is to get them reliable, high-quality data on a constant basis,” she says. “If we do a denial-of-service attack against the site, we are basically going to have to stop pulling data altogether.”
While any data in the public domain could potentially be useful for investment purposes, it is usually not that simple. The quantity of data needed to glean meaningful insight can be huge. The information on websites is also constantly changing; there are always newer and better tools to prevent users from scraping information. Even when scraping is successful, the data tends to be unstructured, with each website having its own schema and internal database.
But the fact remains that for buy-side firms struggling to find alpha, the greatest source of data is the internet—and there is no slowing this trend down.
Further reading
Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.
To access these options, along with all other subscription benefits, please contact info@waterstechnology.com or view our subscription options here: http://subscriptions.waterstechnology.com/subscribe
You are currently unable to print this content. Please contact info@waterstechnology.com to find out more.
You are currently unable to copy this content. Please contact info@waterstechnology.com to find out more.
Copyright Infopro Digital Limited. All rights reserved.
As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (point 2.4), printing is limited to a single copy.
If you would like to purchase additional rights please email info@waterstechnology.com
Copyright Infopro Digital Limited. All rights reserved.
You may share this content using our article tools. As outlined in our terms and conditions, https://www.infopro-digital.com/terms-and-conditions/subscriptions/ (clause 2.4), an Authorised User may only make one copy of the materials for their own personal use. You must also comply with the restrictions in clause 2.5.
If you would like to purchase additional rights please email info@waterstechnology.com
More on Data Management
New working group to create open framework for managing rising market data costs
Substantive Research is putting together a working group of market data-consuming firms with the aim of crafting quantitative metrics for market data cost avoidance.
Off-channel messaging (and regulators) still a massive headache for banks
Waters Wrap: Anthony wonders why US regulators are waging a war using fines, while European regulators have chosen a less draconian path.
Back to basics: Data management woes continue for the buy side
Data management platform Fencore helps investment managers resolve symptoms of not having a central data layer.
‘Feature, not a bug’: Bloomberg makes the case for Figi
Bloomberg created the Figi identifier, but ceded all its rights to the Object Management Group 10 years ago. Here, Bloomberg’s Richard Robinson and Steve Meizanis write to dispel what they believe to be misconceptions about Figi and the FDTA.
SS&C builds data mesh to unite acquired platforms
The vendor is using GenAI and APIs as part of the ongoing project.
Aussie asset managers struggle to meet ‘bank-like’ collateral, margin obligations
New margin and collateral requirements imposed by UMR and its regulator, Apra, are forcing buy-side firms to find tools to help.
Where have all the exchange platform providers gone?
The IMD Wrap: Running an exchange is a profitable business. The margins on market data sales alone can be staggering. And since every exchange needs a reliable and efficient exchange technology stack, Max asks why more vendors aren’t diving into this space.
Reading the bones: Citi, BNY, Morgan Stanley invest in AI, alt data, & private markets
Investment arms at large US banks are taken with emerging technologies such as generative AI, alternative and unstructured data, and private markets as they look to partner with, acquire, and invest in leading startups.