Banks, Asset Managers Turn to Web Scraping to Generate Alpha

The immense growth of online data is driving an increasing number of asset managers to deploy web-scraping tools to find unique investment insights.

Web-scraped data for investment purposes accounts for as much as 5% of online traffic—but operational, legal and technological barriers remain for investment firms looking to fully tap this vast resource.

This figure comes from a report by Opimas, published earlier this year, which also estimates that spending on web scraping for investment purposes will exceed $1.8 billion by 2020.

Since the early days of the World Wide Web, its potential as a source of investment data has been obvious. Everything from social media trends to retail prices to job postings are available online. And this potential will only grow now as investors are increasingly looking for alternative data to help them stay ahead of the pack. 

“What people don’t realize is web data is the largest publicly available dataset in the world, and it doubles in size every year,” says Vinicius Vacanti, CEO of YipitData, whose clients include hedge funds and mutual funds. When YipitData started five years ago, it had three employees. It now employs 100 people, thanks to a seemingly insatiable thirst for new and unique datasets.

Collecting this data is getting more complicated as websites set up barriers to prevent machines from crawling their data, and firms worry about ambiguity and the lack of standards around the legality of using public data for commercial purposes.

How It’s Done

Nonetheless, YipitData belongs to a small universe of specialists that have sprung up to meet the growing investor appetite for this type of information. Another is Thinknum, which was founded in 2014 and aims to collect web data and sell it in a structured format to investors.

Thinknum founders Justin Zhen and Gregory Ugwi met while studying at Princeton University. After they graduated, Zhen landed at a hedge fund and Ugwi became a strategist at Goldman Sachs. Zhen was looking for information available on social media—specifically, Twitter—while Ugwi was interested in real estate data. However, both realized they were facing the same issue: how to access, and make sense of, public web data. Many of their contacts and colleagues were having the same problems. 

“We thought that we should build a company that organizes public web-data trails and makes them usable for investors,” Zhen says. 

Thinknum works like a business search engine, Zhen says. While traditional search engines collect everything online, Thinknum gathers specific information related to business activity. What data is collected depends on what the bot is programmed to pick up.

There are massive compliance issues. Some people abide by the rules very clearly, and some people don’t.
Nick Jain, Citizen Asset Management

Thinknum uses crawlers—bots that scour websites looking for information—similar to what Google does. It then organizes and structures the underlying data to make it more easily digestible. Thinknum offers 30 different datasets on its platform, including job listings, car inventory, store locations, LinkedIn profiles, Twitter followers, restaurant menu pricing, and government contracts. 

A user can view information such as who the firm is trying to hire every day, their job title, type of job, and the location of the job. “I can see how many jobs the company has every single day; if I actually overlay the stock price, I can see that this data is predictive,” Zhen says. “When the company hires, the stock price goes up about six weeks later. When the company stops hiring, the stock price goes down about six weeks later.”  

Zhen likens the service to a Bloomberg terminal in that an investor could build their own, but wouldn’t bother when they could just subscribe to a better, ready-made service. “There is no reason why a fund would scrape 400,000 companies across 30 datasets,” he says.

DIY Scraping

However, many asset managers do have in-house web-scraping operations. Nick Jain, founder of Citizen Asset Management, has been scraping data himself, rather than using a third-party provider. The kind of web data that interests him is site traffic, browsing history, API calls, and social media analysis.

“I am actually not a technologist or programmer by background, [but] I think it took me five or six hours to go from knowing no code to being able to write code that can scrape data that I want,” he says. 

Jain has an MBA from Harvard, as well as a background in mathematics and theoretical physics. While not a technologist by trade, he does have the practical foundation to expand his skillset. For a large number of asset management firms, though, building a team with the skills necessary to exact value from such a massive universe of data might not be cost-effective. Not only do firms need to hire experts, but if they want to do this on their own, web scraping on a large scale requires hundreds of gigabytes of data and a mass of servers. Even Jain has to rely on third parties to help provide the infrastructure.  

“If I wanted to scrape that sort of data, I have the coding skills to do so, but I don’t have the server farms that I would need to go do that,” says Jain. “I can rent them from Amazon or [another vendor], but that is the one limiting factor.”

There is another option for funds that don’t want to outsource this capability, but also don’t have the resources to do it all in-house.

YipitData launched a product called ReadyPipe, which is delivered via a software-as-a-service model. This allows users to scrape data themselves without worrying about the infrastructure and databases required. 

“We are starting to see investors try to collect their own web data by hiring an engineer or a technical data analyst to their team, which is why we developed ReadyPipe,” says Vacanti.

Others in the space, like Sequentum, collect the data, but then hand it to clients who want it raw so they can generate their own specialized reports. 

“As long as it’s in a machine-readable format, then [clients] are happy,” says Sarah McKenna, the vendor’s CEO. Sequentum can perform some transformation of the data, such as changing date and time formats, or converting currency to US dollars. Occasionally, smaller clients without engineering expertise will request some sentiment analysis or text analytics.

Differing Returns

But the fact remains that no matter what help or solutions are out there, the data itself needs to be relevant to a fund’s particular investment strategy, and sometimes the effort and expense involved outweigh the benefits.

Neil Bond, head of trading at Ardevora Asset Management, says his firm did some work with web-scraped data a couple of years ago but ended up dropping the project. It took a lot of work and did not add much value to the firm’s alpha generation.

“We were looking for keywords in trading updates that were followed by unexpected outperformance or underperformance in results,” says Bond. “For example, if a CEO used the word ‘absolutely’ several times in a trading update, we could expect disappointing results. We no longer do this.”  

I think eventually you will start to see a few of the very technology savvy hedge-funds identify ways to identify meaningful trading signals based on what is and isn’t said.
Josh Sutton, Agorai

Social media can be a mess of unstructured, low-quality datasets. You have to know how to make it useful, says Citizen’s Jain. 

“Some quant funds, more on the high-frequency side, monitor Twitter sentiment data to trade stocks on a minute-by-minute or hour-by-hour basis,” Jain says. “They notice Twitter sentiment is positive, so they go buy Microsoft stock, or vice versa. That is a relatively well-understood space and there are lots of quant funds doing that.”  

Using Twitter data for longer-term investing is more challenging and there are fewer firms able to do that—indeed, there is a lot of skepticism about whether it can be done at all, says Jain. 

“I personally have figured out a few useful cases to do it, so I think it works,” he says. “But I don’t know whether it works generally.”

Thinknum’s Zhen says it is important to look beyond just one dataset to get a holistic view about a company. “Let’s say their Twitter followers are going up. But if their job listings are also going up, the product is becoming more expensive,” says Zhen. “If people are saying good things about management internally, all these things are good signs for the company. You want to paint a very complete picture about each company that you are looking at.” 

Josh Sutton, CEO of artificial intelligence technology vendor Agorai, says that looking at how usage of a certain phrase increases or decreases in frequency can be interesting.  

“I think eventually you will start to see a few of the very technology-savvy hedge funds find ways to identify meaningful trading signals based on what is and isn’t said,” says Sutton. “I think from a natural language understanding point of view, we are still a ways away from that. I do think there is a window that is continually moving, which is the ability to trade off web-scraped data in a quant-driven type of model.”  

Regulatory and Privacy Hurdles

Another hindrance to wider adoption of these techniques is concern about ending up in court. 

“There are massive compliance issues,” says Jain. “Some people abide by the rules very clearly, and some people don’t.”

Jain says he works with little web-scraped data because so much of it is not compliant with rules and regulations. 

“Most third-party providers that I looked at, I just didn’t trust their compliance procedures. I think their goal was to scrape as much data and sell it, without respecting the terms and conditions of the place they were subscribing with. I just didn’t want that to ever be an issue for me,” he says.

Anyone collecting this type of data for investment purposes must read the small print on the website. Many websites have a standard clause that their information cannot be used for commercial purposes.

A few years ago, Jonathan Streeter, a lawyer at Dechert LLP, began to notice a significant increase in queries about the legality of alternative data for investment strategies. “I think activity in the space picked up considerably about three or four years ago and a lot more investment managers got interested in it at that time,” he says. 

Sequentum’s McKenna says that when her firm contracts with funds, it has specific compliance concerns to make sure it is not subject to insider trading accusations. It always aims to follow Captcha tests—Turing tests that ask users to check all the boxes that contain photos of storefronts, cars or street lines in them before they are allowed to access a site.

“As long as we are getting data that is public, it is not behind fake accounts, not behind logins, it is readily available to basically anybody who is cruising the web, then that is not considered insider trading,” says McKenna.

She says the firm calculates the average daily volume of traffic on the site and limits its visits to as low as 1%.  

“When the analysts say, ‘I want all the data, every hour,’ then we explain to them the goal is to get them reliable, high-quality data on a constant basis,” she says. “If we do a denial-of-service attack against the site, we are basically going to have to stop pulling data altogether.”

While any data in the public domain could potentially be useful for investment purposes, it is usually not that simple. The quantity of data needed to glean meaningful insight can be huge. The information on websites is also constantly changing; there are always newer and better tools to prevent users from scraping information. Even when scraping is successful, the data tends to be unstructured, with each website having its own schema and internal database. 

But the fact remains that for buy-side firms struggling to find alpha, the greatest source of data is the internet—and there is no slowing this trend down.  

Only users who have a paid subscription or are part of a corporate subscription are able to print or copy content.

To access these options, along with all other subscription benefits, please contact info@waterstechnology.com or view our subscription options here: http://subscriptions.waterstechnology.com/subscribe

You are currently unable to copy this content. Please contact info@waterstechnology.com to find out more.

‘Feature, not a bug’: Bloomberg makes the case for Figi

Bloomberg created the Figi identifier, but ceded all its rights to the Object Management Group 10 years ago. Here, Bloomberg’s Richard Robinson and Steve Meizanis write to dispel what they believe to be misconceptions about Figi and the FDTA.

Where have all the exchange platform providers gone?

The IMD Wrap: Running an exchange is a profitable business. The margins on market data sales alone can be staggering. And since every exchange needs a reliable and efficient exchange technology stack, Max asks why more vendors aren’t diving into this space.

Most read articles loading...

You need to sign in to use this feature. If you don’t have a WatersTechnology account, please register for a trial.

Sign in
You are currently on corporate access.

To use this feature you will need an individual account. If you have one already please sign in.

Sign in.

Alternatively you can request an individual account here