What Are Data Scraping and Web Scraping?
Data scraping and web scraping are two different automated techniques that achieve the same end. They harvest data from systems owned by third parties. They extract the data, collate it, and store it in ways that facilitate its reuse. Typically this means putting it into a database or into a portable format like CSV.
Data scraping makes use of APIs provided by the platform that is being scraped, even though the terms of use of the API almost certainly prohibit the gathering of data en masse.
Web scraping works by making requests for web pages just like a web browser does. But instead of displaying the webpage, the software extracts the data it is interested in, saves it, and requests another page. The terms and conditions of most websites and certainly all social media platforms prohibit data and web scraping. Despite this, the user numbers associated with social media platforms make them attractive targets for scrapers.
Scraping can be performed by cybercriminals who want to collect login credentials, payment details, or personally identifiable information. It can also be used for legitimate reasons such as aggregating news stories, monitoring your resellers to see that they don’t break pricing agreements, or for market analysis. It’s also used for collecting business intelligence, locating sales leads, and underpinning marketing and advertising.
RELATED: How To Defend Yourself Against API Attacks
Big Numbers – Scraping and Cybercrime
In 2020, the number of personal records scraped from YouTube was 4 million. The figure for TikTok was over ten times higher, at 42 million. That same year, 191 million personal records were scraped from Instagram. All of these platforms prohibit the scraping of data.
In April 2021, LinkedIn hit the headlines when a database of 500 million personal records was put up for sale on the dark web. Microsoft, which owns LinkedIn, said there had been no security breach. The database was the result of data scraping.
The database contained each affected member’s:
Real name Gender LinkedIn profile URLs Registered email addresses Landline and smartphone numbers Physical addresses Geolocation details Usernames for other social media accounts
In June 2021, a database of 700 million personal records appeared. That’s over 90 percent of LinkedIn’s membership. Together with the extra 200 million records, the second database is cross-referenced to data scraped from other sources, providing a more detailed picture of the affected individuals.
Created by cybercriminals for cybercriminals, the database can be bought—for $5000 at the time of writing—on dark web marketplaces and forums. The information it contains will e used for crimes such as phishing attacks, spear-phishing attacks, social engineering attacks, and other financial frauds.
RELATED: How to Stop Identity Thieves from Opening Accounts in Your Name
Commercial Scraping is Problematic Too
What about the commercial web and data scraping that takes place? There are companies you can engage with who will scrape data for you. You can use data parsing toolkits such as the freely available Beautiful Soup Python library to create your own web scraping applications.
The problem is, you’re still almost certainly violating the rules of the platform you’re scraping. And the platforms will try to defend themselves. If they don’t, their members, customers, or other users are liable to leave their platform.
When you choose to provide personal data to an online service, you’re entrusting that organization with your data. You’re not giving permission for anyone else to come and hoover up that data and use it as they see fit. When organizations scrape your data you don’t know who they are, what they’re going to do with the data, how they’re going to safeguard and protect it, nor who they are going to share it with.
LinkedIn took hiQ Labs Inc. to court over their data and web scraping. In their defense, hiQ claimed that the data they were scraping from LinkedIn was in the public domain and that meant it was up for grabs. In 2019, the 9th US Circuit Court of Appeals ruled in hiQ’s favor. But on June 14, 2021, the Supreme Court vacated the Ninth Circuit’s decision. As of July 2021, data scraping and web scraping for non-criminal purposes is in a legal gray area.
And things get more complicated when you take into account the data legislation that applies to the members of the platform. For example, whether an EU citizen’s data is in the public domain or not, you can’t harvest it, store it, and process it digitally without a lawful basis—as defined by the GDPR—for doing so. Also, there’s a difference between publicly visible and in the public domain.
Under the GDPR there are only two lawful bases that could conceivably apply to scraping data. One is “consent” and the other is “legitimate interest.” Plainly, consent has not been given by the individuals, so that’s off the table. And it would be extremely difficult to argue that you had a legitimate interest in scraping the data that didn’t trample on the legitimate interests of the data subjects, and their data privacy rights and freedoms. The GDPR demands that you uphold those rights and freedoms and not ride roughshod over them.
The GDPR protects the data privacy rights of EU citizens regardless of where the processing is taking place. An organization in the U.S. that is scraping data from another U.S.-based organization must still comply with the GDPR if personally identifiable information of EU citizens is in the data being scraped.
Data protection legislation from other regions adopts the same stance, with some small variances. The legality of scraping is tenuous, to say the least. We’re likely to see more formal challenges.
RELATED: How Data Breaches and Leaks Can Affect Your Employees
How To Protect Your Organization
There are steps and measures that you put in place to make life more difficult for the data scrapers.
Terms of Use and Conditions
Although Terms and Conditions and Terms of Use won’t do anything to stop cybercriminals and might not even stop “legitimate” scraping, it still makes sense to explicitly prohibit the gathering, processing, storing, or sharing of any data including but not limited to personally identifiable data.
It might stop some people from scraping. If it does, that was an easy win. Even if it doesn’t, it’ll give you a legal advantage if matters need to be resolved in court.
Disable Hotlinking
Displaying images and other media on one website by linking back to the original website is called hotlinking. It uses the original website’s bandwidth and other resources to serve the media.
Web scraping usually retrieves images directly and so disabling hotlinking won’t affect their scraping activities. But, if any scraping takes place that relies on hotlinking, it at least prevents insult from being added to injury. They won’t be pinching even more bandwidth when your stolen data is being viewed.
Use CSRF Tokens
The automated systems that do the scraping make successive HTTPS requests to your website. They crawl from page to page, following links. They also create URLs to try. If they spot a pattern—such as URLs that differ by a single digit—the software works its way through the predictable combinations until the sequence fails.
Introducing Cross-Site Request Forgery tokens to your website can fox all but the smartest of scraping software. A CSRF token is a unique identifier sent from the webserver to the client making the request. Under normal circumstances, this would be a browser.
The client must send the CSRF token back to the server when it makes its next request. The server will not respond to any requests that don’t include the correct CSRF token. Most web scraping software cannot handle CSRF tokens, so this is an effective measure to limit your exposure.
Rate Limit Page Requests
Rate limiting sets thresholds on the number of requests that can be made from a client within a given period of time. Typically this is done by IP address, with restrictions on how many page requests or downloads can be made per second.
Use Dedicated Anti-Scraping Software
Commercial packages are available that will detect scraping activity and block it. They use techniques that far surpass simply identifying a client by its IP Address. They use machine learning techniques to identify bot activity by measuring actions such as the speed the client can fill in fields and forms, the way the mouse moves across the page, and the way the client moves through the website. Any non-human activity is blocked.
Require Human Interaction
Forcing clients to create an account and using CAPTCHA or other challenge-response tests can help in rejecting automatic scrapers.
Make Your APIs Tight-Lipped
Secure your APIs, and limit their capabilities so that they return the minimum amount of data to satisfy the API call they’re servicing.
It’s appealing to developers to provide data-rich APIs, and to over-provide rather than under-provide. This places the responsibility on the client to parse out the information they want and to reject the rest. It reduces the chance of rework being required because the API didn’t provide a particular piece of information. But that verbosity plays into the scrapers’ hands.
Instead, make your APIs lean and mean. Provide what was asked for, and no more. You can rate limit API clients, too.
Use Decoy Links
Hidden links on a webpage will be invisible to genuine users but web scraping software will find and follow all links. If a client follows a hidden link it is likely an automated process. you can then block them.
Time Will Tell
Cybercriminals, by definition, don’t care about the law. Commercial operations don’t have a choice. If the hiQ v. LinkedIn case establishes a legal precedent and considers scraping to be in violation of the Computer Fraud and Abuse Act, it’ll only affect the execution of “commercial” scraping. Data scraping by cybercriminals will continue.
So whatever the outcome, you’ll still need to protect your organization.