Ethics in Web Scraping: principles to develop good practices

Web Scraping Ethics Practices

The exploitation of large volumes of data brings with it a great responsibility. Knowing the principles that make good practices is essential for a data scientist, developer, or owner of a website.

The question of whether Web Scraping is ethical or what ethical Web Scraping would be like is fundamental in this regard. The fact that Web Scraping is legal does not necessarily mean that it is ethical. People do unethical things within the limits of the law all the time. And Web Scraping falls into that category.


It is time to offer some principles that would make Web Scraping a 100% ethical activity:

Ethical Principles Web Scraping

1- Do not overload the destination website

We could summarize them in a general rule: do no harm. And one way Web Scraping can harm a business or website is by not sending requests at a reasonable rate. Don’t be confused with a DDOS attack. Do some research to find out what tasks the target website can handle so that it doesn’t cause functionality issues. There is a big difference between scraping a huge website like Google or Amazon and scraping a small local business site.

Websites that are not used to high traffic may not be able to cope with many requests sent by bots. Sending too many requests can skew the company’s user statistics and cause the website to slow down or even crash. So schedule your requests according to what the website can handle, separate them out, and consider scheduling your scraping tasks at off-peak times.


2- Respect the creators of the data that is collected

Everyone is generating data, and some of it is personal information, like contact details. If the data is public, scraping is usually not a problem. However, information that requires a login to access is generally not public. Using or sharing such data for commercial purposes without permission could be a legal violation. Be sure to check if it is legal to extract that data before proceeding.

It is relevant to read the terms of use. If the website outright prohibits automated web scraping, consider contacting the webmaster to explain why you want to extract your data and ask for their permission. And remember that the data you collect is not yours. Consider what you are using the information for and keep only the data necessary for your purpose. Ask yourself if what you plan to do with the data will add value. Whenever possible and relevant, credit your sources if you share the data you’ve collected.


3- Promote an open web

The “open web” is a broad term that can include everything from technical concepts like open source code and standards to democratic ideas like freedom of expression and digital inclusion. But the idea that connects them is that the web was created by and for users, not by large corporations, governments, and select gatekeepers. The web is for everyone, and the information it contains (with some exceptions such as intellectual property) should not be the exclusive property of companies and institutions. If you are a website owner, you must accept that web scraping is a reality of the open web.


4- Do not monopolize the data

It is unethical to acquire user-produced data and claim it as one’s property. In this sense, it is not uncommon for large companies to prohibit others from web scraping their website in search of data that they have acquired through the same scraping. Respecting the creators of the data you extract applies to both website owners and those who want to extract data from those sites.


5- Do not block scrapers without a good reason

If you respect the open web and recognize that the data you hold is not your proprietary data, don’t block web scrapers unless you have a good reason to do it. One reason could be that you need to protect user privacy if people try to extract personal data for unethical purposes. Another reason could be the bias of your data statistics in large-scale scraping cases. In such cases, consider temporarily blocking the request before permanently banning it. If someone sends a user agent string or requests to extract your data, please reply and contact the developer. If you need to block them, please explain why.


And finally, don’t be greedy. Data is information, information is power, and with great power comes great responsibility. Don’t try to keep all the power for yourself, but be sure to use and share data responsibly.