The scraping It is a practice where the content of the websites is extracted, copied and stored manually or with the help of software and, if necessary, it is reused in a modified version on your website. If used positively, web scraping offers the opportunity to add more value to a website with content from other websites. Regardless, if misused, scraping violates copyright and it is considered spam.
Techniques
Scraping can be done with different techniques. The most common are briefly described below:
- Using the http manipulation, the content of static or dynamic webs can be copied via http-request.
- With the method of Data Mining or data mining, the different contents are identified by the templates and scripts in which they are embedded. The content is converted using a wrapper and made available to a different website. The wrapper acts as a kind of interface between the two systems.
- Scraping tools perform various tasks, both automated and manually controlled. From copied content to copied structures or functionalities.
- HTML parsers, as used in browsers, retrieve data from other websites and convert it for other purposes.
- The manual copy of the content it is usually called scraping. From the simple copy of texts to the copy of complete snippets of the source code. Manual scraping is commonly used when scraping programs crash, for example with the robots.txt file.
- Microformat scanning it is also part of scraping. With the continuous advancement of the development of the semantic web, microformats are popular components of a web.
Common apps
Scraping is used for many purposes. Some examples are:
- Web analysis tools: save the ranking in the Google search engine and other search engines, and prepare the data for your customers. In 2012, this topic was hotly debated when Google blocked some services.
- RSS services: content provided through RSS feeds is used on other websites.
- Meteorological data: many websites, such as travel portals, use weather data from large weather websites to increase their own functionality.
- Driving and flight plans: For example, Google uses relevant data from public transport services to supplement the itinerary function of Google Maps.
Scraping as a spam method
In the context of content syndication, the content of the webs can be distributed to other publishers. Despite everything, scraping can usually violate these rules. There are websites that only have content that has been scraped from other websites. Very often you can find pages that contain information that has been copied directly from Wikipedia without showing the source of the content. Another case of spam scraping is that online stores copy the descriptions of their products from successful competitors. Even the format usually stays the same.
It is essential that webmasters know if their content is being copied by other websites. Because in the extreme case, Google can accuse the author of scraping, which could lead to the domain that has been scraped seeing its ranking in the SERPs reduced. Alerts can be configured in the Google Analytics search engine to monitor whether content is being copied by other websites.
Google as scraper
Search engines like Google use scraping to boost their own content with relevant information from other sources. In particular, Google uses scraping methods for OneBox or to make its Knowledge Graph. Google also scrapes the web to add entries to Google Maps that have not yet been claimed by companies. At the same time, Google collects relevant data from websites that have made microformats of their content available to create rich snippets.
How to prevent scraping
There are several simple measures that webmasters can use to prevent their websites from being affected by scraping:
- Bot blocking with robots.txt.
- They will insert captcha queries on the site.
- Use of CSS to display phone numbers or email addresses.
- Enforce firewall rules for the server.
Web Links