Skip to main content

Headless Crawling

The headless crawling is the automated browsing of the Internet and individual domains using a headless browser, which is a web browser no graphical user interface. Headless crawling includes many approaches and methods for extracting, storing, reviewing, and processing data. Websites, web apps and individual web features can also be automatically tested and verified. Headless crawling includes thematic overlays with topics such as information retrieval, data mining, scraping, and test automation.

General information

Until recently, Google recommended the use of headless browsers to crawl dynamic websites. Operators had to provide an HTML screenshot of their web portal, so that Google could read and examine its content. The so-called AJAX crawl / crawl scheme has been deprecated and is no longer used. Instead, the content of the web is provided regardless of the technology used, including the device, the browser and the Internet connection, which is known as progressive enhancement. [1]. Headless crawling is essentially a part of any search engine. Web content is navigated, but is not rendered or displayed to the user graphically.

What happens to the detected data is a matter of focus. However, the Google search engine is supposed to use headless crawling since 2004 and JavaScript is no longer an obstacle since October 2015. Search engines can use headless crawling to browse websites. To the extent that the crawler simulates a call to a web portal with a non-graphical interface, search engines can draw conclusions from this information and rate websites based on their headless browser behavior. [2].

How does it work

In the center of the headless crawling is the headless browser, a program that reads web content, passes it to other programs, or displays it based on text in the form of files, lists, and arrays. These types of browsers obtain access to websites through their implementation in a server infrastructure. Optionally, a virtual server or proxy server can be used. From there, the headless browser tries to access a URL; this is the starting point of the crawling procedure, which is started with a command line or script command [3]. Depending on the configuration, the browser can find more URLs. The contents stored there can be processed, even the question of the link positions in the web portal is feasible. However, an API interface, which transfers the data to the processing program, is often necessary for this purpose.

What makes headless crawling special is machine-to-machine (M2M) communication. Both the URLs called and the web content found are not shown to the end user, as in the case of conventional browsers. Instead, the headless browser returns the retrieved data in formats that must be defined beforehand, but can be processed automatically later. If implemented extensively, a headless browser can handle different programming languages, scripts and processes thanks to an API that can communicate with other programs or infrastructures through HTTP or TCP requests. This principle is frequently used to extract large amounts of data, which ultimately raises the question of the extent to which it is legal to compile and process such data. In principle, copyrights, privacy agreements and user privacy could be violated [4]. The same applies to price comparison portals, search engines, and meta-search providers.

Practical relevance

Headless crawling is not only applied in search engines, but also in other use cases. Two examples:

  • Test automation: Testing of websites, website items and functions is a common use of headless crawling. Hence, broken links, redirects, interactive items, individual components (units) and modules can be checked for their function. You can test the performance characteristics and the generation of website content from databases. With extensive implementation, websites can be relatively fully tested and, fundamentally, automated. In this way, the test scenarios that use headless crawling go far beyond the mere test of a system in terms of crashes, system errors and unwanted behavior. Headless crawling tests are similar to acceptance tests because the headless browser can simulate the behavior of websites from the user's perspective and, for example, clickable links. [5]. However, deep programming and scripting skills are required for this scenario. Since testing is performed at the customer's request or with a chosen test object whose rights belong to the site owner, testing automation with headless crawling is generally not objectionable. Known headless browsers with framework (API, programming language support or DOM handling) are Selenium, PhatnomJS or HtmlUnit. Headless browsers generally use a layout engine, which is also built into conventional browsers and search engine crawlers. Examples of layout engines are Webkit, Gecko, or Trident.
  • Web Scraping: Scraping is a crawling technique, in which data is extracted and added for later use. Sometimes large amounts of data are collected, read, and processed from one or more sources. Scraping can be harmful and is classified as black-hat or cracker technology in many usage scenarios. Denial of service (DoS) and distributed denial of service (DDoS) attacks use the principle of headless crawling to access a web portal or web application [6]. Usually some illegal methods are used, for example, to hide the IP address (IP spoofing) to distract from the real attack on the network or to infiltrate the communication between the server and various clients by means of TCP (hijacking).

Relevance for search engine optimization

Headless crawling is an important aspect of SEO. As already mentioned, the principle is (most likely) used by various search engines to crawl websites and web apps, even if the AJAX crawl scheme is out of date. Google recommends at various points in the Quality Guidelines to use a text-based browser, such as Lynx, to render websites as Google sees them. It can be assumed that the capabilities of Google and other search engines can do much more than text-based browsers and what is officially communicated. Consequently, it would make sense to learn headless crawling in detail. Because with this principle, websites can be thoroughly tested and with this perspective SEOs can venture to look behind the scenes of the search engine operator, without giving up their eyes on users.

Web Links