Googlebot

Googlebot is Google's crawler, which collects documents from the Internet and delivers them later for Google search. It collects documents through an automated procedure, which works much like a web browser. The bot sends a request and receives a response from a server.

If certain parameters allow access to the Googlebot, it uploads a single web page, which can be entered via a URL, and initially stores it in the Google index. This is how Googlebot crawls the global internet using distributed resources. Googlebot's computing power is distributed through a huge data center system, so it can crawl hundreds of websites simultaneously.

General information

Google's crawl technology is simply an algorithm that works independently. It is based on the concept of the WWW (world wide web). The Internet can be conceived as a very large network of websites, including nodes, links, hyperlinks.

Mathematically, this concept can be described as a graph. Each node is reachable by means of a web address, the URL. Links on a website lead to other subpages or other resources with another URL or domain address. Hence, the crawler distinguishes between HREF links (the connections) and SRC links (the resources). The speed and efficiency with which a crawler can search the entire graph is described in graphics theory.

Google works with different techniques. On the one hand, Google uses the multi-threading, that is, the simultaneous processing of several crawl processes. Apart from this, Google works with targeted crawlers, which focus on thematically restricted topics, for example, searching the web for certain types of links, websites or content. Google has a bot to crawl images, one for commercial promotion in search engines and another for mobile devices.

Practical application

Webmasters and web operators have different options to provide information about their sites to the crawler, or even to deny it. Each crawler is initially labeled with the term "user agent". The Googlebot name in the server log files is "Googlebot" with the host address "googlebot.com".^[1]

For the Bing search engine, it is "BingBot" and the address is "bing.com/bingbot.htm." The log files reveal who is sending the requests to the server. Webmasters can deny access to certain bots or grant them access. This is done through the Robots.txt file, using the Disallow: attribute or with certain meta tags from an HTML document. By adding a meta tag on the web page, the webmaster can grant the Googlebot limited access to their site data, as needed. This meta tag might look like this:

You can define how often Googlebot should crawl a website. This is normally done in the Google Search Console. This is especially recommended when the crawler reduces server performance or if the web portal is updated many times and thus must be crawled many times. It is necessary to know the number of pages of a web that are going to be crawled, since it is essential to know the budget of the crawl.

Relevance for SEO

It is especially important to know how Googlebot works for website search engine optimization, not only in theory, but especially in the practice. It is recommended to provide a new URL to the crawler (seeding), that is, to provide the bot with an address as the starting URL. Since the bot will find additional content and links on other websites through links, a HREF link on a specific resource can guarantee that the bot will receive a new URL.

You basically ping the WWW. Sooner or later, Googlebot will come across the address. At the same time, it is recommended to provide sitemaps to the bot. This gives him important information about the structure of your website and at the same time he will know which URL to follow next. This is particularly useful when a website has been relaunched.

Since Googlebot can read different types of content, not just text or images, you should pay attention to web development. Google has been working for several years on reading Flash content, dynamic web pages, JavaScript and Ajax code and is already partially successful in these areas.^[2] Certain methods such as GET or POST can already be identified by Googlebot and parts of the Flash content can also be read.^[3]