The term duplicate content, or duplicate content comes from SEO. Duplicate content is created when the same content can be accessed with different URLs and is indexed with different URLs. The indexing of websites with duplicate content can have a negative effect in the Ranking in the SERPs.
Types of duplicate content
Duplicate content can arise if:
- Content is illegally syndicated, sold or copied, so different websites use the same content. For this case, the duplicate content can harm the creator.
- The content of a web portal is accidentally displayed in different domains or subdomains (for example, without "www").
- The content is used twice in different categories. This can happen if the content of a URL is published in a news feed.
- The content management system cannot assign unique URLs to content.
- Different attribute filters in online stores offer the same product lists.
The nearly duplicate content It is very similar content that could also lead to problems. Blocks of text that are copied many times (such as recurring teasers or texts on each page) can be rendered as duplicate content by search engines.
Background
Google] has made several adjustments to its algorithms to ensure that the search engine can filter out duplicate content very well. Both the 2004 Brandy Update and the 2005 Bourbon Update improved Google's ability to detect duplicate content.
Consequences of duplicate content
Duplicate content presents a roadblock to search engines. They have to choose which of the duplicate pages is the most relevant to a search query. Google stresses that 'duplication of content on a web portal […] is not a reason to take action against this web portal'. Regardless, the search engine provider reserves the right to impose penalties for manipulative intentions: "In the rare cases where we have to assume that duplicate content is displayed with the intention of manipulating positioning or misleading our users , we make the appropriate corrections to the index and ranking of the websites in question ». The webmasters they should not let Google decide if the duplicate content is inadvertent or deliberately created; basically they should avoid duplicate content.
Technical causes of duplicate content
Duplicate content can have a number of causes, often based on misconfiguration of the servers.
Content duplication due to bad server configuration
The arguments to avoid duplication of content within the web portal itself are found in the server configuration. The following problems can be easily solved:
Duplicate content due to a Catch-All / Wildcards subdomain
One of the most basic SEO mistakes of a page arises when a domain responds simultaneously to all subdomains. This can be easily tested by basically visiting
«H
"http://www.DOMINIO.com" followed by "http://domain.com" (ie, without "www")
If the same content is displayed in both cases (and the address bar still shows the domain entered), act quickly. In the worst case, the server responds to all subdomains, including a subdomain like
"Http://potatoe.DOMINIO.com"
These other pages with the same content are called doubles. To make it easier for search engines to decide which URL is relevant, the server must be configured correctly. This can be done, for example, using the mod-rewrite module for the commonly used Apache server. With an .htaccess file in the root directory of the web portal, the following code can be taught to the server using a 301 redirect to ensure that the web portal only responds to the correct domain - and automatically redirects common subdomains to the correct domain:
RewriteEngine On #! Please remember to replace “DOMAIN2 with the respective domain of your project! RewriteCond % {HTTP_HOST}! ^ Www.DOMAIN.com$ [NC] RewriteRule (. *) Http://www.DOMAIN.com/$1 [R = 301, L]
As a preliminary consideration, one should first choose what should be the main domain - that is, with or without "www"? In the case of international websites, the country ID should also be considered a subdomain.
http://en.DOMAIN.com/
Duplicate content due to missing bars
Another widespread form of duplicate content arises from the use of slashes. These are URLs that do not contain file names but point to directories. For example:
http://www.DOMAIN.com/register_a/register_b/
This (usually) opens the index file in the "register_b" subfolder. Depending on the configuration, the next URL also responds similarly:
http://www.DOMAIN.com/register_a/register_b
In the example above, the last slash is missing. The server first tries to find the file "register_b", which does not exist, but later realizes that such a folder exists. Since the server does not want to return an unnecessary error message ("file does not exist"), the index file for this folder is displayed. In principle, this is a good thing but unfortunately it results in duplicate content (as soon as a link points to a "fake" URL). This problem can be treated in different ways:
- 301 Redirect using .htaccess.
- Canonical tag pointing to the correct URL.
- Blocking by robots.txt.
- Correction of all misspelled links (difficult for inbound links).
The best way to do this is by using a 301 redirect via .htaccess as well as rectifying bad links. This saves Google from unnecessary crawling hassles which, in turn, can be of benefit to the web portal at a different point.
Treatment of duplicate content
The tasks of optimizing a page not only consist in avoiding duplication of content, but also in identifying them and acting appropriately. The so-called Duplicate Content Checker can help here. List the URLs that display similar content. It is specifically important that webmasters and SEOs act appropriately in the case of duplicate content. Since indexing is always faster in search engine bots, similar content also gets to the web faster. This results in the risk of a misclassification or even an accelerated exclusion from the index.
Uniqueness of the text
Duplicate content frequently impacts online stores that take over 1: 1 product texts from manufacturers and also use them for price comparison portals. Matt Cutts has already expressed his opinion on this matter. [1] In this way, you must create different texts for your own home page and price comparisons or external shopping portals. Even though it may seem like a troublesome task, individualized texts for different pages are worth it - first, your own website and your brand will be strengthened, and secondly, the price comparisons will receive individualized and thus more interesting texts. both for Google and for the user.
In order to avoid duplication of online content on the site itself, webmasters should review its content carefully and potentially consider if some categories can be merged. In some cases, it may also be useful to mark filter pages with the tag "noindex, follow", for example. Search engines do not index these pages, but rather follow the links that appear on them.
To create unique content, tools are available that take into account the TF * IDF formula.
THEFT content
In the event of external content duplication as a result of "content theft", you should immediately contact the respective webmaster and request that include the original source of the text or delete the text. In most cases, a simple request is sufficient. In addition, a warning can be issued in extreme cases. At the same time, webmasters have the opportunity to report to Google pages that violate copyright by copying content. This form can be submitted from the Google Search Console.
301 redirect
If external duplicate content arises because a webmaster is operating two websites with the same content on two or more domains, a 301 Redirect is often sufficient to prevent duplicate content.
Another alternative is to let Google know the preferred version of a web portal through the Google Search Console, for example.
Canonical tag, noindex tag, and robots.txt
There are several alternatives when it consists of duplicate internal content on the web portal itself. The canonical tag is an important tool in this circumstance. This refers to the subpage duplicated to original page, and the duplicate is exempt from indexing. If you need to be completely sure that a subpage with duplicate content is not indexed, you can bookmark it using a noindex tag. To further exclude duplicate content from the crawl, the respective subpages can also be saved in the robots.txt file.
Hreflang tags on translated pages
Now Google can identify well the translated pages and assign content to an original page. In order to avoid duplication of content through identical translations or languages for different target markets, the tag can be used to indicate the region and language of individual URLs. In this way, Google recognizes that there are translations of a page and that the URL has a certain orientation.
An example: a German online store also offers its products in the German-speaking part of Switzerland and in Austria. For this case, the target language is German. Regardless, the store uses the respective country ending in and for the destination countries. To avoid duplication of content, it will be placed in the header of the German version to refer to a variant for Switzerland.
Rel = alternative with mobile subdomains =
Smartphone optimization can also produce duplicate content. This is fundamentally true if the smartphone web portal has its own subdomain. Duplicate content can be avoided by using the rel = alternative tag. The label refers from the desktop version to the smartphone version. Search engines will then recognize that the domain is the same and avoid double indexing.
Prevention
To avoid duplication of internal content, it is a good idea to plan your page hierarchy appropriately. This makes it possible to detect in advance possible sources of duplicate content. When creating products in online stores, additional preparations must be made for the easy implementation of canonical tags. The following applies to the text level: The more individualized the text, the better it is for Google and the user, and the easier it is to avoid duplicate content.
Duplicate content checker
For the first analysis the so-called Duplicate Content Checker is enabled, such as from copyscape or Ryte. These tools initially identify similar or even identical content on the web. Online stores, in particular, that transmit their product data via CSV files to price comparison portals or sales platforms such as Amazon, are often affected by these problems. Matt Cutts has already expressed his opinion on this matter. [2]
Web Links
- Google Support: duplicate content
- Video by Matt Cutts: If I cite another source, will I be penalized for duplicate content?