Content

What’s Google’s policy on duplicate content?

Oct 15th, 2015

Updated July 2024

Duplicate content is a term used to describe content that appears exactly the same across different URLs on the internet. While it might seem like a straightforward issue, its impact on your site can vary (from wasting resources and crawl budget to creating problems for the user) and there are a number of ways you can get rid of it.

In this guide, we take a look at exactly what ‘counts’ as duplicate content, as well as how it can affect your site and how to successfully remove it.

What is duplicate content?

Here’s the short answer: duplicate content is content that is exactly replicated across any URL, i.e. in multiple places on the internet.

The not so short answer is: “Any page, content or section of content which is exactly replicated across any URL, whether that is on a www or non-www prefix, an http or https, an index.html and similar page suffixes, including mobile friendly sites, tag pages, press releases, syndicated content and product descriptions.” This was stated by John Mueller in his 2015 webmaster video hangout on duplicate content.

The even less short answer is that though duplicate content can take any of the above forms, a lot of the perceived side effects of duplication come down to more complicated factors and signals that are taken into account when Google filters duplicate content. It is therefore much easier to answer the next question and come back to this in a moment.

What isn’t duplicate content?

Now we know what duplicate content is, it’s important to also consider what it isn’t.

Translated content, for instance, isn’t considered to be duplicated. This is because it’s recognised as serving a separate purpose to the original content.

This is also true of content that is duplicated but location specific. For example, let’s take a company that offers the same services to two local areas that are sufficiently far away enough from each other. This could be different English speaking countries, such as Ireland and England, or American states.

In this case, the information is equally significant in both areas and wouldn’t be considered duplicated, therefore both pages could be indexed as a primary source for searches in each area.

In-app content is also not considered duplication, even if it shares a title and description with other website pages.

Does duplicate content affect SEO?

According to John Mueller, the main concern for Google with duplication is simply that it wastes resources, crawl budget and time – delaying the pick-up of new content and making metrics more difficult to track.

But many SEOs, content writers and website owners get continually hung up on duplicate content, certain that it has an impact on rankings. However, Google updated their webmaster guidelines page on duplicate content and have since removed any mention of duplicate content being negative or having an impact on your site. The guidelines used to say:

“Google tries hard to index and show pages with distinct information…In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results”.

This wording is slightly ambiguous, and hints that Google will remove your page or site based on the amount of duplicate content it contains. They have since removed any mention of adjusting rankings or removing sites, and the page now simply explains how to avoid duplicate pages with the help of canonical URLs and redirects.

The document also supports John Mueller’s ideas that duplicate pages will only take up crawl time and may complicate your tracking metrics (multiple URLs will track different metrics when it could all be consolidated into one).

Google simply wants to provide the best user experience possible and this means returning a wide variety of search results – something a significant density of duplicate content can interfere with. So they have a method for filtrating such content, which is split into three stages. These are:

Scheduling: As Google cannot crawl the whole web the whole time, some duplicate content is detected during the scheduling process. This is when the URLs that need to be crawled are decided upon.
Indexing: Duplicate indexed items will waste storage space, so Google will generally index one version of the content. Unless, of course, it meets certain criteria such as those mentioned earlier, including for localisation, in which case both versions are indexed.
Search: Duplicated search results can be confusing to users, which is why Google chooses to omit some entries from the SERPs. This shouldn’t be considered a penalty, but instead is simply a way for Google to improve the user experience.

Does Google penalise duplicate content?

This, of course, leads to the question – if duplicate content is simply an annoyance, why is it penalised? Well, it isn’t.

Google won’t penalise a site for duplicate content alone. They will, however, penalise poor-quality scraper sites that automatically skim content from other sites, doorway sites that exist to redirect the user to other pages, and other similar or equivalent spammy sites. These sorts of websites will likely have other red flags that will cause them to be penalised too.

How to avoid duplicate content

Although there is no direct penalty related to duplicate content, it can still cause problems for your site in other ways – from causing Google to display the non-preferred version of the content to splitting page authority and traffic between two pages, which will obviously impact the content’s ranking.

The first way to avoid content duplication is to ensure any copy on your site is unique and doesn’t appear anywhere else on the web. Once you’ve written the content, you can run it through a plagiarism tool, such as Copyscape, to check just how unique it is. You can also check your existing content in these tools and then rewrite certain sections or entire pages to ensure there’s no duplication.

There are other ways of specifying which duplicate content Google should show in the search results. These include:

Adding a canonical link to the duplicated page (rel=”canonical”)
Redirecting duplicate content to the canonical URL.

These options can be useful for product pages that have very similar content but the URLs differ for each product colour or size (e.g. /fluffy-dressing-gown/pink/18, /fluffy-dressing-gown/grey/12, etc) or for a homepage that can be accessed from multiple URLs, such as URL.com/home, URL.com and home.URL.com.

Adding a canonical link allows you to specify which page is your preferred version. In our example above, you might want to make the canonical URL ‘/fluffy-dressing-gown/pink/, requesting that Google doesn’t show the specific size URLs in the search engine results pages (SERPs).

This can also be useful for products that fall into two categories. For example, a website that sells heat pumps may have the following URLs that should be canonicalised:

url.com/heating-systems/heat-pumps
url.com/renewable-heating-systems/heat-pumps

It’s advised that you don’t have both pages competing against each other, so instead add rel=canonical to the <head> section of the page, e.g. < link=”canonical” href=”url.com/heating-systems/heat-pumps”. This will avoid problems if the content on both pages is identical or very similar.

Redirects can help you to remove duplicate content too. Unlike canonical links, this will remove the redirected page altogether so users can no longer access it. This would be useful in our homepage example, where a homepage can be accessed via multiple URLs. You may choose to permanently redirect (301) url.com/home and home.url.com to url.com.

There are some methods to avoid when getting rid of your duplicate content. We’d advise that you don’t use robots.txt or Google’s URL removal tool for canonicalisation, as this will prevent the page from showing in search results altogether.