Duplicate content, the same or near-identical text appearing on multiple URLs - confuses search engines about which version to rank, dilutes PageRank across copies, and can trigger manual actions for egregious scraping.
What counts as duplicate content?
Google defines duplicate content broadly: substantial blocks of content within or across domains that completely match or are "appreciably similar." This covers:
- Printer-friendly page variants (
/print/). - HTTP vs HTTPS or
wwwvs non-wwwserving the same content. - URL parameters that reorder products or track sessions:
/shoes?color=red&size=10vs/shoes?size=10&color=red. - Pagination (
/page/1) that shares content with the root category. - Syndicated content republished verbatim on partner sites.
- CMS-generated tag, category, and archive pages that aggregate the same articles.
Why it hurts rankings
When Googlebot finds two identical pages, it must decide which to index. It uses
inbound links, PageRank, and canonical hints, but may pick the wrong version.
Worse: link equity splits across duplicates instead of consolidating on the canonical.
Mental model: every duplicate URL is a vote going to the wrong candidate.
Identifying duplicate content
- AuditAI: flags pages with identical
<title>or meta description, fastest proxy for duplication. - Google Search Console โ Pages โ "Duplicate, Google chose different canonical."
- Screaming Frog โ Bulk export โ filter by content hash collision.
- Siteliner.com: free cross-page duplicate analysis.
site:yourdomain.com "exact sentence": spot-check for scrapers republishing your content.
The canonical tag, your primary fix
For content that must exist on multiple URLs, use rel=canonical to declare
the preferred version:
<!-- On the duplicate -->
<link rel="canonical" href="https://example.com/shoes/red" />
Rules:
- Self-referential canonicals on every page (even the canonical itself).
- Canonical must be an absolute URL with scheme + domain.
- Canonical โ
noindex: combining them sends conflicting signals.
Fixing www vs non-www and HTTP vs HTTPS
Redirect ALL variants to one canonical origin at the server level:
# Nginx: force HTTPS + www
server {
listen 80;
server_name example.com www.example.com;
return 301 https://www.example.com$request_uri;
}
server {
listen 443 ssl;
server_name example.com;
return 301 https://www.example.com$request_uri;
}
Handling syndicated content
- Ask the publisher to add
<link rel="canonical" href="your-original">, Medium and Substack both support this. - Wait at least 24โ48 hours after your original is indexed before syndicating.
- Never syndicate before your own URL is crawled, the syndication may be indexed first.
CMS archive and tag pages
WordPress generates /tag/seo/, /category/tips/, and
/page/2/ pages that duplicate post excerpts. Noindex thin archives:
<meta name="robots" content="noindex, follow" />
Keep follow so PageRank flows through links; only noindex
prevents the thin archive from competing with the canonical post.
Duplicate content checklist
- ☑ Self-referential canonical on every page.
- ☑ 301 redirect www โ non-www (or vice versa) at server level.
- ☑ 301 redirect HTTP โ HTTPS.
- ☑ URL parameters handled via canonical or GSC parameter tool.
- ☑ Printer/PDF variants canonicalized to main page.
- ☑ CMS archive/tag pages noindexed or consolidated.
- ☑ Syndicated content has canonical back to origin.
- ☑ No conflicting canonical + noindex on same page.
Related: Technical SEO Checklist ยท XML Sitemaps Guide ยท Internal Linking Strategy
Run a free AuditAI scan to find duplicate-content issues on your site โ