Long-Tail Keyword Clustering at Scale

CTR is dropping and everyone is acting surprised.

We have watched the same pattern across properties: impressions rise, traffic gets weird, and suddenly the old “rank the head term” playbook feels like it belongs in a museum. If you are doing long-tail keyword clustering at scale, this is the moment you stop treating clustering like a tidy spreadsheet exercise and start treating it like a visibility system for messy, fragmented search.

Here’s the uncomfortable part: AI Overviews changed what “winning” looks like. BrightEdge data shows impressions up since AIO launch (+49%) while CTR drops (-30%). That is not a rounding error. It is the new weather. Our job is not to argue with the sky. Our job is to build a workflow that keeps producing pages that get discovered, cited, and trusted, even when the click never comes.

The goal shifted: from ranking to being included

Traditional keyword plans assume a straight line: pick a keyword, build a page, rank, get clicks. That line is bent now.

BrightEdge also reported that queries triggering an AIO with 8+ words grew 7x since May 2024. Those are long, specific prompts, and they tend to pull in citations from deeper in the SERP than old-school SEO people are comfortable admitting. They saw a +400% increase in citations sourced from ranks 21 to 30 and +200% from ranks 31 to 100. And the number that should make you sit up: 89% of AI citations come from outside the top 10.

So clustering is no longer just about consolidating variants to avoid cannibalization. It is also about building topical coverage that increases your odds of being the page the model can safely quote.

Potential friction shows up fast here: teams keep optimizing for rank #1 and panic when CTR falls, instead of optimizing for coverage, citations, and cluster-level share of voice. If your reporting still ends with “average position,” you are going to keep having the wrong argument with your stakeholders.

The part that actually breaks projects: what belongs in one cluster

Most clustering projects fail in the same boring way. The tool outputs clusters. The clusters look reasonable. The content team builds pages. Three months later you have cannibalization, irrelevant rankings, and a bunch of “why are we getting impressions but no conversions?” meetings.

This is not a tooling issue. It is intent boundaries.

Google does not cluster keywords the way your embedding model does. Google clusters by what it believes a searcher wants right now, and it enforces that belief with the SERP. When you merge two intents that Google keeps separate, you end up with a page that is mediocre for both. When you split a single intent into ten pages, you create thin content and internal competition.

We have done both. More than once.

Why semantic similarity alone betrays you

Embeddings are good at meaning. They are not inherently good at SERP segmentation.

A semantic model will happily group together:

“best waterproof trail running shoes for wide feet”

“how to waterproof trail running shoes”

“waterproof trail running shoes vs gore tex”

Those are related, but they are not the same job. One is product selection, one is DIY maintenance, one is comparison research. If you force them into one page because cosine similarity says they are close, you get a Frankenstein article. It might rank for some crumbs, but it often fails to win the real queries.

What trips people up is assuming the model understands the same intent boundary Google enforces, then discovering cannibalization or irrelevant rankings later. The model is not wrong. Your objective function is.

A practical hybrid decision framework that survives scale

At 100,000 keywords, you cannot SERP-cluster everything unless you enjoy lighting money on fire via SERP APIs. ContentGecko’s point is real: semantic clustering is viable for 100k+ keywords without heavy SERP API cost. We agree.

But you also cannot blindly trust embeddings for ambiguous clusters. Our workflow is a hybrid:

First, semantic pre-cluster to cheaply create buckets that are mostly coherent.

Then, selectively run SERP overlap tests on the buckets that matter or look suspicious.

Finally, apply explicit merge-split rules so different people do not “feel” their way into inconsistent decisions.

Here is the triage rubric we use for when to pay the SERP API cost:

We validate with SERPs when the cluster is high value (revenue-adjacent, brand risk, or a major category page), when the cluster contains mixed intent modifiers (buy vs how-to vs comparison), or when the semantic cluster is unusually diverse (high variance in modifiers, entities, or parts of speech).
We usually trust embeddings when the cluster is clearly single-intent (very consistent modifiers), low stakes (informational long-tail with no conversion path), or highly niche where SERP results are sparse anyway.
We also validate when a cluster is large enough to justify a pillar page that will become a hub for internal linking. If you are going to build architecture around it, verify it.

Now the operational SERP rule, kept intentionally simple so the team can apply it without philosophical debates: if two keywords produce overlapping Google top-10 results, they belong in the same cluster. In practice, we treat this as a threshold question. Do they share enough of the same URLs in the top 10 that Google is basically returning the same set of answers? If yes, merge. If no, split.

There is nuance, but don’t overcomplicate it. The moment you make the rule “it depends,” the whole system becomes politics.

Merge-split rules we wish we had written down earlier

We learned these the hard way, usually after publishing something we had to unpublish.

If the primary intent differs, split. “Buy” and “how to” do not belong on the same URL unless the SERP already blends commerce and guidance. If the SERP is mixed, you can sometimes make a hybrid page work, but we only do it when we can answer the question in the first paragraph and still provide a clean path to the commercial elements without burying the lead.

If the entity changes, split. Different product models, different medications, different laws by state, different integrations, different anything. Embeddings will cluster them. Users will not forgive you.

If the constraint changes and alters the solution, split. “For wide feet” and “for plantar fasciitis” can look like minor modifiers, but the content requirements are different enough that merging often produces vague advice that never gets cited.

If it’s just phrasing, merge. That is the easy win. Different word orders, synonyms, and question formats usually belong together.

We still get this wrong. Honestly, it took us three tries to get our merge rules stable on one enterprise taxonomy because every department had a different definition of “same intent.” Eventually we stopped arguing and started measuring cannibalization by cluster. That ended the debate.

Input quality at 100k keywords: you cannot cluster garbage

Most teams skip this because it feels like janitorial work, and nobody gets promoted for janitorial work. Then they cluster raw exports, get nonsense clusters, and blame the model.

The annoying part: at scale, input hygiene determines whether your centroid keywords are meaningful or misleading.

Here is what we do before we vectorize anything.

We dedupe hard. Not just exact duplicates. Near-duplicates too: punctuation variants, pluralization, casing, stray spaces, and the same keyword with tracking junk appended from internal exports.

We normalize locale. en-US vs en-GB spelling differences can fracture clusters in subtle ways. If you mix them, you will often get centroid terms that are not how your audience actually types.

We standardize brand and product naming. If one export says “G Suite” and another says “Google Workspace,” your clusters will split like a cell in a biology lab. Pick a canonical naming scheme and map variants into it.

We preserve intent-bearing modifiers. People love stripping “best,” “cheap,” “near me,” “for beginners,” “without,” “vs,” as if they are stopwords. They are not. They are the whole point. Strip only the truly useless stuff.

Potential friction: clustering raw exports with duplicates, near-duplicates, mixed locales, and inconsistent brand naming creates fragmented clusters and misleading centroid terms. The downstream effect is worse than “messy clusters.” It causes wrong page decisions.

Finding long-tails worth clustering (without deleting the good stuff)

You can get keyword ideas from anywhere. The hard part is not discovery, it’s resisting the urge to over-filter.

Semrush’s Keyword Magic Tool is massive (27.2B keywords in the database), which is both a blessing and a curse. You can drown in variants. Their practical filters are a decent starting point: volume 0 to 1,000, PKD 0 to 29, word count 3+. The “Questions” filter is useful when you are explicitly building informational coverage.

The catch is that volume filters can trick you into deleting the exact phrases that matter in the AI era. Semrush themselves point out that long-tail monthly volumes can be as low as 10 searches per month, yet clustered volume can exceed 1.5K. Those 10-search phrases are often the ones with clear constraints that produce citations.

We also pull phrasing from places where people talk like humans. Reddit is messy, but it is where you find the “how do I do X without Y breaking” wording that matches the 8+ word queries BrightEdge sees growing.

Competitor gaps still matter. A “keyword gap” style pull often reveals clusters you never would have brainstormed, especially for integrations, edge-case troubleshooting, and niche comparisons.

Anyway, back to clustering.

The practical workflow for long-tail keyword clustering at scale

If you want a repeatable workflow that does not collapse under enterprise volume, you need to separate three concerns: representing meaning (embeddings), making clustering computationally sane (dimensionality reduction), and choosing a clustering method that matches your data distribution.

ContentGecko’s 4-step workflow is close to what we run: vectorize, reduce dimensions with UMAP, cluster, then label and extract centroids.

Where this falls apart is when teams treat it like a one-click step. Then you get one mega-cluster, too many singletons, or clusters that look coherent but fail in search.

Step 1: Embeddings that behave in keyword space

We have used Sentence-Transformers models like `all-MiniLM-L6-v2` for cost and speed, and OpenAI’s `text-embedding-3-small` when we want more semantic sensitivity. Both can work.

The practical difference is not “accuracy” in the abstract. It’s how often the model collapses distinct intents because the keywords share the same nouns. In verticals with lots of repeated nouns (software features, medical conditions, legal topics), the slightly richer embedding often reduces the weird merges.

You are embedding short phrases, not essays. Keyword embeddings are brittle. Expect it.

Step 2: UMAP to 5 to 10 dimensions (and why that range is not arbitrary)

UMAP is doing two jobs for us: it preserves local neighborhood structure while compressing the space enough that clustering is faster and more stable.

We target 5 to 10 dimensions because below 5, we start losing too much nuance. Clusters become blob-like. Above 10, we do not see enough benefit to justify the extra noise and compute.

UMAP parameters can become a rabbit hole. We keep it boring unless we have evidence it’s broken. If your output becomes one giant cluster, that can be a sign that your reduction step is over-smoothing neighborhoods or your clustering threshold is too permissive.

Step 3: Agglomerative vs HDBSCAN, and how to choose without guessing

Agglomerative clustering with a strict distance threshold in cosine space is the workhorse when you want predictability. It will form clusters based on linkage and stop when distances exceed your threshold.

HDBSCAN is better when your keyword universe has uneven density. That is most real datasets. You have one area with a million “best X for Y” variants and another area with sparse, technical troubleshooting phrases. HDBSCAN can find dense clusters and leave noise as noise without forcing everything into a cluster.

The decision we use is practical: if you need stable, explainable clusters that barely change month to month, start with Agglomerative. If you are seeing tons of singletons in some areas and monster clusters in others, try HDBSCAN.

Picking a strict distance threshold is the part competitors rarely spell out because it is annoying to explain. Here is how we do it without pretending there is a universal number.

We sample a few thousand keyword pairs from within a semantic pre-cluster and compute cosine distances. Then we look at the distribution: you will usually see a dense bump of “basically the same query” pairs and a long tail of “related but not same intent” pairs. We set the threshold near the elbow, then run clustering and inspect cluster size distribution.

If you want a concrete starting point: we often begin with a conservative threshold and loosen it only if singleton rate is unacceptably high. The goal is not to minimize singletons at any cost. The goal is to avoid over-merging.

QA signals that tell you clustering is lying to you

We do not trust the first run. Ever.

We look at:

Cluster size distribution: one mega-cluster is usually a bug, not insight.
Singleton rate: some singletons are fine, especially for genuinely unique queries. A sea of singletons usually means your threshold is too strict or your embeddings are not capturing the domain.
Top terms per cluster: if the top keywords in a cluster do not share intent modifiers, something is off.
Random spot checks: we pick clusters that “feel” borderline and manually review. This is where you catch the merges that will become cannibalization.

We once shipped a clustering output to a content team without QA because we were behind. The next two sprints became cleanup. Do not do that.

Minimal reproducible pipeline (inputs, outputs, artifacts)

You do not need a fancy platform to make this useful. You need consistent artifacts that content and SEO teams can act on.

Input: a cleaned keyword list with locale normalized, duplicates removed, and canonical entity naming.

Process: embeddings, UMAP to 5 to 10 dims, cluster via Agglomerative or HDBSCAN.

Output per cluster: a centroid keyword (closest to the cluster center), a human intent label, and supporting variants. We also store the cluster ID so the same keyword stays in the same place month to month, unless we deliberately re-cluster.

Those artifacts become your planning unit. Not the individual keyword.

Turning clusters into pages without building a content farm

Clustering outputs are seductive because they look like a content calendar. That is how you end up publishing 400 near-identical pages.

The fix is to treat a cluster as a map of sub-intents, not a to-do list.

We pick one pillar query per cluster. Not because it has the highest volume, but because it best represents the shared intent and can host the other variants without turning into sludge.

Then we map supporting variants into sections. This is what BrightEdge calls “prompt completeness,” and it matters because long-tail questions often contain multiple implied requirements. If the query implies causes plus constraints plus options, your page needs those parts. Fast.

Potential friction: creating one page per micro-variant or stuffing variants into a page without structuring sections that match the sub-intents in the cluster. Both fail, just in different ways.

A practical rule we use: if a supporting variant would require a new set of examples, a different decision framework, or a different audience, it probably wants its own page. If it can be answered as a section with a clear heading and a concrete answer, it belongs in the pillar.

AI-friendly on-page patterns that scale across lots of pages

Most advice here is obvious, so we keep only what changes outcomes.

Use the full long-tail query in the Title or H1 when it is not grotesque. If it is grotesque, use a close variant and keep the exact query as an H2.

Answer the implied question in the first paragraph. Not after a brand story, not after a definition, not after “in this guide.” Put the answer up front, then expand.

Format for extractability: short paragraphs, lists when it helps, concrete examples. You are writing for a human and for a machine that prefers clean snippets.

Schema is scaffolding. We use `FAQPage` when it matches the page honestly, not as a trick. Even when it does not directly trigger an AIO, it improves machine parsing.

Potential friction in one sentence: burying the answer below introductions reduces extractability for AI systems and frustrates high-intent long-tail searchers.

Measurement when impressions rise and CTR drops

If you measure this like 2019, you will call it a failure.

We track cluster-level KPIs, not just keyword-level positions. That means we roll up impressions, clicks, conversions, and visibility across all keywords in the cluster footprint. The cluster is your product now.

We also track AIO citation presence. If AI Overviews cite outside the top 10 most of the time, then being the cited page is a win, even if the click never arrives.

We still watch engagement metrics like scroll depth and time on page, but only as diagnostics. If a page is getting impressions and citations but users bounce instantly when they do click, you probably answered the wrong intent or you buried the answer.

What nobody mentions: CTR dropping is not automatically a content problem anymore. It can be a SERP problem. Your job is to show stakeholders that visibility and influence are moving even when clicks are flat.

Cost and governance: keeping the system alive month to month

At-scale workflows fail operationally more than technically.

Semantic clustering replaces SERP APIs for the bulk of the work because it is the only way to process 100k+ keywords without exhausting budget. SERP validation becomes a scalpel, not a hammer.

We re-run clustering on a cadence, usually monthly or quarterly depending on how fast the market shifts. We keep cluster IDs stable when possible, and we document merge-split rules so new team members do not reinvent chaos.

The last friction point is the most boring one: running clustering once as a project, then never re-clustering or re-validating. Clusters rot. New keywords appear. Old ones drift in meaning. You end up making content decisions based on last quarter’s map.

If you want this to work, treat clustering like infrastructure. Not a deck.

FAQ

How do you cluster long-tail keywords at scale without paying for SERP APIs on everything?

Pre-cluster with embeddings, then validate only the high-value or suspicious clusters with SERP overlap checks. Use written merge-split rules so reviewers make consistent decisions.

What is an example of a long-tail keyword?

A long-tail keyword is a specific query with clear constraints, for example: best waterproof trail running shoes for wide feet. These are often 8 or more words and map to a narrow intent.

How do you decide if two keywords belong in the same cluster?

Use intent first, then confirm with SERP overlap when needed. If the top 10 results substantially overlap, Google is treating them as the same job and they usually belong together.

What should you measure if impressions are up but CTR is down?

Track cluster-level performance, not just individual keyword position. Include impressions, clicks, conversions, and whether your pages are being cited in AI Overviews.

Long-tail keyword clustering at scale, a practical workflow