Posted by Dr. Pete
There’s an app for everything – the problem is that we’re so busy chasing the newest shiny toy that we rarely stop to learn to use simple tools well. As a technical SEO, one of the tools I seem to never stop finding new uses for is the site: operator. I recently devoted a few slides to it in my BlueGlassX presentation, but I realized that those 5 minutes were just a tiny slice of all of the uses I’ve found over the years.
People often complain that site:, by itself, is inaccurate (I’ll talk about that more at the end of the post), but the magic is in the combination of site: with other query operators. So, I’ve come up with two dozen killer combos that can help you dive deep into any site.
Ok, this one’s not really a combination, but let’s start with the basics. Paired with a root domain or sub-domain, the [site:] operator returns an estimated count of the number of indexed pages for that domain. The “estimated” part is important, but we’ll get to that later. For a big picture, I generally stick to the root domain (leave out the “www”, etc.).
Each combo in this post will have a clickable example (see below). I'm picking on Amazon.com in my examples, because they're big enough for all of these combos to come into play:
You’ll end up with two bits of information: (1) the actual list of pages in the index, and (2) the count of those pages (circled in purple below):
I think we can all agree that 273,000,000 results is a whole lot more than most of us would want to sort through. Even if we wanted to do that much clicking, Google would stop us after 100 pages. So, how can we get more sophisticated and drill down into the Google index?
The simplest way to dive deeper into this mess is to provide a sub-folder (like “/blog”) – just append it to the end of the root domain. Don’t let the simplicity of this combo fool you – if you know a site’s basic architecture, you can use it to drill down into the index quickly and spot crawl problems.
You can also drill down into specific sub-domains. Just use the full sub-domain in the query. I generally start with #1 to sweep up all sub-domains, but #3 can be very useful for situations like tracking down a development or staging sub-domain that may have been accidentally crawled.
The "inurl:" operator searches for specific text in the indexed URLs. You can pair “site:” with “inurl:” to find the sub-domain in the full URL. Why would you use this instead of #3? On the one hand, "inurl:" will look for the text anywhere in the URL, including the folder and page/file names. For tracking sub-domains this may not be desirable. However, "inurl:" is much more flexible than putting the sub-domain directly into the main query. You'll see why in examples #5 and #6.
Adding [-] to most operators tells Google to search for anything but that particular text. In this case, by separating out "inurl:www", you can change it to "-inurl:www" and find any indexed URLs that are not on the "www" sub-domain. If "www" is your canonical sub-domain, this can be very useful for finding non-canonical URLs that Google may have crawled.
I'm not going to list every possible combination of Google operators, but keep in mind that you can chain most operators. Let's say you suspect there are some stray sub-domains, but you aren't sure what they are. You are, however, aware of "www.", "dev." and "shop.". You can chain multiple "-inurl:" operators to remove all of these known sub-domains from the query, leaving you with a list of any stragglers.
You can't put a protocol directly into "site:" (e.g. "https:", "ftp:", etc.). Fortunately, you can put "https" into an "inurl:" operator, allowing you to see any secure pages that Google has indexed. As with all "inurl:" queries, this will find "https" anywhere in the URL, but it's relatively rare to see it somewhere other than the protocol.
URL parameters can be a Panda's dream. If you're worried about something like search sorts, filters, or pagination, and your site uses URL parameters to create those pages, then you can use "inurl:" plus the parameter name to track them down. Again, keep in mind that Google will look for that name anywhere in the URL, which can occasionally cause headaches.
Pro Tip: Try out the example above, and you'll notice that "inurl:ref" returns any URL with "ref" in it, not just traditional URL parameters. Be careful when searching for a parameter that is also a common word.
Maybe you want to know how many search pages are being indexed without sorts or how many product pages Google is tracking with no size or color selection – just add [-] to your "inurl:" statement to exclude that parameter. Keep in mind that you can combine "inurl:" with "-inurl:", specifically including some parameters and excluding others. For complex, e-commerce sites, these two combos alone can have dozens of uses.
Of course, you can alway combine the "site:" operator with a plain-old, text query. This will search the contents of the entire page within the given site. Like standard queries, this is essentially a logical [AND], but it's a bit of a loose [AND] – Google will try to match all terms, but those terms may be separated on the page or you may get back results that only include some of the terms. You'll see that the example below matches the phrase "free Kindle books" but also phrases like "free books on Kindle".
If you want to search for an exact-match phrase, put it in quotes. This simple combination can be extremely useful for tracking down duplicate and near-duplicate copy on your site. If you're worried about one of your product descriptions being repeated across dozens of pages, for example, pull out a few unique terms and put them in quotes.
This is just a reminder that you can combine text (with or without quotes) with almost any of the combinations previously discussed. Narrow your query to just your blog or your store pages, for example, to really target your search for duplicates.
If you specifically want a logical [OR], Google does support use of "or" in queries. In this case, you'd get back any pages indexed on the domain that contained either "this" or "that" (or both, as with any logical [OR]). This can be very useful if you've forgotten exactly which term you used or are searching for a family of keywords.
Edit: Hat Tip to TracyMu in the comments - this is one case where capitalization matters. Either use "OR" in all-caps or the pipe "|" symbol. If you use lower-case "or", Google could interpret it as part of a phrase.
The asterisk [*] can be used as a wildcard in Google queries to replace unknown text. Let's say you want to find all of the "Top X" posts on your blog. You could use "site:" to target your blog folder and then "Top *" to query only those posts.
Pro Tip: The wild'card [*] operator will match one or multiple words. So, "top * questions" can match "Top 40 Books" or "Top Career Management Books". Try the sample query above for more examples.
If you have a specific range of numbers in mind, you can use "X..Y" to return anything in the range from X to Y. While the example above is probably a bit silly, you can use ranges across any kind of on-page data, from product IDs to prices.
The tilde [~] operator tells Google to find words related to the word in question. Let's say you wanted to find all of the posts on your blog related to the concept of consulting – just add "~consulting" to the query, and you'll get the wider set of terms that Google thinks are relevant.
By using [-] to exclude the specific word, you can tell Google to find any pages related to the concept that don't specifically target that term. This can be useful when you're trying to assess your keyword targeting or create new content based on keyword research.
The "intitle:" operator only matches text that appears in the <TITLE></TITLE> tag. One of the first spot-checks I do on any technical SEO audit is to use this tactic with the home-page title (or a unique phrase from it). It can be incredibly useful for quickly finding major duplicate content problems.
You can use almost any of the variations mentioned in (12)-(17) with "intitle:" – I won't list them all, but don't be afraid to get creative. Here's an example that uses the wildcard search in #14, but targets it specifically to page titles.
Pro Tip: Remember to use quotes around the phrase after "intitle:", or Google will view the query as a one-word title search plus straight text. For example, "intitle:text goes here" will look for "text" in the title plus "goes" and "here" anywhere on the page.
This one's not really a "site:" combo, but it's so useful that I had to include it. Are you suspicious that other sites may be copying your content? Just put any unique phrase in quotes after "intitle:" and you can find copies across the entire web. This is the fastest and cheapest way I've found to find people who have stolen your content. It's also a good way to make sure your article titles are unique.
If you want to get a bit more sophisticated, you can use "-site:" and exclude mentions of copy on any domain (including your own). This can be used with straight text or with "intitle:" (like in #20). Including your own site can be useful, just to get a sense of where your ranking ability stacks up, but subtracting out your site allows you to see only the copies.
The "intext:" operator looks for keywords in the body of the document, but doesn't search the <TITLE> tag. The text could appear in the title, but Google won't look for it there. Oddly, "intext:" will match keywords in the URL (seems like a glitch to me, but I don't make the rules).
You might think that #22 and #23 are the same, but there's a subtle difference. If you use "intext:", Google will ignore the <TITLE> tag, but it won't specifically remove anything with "text goes here" in the title. If you specfically want to remove any title mentions in your results, then use "-intitle:".
One of the drawbacks of "inurl:" is that it will match any string in the URL. So, for example, searching on "inurl:pdf", could return a page called "/guide-to-creating-a-great-pdf". By using "filetype:", you can specify that Google only search on the file extension. Google can detect some filetypes (like PDFs) even without a ".pdf" extension, but others (like "html") seem to require a file extension in the indexed document.
Finally, you can target just the Top-Level Domain (TLD), by leaving out the root domain. This is more useful for link-building and competitive research than on-page SEO, but it's definitely worth mentioning. One of our community members, Himanshu, has an excellent post on his own blog about using advanced query operators for link-building.
Experienced SEOs may be wondering why I left out the operators "allintitle:" and "allinurl:" – the short answer is that I've found them increasingly unreliable over the past couple of years. Using "intitle:" or "inurl:" with your keywords in quotes is generally more predictable and just as effective, in my opinion.
I want to give you a quick case study to show that these combos aren't just parlor tricks. I once worked with a fairly large site that we thought was hit by Panda. It was an e-commerce site that allowed members to spin off their own stores (think Etsy, but in a much different industry). I discovered something very interesting just by using "site:" combos (all URLs are fictional, to protect the client):
First, I found that the site had a very large number (11 million) of indexed pages, especially relative to its overall authority. So, I quickly looked at the site architecture and found a number of sub-folders. One of them was the "/stores" sub-folder, which contained all of the member-created stores:
Over 8 million pages in Google's index were coming just from those customer stores, many of which were empty. I was clearly on the right track. Finally, simply by browsing a few of those stores, I noticed that every member-created store had its own internal search filters, all of which used the "?filter" parameter in the URL. So, I narrowed it down a bit more:
Over 60% of the indexed pages for this site were coming from search filters on user-generated content. Obviously, this was just the beginning of my work, but I found a critical issue on a very large site in less than 30 minutes, just by using a few simple query operator combos. It didn't take an 8-hour desktop crawl or millions of rows of Excel data – I just had to use some logic and ask the right questions.
Historically, some SEOs have complained that the numbers you get from "site:" can vary wildly across time and data centers. Let's cut to the chase: they're absolutely right. You shouldn't take any single number you get back as absolute truth. I ran an experiment recently to put this to the test. Every 10 minutes for 24 hours, I automatically queried the following:
Even using a fixed IP address (single data center, presumably), the results varied quite a bit, especially for the broad queries. The range for each of the "site:" combos across 24 hours (144 measurements) was as follows:
Across two sets of IPs (unique C-blocks), the range was even larger (see the "/blog" data):
Does that mean that "site:" is useless? No, not at all. You just have to be careful. Sometimes, you don't even need the exact count – you're just interested in finding examples of URLs that match the pattern in question. Even if you need a count, the key is to drill down. The narrowest range in the experiment was completely consistent across 24 hours and both data centers. The more you drill down, the better off you are.
You can also use relative numbers. In my example above, it didn't really matter if the 11M total indexed page count was accurate. What mattered was that I was able to isolate a large section of the index based on one common piece of site architecture. Assumedly, the margin of error for each of those measurements was similar – I was only interested in the relative percentages at each step. When in doubt, take more than one measurement.
Keep in mind that this problem isn't unique to the "site:" operator – all search result counts on Google are estimates, especially the larger numbers. Matt Cutts discussed this in a recent video, along with how you can use the page 2 count to sometimes reduce the margin of error:
If you run enough "site:" combos often enough, even by hand, you may eventually be greeted with this:
If you managed to trigger a CAPTCHA without using automation, then congratulations, my friend! You're a real SEO now. Enjoy your new tools, and try not to hurt anyone.
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!
Comments are closed.