1.5K members online now
Understand information in your reports and troubleshoot reporting issues such as self-referrals, (not set) data, and inaccurate information
Guide Me

Best way to identify low quality pages?

[ Edited ]
Visitor ✭ ✭ ✭
# 1
Visitor ✭ ✭ ✭



The website


The website in question has 3.3k pages and a decent traffic. In the past it was hit by Panda (possibly multiple times) but no remedy action was taken by the webmasters. Now, upon digging in Google Analytics reports (for the last 12 months as a time-frame) I realized that about 1.3k of those pages bring only 5% of the traffic to the site. In short almost half of the site contributes almost nothing to the well-being of the site.


I want to noindex these pages and see what happens. There's nothing much to lose, except maybe a small portion of the traffic in the worst case scenario, but I think this way the site will gain traffic on the long run.


Google Analytics Reports


Now the problem is how to precisely identify those pages? I looked at three page level dimension reports in Google Analytics and they all give different results in terms of order of pages in them. I looked at "Unique Pageviews", "Sessions" and "Sessions coming from Google" (these are simplified titles). For example a page on the first report is let's say at 450th place (ascending order), and on the second one 960th, and on the third report at 1300th place in the long list of pages.


That basically means that the page in comparison to the other pages on the site is doing best when it comes to sessions from Google, not so good when it comes to sessions in general, and really bad when it comes to unique pageviews.


Now when looking at sessions at page level dimension they basically represent entrances (the first hit of the session that is also a page) as per this Google Help explanation. So basically the reports I'm looking at page level dimension are "Unique Pageviews", "Page Entrances", and "Page Entrances from Google".


It's also worth to note that the last report (Entrances from Google Search) is sampled to 1%, meaning it may be somewhat unreliable, although there's small chance for that.




A same page can rank differently in these three reports when it comes to "entrances", "unique pageviews" and "entrances from Google search". What will be the correct report to determine which pages are least valuable for the site? Is combination of these three reports the best way to certainly determine the least valuable pages? For example if a page ranks bellow 1000th place on all three reports it's a candidate for non-indexation.



Re: Best way to identify thin content?

Explorer ✭ ✭ ☆
# 2
Explorer ✭ ✭ ☆
It sounds like your goal is to determine which pages show up the least. Well, Google Analytics only tells you which pages are receiving traffic, otherwise data is not sent to Google. Odds are you will need an XML sitemap of all the content on the website and they match that to the data in Analytics over an extended period of time, maybe 12 months.

Start with the pages on your sitemap that have 0 page views in Analytics. I would say next are the pages with 0 entrances, since those will not be linked to from Google or External websites , then sort from least to greatest. But generally speaking, setting them to noindex might not help your cause as much as you would think.

Maybe you should do a qualitative analysis to see if these pages aren't receiving traffic because of errors, something to do with the content or topic, or if the subjects don't align with the rest of your website. Odd are you will find some trend for why they are not performing well.

Re: Best way to identify thin content?

[ Edited ]
Visitor ✭ ✭ ✭
# 3
Visitor ✭ ✭ ✭
  1. I do have XML sitemap and the pages match the data in Analytics. Of course, in the GA report(s) I have few advanced conditions in place to exclude pages that are not actual website pages (i.e. pages with queries, error pages etc.)

  2. There are no pages with 0 entrances or page-views. As I've mentioned the site has decent traffic. It's just that the large portion of the pages contribute very little to the overall traffic. The pages in the reports are properly listed from least to greatest, but the problem, as previously noted, is the choice of the most effective report.

  3. I've done qualitative analysis and for sure those pages with very little traffic are thin pages. No doubt about it. But, rewriting 1.3k pages is not an option. Also the scope and effort for that job is far greater that the eventual reward. I've done some testing half a year ago with revamping around 50 pages and the results are none to this day. Basically those pages don't have enough back-links, social shares and everything else to begin with to climb in the SERPS. Rewriting them only won't do the job, they need to be republished, re-exposed on the front-page of the site, etc. to possibly gain traction. I think even then the results will be minuscule, and as I've mentioned before, that task is a mammoth one.

Re: Best way to identify thin content?

[ Edited ]
Visitor ✭ ✭ ✭
# 4
Visitor ✭ ✭ ✭

After much thought I think that the most appropriate GA reports for determining the least valuable pages are "Sessions/Entrances" and "Sessions/Entrances from Google" reports.

Unique Pageviews

For example, the "Unique Pageviews" report doesn't represent landing pages but just the pages. So a poor page can rack up unique pageviews if it's listed as a related page to one of the good pages. So a visitor enters the site via google, back-link, social media, email, etc. and lands at the front page or at one of the good pages, then follows an internal link to a poor page and goes back or jumps to another page. So a poor page will earn an unique pageview that way even if it's actually poor in the eyes of the external sources of traffic.


The "Sessions/Entrances" report represents landing pages, so in that report pages are listed according to their 'value' to the external sources of traffic, or in other words how good they are in attracting traffic from the external world.

 Google Sessions/Entrances

Now the "Google Sessions/Entrances" report is only a part of the whole "Sessions/Entrances" report, plus, as I've said it's heavily sampled because of the complexity of the query. Maybe this report is the best because the very fact of no-indexation will be done ONLY for Google Search purposes, but I'm worried about the inaccuracy of that report because only a sample of 1.18% of sessions are included in the report. I was reading about accuracy/inaccuracy of sampled reports and people have mixed feeling about those.
So what do you think?

Re: Best way to identify thin content?

Explorer ✭ ✭ ☆
# 5
Explorer ✭ ✭ ☆
I think sampling is something that gets discussed very often here and I would be wary of that 1.18% sample size. Since you are likely looking at a longer period of time, your secondary dimension is likely causing the sampling. If you decrease the period or remove the secondary dimension, the sample rate should improve. Another option is using a custom segment instead, which can yield more accurate results.

It seems we have the same train of thought in starting with the pages with the least unique page views and then Sessions/Entrances. I am still not confident that noindex will solve your problem. There is always a chance they are still getting indexed and impacting your statistics.

This abstract discussion is challenging without knowing what website you are working on. If you know which pages are performing well on related topic, you could try consolidating content and using a 302 temporary redirect from the low performing page to see if it is still needed. Ultimately, you might decide to remove that content altogether and having a 301 permanent redirect will help prevent 404 errors on the pages that were indexed.

Let me ask you this, what do you hope to see by using noindex on that content and how do you intend to measure the result?

Re: Best way to identify thin content?

[ Edited ]
Visitor ✭ ✭ ✭
# 6
Visitor ✭ ✭ ✭
  1. There's no way for avoiding the sampling there. Even if you look at one month period (with highest precision) you get 33.31% session sample which is not quite good. I would have to look at a period of less than week to get 100% and then add up all those reports for all 3.3k URLs for a whole year. That would be a lot of work, almost impossible. And there's no secondary dimension in that report:

    Acquisition > Source/Medium > Google/Organic > Landing Page

    Landing Page there is primary dimension along with the Source/Medium.

  2. About the unique pageviews, after some thought, I said that I wouldn't trust that metric because a poor page can also have a lot of unique pageviews because of the simple fact that it can be listed as related article at the bottom of a great article or simply linked from some section of the site that gets decent exposure. Or even accessed from the website's search engine. So if you trust that metric you would end up not deleting some poor pages and deleting the ones with good external traffic. That's why I think more reliable metric for determining low quality pages are Sessions (those are Entrances when looking at the landing page dimension level). That way you would pruning pages that don't get enough external traffic only.

    I've read the article you're referring to before opening the topic here. It has a good points and it basically says that if you noindex a lot of pages on your site and you still continue to link them from your indexed pages you would dilute your link juice and confuse the robots. However, there's a solution for that. You just noindex your low quality pages and also de-link them from your good ones. Do not show them as related articles nowhere on your site. They will end up accessible only in your deep archives (for humans and robots) which are noindex pages too. That way the robots would not end up in a chain like this:

    index > noindex > index > noindex > index > etc.

    instead they would see:

    index > index > index > noindex >noindex > etc.

    So, if a human want's to browse the archives can still see those pages, and also the robots if they are willing to crawl that deep can access them too.

  3. I was thinking a lot about consolidating low quality pages, and I have done that in the past on a small sample. But it's a tricky business. Those pages that you would consolidate have to be really related. You can't redirect if they just have few words in common. If you start redirecting en masse without scrutiny you will start fooling your visitors and robots too. In my case redirection is probably not an option, maybe only in a very small number of cases, because the pages on the site are actually movies. So if you decide to delete a movie I think you can't really redirect users to another similar movie. They may revolve around the same topic but they're fundamentally different. What do you think?

    Ultimately, deleting low quality pages may be the best option, but Google's Gary Illyes disagrees and I agree with him He said noindex is better than 404. Even to keep those noindex pages in the sitemap.

  4. Now as I've said, the site was hit by Panda multiple times (not severely, but 10%-15% every time). As you may know Panda is a domain level penalty, which means those low quality pages you're keeping are hurting every other pages on the site and the domain itself. For starters, by noindexing those pages that bring only 1% of the traffic to the site I want to see if the site's traffic goes up. If not I can easily bring them back in the index. The main problem here is determining the best way to pinpoint low quality pages.

I hope this makes sense.