Analytics
4.8K members online now
4.8K members online now
Learn to identify referral spam traffic, as well as best practices to reduce referral spam in your Google Analytics account
 
Guide Me
star_border
Reply

hostname "include only" filter seems to be excluding valid traffic

Visitor ✭ ✭ ✭
# 1
Visitor ✭ ✭ ✭

my default view has no filters on it. this default view has hits from many different hostnames, visible by looking at the behavior report → site content → all pages, and then setting primary dimension to be hostname. the two hostnames i am interested in are 'www.example.com' and 'blog.example.com'.

 

to that end, i've made additional "hostname-only" views configured identically the primary/default view. these "hostname only" views each have their own filter defined, a "predefined" filter set up to include only traffic to the hostname equal to "www.example.com". as expected, this filter configuration works: no undesired traffic is making it to the views.

 

i am finding, however, that a whole lot of seemingly valid traffic is NOT making to the views. in an attempt to debug this situation and inspect/validate the analytics traffic GA is receiving, i've added a hitCallback to the pageview that makes a request to my server.

 

after reviewing this traffic, it's clear the traffic in the primary/default view more closely represents the traffic i am seeing on my server via the hitCallback. the "hostname-only" view is seeing roughly 1/10 of the default view's traffic.

 

it seems as though the hostname-only filter is filtering out waaaaay too much seemingly valid data. my first thought was bots executing the GA JS, but this hitCallback thing tells me the hits appear to be from non-bot user agents. (all of the views, btw, are configured to "exclude all known bots".)

 

i hope that all makes sense. does anyone have any clue as to what's going on? i'm at a loss to explain this.

 

1 Expert replyverified_user

hostname "include only" filter seems to be excluding valid traffic

Top Contributor
# 2
Top Contributor

Hi :-)
I'm not really understanding your View/filter setup.

If you are wanting to setup a view that collects data for both www.example.com and blog.example.com
together in the same View, then you will either need to

1) create a custom (not predefined) include filter using a regular expression for the filter pattern eg ^(blog|www)\.example\.com
or
2) create a predefined filter to include only > traffic to the hostname > that end with

example.com (bare in mind though, if there are also other subdomains such as m.example.com and similar, then it will also likely include data for those hostnames too)

If you create two seperate include filters for the same View to filter the same Field, any visits that do not fulfill the conditions of the first filter will be discarded and not seen by the 2nd filter.

Bronwyn Vourtis, Google Analytics Top Contributor
Was my response helpful? If yes, please mark it as the ‘Best Answer.’ Learn how here

Re: hostname "include only" filter seems to be excluding valid traffic

[ Edited ]
Visitor ✭ ✭ ✭
# 3
Visitor ✭ ✭ ✭

Hi Bronwyn, thanks for your quick reply.

 

Sorry if I wasn't clear about the filters. I've had them set up similar to how you have described. Here is the full setup:

 

  • default/unfiltered view
    • no filters applied
  • hostname-only view for www.example.com
  • hostname-only view for blog.example.com
    • predefined, include traffic to hostname: blog.example.com (screenshot)
  • combined hostname view for www and blog
    • custom, include hostname filter w/ regexp: ^(www|blog\.example\.com (screenshot)
    • the usual "prepend hostname to request URI" when you're tracking different hostnames in the same account (screenshot)

My issue is that these single hostname-only views, specifically the one for www.example.com, is not receiving the data I expect it to: the hostname-only view is only showing about 1/10 of the traffic of the unfiltered view's traffic for www.example.com. Surely, 9/10 of the traffic can't be spam! (I have not yet been able to determine a pattern to the excluded traffic.) This prompted me to see which user agents were making analytics calls via the hitCallback config. Sure enough, the hits logged to my server via hitCallback are 'good' (seemingly real) users, which means I'm missing 9/10 of the traffic in the hostname-only view! I'm getting some of the traffic, so clearly it's (kinda) working. My filter is so simple... what can be going wrong? This is the crux of my concern.

 

Note that while my question is only concerning a single view (hostname-only view for www.example.com), I'm glad you bring up the regex for the "combined" view. When I first set this up a few months ago, I was having issue with the 'verify filter' functionality... the regexp was not working as expected. I tried numerous configurations of the regexp (anchoring, no anchoring; capture groups with parens, no capture groups or parens; ordering of www and blog, etc.) and whichever way it was configured, only the hostname that matched the first condition of the regexp (before the pipe char) was matched. This drove me absolutely nuts. I felt like I couldn't trust the 'verify filter' functionality (or that my core understanding of GA was off), so I gave up on this combined view, leaving the supposedly correct regexp in place. (My regexp was essentially identical to yours, save for the ordering of www vs blog).

 

I just tried testing out using 'verify filter' with the above regexp on the unfiltered view and it now works as expected. However... strangely, my dev server shows up with 2 sessions and 0 pageviews (screenshot). I'm having trouble reconciling what this is telling me. A session that included hits to my dev server also included hits to other servers? Does this have to do with the way the cookie is configured? I'm using this on all server instances (local, staging, production):

 

ga('create', 'UA-XXXXXXXX-1', 'example.com');

 

(A note re: the missing traffic... I've set up many sites over the years nearly identical to this scheme, and this is the first time I've noticed traffic/analytics anomalies. Maybe that's because I never actually verified traffic to the extent I've done here? *shrug*)

 

Anyhow, thanks for reading this far and for any ideas you might have.

 

-Justin

 

Re: hostname "include only" filter seems to be excluding valid traffic

Top Contributor
# 4
Top Contributor

Hmm my reply got eaten when i tried to post it.. 
will try again :-)

 

Few things to consider.

 

If it were me, I would not be using the same Property for localhost or staging setup as for the live website(s)

 

When I first set this up a few months ago, I was having issue with the 'verify filter' functionality... the regexp was not working as expected. I tried numerous configurations of the regexp (anchoring, no anchoring; capture groups with parens, no capture groups or parens; ordering of www and blog, etc.) and whichever way it was configured, only the hostname that matched the first condition of the regexp (before the pipe char) was matched. This drove me absolutely nuts. I felt like I couldn't trust the 'verify filter' functionality (or that my core understanding of GA was off), so I gave up on this combined view, leaving the supposedly correct regexp in place. (My regexp was essentially identical to yours, save for the ordering of www vs blog).
The filter verification is only an indicator. It can give incorrect results.
Considering filters are usually applied to new Views with little/no data (since filters have no affect on historical data) and that the verification uses the View's data from the past 7 days.. often it is unable to return a result due to too little data within the particular View. Same applies to established Views if the site has not a lot of traffic for the past 7 days.

 

Regarding ga('create', 'UA-XXXXXXXX-1', 'example.com');

https://developers.google.com/analytics/devguides/collection/analyticsjs/cookies-user-id


For tracking a domain and its own subdomains without any extra setup required, for successful sharing of ga clientID, sessions and source/medium info between them. cookieDomain should be set to auto.

 

If you set the cookieDomain to example.com in the tracking code and then add that tracking code to somedomain.com, those hits do not get sent to analytics. (i have verified this with testing)

 

cookieDomain gets set to 'none' automatically for the tracking code instance, when GA detects it is running on localhost

 

For filtering out spam data. I suggest the following article. Its a long read but well worth it.
It outlines how to check your data and also correctly setup the necessary filters and segments for keeping your reports clean.

http://help.analyticsedge.com/spam-filter/definitive-guide-to-removing-google-analytics-spam/

Bronwyn Vourtis, Google Analytics Top Contributor
Was my response helpful? If yes, please mark it as the ‘Best Answer.’ Learn how here

Re: hostname "include only" filter seems to be excluding valid traffic

Visitor ✭ ✭ ✭
# 5
Visitor ✭ ✭ ✭

> If it were me, I would not be using the same Property for localhost or staging setup as for the live website(s)

 

Agreed, I usually direct analytics for local dev to a separate property ID . However, I inherited the site (a web app) from the previous developer and I have just recently refactored the app config to use environment variables. Moving forward I'll be able to set the property ID per app instance. For now, though, I've left it as-is to not rock the boat too much. In theory, hostname filters can handle this nicely.

 

> The filter verification is only an indicator. It can give incorrect results.

 

I get that maybe it truncates the number of rows it displays, but for the issue I was experiencing was seemingly incorrect processing of a regexp. Is the 'incorrect results' thing documented somewhere or otherwise acknowledged by Google? Anyway...

 

> Considering filters are usually applied to new Views with little/no data (since filters have no affect on historical data) and that the verification uses the View's data from the past 7 days.. often it is unable to return a result due to too little data within the particular View. Same applies to established Views if the site has not a lot of traffic for the past 7 days.

 

Yup, I hear ya. However, when I was having the odd problem, I had done the filter verification on the default/main view which has been capturing data for 2+ years. *shrug* I chalk it up to software bugs.

 

> Regarding ga('create', 'UA-XXXXXXXX-1', 'example.com');
> [...]

 

The cookie domain is something that I just realized as I was assembling this post and sanity-checking everything. I don't expect it's directly related to my issue with the filters excluding data. That said, I will adjust this (change to 'auto') for the dev environment. I should note, though, that my dev environment (example.dev) is successfully logging data to GA with the cookie set to example.com. (I can see pageviews on example.dev in the default view.)

 

> For filtering out spam data. I suggest the following article. Its a long read but well worth it.

 

Yeah, I've seen that article. The second item on its list is the hostname filter to eliminate ghost visits.... which brings us back to my original question:

 

When I implement the hostname filter for www.example.com, I get substantially less traffic in the hostname-only view. This traffic does NOT match my web server logs. In other words, too much traffic is not being included by this simple filter. What's going on?

 

(Also note that, embarrassingly, I posted this in the Referral Spam Traffic forum. I meant to post this in the general Filters forum.)

 

hostname "include only" filter seems to be excluding valid traffic

[ Edited ]
Top Contributor
# 6
Top Contributor

Did you create a custom segment in the unfiltered View 
Conditions -

Filter > Sessions > Includes

Hostname > matches regex > www.example.com

and compare what it shows to the data captured in the View with the hostname filter for the main domain

- also be aware of date ranges. eg if filtered view was not setup until Jan 5th for example.. do not compare the filtered view data 

against the unfiltered view left set to the default date range of the past 30 days.. also set it to only include dates from the date the filtered view was setup to compare against.

Also are you able to possibly provide the url of the website.. if possible, id like to have a quick look at it via browser

Thanks

 

eta - 

Is the 'incorrect results' thing documented somewhere or otherwise acknowledged by Google? Anyway...

 It is, in various places within the help documentation:
"... Even with filter verification, we STRONGLY encourage you to apply new filters to a test view before assigning them to your real views. ..."
" ... Because filter verification uses a calculated sampling of your data, the results cannot be guaranteed to be accurate in all cases. You should always maintain an unfiltered view of your data as a back up. ..."
ref - https://support.google.com/analytics/answer/6046990?hl=en




 

Bronwyn Vourtis, Google Analytics Top Contributor
Was my response helpful? If yes, please mark it as the ‘Best Answer.’ Learn how here

Re: hostname "include only" filter seems to be excluding valid traffic

Visitor ✭ ✭ ✭
# 7
Visitor ✭ ✭ ✭

Hi Justin,

 

My team is having the exactly same problem.

 

We've even set up a new view with the option Exclude all hits from known bots and spiders on in hope that it was somehow bots causing this difference. But after a few days running it was clear that this new view data was closer to the view with no hostname filter than to the one with the filter.

 

Did you manage to find what was causing it?

 

We're thinking of using the view with no filter as our "trusted" source. From what you've said I'd guess that's the way you went too.

 

It's good to see we weren't the only ones with this problem at least.

 

Cheers,

Mikail