To deal with these, we first need to properly extract the second double-quoted field. Note that Apache log files use backslashes to escape extra quotes or other special characters. This means naive regular expressions such as "[^"]*"
aren't good enough.
Using grep to extract the referrer field (second double-quoted field):
grep -oP '^[^"]+"[^"\\]*(?:\\.[^"\\]*)*"[^"]+"\K[^"\\]*(?:\\.[^"\\]*)*(?=")' logfile.txt
Looks crazy! Let's break it down:
- The
o
argument togrep
means we just get the matching part of the line, not the rest of it - The
P
argument togrep
tells it to use Perl-compatible regular expressions - The overall structure of the regular expression in use here,
...\K...(?=...)
, means we are checking the whole pattern but only the things between the\K
and the(?=...)
will be output
Breaking the regular expression down further:
^[^"]+
– Get everything between the start of the line and the first"
"[^"\\]*(?:\\.[^"\\]*)*"
– Get the entire first double-quoted string.[^"]+
– Get everything between the two strings"\K[^"\\]*(?:\\.[^"\\]*)*(?=")
The same as above, but we have the\K
after the first"
to start matching data after that and the(?=")
to stop matching data before the last"
.
After this point the data will be much easier to process because you no longer have to worry about the quotes and extracting the field properly from the log file.
For example, you could pipe the output into another grep:
grep -oP ... logfile.txt | grep -oPi '^https?://www\.google\.com/search\?\K.*'
Here the i
option to the second grep makes it case-insensitive.
Alternatively, you could add the check for the the start of the google.com
referrer directly into the first regular expression and move the \K
as appropriate, but I would recommend against this since it's better to run two regular expressions which both do one job and do it well than to combine them into one where its job is not clear.
Note that if you want to collect referrers from other Google domains you will need to modify your regular expression a fair bit. Google owns a lot of search domains.
If you didn't mind potentially catching a few non-Google sites, you could do:
... | grep -oPi '^https?://(www\.)?google\.[a-z]{2,3}(\.[a-z]{2})?/search\?\K.*'
Otherwise you would need to attempt to match only Google-owned search domains, which is a constantly moving target:
... | grep -oPi '^https?://(www\.)?google\.(a[cdelmstz]|b[aefgijsty]|cat|c[acdfghilmnvz]|co\.(ao|bw|c[kr]|i[dln]|jp|k[er]|ls|m[az]|nz|t[hz]|u[gkz]|v[ei]|z[amw])|com(\.(a[fgiru]|b[dhnorz]|c[ouy]|do|e[cgt]|fj|g[hit]|hk|jm|k[hw]|l[bcy]|m[mtxy]|n[afgip]|om|p[aeghkry]|qa|s[abglv]|t[jrw]|u[ay]|v[cn]))?|d[ejkmz]|e[es]|f[imr]|g[aefglmpry]|h[nrtu]|i[emoqst]|j[eo]|k[giz]|l[aiktuv]|m[degklnsuvw]|n[eloru]|p[lnst]|r[osuw]|s[cehikmnort]|t[dgklmnot]|us|v[gu]|ws)/search\?\K.*'
Also note if you want to include Google's image search and other search subdomains, you will need to change the (www\.)?
in one of the above grep commands to something like ((www|images|other|sub|domains)\.)?
.