To deal with these, we first need to properly extract the second double-quoted field. Note that Apache log files use backslashes to escape extra quotes or other special characters. This means naive regular expressions such as "[^"]*"
aren't good enough.
Using grep to extract the referrer field (second double-quoted field):
grep -oP '^[^"]+"[^"\\]*(?:\\.[^"\\]*)*"[^"]+"\K[^"\\]*(?:\\.[^"\\]*)*(?=")' logfile.txt
Looks crazy! Let's break it down:
- The
o
argument togrep
means we just get the matching part of the line, not the rest of it - The
P
argument togrep
tells it to use Perl-compatible regular expressions - The overall structure of the regular expression in use here,
...\K...(?=...)
, means we are checking the whole pattern but only the things between the\K
and the(?=...)
will be output
Breaking the regular expression down further:
^[^"]+
– Get everything between the start of the line and the first"
"[^"\\]*(?:\\.[^"\\]*)*"
– Get the entire first double-quoted string.[^"]+
– Get everything between the two strings"\K[^"\\]*(?:\\.[^"\\]*)*(?=")
The same as above, but we have the\K
after the first"
to start matching data after that and the(?=")
to stop matching data before the last"
.
After this point the data will be much easier to process because you no longer have to worry about the quotes and extracting the field properly from the log file.
For example, you could pipe the output into another grep:
grep -oP ... logfile.txt | grep -oPi '^https?://www\.google\.com/search\?\K.*'
Here the i
option to the second grep makes it case-insensitive.
Alternatively, you could add the check for the the start of the google.com
referrer directly into the first regular expression and move the \K
as appropriate, but I would recommend against this since it's better to run two regular expressions which both do one job and do it well than to combine them into one where its job is not clear.
Note that if you want to collect referrers from other Google domains you will need to modify your regular expression a fair bit. Google owns a lot of search domains.
If you didn't mind potentially catching a few non-Google sites, you could do:
... | grep -oPi '^https?://(www\.)?google\.[a-z]{2,3}(\.[a-z]{2})?/search\?\K.*'
Otherwise you would need to attempt to match only Google-owned search domains, which is a constantly moving target:
... | grep -oPi '^https?://(www\.)?google\.(a[cdelmstz]|b[aefgijsty]|cat|c[acdfghilmnvz]|co\.(ao|bw|c[kr]|i[dln]|jp|k[er]|ls|m[az]|nz|t[hz]|u[gkz]|v[ei]|z[amw])|com(\.(a[fgiru]|b[dhnorz]|c[ouy]|do|e[cgt]|fj|g[hit]|hk|jm|k[hw]|l[bcy]|m[mtxy]|n[afgip]|om|p[aeghkry]|qa|s[abglv]|t[jrw]|u[ay]|v[cn]))?|d[ejkmz]|e[es]|f[imr]|g[aefglmpry]|h[nrtu]|i[emoqst]|j[eo]|k[giz]|l[aiktuv]|m[degklnsuvw]|n[eloru]|p[lnst]|r[osuw]|s[cehikmnort]|t[dgklmnot]|us|v[gu]|ws)/search\?\K.*'
Also note if you want to include Google's image search and other search subdomains, you will need to change the (www\.)?
in one of the above grep commands to something like ((www|images|other|sub|domains)\.)?
.
Good blog to read and share..!
ReplyDeleteCyber Security Course In Chennai
Cyber Security Online Course
Cyber Security Course In Bangalore
How to Check if a Website OR URL is Safe or Not? Cyber security
ReplyDelete🔒🔓
http://www.urlhelp.xyz/2021/07/how-to-check-if-website-or-url-is-safe.html
I am from internet data search help service
https://www.urlhelp.xyz/
Great Blog!!! Thanks for sharing this wonderful data with us.
ReplyDeleteSoftware Testing Course in Chennai
Software Testing Online Course
Software Testing Course in Coimbatore
This is really informative.
ReplyDeleteCCISO Certification
AWS Certification
CISSP Certification
CEH v11 Certification
cyber security services in usa
ReplyDeleteWe offer a unique array of services like NERC CIP compliance, industrial Cyber security (IoT & IIoT), NIST ICS readiness, Site Assessment Testing, Critical Infrastructure Maturity Assessment, Digital Transformation Blueprint, Cloud Security Assessment, PCI-DSS Compliance, SOC Audits, and Penetration testing services and more.
Wonderful Blog, thanks for sharing this blog with us, waiting for your next update.
ReplyDeletewhat is machine learning?
why is machine learning important?
Cybersecurity or data security is the state or process of protecting computers, smartphones, networks, servers, and information from external attacks. However, advanced cybercrime is increasingly getting smart and sophisticated. It lets your sensitive data at risk, as hackers employ a new approach powered by artificial intelligence to circumvent traditional security controls. Thank you for sharing nice blog. Get more about: cyber security in perth autralia.
ReplyDelete