
I have blogged before how I don’t like Google’s growing monopoly on the web, and how they increasingly block non-Chrome browsers when you use their own web applications.
It looks like Google is now moving into more unsound web practices, using their web crawling power and abilities to unfair, unethical and illegal purposes.
Google pays bloggers for Chrome links
Earlier this month, news broke that Google paid bloggers to write about Chrome, their own browser. Embedding paid advertisements, and, in some cases, hiding it as genuine blog content, has always been something I found highly unethical.
OK, in some cases during Google’s campaign, the paid blogs were clearly marked as such, but the post embedded links to Chrome’s download site. “Paid” links is something Google themselves have always battled against, saying it cheats search engines. Now they are doing it themselves, to market their own products.
Verdict: unethical and unfair competition.
Google crawls for data for their own profit
Today, I read this interesting and in-depth story where Google crawled data from Mocality, a small Kenyan company publishing an online Kenya business directory.
In their article, the Mocality folks detail how Google crawls their business directory, with the sole purpose to contact the listed companies, convincing them to convert to a Google business directory. The article features recording of telephone calls from Google employees to the listed companies, which also reveals other unclear and illegal business practices.
Even worse, after the story broke out, Google willingly switched the crawler’s IP address, and continued their malpractice in crawling Mocality’s data. This shows a lack of ignorance, and clear mal-intent.
Verdict: unethical, unfair competition, commercial spying.
My own thing with Google
Web crawlers, scanning your web content, are to obey rules web admins set in “robots.txt”. This file defines what crawlers can scan, what they should not, and at what rate they can scan.
I always had a problem that for the Google crawler, some settings, like the Google’s crawler rate, can only be set in the Google Web Master tools, ignoring the settings in “robots.txt”. Even worse, every three months, Google resets their crawler rates. If you want to spare your server from excessive Google crawling, you have to manually reset the crawler rates for each of your site, again. Every three months.
Today, I discovered that Google’s crawlers also ignore other, more basic rules:
On “Humanitarian News“, one of my sites, the “robots.txt” explicitly disallows crawlers to access the search facility:
Disallow: /search/
Disallow: /opensearch/
The reason is that with the rate Google crawls, and the amount of search combinations possible, my server performance goes down each time Google spams my site with excessive crawls. Even worse, this “opensearch” string calls for SOLR search, which invokes a JAVA script offering advanced RSS features for “live” users. Needless to say this sucks up server resources, certainly if you shoot off opensearch requests at a rate Google’s crawler does. This is the reason why I block crawlers from accessing the search, in the first place.
This morning, while verifying my log, I found a string of searches which caught my attention:
search 13 Jan 2012 – 12:02 somalia (Search). Anonymous results
search 13 Jan 2012 – 12:02 somalia (Search). Anonymous results
search 13 Jan 2012 – 12:02 somalia (Search). Anonymous results
search 13 Jan 2012 – 12:02 somalia (Search). Anonymous results
search 13 Jan 2012 – 12:02 somalia (Search). Anonymous results
etc etc etc..
Looking into it further, the search strings were of the format:
http://humanitariannews.org/search/apachesolr_search/somalia?page=1442
And who generated those searches? They all come from IP 66.249.72.105
Which is (source)…:
OrgName: Google Inc.
OrgId: GOGL
Address: 1600 Amphitheatre Parkway
City: Mountain View
StateProv: CA
PostalCode: 94043
Country: US
So…
- My robots.txt clearly blocks /search access.
- Google ignores that rule, and as such ignores a basic rule to protect certain web content from crawling.
- Going over my log, I find repetitive Google crawlers of the search
- On top of that, there is not much relevant data to be found in the search which can not be found in a normal crawl of my site. And certainly not on (as in the example) on “page 1,442″ of a search (see above example).
The problem is that other than blocking their crawler’s IP address, you can’t do much about it. And if you block their crawler’s IP address, as a web manager, you can just as well throw yourself under a train: The Google search results for your site, would disappear. – do I smell a monopoly here?-
Verdict: monopoly, unethical business practice, bad example for anyone else on the web.
Is it time for an #OccupyGoogle ? I think it is.
PS: If anyone has bright ideas how to tweak the robots.txt to disallow crawler access, I’d like to hear about it.
Angry face picture courtesy SearchEnginePeople
Peter. Flemish, European, aid worker, blogger, expeditioner, sailor, traveller, husband, father, friend, nutcase. Not necessarily in that order. (
{ 2 comments… read them below or add one }
Finally, Check this out http://www.google.com/support/forum/p/Webmasters/thread?tid=0da8e0b9fd682420&hl=en
Don’t you have to so serious about the problems. Beside, immediately switch your browser to IE9 for WIN and Firefox for Mac would be great advised.
Hi Ferb,
All what is recommended in that post, I did, and more… And that is exactly my point: Google does NOT follow the robots.txt rules
Peter