Skip to content


A few alternative search engines

A recurrent concern of mine is Google’s always increasing internet monopoly. At first they had just the search engine (so successful that it turned into a verb), then Gmail (even my current work e-mail box is there), then Youtube (we all know about Dailymotion but for some reason mainly use Youtube), but also the maps, the translation (whatever happened to babelfish?)… and of course the apparently geek-friendly Chrome, passing the 5% browser market share at the beginning of the year.
Anyway, apart from the other big ones (although not nearly as big…), Bing and Yahoo, a few other search engines can be worth checking out. I’ve been contributing to a distributed search engine called Majestic-12 for almost 2 years and on their forums we sometimes discuss about robots.txt issues. Recently someone there posted the following robots.txt, and I thought it might be funny to check the bots out 🙂

User-agent: MJ12bot
Disallow: /
User-agent: dotbot
Disallow: /
User-agent: twiceler
Disallow: /
User-agent: Yandex
Disallow: /
User-agent: Baiduspider
Disallow: /
User-agent: spbot
Disallow: /
User-agent: ScoutJet
Disallow: /
User-agent: Linguee
Disallow: /

  • MJ12bot is the bot used by Majestic-12. They seem to have an index as big as Google’s, with over 1 trillion pages (source), but they currently focus on their SEO tools and the search engine itself doesn’t really give good results yet. They’re planning to work on it a lot this year, though (source).
  • dotbot is the bot of dotnetdotcom.org. I don’t really have a clue of what they’re trying to achieve. They’re not a search engine (yet?) and all they do is offer to download as a huge archive their whole index. The latest version to date was released 10 months ago, in April 2009, and contains only 600k pages according to the description… Not worth checking out ATM IMO.
  • twiceler is the bot of Cuil. They say their index is 127 billion pages big, which is not bad for a search engine I had never heard about. They let you choose your preferred language (8 languages available) and the layout is rather pretty. Too bad they can only display 10 results per page, plus the results aren’t always very properly sorted. But this means that they will sometimes give you interesting results that even Google didn’t manage to find! So, definitely worth trying.
  • Yandex is from Yandex.ru, a Russian search engine. No figures about their index, they seem to give good results although not extensive.
  • Baiduspider is from Baidu, a Chinese search engine. For some reason a Google search about it gives a lot of results about people having troubles blocking it. The results don’t seem very good, at least patheticcockroach.com and wiki4games.com don’t show up when I look for them…
  • spbot is from SEOprofiler by Axandra GmbHSE. So apparently only a SEO site, not a search engine.
  • ScoutJet is the bot of Blekko, a yet-to-be-released search engine.
  • Linguee describes itself as a “German-English dictionary”. So I’m not sure why they have a crawler…

On a side note, while searching for the above bots, I also found Slurp, which is Yahoo’s crawler, and Teoma, which is ask.com’s crawler.

Posted in Google, Internet.


3 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Jo says

    Hi there, I just came across your scroogle add-on for Firefox. Would you mind writing the same one for the scroogle german search? Can be found here (I’m sure you know): http://www.scroogle.org/scrapde8.html
    Best
    Jo

  2. patheticcockroach says

    Actually, I didn’t know there was localized versions. Will work on it this week-end then 🙂 (and I’ll rather do the SSL version I think: https://ssl.scroogle.org/scrapde8.html)



Some HTML is OK

or, reply to this post via trackback.

Sorry about the CAPTCHA that requires JS. If you really don't want to enable JS and still want to comment, you can send me your comment via e-mail and I'll post it for you.

Please solve the CAPTCHA below in order to fight spamWordPress CAPTCHA