Running notes from Google is Harder Than it Looks Nelson Minar, Google http://conferences.oreillynet.com/cs/et2004/view/e_sess/4916 at the O'Reilly Emerging Technology Conference: http://conferences.oreilly.com/etech/ 11 February, 2004 San Diego, CA by Cory Doctorow doctorow@craphound.com -- Simplicity is a key part of Google's value proposition, but the back end is complex Google: Organize the world's information and make it universally accessible and useful (not just the web, not just all the possible results, not just for geeks) Google is massively localized, with over 90 cc:tlds. More than half of Google's searches originate outside of the US. -- Google results: 1. Search results 2. Quicklinks/onebox (not traditional search results, but relevant, such as "define 802.11b") 3. Advertising sidebar (separated carfeully from results) 4. Bragging (response time) -- Search results Main crawl: The stuff that comes off of the googlebot's bulk-of-the-web search, used to be monthly Fresh: Pages that change a lot, worth indexing more often News: (via onebox and Google News) Every ten minutes -- How a search works: Query comes into custom httpd, Google Web Server ("gwis") Sent in parallel to several places: * Index server, "every page with the word 'apple' in it -- a cluster that manages "shards" or "partitions" (everything starting with the letter "a") and then load-balancing replications for each. Have to calculate intersections for multiple-term queries * Doc server, copies of webpages -- whence page-snippets are served in results. Sharded and replicated for scaleability and redundancy * Misc servers: QuickLinks, spell-checkers, Ad server (first two are small servers, ad server is humongous) -- Relevance: Google examines > 100 factors to ensure accurate results Link text Font size Proximity Anchor text -- allows for matches to pages that don't even contain keywords Google has an adversarial relationship with people who want to get higher results because Google is committed to presenting unbiased results. [Ed: what's bias? Nelson: sometimes it's obvious] [Ed: what if it's not obvious?] Does Google punish people who spam results? No, we try to keep good stuff at the top. -- PageRank Examine graph structure of the Web: a page with a lot of inbound links is probably relevant, esp if each inbound linkers have lots of inbound links in turn. This is a matrix calculation with 30MM nodes with 10 edges each -- Onebox: There's one place you type stuff, we guess what you want and put the results in the onebox. -- Our index has 3.3B pages -- for some definition of page. We try to give supplemental results with deeper crawls. Burtonator: A search for "the" returns 5.4B results -- Google Ads: Advertising goal: connect people who are visiting webpages to ads We do a subtle thing to calculate whose ad gets displayed -- if your clickthrough rate is higher, we give you more prominent placing on the page. Ads that don't rank don't get shown (you're not getting clikcs, we're not making money, it annoys the users, forget it) -- Ad targeting We understand pages based on keywords, word freq, font size, anchor text, linugistic processing, works on dynamic content. -- Hardware: PCs are unreliable, cheap and fast. Use software to make it reliable. -- Google FileSystem: Fault-tolerant mass storage: 300TB on 1000 machines Design decisions: Machine failure is common, 1GB+ files, high bandwidth traded for low latency, files are appended, not edited; API and apps co-developed with filesystem Implementation: Maseter sends requests to chunkservers that manage 64MB chunks, master metadata manages chunks eof