Re: 請教一些面試問題

看板Oversea_Job (海外工作)作者LINC.時間18年前 (2007/08/24 09:23)推噓0(0推 0噓 0→)

留言0則, 0人參與討論串3/8 (看更多)

※ 引述《michaelz (michaelz)》之銘言： : ※ 引述《LINC (Go cubs!)》之銘言： : : 第一道題: : : Given N computers networked together, with each computer storing N integers, : : finds the "median" of all of the numbers. Assuming a computer can hold O(N) : : integers. Also assume that there exists a computer on the network without : : integers, that we can use to interface with the other computers storing the : : integers. : : 第二道題: : : How to fast check if a URL is visited by web crawler? : : 我看到的解法: hash table (有這麼簡單嗎@@) : : 直覺上來說好像不對勁 : : 一個URL假設是20 char, 算20 bytes : : 假設Internet有5 billion pages -> 5 * 20 billion bytes = 100 billion bytes : : = 100 GB : : 100GB(至少) hastable? 有沒搞錯? : : 我查了一下wikipedia 上面也是說Google有個URL server專門在作這個URL revisit : : check : : 請問真的是用Hashing嗎還是Distributed Hashing?? : 這看起來滿像google的題目 : 第一題應該可以做到average NlogN 或是 linear, 把quick sort變一下就行了 : 第二題的話可以用database 加上index 然後再加一層cache,太大的話做partition : 分到不同的database上, 或是把database換成hashtable也行今天看了一下以前上課上過的paper(Mercator web crawler, written in Java) 它有提到他們是怎麼作的簡單的敘述: 假設整個Internet URLs無法放進整個HashSet(註) 所以就把它放在disk上另外使用LRU cache來作in memory cache 程序如下新的URL進來: -> 用LRU cache看有無cache hit -> 有就丟掉 -> 無的話找disk上的資料 -> 在disk上丟掉 -> 不在disk上, unvisited URL, pop first URL in LRU cache and write it to disk. Add new URL to LRU cache. 註: 1998年為10 million web pages, 2006年為4.2 billion -- ※ 發信站: 批踢踢參(ptt3.cc) ◆ From: 141.158.245.93