about 2 years ago

爬梳好的資料,已經結構化的儲存在資料庫裡。如果我們輸入中文搜尋想正確找到中文搜尋出現在同一網頁下的url與不同位置wordlocation和對應的wordid,在sql語法下可寫成

select w0.urlid,w0.location,w1.location
from wordlocation w0, wordlocation w1
where w0.wordid=3400
and w0.urlid=w1.urlid 
and w1.wordid=2        

即能找到對應結果。

urlid location location
31 20 11
31 20 119
13 102 87
13 102 107
$$\vdots$$ $$\vdots$$ $$\vdots$$

這件事情使用python連結sqlite處理即是以下函式,


中文搜尋

news_crawler.py
def getmatchrows(self,q):
    # q:input query, useage: getmatchrows('中文搜尋') 

    # output: rows,wordid 

    #    --> ([(urlid1,wordlocation1,wordlocation2,...),

    #           (urlid2,wordlocation1,wordlocation2,...),

    #           (),...],[wordid1,wordid2])

    # execute sql: 'select w0.urlid,w0.location,w1.location

    #               from wordlocation w0, wordlocation w1

    #               where w0.urlid=w1.urlid 

    #               and w0.wordid=3400

    #               and w1.wordid=2'

    

    # string to build the query

    fieldlist='w0.urlid'
    tablelist=''
    clauselist=''
    wordids=[]
    
    # Split the words by jieba engine

    words=jieba.cut(q,cut_all=False)


    for tablenumber,word in enumerate(words):
        ## get wordid

        wordrow = self.con.execute(
        "select rowid from wordlist where word='%s'"%word).fetchone()

        if wordrow!=None:
            wordid = wordrow[0]
            wordids.append(wordid)
            if tablenumber!=0:
                tablelist+=','
                clauselist+=' and '
                clauselist+='w%d.urlid=w%d.urlid and '%(tablenumber-1,tablenumber)

            fieldlist+=',w%d.location'%tablenumber
            tablelist+='wordlocation w%d'%tablenumber
            # ipdb.set_trace()

            clauselist+='w%d.wordid=%d'%(tablenumber,wordid)

    # create full query from seperate parts


    fullquery = "select %s from %s where %s" %(fieldlist,tablelist,clauselist)
    # ipdb.set_trace()

    cur = self.con.execute(fullquery)
    rows = [row for row in cur]

    return rows,wordids

上述函式回傳的結果為rows,wordid型式,其中搜尋結果rows = [(urlid1,wordlocation1,wordlocation2),(urlid2,wordlocation1,wordlocation2),...],字串wordid=[3400,2]我們可以利用此資訊,進一步處理排序問題。

下篇… 搜尋引擎(三)網頁排序

← 搜尋引擎(一)網頁爬梳 搜尋引擎(三)網頁排序 →
 
comments powered by Disqus