over 2 years ago

在電腦上簡單的輸入幾個關鍵字,就能跑出相關網頁連結的排序結果,這是搜尋引擎的工作。這幾篇相關文章會討論如何從無到有,模仿google,實做一個陽春版的搜尋引擎。包含以下工作,

  • 網頁爬梳(Crawler)

    • 關聯式資料庫結構(SQLite Schema)
    • 中文分詞(Jieba)
    • 正規表示法(Regular Expression)
  • 網頁排序(searcher)

    • 字頻(words frequency)
    • 字距(word location)
    • 網頁連結度(inbound linking)
    • 網頁排名(PageRank)
    • 標題連結(link text)

資料庫結構(schema)

所有的文字、連結資料必須從網頁拿下,除了負責爬梳資料以外我們必須先決定資料庫的結構。具體來說,我們會想保存以下資訊:

  • word
  • url
  • wordlocation
  • fromurl, tourl

設計資料庫如下,


(*紅字部份代表下了index加速RDBM搜尋 )

中文分詞(Jieba)

中文分詞與英文只用split()指令不同,先天上中文斷字就是個不好處理的問題。結巴中文分詞是一個幫python的使用者,解決了大部分分詞麻煩的套件(R使用者可以參考這篇)。相關的繁中分詞介紹請參考這篇大大的介紹,相信能很快上手中文分詞。

news_crawler.py
def wordsplit(content):
# use jibeba engine to splits chinese articles

    time0 = time.time()
    words = jieba.cut(content,cut_all=False)
    print "elapse splits words: {}".format(time.time()-time0)
    return (word.strip() for word in words if len(word.strip())>1) # iterator

正規表示法

中文unicode的正規表示範圍在u[\u4e00-\u9fa5]

news_crawler.py
def gettextonly(text):
    pat = re.compile(u'[\u4e00-\u9fa5]+')    
    return ' '.join(re.findall(pat,text)) # 


爬梳流程

廣度搜尋(先搜尋全範圍後再往下深入)

  1. 從一網頁進入,擷取內文後作斷詞。
  2. 斷詞文字存入資料庫(urllist/wordlocation/wordlist/)
  3. 取出該網頁的連結(links)與連結文字
  4. 連結文字作分詞(linkTexts)
  5. 存入連結資料庫(link/linkword/)


爬梳程式碼

news_crawler.py
#! encoding:utf8

# search engine


from collections import defaultdict
from bs4 import BeautifulSoup
from urlparse import urljoin
from sqlite3 import dbapi2 as sqlite
import requests
import pandas as pd
import jieba 
import re
import ipdb
import time

jieba.set_dictionary('dict.txt.big')
url = 'https://news.google.com.tw/'

class Crawler:

    ## crawler for google news, for searching and ranking purpose

    ## only for chinese char is picked.

    ## usage: Crawler('dbname.db')


    def __init__(self,dbname):
        self.con = sqlite.connect(dbname)
    def __del__(self):
        self.con.close()


    def createindextable(self):
        
        self.con.execute('create table urllist(url)')
        self.con.execute('create table wordlist(word)')
        self.con.execute('create table link(fromid iteger,toid integer)')
        self.con.execute('create table wordlocation(wordid,urlid,location)')
        self.con.execute('create table linkword(wordid,linkid)')
        self.con.execute('create index wordidx on wordlist(word)')
        self.con.execute('create index urlidx on urllist(url)')
        self.con.execute('create index wordurlidx on wordlocation(wordid)')
        self.con.execute('create index urltoidx on link(toid)')
        self.con.execute('create index urlfromidx on link(fromid)')
        self.con.commit()

    def isindexed(self,url):
    # return true if url is already indexed

        # ipdb.set_trace()

        cur = self.con.execute(
            "select rowid from urllist where url= '%s' " %url
            ).fetchone()
        if cur != None:
            # check it has been crawled

            v = self.con.execute(
                "select * from wordlocation where urlid='%d'" %cur[0]).fetchone()
            if v!= None:            
                return True

        return False

    def addtoIndex(self,url,words):
    # Index a whole page, wordlocation

        if self.isindexed(url): return
        print "Indexing, url: %s" %url
        
        # Get urlid

        urlid = self.getentryid('urllist','url',url)
        # Link each word to this url

        for location,word in enumerate(words):
            wordid = self.getentryid('wordlist','word',word)
            self.con.execute(
                "insert into wordlocation(urlid,wordid,location) values(%d,%d,%d)"
                %(urlid,wordid,location) 
                )
        # db.commit()


    def getentryid(self,table,field,value, create=True):
        # auxilary function to get entry id, if not create it in default

        
        cur = self.con.execute(
                        "select rowid from %s where %s = '%s'" 
                        % (table,field,value)
                        )
        res = cur.fetchone()
        # db.commit()

        # ipdb.set_trace()

        if res == None:
            cur = self.con.execute(
                            "insert into %s (%s) values ('%s')"
                            %(table,field,value)
                            )            
            return cur.lastrowid
        else:             
            return res[0]

    def addlinkref(self,fromurl,tourl,texts):
        # add a link between two pages: linkword/link

        
        fromid = self.getentryid('urllist','url',fromurl)
        toid = self.getentryid('urllist','url',tourl)
        if fromid==toid: return 
        cur = self.con.execute(
                    "insert into link(fromid,toid) values(%d,%d)" % (fromid,toid))
        linkid=cur.lastrowid

        for text in texts:
            # linkword table

            wordid = self.getentryid('wordlist','word',text)
            cur = self.con.execute(
                "insert into linkword(linkid,wordid) values(%d,%d)" %(linkid,wordid)
                                )        
    

    def crawler(self,pages,pagedepth=1):
        # crawl for given pages:[url1,url2,url3...]

        # 

        for depth in range(pagedepth):
            newpages = {}
            for num,page in enumerate(pages):
                try:
                    res=requests.get(page)
                except: 
                    print 'Could not find %s page'%page
                
                soup=BeautifulSoup(res.text)

                # chinese contents

                contents=gettextonly(soup.text)
                words = wordsplit(contents) # split

                
                # indexed wordlist/urllist/wordlocation into db

                self.addtoIndex(page,words)
                print "addtoindex, page:%d"%num
                # url links <a href=''>xx</a>

                links = soup.select('a')

                for i,link in enumerate(links):

                    if 'href' in link.attrs:
                        url = urljoin(page,link['href'])
                        if url.find("'") != -1: continue
                        url = url.split('#')[0]
                        if url[0:4] == 'http':
                            newpages[url]=link.text # link.text->linktitle, url->link

                        linkTexts = wordsplit(gettextonly(link.text))

                        self.addlinkref(page,url,linkTexts)
                        # print "addlinkref, link:%d"%i

                # print 'link %s,newpages %s '%(link,newpages.keys())

                self.con.commit()
            pages = newpages

下篇… 搜尋引擎(二)中文搜尋

← 層級分類法(Hierachical Clustering) 搜尋引擎(二)中文搜尋 →
 
comments powered by Disqus