over 2 years ago

實作一中,推薦系統的作法是,想像使用者與電影名稱構成一張大表單。假定電影風格()與個人評分權重()存在線性關係()。盡可能去fit表單中已知的數據,然後藉由最小化誤差函數得到最佳的近似解()。

但我們也可以利用計算每個使用者的相似程度,作進一步相似性的推薦。這是此篇的作法,說明如下,

手邊已經有用戶對影片的評價分數,

電影\用戶 Claudia Puig Gene Seymon Jack Matthews Lisa Rose Michael Phillps Mick LaSalle Toby
Just My Luck 3 1.5 ? 3 ? 2 ?
Lady in the Water ? 3.0 3 2.5 2.5 3.0 ?
Snakes on a Plane 3.5 3.5 4 3.5 3.0 4 4.5
Superman Return 4 5.0 5.0 3.5 3.5 3 4.0
The Night Listener 4.5 3.0 3.0 3.0 4.0 3.0 ?
You, Me and Dupree 2.5 3.5 3.5 2.5 ? 2 1.0

如何定義兩個用戶的相似程度,有常見的兩種方法

歐幾里得距離(Euclidean Distance)

如果要計算Toby與Rose的相似性,可以在以影片評分為基底的空間平面中,定義兩人的直線距離然後計算相似度評分(scores):
$$
\frac{1}{1 + ||x_{Toby} - x_{Rose}||}
$$

pearson_recommendation.py
def sim_distance(data,person1,person2):
    # using Euclidean distance to find score

    square = (data[person1] - data[person2])**2
    sumOfsquare =  np.sum(square.fillna(0))
    distance = np.sqrt(sumOfsquare)
    score = 1/(1 + distance)

    return score

皮爾森相關性(Pearson correlation)

只計算相對距離存在一個問題,如果Toby和Rose對影片的口味是一致的。但是Rose相對比較保守,會比Toby給的評分都低1~2分。這時候使用相關性會得到比較好的結果。

皮爾森相關定義參考wiki

numpy裡可以直接使用np.corrcoef(list1,list2)就能計算covariance矩陣。

pearson_recommendation
def sim_pearson(data,p1,p2):
    # return pearson similarity between two persons

    p1Lists = data[p1]
    p2Lists = data[p2]
    # choose not null index in each personal Lists


    p1index = p1Lists[p1Lists.notnull()].index
    p2index = p2Lists[p2Lists.notnull()].index
    # pcikup common index 

    commonIndex = pd.Index([e for e in p1index if e in p2index]) 
    
    
    return np.corrcoef(p1Lists[commonIndex],p2Lists[commonIndex])[0,1]

相似性權種

對Toby來說,電影中有The Night Listener, Just My Luck,Lady in the Water還沒有評分。利用每個使用者對該影片評分與Toby的相似性作加乘,以Night為例

相似性 The Night Listener 相似性*Night
Rose 0.99 3.0 2.97
Seymour 0.38 3.0 1.14
Puig 0.89 4.5 4.02
LaSalle 0.92 3 2.77
Mathews 0.66 3 1.99
總分 12.89
相似度加總 3.84
總分/相似度加總 3.35

能如此推測對Toby來說The Nigh Listener得分為3.35


程式碼

pearson_recommendation.py
## using pearson distance to evalute the similarity

## collaborative filtering based on 1) users 2)items 


# 2016.02.05


# module 

from collections import defaultdict
import pandas as pd
import numpy as np

### data ####

critics = {'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
                         'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,
                         'The Night Listener': 3.0},
           'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,
                            'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,
                            'You, Me and Dupree': 3.5},
           'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
                                'Superman Returns': 3.5, 'The Night Listener': 4.0},
           'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
                            'The Night Listener': 4.5, 'Superman Returns': 4.0,
                            'You, Me and Dupree': 2.5},
           'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
                            'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
                            'You, Me and Dupree': 2.0},
           'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
                             'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
           'Toby': {'Snakes on a Plane': 4.5, 'You, Me and Dupree': 1.0, 'Superman Returns': 4.0}}


data = pd.DataFrame(critics)
dataItem = data.T
## return the similarity score from two person


def sim_distance(data,person1,person2):
    # using Euclidean distance to find score

    square = (data[person1] - data[person2])**2
    sumOfsquare =  np.sum(square.fillna(0))
    distance = np.sqrt(sumOfsquare)
    score = 1/(1 + distance)

    return score

def sim_pearson(data,p1,p2):
    # return pearson similarity between two persons

    p1Lists = data[p1]
    p2Lists = data[p2]
    # choose not null index in each personal Lists


    p1index = p1Lists[p1Lists.notnull()].index
    p2index = p2Lists[p2Lists.notnull()].index
    # pcikup common index 

    commonIndex = pd.Index([e for e in p1index if e in p2index]) 
    
    
    return np.corrcoef(p1Lists[commonIndex],p2Lists[commonIndex])[0,1]



def top_match(data,person,n=5, similarity= sim_pearson):
    # return the best match for pearson from the data, 

    # numbers of result(n) is optional


    scores = [(other,similarity(data,other,person)) for other in data 
                                                if other!=person] 
    scores.sort(key=lambda (_,x):x,reverse=True)

    return scores[:n]


def getRecommendations(data,person,similarity=sim_pearson):
    # Gets recommendations for a person by using a weighted average

    # of every other user's rankings


    # similarity to person

    sim = {person:{other:similarity(data,other,person) for other in data 
                                                if other!=person}}


    # ignore scores lower than 0

    sim_person = pd.DataFrame(sim)
    sim_person_include = sim_person[sim_person[person]>0].index 
    sim_person = sim_person[sim_person[person]>0]


    # item needed to be recommened to the person

    dataperson = data[person]
    itemRecommendation = dataperson[dataperson.isnull()].index 

    
    rankings = []

    for item in itemRecommendation:
        scores_from_other = data.ix[item] 
        scores_from_other = scores_from_other[sim_person_include] # exclude ignoring similarity lower than 0

        scores_from_other = scores_from_other[np.isfinite(scores_from_other)] #score from other but not NaN in person

        sim_person_count = sim_person.ix[scores_from_other.index]

        normalized_star = np.dot(scores_from_other,sim_person_count)/np.sum(sim_person_count)
        normalized_star = normalized_star[0]
        rankings.append( (item,normalized_star))
        rankings.sort(reverse=True)

    return rankings

        
← 推薦系統實作(一)- 線性迴歸 推薦系統實作(三)-以商品為基礎 →
 
comments powered by Disqus