almost 3 years ago

這篇是Andrew Ng講解推薦系統原理的實做,詳細的教材可以參考這裡

如果手邊有用戶對線上影片的評價如下,

電影\用戶名 Alice Bob Carol Eva Dave 風格(x1) 風格2(x2)
Love at last 5 5 0 0 ? ? ?
Romance forever 5 ? ? 0 ? ? ?
Cute puppies of love ? 4 0 ? ? ? ?
Non stop car chase 0 0 5 4 ? ? ?
Swords vs. Karate 0 0 5 ? ? ? ?

想要推測用戶還沒有評價的影片,可能會得到幾分。利用線性迴歸的假設()來實做此問題,

作法是,想像使用者與電影名稱構成一張大表單。假定電影風格()與個人評分權重()存在線性關係()。盡可能去fit表單中已知的數據,然後藉由最小化誤差函數而得到最佳的解。

如果表示電影風格的比重係數(矩陣),表示用戶對不同風格的喜好程度(矩陣)。

我們透過最小化,
$$
min_{X,\Theta} \rightarrow \frac{1}{2} \| X^T \Theta - Y\|^2
$$

此時找到的代入即為最佳預測解。

實做步驟如下,

  1. 給一小亂數初始化 且定義誤差矩陣 將?部份以0來取代,其中Y為減去平均值後的訓練集矩陣。
  2. 利用Gradient Descent 迭代 $$ \tilde{X} \rightarrow \tilde{X} - \alpha \left( \tilde{\Theta}\Delta^T - \lambda \tilde{X} \right)
    $$ $$ \tilde{\Theta} \rightarrow \tilde{\Theta} - \alpha \left( \tilde{X}\tilde{\Delta} - \lambda \tilde{\Theta}\right) $$
  3. 加回對原本影片的均值, $$ Y_{pred} = \tilde{X}^T\Theta + \mu $$

程式碼

my_recommend_means.py
#! encoding: utf8 


import numpy as np
import pandas as pd

## datasets #####

datasets2 = dataset={
         'Lisa Rose': {'Lady in the Water': 2.5,
                       'Snakes on a Plane': 3.5,
                       'Just My Luck': 3.0,
                       'Superman Returns': 3.5,
                       'You, Me and Dupree': 2.5,
                       'The Night Listener': 3.0},
         'Gene Seymour': {'Lady in the Water': 3.0,
                          'Snakes on a Plane': 3.5,
                          'Just My Luck': 1.5,
                          'Superman Returns': 5.0,
                          'The Night Listener': 3.0,
                          'You, Me and Dupree': 3.5},
 
        'Michael Phillips': {'Lady in the Water': 2.5,
                             'Snakes on a Plane': 3.0,
                             'Superman Returns': 3.5,
                             'The Night Listener': 4.0},
        'Claudia Puig': {'Snakes on a Plane': 3.5,
                         'Just My Luck': 3.0,
                         'The Night Listener': 4.5,
                         'Superman Returns': 4.0,
                         'You, Me and Dupree': 2.5},
        'Mick LaSalle': {'Lady in the Water': 3.0,
                         'Snakes on a Plane': 4.0,
                         'Just My Luck': 2.0,
                         'Superman Returns': 3.0,
                         'The Night Listener': 3.0,
                         'You, Me and Dupree': 2.0},
       'Jack Matthews': {'Lady in the Water': 3.0,
                         'Snakes on a Plane': 4.0,
                         'The Night Listener': 3.0,
                         'Superman Returns': 5.0,
                         'You, Me and Dupree': 3.5},
      'Toby': {'Snakes on a Plane':4.5,
               'You, Me and Dupree':1.0,
               'Superman Returns':4.0}}

datasets1 = {'Alice':{'Love at last':5,
                     'Romance forever':5,
                     'Nonstop car chase':0,
                     'Swords vs Karate':0},
             'Bob':{'Love at last':5,
                     'Cute puppies of love':4,
                     'Nonstop car chase':0,
                     'Swords vs Karate':0},
             'Carol':{'Love at last':0,
                     'Cute puppies of love':0,
                     'Nonstop car chase':5,
                     'Swords vs Karate':5},
             'Dave':{'Love at last':0,
                     'Romance forever':0,
                     'Nonstop car chase':4,
                     },
             'Eva':{

             }
            }

### 


data = pd.DataFrame(datasets1)

totalUsersNumber = data.shape[1]  ## numbers of users

totalMoviesNumber = data.shape[0] ##


featuresNumber = 2 # features used:

                    # 2 is good for datasets1, 5 is good for datasets2.


## Gradient Descent with mean normalization and Low rank factorization

## algorithm detail: 

## see here (http://www.holehouse.org/mlclass/16_Recommender_Systems.html)



X = abs(np.random.randn(featuresNumber,totalMoviesNumber)) # initalize X movie matrix with random

Ymean = data.mean(axis=1)# mean score for each movie

Ytilde = data.subtract(Ymean,axis=0)
# Xtilt = X.subtract(Xmean,axis=0) 


Theta = np.random.randn(featuresNumber,totalUsersNumber)
DeltaY = (np.dot(X.T,Theta) - Ytilde).fillna(0)

alpha = 10**-1 # learning rate

regul = 0.1 # regularization factor

err = 1
iterNo =0

while iterNo<1000 and err>10**-4:
    '''
    Using Gradient descent method to solve Ymean
    '''

    Xiter = X - alpha*(np.dot(Theta,DeltaY.T) + regul*X)
    ThetaIter = Theta - alpha*(np.dot(X,DeltaY) + regul*Theta)
    Ypred = np.dot(Xiter.T,ThetaIter) 
    DeltaY = (Ypred - Ytilde).fillna(0) ## IMPORTANT: 保持沒有資料的地方誤差為零

    err = max(np.max((DeltaY**2).cumsum()))

    X = Xiter
    Theta = ThetaIter
    iterNo +=1
    
    print 'iterNo:{}, error:{}'.format(iterNo,err)

dataPredict = pd.DataFrame(Ypred,index=data.index,columns=data.columns)
dataPredict = dataPredict.add(Ymean,axis=0)
print '\n==========================================='
print 'predict star label is: \n'
print '{}'.format(data)
print '\n==========================================='
print 'predict star label is: \n'
print '{}'.format(dataPredict)
print '==========================================='

結果如下:

資料一:

資料二:


註解:

  1. 資料集1僅須選擇影片風格為2就能很快收斂,但是資料集2必須選擇5種風格。
  2. 由於此作法是利用線性迴歸與GD的技巧,必須選擇learning rate與regularization factor,增加了計算的複雜度。
  3. 也可以使用pearson distance來計算影片的相似性,類似此篇。希望下一次能實做這個案例。
← 字體辨識- notMNIST(SVM) 推薦系統實作(二)-以用戶為基礎 →
 
comments powered by Disqus