over 2 years ago

Hong-e-learning 是一間線上的資料科學教學網站。公司已經收集到每個會員,有註冊過或正在進行中的課程資料。根據這些相關資訊,我們想推薦用戶可能感興趣的課程,讓用戶更滿意。

以用戶為基礎(user-based)

假設我們有用戶上課資料, 如下,

users_interests = [
["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
["R", "Python", "statistics", "regression", "probability"],
["machine learning", "regression", "decision trees", "libsvm"],
["Python", "R", "Java", "C++", "Haskell", "programming languages"],
["statistics", "probability", "mathematics", "theory"],
["machine learning", "scikit-learn", "Mahout", "neural networks"],
["neural networks", "deep learning", "Big Data", "artificial intelligence"],
["Hadoop", "Java", "MapReduce", "Big Data"],
["statistics", "R", "statsmodels"],
["C++", "deep learning", "artificial intelligence", "probability"],
["pandas", "R", "Python"],
["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
["libsvm", "regression", "support vector machines"]
]

想法是按照使用者的相似度,作不同權重的加成。

舉例來說李小仁、柳德華、飯冰冰、諸葛涼的選課相似度(此範例以餘弦相似計算,亦可使用pearson,或Mahanttan距離 )分別為0.5,0.7,0.2,則李小仁的推薦係數計算為柳德華的已選修課程*0.5 + 飯冰冰的選修課程*0.7 + 諸葛涼的選修課*0.2。這時會給出對於李小仁來說不同課程的推薦係數,此時須過濾李小仁已經選修的課程,然後按照係數多少,排列推薦。

程式碼如下

recommendation-user.py
## recommendation  user based

from __future__ import division
from collections import Counter,defaultdict
import numpy as np
import math

# most popular topic

popular_topics = Counter([interest_topic for user in users_interests
                        for interest_topic in user]).most_common()

def most_popular_interests(user_interests,max_result=5):
    suggestions = [interest for interest,freq in popular_topics if \
                     interest not in user_interests
                     ]
    return suggestions[:max_result]

## try this with user no3


most_popular_interests(users_interests[2], max_result=5)

###### user-based collabitive filtering


# cosine similarity


def cosine_similarity(v,w):
    # v,w is a vector, return the cosine between the two vectors,

    # value between -1~1

    return np.dot(v,w)/(math.sqrt(np.dot(v,v) * np.dot(w,w)))

topics = sorted(Counter([interest_topic for user in users_interests
                        for interest_topic in user]).keys())

# construct users interests' vector based on all topic


def user_interest_vector(user_interests):
    vectors = []
    for topic in topics:
        if topic not in user_interests:
            vectors.append(0)
        else:
            vectors.append(1)
    return vectors

users_interests_matrix = [user_interest_vector(user_interests) for 
                            user_interests in users_interests]  
                            # same as map(user_interests_vector, users_interests)


# for each user construct cosine relation to every others' interests


def users_similarity(users_interests_matrix):
    users_similarity_matrix = []
    for i,user in enumerate(users_interests_matrix):
        similarity_vector = []
        for j in range(len(users_interests_matrix)):
            similarity_val=cosine_similarity(user, users_interests_matrix[j])
            # print similarity_val

            similarity_vector.append(similarity_val)
        users_similarity_matrix.append(similarity_vector)

    return users_similarity_matrix
        
sim_matrix = users_similarity(users_interests_matrix)

def most_similiar_to(userId):
    pairs=[(otherId,e) for otherId,e in enumerate(sim_matrix[userId])
     if (userId!=otherId and e!=1 and e!=0)]
    return sorted(pairs,key=lambda (_,similarity): similarity,
        reverse=True)


def user_based_suggestion(userId,include_current_interest=False):
    suggestions = defaultdict(float)
    for otherId,similarity in most_similiar_to(userId):
        for j in users_interests[otherId]:
            suggestions[j] += similarity
    
    suggestions = sorted(suggestions.items(),
        key=lambda (_,weight):weight,reverse=True)
    # 

    if include_current_interest:
        return suggestions
    else:
        return [(suggestion,weight) for suggestion,weight in 
                suggestions if suggestion not in users_interests[userId]
                ]

這個算法(以用戶為基礎(user-based)計算相似矩陣時)對使用小資料,且收集到的客戶資訊比較完整時,較為適用!
可是當商品(課程)資料愈來愈大,

  1. 大部分的人商品欄會是空的。

    • 造成用戶的相似度不像
  2. 必須在用戶上線時,才能計算用戶間的相似矩陣。

    • 這顯然無法在線上即時推薦合適的課程。

這時必須使用以商品(item-based)為基礎的推薦法

以商品為基礎(item-based)


我們可以將用戶感興趣的資料轉置,即可得到對特定課程,感興趣/或正在選修的同學。利用此商品為基礎的資料,計算各個不同商品間的相似程度。然後針對李小仁正在上的課程,作產品的權重加成。

  • 例如:Hadoop相對於big data, python, HBase 的相關係數分別為 0.7, 0.2,0.8 而李小仁正在上Hadoop課程。此時big data, python, Hbase 就會得到0.7, 0.2, 0.8的權重分數。然後針對李小仁上過/或正在上的課程作類似的加總,就能得到推薦的課程。

程式碼實做如下

recommendation-item.py
## recommendation based on item collabative filtering 

from __future__ import division
from collections import defaultdict,Counter
import numpy as np
import math

topics = sorted(Counter([interest_topic for user in users_interests
                        for interest_topic in user]).keys())

def user_interest_vector(user_interests):
    vectors = []
    for topic in topics:
        if topic not in user_interests:
            vectors.append(0)
        else:
            vectors.append(1)
    return vectors

def cosine_similarity(v,w):
    # v,w is a vector, return the cosine between the two vectors,

    # value between -1~1

    return np.dot(v,w)/(math.sqrt(np.dot(v,v) * np.dot(w,w)))

users_interests_matrix = [user_interest_vector(user_interests) for 
                            user_interests in users_interests]  
                            # same as map(user_interests_vector, users_interests)


item_array = np.array(users_interests_matrix).T

item_similarity_matrix = [[cosine_similarity(item_i, item_j)
                        for item_j in item_array] for item_i in
                        item_array
                        ]

def most_similiar_to(itemId):    
    pairs=[(topics[otherId],e) for otherId,e in enumerate(item_similarity_matrix[itemId])
     if (itemId!=otherId and e!=1 and e!=0)]

    return sorted(pairs,key=lambda (_,similarity): similarity,
        reverse=True)


def item_based_suggestion(userId,include_current_interest=False):
    suggestions = defaultdict(float)
    for itemId,_ in enumerate(users_interests[userId]):
        for interest,weight in most_similiar_to(itemId):
            suggestions[interest] += weight
    suggestions = sorted(suggestions.items(), key=lambda (_,weight):weight,
                         reverse=True)
    if include_current_interest:
        return suggestions

    return [(suggestion,weight) for 
    suggestion,weight in suggestions if
     suggestion not in users_interests[userId]]


← pandas資料整理(練習) KNN分類演算法 →
 
comments powered by Disqus