如何为「纽约时报」开发基于内容的推荐系统

栏目: 数据库 · 发布时间: 6年前

内容简介：我们在帮助纽约时报（The New York Times，以下简称NYT）开发一套基于内容的推荐系统，大家可以把这套系统看作一个非常简单的推荐系统开发示例。依托用户近期的文章浏览数据，我们会为其推荐适合阅读的新文章，而想做到这一点，只需以这篇文章的文本数据为基础，推荐给用户类似的内容。数据检验以下是数据集中第一篇NYT文章中的摘录，我们已经做过文本处理。首先需要解决的问题是，该如何将这段内容矢量化，并且设计诸如Parts-of-Speech 、N-grams 、sentiment scores 或 Nam

我们在帮助纽约时报（The New York Times，以下简称NYT）开发一套基于内容的推荐系统，大家可以把这套系统看作一个非常简单的推荐系统开发示例。依托用户近期的文章浏览数据，我们会为其推荐适合阅读的新文章，而想做到这一点，只需以这篇文章的文本数据为基础，推荐给用户类似的内容。

数据检验以下是数据集中第一篇NYT文章中的摘录，我们已经做过文本处理。

'TOKYO — State-backed Japan Bank for International Cooperation [JBIC.UL] will lend about 4 billion yen ($39 million) to Russia's Sberbank, which is subject to Western sanctions, in the hope of advancing talks on a territorial dispute, the Nikkei business daily said on Saturday, [...]"

首先需要解决的问题是，该如何将这段内容矢量化，并且设计诸如Parts-of-Speech 、N-grams 、sentiment scores 或 Named Entities等新特征。

显然NLP tunnel有深入研究的价值，甚至可以花费很多时间在既有方案上做实验。但真正的科学往往是从试水最简单可行的方案开始的，这样后续的迭代才会愈加完善。

而在这篇文章中，我们就开始执行这个简单可行的方案。

数据拆分我们需要将标准数据进行预加工，方法是确定数据库中符合要求的特征，打乱顺序，然后将这些特征分别放入训练和测试集。

# move articles to an array
articles = df.body.values

# move article section names to an array
sections = df.section_name.values

# move article web_urls to an array
web_url = df.web_url.values

# shuffle these three arrays 
articles, sections, web_ur = shuffle(articles, sections, web_url, random_state=4)

# split the shuffled articles into two arrays
n = 10

# one will have all but the last 10 articles -- think of this as your training set/corpus 
X_train = articles[:-n]
X_train_urls = web_url[:-n]
X_train_sections = sections[:-n]

# the other will have those last 10 articles -- think of this as your test set/corpus 
X_test = articles[-n:]
X_test_urls = web_url[-n:]
X_test_sections = sections[-n:]
复制代码

文本矢量化系统可以从Bag-of-Words(BoW)、Tf-Idf、Word2Vec等几种不同的文本矢量化系统中选择。

我们选择Tf-Idf的原因之一是，不同于BoW，Tf-Idf识别词汇重要性的方式除文本频率外，还包括逆文档频率。

举例，一个像“Obama”这样的词汇虽然在文章中仅出现几次（不包括类似“a”、“the”这样不能传达太多信息的词汇），但出现在多篇不同的文章中，那么就应该得到更高的权重值。

因为“Obama”既不是停用词，也不是日常用语（即说明该词汇与文章主题高度相关）。

相似性准则确定相似性准则时有好几种方案，比如将Jacard和Cosine做对比。

Jacard的实现依靠两集之间的比较及重叠元素选择。考虑到已选择Tf-Idf作为文本矢量化系统，作为选项，Jacard相似性并无意义。如果选择BoWs矢量化，可能Jacard可能才能发挥作用。

因此，我们尝试将Cosine作为相似性准则。

从Tf-Idf为每篇文章中的每个标记分配权重开始，就能够从不同文章标记的权重之间取点积了。

如果文章A中类似“Obama” 或者“White House”这样的标记权重较高，并且文章B中也是如此，那么相对于文章B中相同标记权重低的情况来说，两者的相似性乘积将得出一个更大的数值。

建立推荐系统根据用户已读文章和所有语料库中的其他文章（即训练数据）的相似性数值，现在你就可以建立一个输出前N篇文章的函数，然后开始给用户推荐了。

def get_top_n_rec_articles(X_train_tfidf, X_train, test_article, X_train_sections, X_train_urls, n = 5):
    '''This function calculates similarity scores between a document and a corpus
    
       INPUT: vectorized document corpus, 2D array
              text document corpus, 1D array
              user article, 1D array
              article section names, 1D array
              article URLs, 1D array
              number of articles to recommend, int
              
       OUTPUT: top n recommendations, 1D array
               top n corresponding section names, 1D array
               top n corresponding URLs, 1D array
               similarity scores bewteen user article and entire corpus, 1D array
              '''
    # calculate similarity between the corpus (i.e. the "test" data) and the user's article
    similarity_scores = X_train_tfidf.dot(test_article.toarray().T)
    # get sorted similarity score indices  
    sorted_indicies = np.argsort(similarity_scores, axis = 0)[::-1]
    # get sorted similarity scores
    sorted_sim_scores = similarity_scores[sorted_indicies]
    # get top n most similar documents
    top_n_recs = X_train[sorted_indicies[:n]]
    # get top n corresponding document section names
    rec_sections = X_train_sections[sorted_indicies[:n]]
    # get top n corresponding urls
    rec_urls = X_train_urls[sorted_indicies[:n]]
    
    # return recommendations and corresponding article meta-data
    return top_n_recs, rec_sections, rec_urls, sorted_sim_scores
复制代码

以下是该函数的执行步骤：

1.计算用户文章和语料库的相似性；

2.将相似性分值从高到低排序；

3.得出前N篇最相似的文章；

4.获取对应前N篇文章的小标题及URL；

5.返回前N篇文章，小标题，URL和分值

结果验证现在我们已经可以根据用户正在阅读的内容，为他们推荐可供阅读的文章来检测结果是否可行了。

接下来让我们将用户文章和对应小标题与推荐文章和对应小标题作对比。

首先看一下相似性分值。

# similarity scores
sorted_sim_scores[:5]
# OUTPUT:
# 0.566
# 0.498
# 0.479
# .
# .
复制代码

Cosine相似度的取值范围在0-1，由此可见其分值并不高。该如何提高分值呢？可以选择类似Doc2Vec这样不同的矢量化系统，也可以换一个相似性准则。尽管如此，还是让我们来看一下结果。

# user's article's section name
X_test_sections[k]
# OUTPUT:
'U.S'

# corresponding section names for top n recs 
rec_sections
# OUTPUT:
'World'
'U.S'
'World'
'World'
'U.S.'
复制代码

从结果可以看出，推荐的小标题是符合需要的。

#user's article X_test[k] 'LOS ANGELES — The White House says President Barack Obama has told the Defense Department that it must ensure service members instructed to repay enlistment bonuses are being treated fairly and expeditiously.\nWhite House spokesman Josh Earnest says the president only recently become aware of Pentagon demands that some soldiers repay their enlistment bonuses after audits revealed overpayments by the California National Guard. If soldiers refuse, they could face interest charges, wage garnishments and tax liens.\nEarnest says he did not believe the president was prepared to support a blanket waiver of those repayments, but he said "we're not going to nickel and dime" service members when they get back from serving the country. He says they should not be held responsible for fraud perpetrated by others.'

前五篇推荐文章都与读者当前阅读的文章相关，事实证明该推荐系统符合预期。

关于验证的说明通过比较推荐文本和小标题的ad-hoc验证过程，表明我们的推荐系统可以按照要求正常运行。

人工验证的效果还不错，不过我们最终希望得到的是一个完全自动化的系统，以便于将其放入模型并自我验证。

如何将该推荐系统放入模型不是本文的主题，本文旨在说明如何在真实数据集的基础上设计这样的推荐系统原型。

原文作者为数据科学家 Alexander Barriga ，由国内智能推荐平台先荐_个性化内容推荐专家编译，部分有删改，转载请注明出处。

以上所述就是小编给大家介绍的《如何为「纽约时报」开发基于内容的推荐系统》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

剑指Offer

何海涛 / 电子工业出版社 / 2014-6-1 / CNY 55.00

《剑指Offer——名企面试官精讲典型编程题（纪念版）》是为纪念本书英文版全球发行而推出的特殊版本，在原版基础上新增大量本书英文版中的精选题目，系统整理基础知识、代码质量、解题思路、优化效率和综合能力这5个面试要点。全书分为8章，主要包括面试流程：讨论面试每一环节需要注意的问题；面试需要的基础知识：从编程语言、数据结构及算法三方面总结程序员面试知识点；高质量代码：讨论影响代码质量的3个要素（规范性......一起来看看《剑指Offer》这本书的介绍吧!

码农工具

如何为「纽约时报」开发基于内容的推荐系统

剑指Offer

HTML 编码/解码

RGB HSV 转换

HEX HSV 转换工具