用word2vec分析标题，预测文章转发量

栏目: 编程工具 · 发布时间: 6年前

内容简介：本文研究目标为:操作的全过程可以参见：https://nbviewer.jupyter.org/github/chambliss/Notebooks/blob/master/Word2Vec_News_Analysis.ipynb

用word2vec分析标题，预测文章转发量流行新闻网站的标题情绪分布图

词嵌入是表达词语以及文档（即词语集合）潜在信息的有效方式。利用一系列关于文章标题的数据集，包括来源、情感、主题以及转发量等影响因素，基于各篇文章的词嵌入方法，可以它们之间的关系。

本文研究目标为:

使用NLTK预处理/清理文本数据
使用word2vec创建词汇和标题嵌入，并通过t-SNE使其显示为数据集
将标题情绪与文章转发量之间的关系可视化
尝试通过嵌入及其他相关方法预测文章转发量
使用模型融合来提升文章转发量预测的准确性（此步最终未实现）

操作的全过程可以参见：

https://nbviewer.jupyter.org/github/chambliss/Notebooks/blob/master/Word2Vec_News_Analysis.ipynb

数据输入和预处理

首先，数据输入如下:

import pandas as pd

import gensim

import seaborn as sns

import matplotlib.pyplot as plt

import numpy as np

import xgboost as xgb

然后读取数据：

main_data = pd.read_csv('News_Final.csv')

main_data.head()

# Grab all the titles

article_titles = main_data['Title']

# Create a list of strings, one for each title

titles_list = [title for title in article_titles]

# Collapse the list of strings into a single long string for processing

big_title_string = ' '.join(titles_list)

from nltk.tokenize import word_tokenize

# Tokenize the string into words

tokens = word_tokenize(big_title_string)

# Remove non-alphabetic tokens, such as punctuation

words = [word.lower() for word in tokens if word.isalpha()]

# Filter out stopwords

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

words = [word for word in words if not word in stop_words]

# Print first 10 words

words[:10]

接下来，需要加载预先设定好的word2vec模型（可参照https://github.com/RaRe-Technologies/gensim-data）。由于这是一个新闻数据集，笔者使用的是谷歌新闻模型，该模型已经通过大约1000亿词汇量的测试。

# Loadword2vecmodel (trained on an enormous Google corpus)

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary = True)

# Check dimension of word vectors

model.vector_size

300

该模型将生成300维的词汇向量，生成的向量必须通过模型传递下去。向量列举如下：

economy_vec = model['economy']

economy_vec[:20] # First 20 components

用word2vec分析标题，预测文章转发量

word2vec只能依据自己的词汇库来创造向量。因此，在创建词汇向量的完整列表时，要指定“if model in model.vocab”

# Filter the list of vectors to include only those that Word2Vec has a vector for

vector_list = [model[word] for word in words if word in model.vocab]

# Create a list of the words corresponding to these vectors

words_filtered = [word for word in words if word in model.vocab]

# Zip the words together with their vector representations

word_vec_zip = zip(words_filtered, vector_list)

# Cast to a dict so we can turn it into a DataFrame

word_vec_dict = dict(word_vec_zip)

df = pd.DataFrame.from_dict(word_vec_dict, orient='index')

df.head(3)

使用t-SNE降低维数

接下来，使用t-SNE压缩这些词汇向量（即降维）。如果想了解更多关于t-SNE的相关知识，请查看：https://distill.pub/2016/misread-tsne/。

选择t-SNE的参数非常重要，因为不同的参数会产生截然不同的结果。笔者测试了0到100之间的几个值的准确性，每次产生的结果大致相同。笔者又测试了20到400之间几个学习率的数值，最终决定将学习率设为在默认值（200）。

为了节省处理时间，笔者并没有使用全部的20000个单词向量，而是只选用了其中的400个。

from sklearn.manifold import TSNE

# Initialize t-SNE

tsne = TSNE(n_components = 2, init = 'random', random_state = 10, perplexity = 100)

# Use only 400 rows to shorten processing time

tsne_df = tsne.fit_transform(df[:400])

现在，可以准备测绘减少的词汇向量数组的二维图。在这里，笔者使用adjust_text将单词自动分开。

sns.set()

# Initialize figure

fig, ax = plt.subplots(figsize = (11.7, 8.27))

sns.scatterplot(tsne_df[:, 0], tsne_df[:, 1], alpha = 0.5)

# Import adjustText, initialize list of texts

from adjustText import adjust_text

texts = []

words_to_plot = list(np.arange(0, 400, 10))

# Append words to list

for word in words_to_plot:

texts.append(plt.text(tsne_df[word, 0], tsne_df[word, 1], df.index[word], fontsize = 14))

# Plot text using adjust_text (because overlapping text is hard to read)

adjust_text(texts, force_points = 0.4, force_text = 0.4,

expand_points = (2,1), expand_text = (1,2),

arrowprops = dict(arrowstyle = "-", color = 'black', lw = 0.5))

plt.show()

如果你有兴趣使用adjust_text来进行测绘，可以参照：https://github.com/Phlya/adjustText。注意要使用驼峰格式adjustText进行导入，且adjustText目前与matplotlib3.0或其更高版本不兼容。

即使将矢量嵌入减少到2维，还是可以看到某些项聚集在一起。例如，我们在左/左上角导入“month”，在底部导入“corporatefinance terms”，在中间导入较多的通用非主题词（如“full”, “really”, “slew”）。

需要注意的是，如果使用不同的参数再次运行t-SNE，可能会得到相似的结果，但不一定完全相同。t-SNE不是确定性的，数据集的紧密度和它们之间的分散距离并不能说明什么。t-SNE主要是一种探索性工具，而不是相似性的决定性指标。

平均词嵌入

上文已经介绍了如何将词嵌入的方法应用于此数据集中。接下来，本文将探讨ML的应用：找到聚集的标题集，并研究它们的关系。

Doc2Vec没有预先设定好的模型，因此需要更长时间的测试过程。笔者在这里使用更简洁更有效率的方法：即平均每个文档中单词向量的嵌入。在这个例子中，文档指的就是标题。

此处需要对标题再次进行预处理步骤，这要比分隔词汇的处理更复杂。Dimitris Spathis开发了一系列功能，在本例中可以得到很好的应用。

def document_vector(word2vec_model, doc):

# remove out-of-vocabulary words

doc = [word for word in doc if word in model.vocab]

return np.mean(model[doc], axis=0)

# Our earlier preprocessing was done when we were dealing only with word vectors

# Here, we need each document to remain a document

def preprocess(text):

text = text.lower()

doc = word_tokenize(text)

doc = [word for word in doc if word not in stop_words]

doc = [word for word in doc if word.isalpha()]

return doc

# Function that will help us drop documents that have no word vectors inword2vec

def has_vector_representation(word2vec_model, doc):

"""check if at least one word of the document is in the

word2vecdictionary"""

return not all(word not inword2vec_model.vocab for word in doc)

# Filter out documents

def filter_docs(corpus, texts, condition_on_doc):

"""

Filter corpus and texts given the function condition_on_doc which takes a doc. The document doc is kept if condition_on_doc(doc) is true.

"""

number_of_docs = len(corpus)

if texts is not None:

texts = [text for (text, doc) in zip(texts, corpus)

if condition_on_doc(doc)]

corpus = [doc for doc in corpus if condition_on_doc(doc)]

print("{} docs removed".format(number_of_docs - len(corpus)))

return (corpus, texts)

根据上述结果，进行下一步处理：

# Preprocess the corpus

corpus = [preprocess(title) for title in titles_list]

# Remove docs that don't include any words in W2V's vocab

corpus, titles_list = filter_docs(corpus, titles_list, lambda doc: has_vector_representation(model, doc))

# Filter out any empty docs

corpus, titles_list = filter_docs(corpus, titles_list, lambda doc: (len(doc) != 0))

x = []

for doc in corpus: # append the vector for each document

x.append(document_vector(model, doc))

X = np.array(x) # list to array

t-SNE操作：文件向量

现在，我们已成功创建了文档向量数组，对它们进行t-SNE测算可能会得到与上文类似的结果。

# Initialize t-SNE

tsne = TSNE(n_components = 2, init = 'random', random_state = 10, perplexity = 100)

# Again use only 400 rows to shorten processing time

tsne_df = tsne.fit_transform(X[:400])

fig, ax = plt.subplots(figsize = (14, 10))

sns.scatterplot(tsne_df[:, 0], tsne_df[:, 1], alpha = 0.5)

from adjustText import adjust_text

texts = []

titles_to_plot = list(np.arange(0, 400, 40)) # plots every 40th title in first 400 titles

# Append words to list

for title in titles_to_plot:

texts.append(plt.text(tsne_df[title, 0], tsne_df[title, 1], titles_list[title], fontsize = 14))

# Plot text using adjust_text

adjust_text(texts, force_points = 0.4, force_text = 0.4,

expand_points = (2,1), expand_text = (1,2),

arrowprops = dict(arrowstyle = "-", color = 'black', lw = 0.5))

plt.show()

根据上述操作，可以发现t-SNE将文档向量折叠成一个维度空间，文档根据其内容与某些领域的相关性而展开，如与国家、世界领导者、外交事务及技术公司等方面的相关性。

现在可以对文章的转发量进行研究。通常，人们认为标题要耸人听闻才会点击率越高，文章被转发分享的可能性也就越高。那么这种共识在数据集中是否能够得到验证呢？下一部分将进行具体介绍。

文章转发量与标题情绪分析

首先，我们需要删除所有没有来源或者无法判定转发量的文章。转发量的零位测量在数据中表示为-1.

# Drop all the rows where the article popularities are unknown (this is only about 11% of the data)

main_data = main_data.drop(main_data[(main_data.Facebook == -1) |

(main_data.GooglePlus == -1) |

(main_data.LinkedIn == -1)].index)

# Also drop all rows where we don't know the source

main_data = main_data.drop(main_data[main_data['Source'].isna()].index)

main_data.shape

用word2vec分析标题，预测文章转发量

删除一部分文章后，仍有81000篇文章可供使用。接下来，可以研究标题情绪与文章的转发量之间是否存在某种相关性。

fig, ax = plt.subplots(1, 3, figsize=(15, 10))

subplots = [a for a in ax]

platforms = ['Facebook', 'GooglePlus', 'LinkedIn']

colors = list(sns.husl_palette(10, h=.5)[1:4])

for platform, subplot, color in zip(platforms, subplots, colors):

sns.scatterplot(x = main_data[platform], y = main_data['SentimentTitle'], ax=subplot, color=color)

subplot.set_title(platform, fontsize=18)

subplot.set_xlabel('')

fig.suptitle('Plot of Popularity (Shares) by Title Sentiment', fontsize=24)

plt.show()

经过上述操作，依然无法明确两者之间的相关性，因为某些文章的转发量显著高于其他文章。可以尝试对x轴进行对数转换，并使用regplot，这样seaborn可以覆盖所有的线性回归。

# Our data has over 80,000 rows, so let's also subsample it to make the log-transformed scatterplot easier to read

subsample = main_data.sample(5000)

fig, ax = plt.subplots(1, 3, figsize=(15, 10))

subplots = [a for a in ax]

for platform, subplot, color in zip(platforms, subplots, colors):

# Regression plot, so we can gauge the linear relationship

sns.regplot(x = np.log(subsample[platform] + 1), y = subsample['SentimentTitle'],

ax=subplot,

color=color,

# Pass an alpha value to regplot's scatterplot call

scatter_kws={'alpha':0.5})

# Set a nice title, get rid of x labels

subplot.set_title(platform, fontsize=18)

subplot.set_xlabel('')

fig.suptitle('Plot of log(Popularity) by Title Sentiment', fontsize=24)

plt.show()

与我们的预期相反，在这个数据集中，吸睛的标题与文章转发量并没有直接的联系。为了更清楚文章转发量的影响因素，可以再做一次关于文章转发量的二维图测绘。

fig, ax = plt.subplots(3, 1, figsize=(15, 10))

subplots = [a for a in ax]

for platform, subplot, color in zip(platforms, subplots, colors):

sns.distplot(np.log(main_data[platform] + 1), ax=subplot, color=color, kde_kws={'shade':True})

# Set a nice title, get rid of x labels

subplot.set_title(platform, fontsize=18)

subplot.set_xlabel('')

fig.suptitle('Plot of Popularity by Platform', fontsize=24)

plt.show()

接下来，当发布者不同时，吸睛的标题所产生的影响是否与上述结论一致呢？

# Get the list of top 12 sources by number of articles

source_names = list(main_data['Source'].value_counts()[:12].index)

source_colors = list(sns.husl_palette(12, h=.5))

fig, ax = plt.subplots(4, 3, figsize=(20, 15), sharex=True, sharey=True)

ax = ax.flatten()

for ax, source, color in zip(ax, source_names, source_colors):

sns.distplot(main_data.loc[main_data['Source'] == source]['SentimentTitle'],

ax=ax, color=color, kde_kws={'shade':True})

ax.set_title(source, fontsize=14)

ax.set_xlabel('')

plt.xlim(-0.75, 0.75)

plt.show()

得出的图像大体一致，但难以分辨细节的区别。因此，尝试将这些图像全部叠加在一起。

# Overlay each density curve on the same plot for closer comparison

fig, ax = plt.subplots(figsize=(12, 8))

for source, color in zip(source_names, source_colors):

sns.distplot(main_data.loc[main_data['Source'] == source]['SentimentTitle'],

ax=ax, hist=False, label=source, color=color)

ax.set_xlabel('')

plt.xlim(-0.75, 0.75)

plt.show()

可以发现，消息来源与标题情绪的分布图非常相似，没有任何一个来源的值是过高或过低的，12个最常见的源都以0为中心，分布图的尾部值也都很接近。但是这能说明真正的关系吗？请看下面这组数据：

# Group by Source, then get descriptive statistics for title sentiment

source_info = main_data.groupby('Source')['SentimentTitle'].describe()

# Recall that `source_names` contains the top 12 sources

# We'll also sort by highest standard deviation

source_info.loc[source_names].sort_values('std', ascending=False)[['std', 'min', 'max']]

非常明显的是，与其他新闻发布者相比，华尔街日报分布图的标准差最大，范围最大。这表明，华尔街日报经常使用带有负面情绪的标题。这是一个很有趣的潜在发现。但要严格地对这个结论进行验证，需要进行假设检验。这超出了本文的研究范围。

文章转发量预测

第一个任务是重新加入带有各自标题的文档向量。在进行预处理时，笔者同时处理了语料库和titles_list，因此向量及其所代表的标题仍将匹配。同时，在main_df中，笔者删除了所有人气值为-1的文章，因此也需要删除代表这些文章标题的向量。

在如此巨大的数据集上构建模型绝非易事。因此，可以减少一部分向量的维度。基于Unix时间，笔者根据文章的发布日期设计了一个新的变量。

具体可以参见：

https://en.wikipedia.org/wiki/Unix_time

import datetime

# Convert publish date column to make it compatible with other datetime objects

main_data['PublishDate'] = pd.to_datetime(main_data['PublishDate'])

# Time since Linux Epoch

t = datetime.datetime(1970, 1, 1)

# Subtract this time from each article's publish date

main_data['TimeSinceEpoch'] = main_data['PublishDate'] - t

# Create another column for just the days from the timedelta objects

main_data['DaysSinceEpoch'] = main_data['TimeSinceEpoch'].astype('timedelta64[D]')

main_data['TimeSinceEpoch'].describe()

根据上述操作可见，这些文章发布的时间形成了一个250天的数据集合。

from sklearn.decomposition import PCA

pca = PCA(n_components=15, random_state=10)

# as a reminder, x is the array with our 300-dimensional vectors

reduced_vecs = pca.fit_transform(x)

df_w_vectors = pd.DataFrame(reduced_vecs)

df_w_vectors['Title'] = titles_list

# Use pd.concat to match original titles with their vectors

main_w_vectors = pd.concat((df_w_vectors, main_data), axis=1)

# Get rid of vectors that couldn't be matched with the main_df

main_w_vectors.dropna(axis=0, inplace=True)

现在，需要删除非数字和非虚拟列，以便更好地建模。笔者还将比例缩放功能应用于DaysSinceEpoch，因为与其他变量相比，发布时间的数据范围更广。

# Drop all non-numeric, non-dummy columns, for feeding into the models

cols_to_drop = ['IDLink', 'Title', 'TimeSinceEpoch', 'Headline', 'PublishDate', 'Source']

data_only_df = pd.get_dummies(main_w_vectors, columns = ['Topic']).drop(columns=cols_to_drop)

# Standardize DaysSinceEpoch since the raw numbers are larger in magnitude

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Reshape so we can feed the column to the scaler

standardized_days = np.array(data_only_df['DaysSinceEpoch']).reshape(-1, 1)

data_only_df['StandardizedDays'] = scaler.fit_transform(standardized_days)

# Drop the raw column; we don't need it anymore

data_only_df.drop(columns=['DaysSinceEpoch'], inplace=True)

# Look at the new range

data_only_df['StandardizedDays'].describe()

# Get Facebook data only

fb_data_only_df = data_only_df.drop(columns=['GooglePlus', 'LinkedIn'])

# Separate the features and the response

X = fb_data_only_df.drop('Facebook', axis=1)

y = fb_data_only_df['Facebook']

# 80% of data goes to training

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)

现在对数据进行非优化的XGBoost处理：

from sklearn.metrics import mean_squared_error

# Instantiate an XGBRegressor

xgr = xgb.XGBRegressor(random_state=2)

# Fit the classifier to the training set

xgr.fit(X_train, y_train)

y_pred = xgr.predict(X_test)

mean_squared_error(y_test, y_pred)

353495.42

可以说，得出的结果并不令人满意。那么，是否可以通过超参数调整来改善结果呢？笔者根据以下文章重新设计了一个超参数调整布局，具体参见：

https://www.kaggle.com/jayatou/xgbregressor-with-gridsearchcv

from sklearn.model_selection import GridSearchCV

# Various hyper-parameters to tune

xgb1 = xgb.XGBRegressor()

parameters = {'nthread':[4],

'objective':['reg:linear'],

'learning_rate': [.03, 0.05, .07],

'max_depth': [5, 6, 7],

'min_child_weight': [4],

'silent': [1],

'subsample': [0.7],

'colsample_bytree': [0.7],

'n_estimators': [250]}

xgb_grid = GridSearchCV(xgb1,

parameters,

cv = 2,

n_jobs = 5,

verbose=True)

xgb_grid.fit(X_train, y_train)

根据xgb_grid，最佳参数如下：

{'colsample_bytree': 0.7, 'learning_rate': 0.03,'max_depth': 5, 'min_child_weight': 4, 'n_estimators': 250, 'nthread': 4,'objective': 'reg:linear', 'silent': 1, 'subsample': 0.7}

使用新参数再试一次：

params = {'colsample_bytree': 0.7, 'learning_rate':0.03, 'max_depth': 5, 'min_child_weight': 4,

'n_estimators': 250, 'nthread':4, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.7}

# Try again with new params

xgr = xgb.XGBRegressor(random_state=2, **params)

# Fit the classifier to the training set

xgr.fit(X_train, y_train)

y_pred = xgr.predict(X_test)

mean_squared_error(y_test, y_pred)

351220.13

可以看出，数据量在35000左右的时候，结果更好。因此，可以推断，当前状态下的数据量不足导致模型产生的结果不好。那么，是否可以对数据特征进行改善呢？笔者利用分类方法将文章分为两组：Duds（0或1次转载）与Not Duds。

如果能够赋予回归量一个新的特征，将更有利于预测转发量更高的文章，从而使残值降低，并减少均方误差。

进行Detour操作: 检测Dud文章

从之前制作的对数转换图中可以看出，总共有两个文章集合，一个集合在0的位置，而另一个从1开始。笔者用一些分类方法来识别文章是否属于“Duds”（即转载次数只有0或1次），并将这种分类作为最终回归量的特征，以此来预测文章的转发量的可能性。这种方法被称为模型融合。

# Define a quick function that will return 1 (true) if the article has 0-1 share(s)

def dud_finder(popularity):

if popularity <= 1:

return 1

else:

return 0

# Create target column using the function

fb_data_only_df['is_dud'] = fb_data_only_df['Facebook'].apply(dud_finder)

fb_data_only_df[['Facebook', 'is_dud']].head()

# 28% of articles can be classified as "duds"

fb_data_only_df['is_dud'].sum() / len(fb_data_only_df)

0.28

现在，dud分类已经完成。接下来，要使用随机森林算法（RandomForest）、优化的XGBC分类法和K-Nearest Neighbors分类法。这里对于优化XGB的过程不做赘述。

from sklearn.ensemble import RandomForestClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

X = fb_data_only_df.drop(['is_dud', 'Facebook'],axis=1)

y = fb_data_only_df['is_dud']

# 80% of data goes to training

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,random_state = 10)

# Best params, produced by HP tuning

params = {'colsample_bytree': 0.7, 'learning_rate': 0.03, 'max_depth': 5,'min_child_weight': 4,

'n_estimators': 200, 'nthread':4, 'silent': 1, 'subsample': 0.7}

# Try xgc again with new params

xgc = xgb.XGBClassifier(random_state=10, **params)

rfc = RandomForestClassifier(n_estimators=100, random_state=10)

knn = KNeighborsClassifier()

preds = {}

for model_name, model in zip(['XGClassifier', 'RandomForestClassifier','KNearestNeighbors'], [xgc, rfc, knn]):

model.fit(X_train, y_train)

preds[model_name] =model.predict(X_test)

测试模型，获取分类报告:

from sklearn.metrics import classification_report,roc_curve, roc_auc_score

for k in preds:

print("{}performance:".format(k))

print()

print(classification_report(y_test,preds[k]), sep='\n')

可以看出，f1分数最高的是XGC分类法，其次是RF，最后是KNN，但KNN在数据撤回方面做得最好（能够成功识别duds）。这足以说明模型融合的意义，因为像XGBoost这样的优秀模型也会在识别任务上表现不佳，而KNN的预测缺乏一定的多样性。

# Plot ROC curves

for model in [xgc, rfc, knn]:

fpr, tpr, thresholds =roc_curve(y_test, model.predict_proba(X_test)[:,1])

plt.plot([0, 1], [0, 1], 'k--')

plt.plot(fpr, tpr)

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC Curves')

plt.show()

文章人气预测：第2轮

现在我们可以从三种分类方法中得出预测概率的平均值，并将其作为回归量的特征。

averaged_probs = (xgc.predict_proba(X)[:, 1] +

knn.predict_proba(X)[:,1] +

rfc.predict_proba(X)[:,1]) / 3

X['prob_dud'] = averaged_probs

y = fb_data_only_df['Facebook']

接下来将进行再一轮对向量特征的调整，此处不作赘述。结果如下所示：

xgb.XGBRegressor(random_state=2,**params)

# Fit the classifier to the training set

xgr.fit(X_train, y_train)

y_pred = xgr.predict(X_test)

mean_squared_error(y_test, y_pred)

314551.59

所得结果与模型融合之前所得的结果基本相同。这就表明，MSE在测量误差时，异常值所占的比重较大。其实，我们还可以计算平均绝对误差（MAE），以此评估显著异常值产生的影响。在数学术语中，MAE可以用来计算残差绝对值的L1范数，而不是MSE应用的L2范数。我们可以将MAE与MSE的平方根进行比较，也称为均方根误差（RMSE）。

mean_absolute_error(y_test,y_pred), np.sqrt(mean_squared_error(y_test, y_pred))

（180.528167661507，560.8489939081992）

平均绝对误差仅为RMSE的1/3左右。模型的准确性比当初设想得要好很多。最后一步，根据XGRegressor了解每个特征的重要性：

zip(list(X.columns),xgr.feature_importances_):

print('Model weight for feature {}:{}'.format(feature, importance))

结果发现，prob_dud是最重要的特征，自定义的StandardsDays功能是第二重要的特征。（特征值0到14对应于缩小的标题嵌入向量。）

尽管通过这一轮模型融合没有改善整体的结果，但我们成功发现了数据变化的重要来源。

如果继续研究下去，笔者会考虑使用外部数据来增加数据量，包括通过分箱或散列将Source作为变量，在原始的300维向量上运行模型，使用每篇文章在不同时间的人气的“分片”数据来进行预测。

本文所使用的原始数据可以参见：

https://archive.ics.uci.edu/ml/datasets/News+Popularity+in+Multiple+Social+Media+Platforms

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

机器学习

周志华 / 清华大学出版社 / 2016-1-1 / 88.00元

机器学习是计算机科学与人工智能的重要分支领域. 本书作为该领域的入门教材，在内容上尽可能涵盖机器学习基础知识的各方面。为了使尽可能多的读者通过本书对机器学习有所了解, 作者试图尽可能少地使用数学知识. 然而, 少量的概率、统计、代数、优化、逻辑知识似乎不可避免. 因此, 本书更适合大学三年级以上的理工科本科生和研究生, 以及具有类似背景的对机器学习感兴趣的人士. 为方便读者, 本书附录给出了一......一起来看看《机器学习》这本书的介绍吧!

码农工具

html转js在线工具

HEX CMYK 转换工具

HEX CMYK 互转工具