Building and deploying end-to-end fake news classifier

栏目: IT技术 · 发布时间: 3年前

内容简介：Create a file named eda.ipynb or eda.py in your project directory.We will first import all the required packages.Now we will first read fake news dataset using

1. Exploratory Data Analysis

Photo by Element5 Digital on Unsplash

Create a file named eda.ipynb or eda.py in your project directory.

We will first import all the required packages.

#Importing all the libraries
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re
from wordcloud import WordCloud
import os

Now we will first read fake news dataset using pd.read_csv() and then we will explore the dataset.

In cell 4 of the above notebook, we count the number of sample fake news in each of the subject. We will also plot its distribution using seaborn count plot sns.coountplot() .

We will now plot a word cloud by first concatenating all the news in a single string then generating tokens and removing stopwords. Word cloud is a very good way to visualize the text data.

As you can see in the next cell now we will import true.csv as real news dataset and perform the same steps as we did on the fake.csv. One different thing you’ll notice in the real news dataset is that in the text column, there is a publication name like WASHINGTON (Reuters) separated by a hyphen(-).

It seems that the real news is credible as it comes from a publication house, so we will separate the publication from the news part to make the dataset uniform in the preprocessing part of this tutorial. For now, we’ll just explore the dataset.

If you are following along, you can see that the news subject column has non-uniform distribution in real and fake news dataset so, we will drop this column later. So that concludes our EDA.

Now we can get our hands dirty with something you guys have been waiting for. I know this part is frustrating but EDA and preprocessing is on of the most import in any Data Science lifecycle

2. Preprocessing and Model Training

Photo by Carlos Muza on Unsplash

In this part we will perform some preprocessing steps on our data and train our model using insights obtained from the EDA we did previously.

Preprocessing

To follow along code in this part open train ipynb file. So without much further ado lets get started.

As usual Importing all of the packages an reading the data. We will first remove Reuters from real data text column. As there are some rows in which Reuters is absent so we will first get those indices.

Removing Reuters or Twitter Tweet information from the text

Text can be split only once at “ — “ which is always present after mentioning the source of publication, this gives us publication part and text part
If we do not get text part, this means publication details wasn’t given for that record
The Twitter tweets always have the same source, a long text of max 259 characters

#First Creating list of index that do not have publication part
unknown_publishers = []
for index,row in enumerate(real.text.values):
try:
 record = row.split(" -", maxsplit=1)
#if no text part is present, following will give error
 record[1]
#if len of publication part is greater than 260
#following will give error, ensuring no text having "-" in between is counted
assert(len(record[0]) < 260)
except:
 unknown_publishers.append(index)

To summarize in one line what the above code does is get the index of text column where the publisher is absent in real dataset.

Now we will separate the Reuters from the text column.

# separating publishers from the news text
publisher = []
tmp_text = []
for index,row in enumerate(real.text.values):
 if index in unknown_publishers:
#add text to tmp_text and "unknown" to publisher
 tmp_text.append(row)

 publisher.append("Unknown")
 continue
 record = row.split(" -", maxsplit=1)
 publisher.append(record[0])
 tmp_text.append(record[1])

In the above code we iterate over the text column and check if index belongs to if it does then we append text as it as to and “Unknown” to publishers list. Else we split the text into publishers and news text and append in respective lists.

#Replace existing text column with new text
#add seperate column for publication info
real["publisher"] = publisher
real["text"] = tmp_text

The above code is pretty self-explanatory, we add a new publisher column and replace text column by news text which is without Reuter.

We will now check if there are any missing values in the text column in both real and fake news dataset and drop that row.

If we check through fake news dataset we will see that there are many rows with missing text values and the whole news is present in title column, so we will merge title and text columns.

real['text'] = real['text'] + " " + real['title']
fake['text'] = fake['text'] + " " + fake['title']

Next we will add class to our dataset, drop unnecessary columns and merge our data into one.

# Adding class info 
real['class'] = 1 
fake['class'] = 0# Subject is diffrent for real and fake thus dropping it # Also dropping Date, title and Publication real.drop(["subject", "date","title", "publisher"], axis=1, inplace=True) fake.drop(["subject", "date", "title"], axis=1, inplace=True)#Combining both into new dataframe data = real.append(fake, ignore_index=True)

Removing StopWords, Punctuations, and single-character words. (very common and basic task in any NLP project).

Model Training

Vectorization: Word2Vec

Photo by Nick Morrison on Unsplash

Word2Vec is one of the most popular techniques to learn word embeddings using shallow neural networks. It was developed by Tomas Mikolov in 2013 at Google. Word embedding is the most popular representation of document vocabulary. It is capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other words, etc.

If you want to learn more about it click here

Let’s create our Word2Vec model.

#install gensim if you haven't already
#!pip install gensim
import gensim#Dimension of vectors we are generating
EMBEDDING_DIM = 100
#Creating Word Vectors by Word2Vec Method
w2v_model = gensim.models.Word2Vec(sentences=X, size=EMBEDDING_DIM, window=5, min_count=1)#vocab size
len(w2v_model.wv.vocab)
#We have now represented each of 122248 words by a 100dim vector.

These Vectors will be passed to LSTM/GRU instead of words. 1D-CNN can further be used to extract features from the vectors.

Keras has an implementation called “ Embedding Layer ” which would create word embeddings(vectors). Since we did that with gensim’s word2vec, we will load these vectors into the embedding layer and make the layer non-trainable.

We cannot pass string words to the embedding layer, thus need some way to represent each word by numbers.

Tokenizer can represent each word by number

# Tokenizing Text -> Repsesenting each word by a number
# Mapping of orginal word to number is preserved in word_index property of tokenizer#Tokenized applies basic processing like changing it yo lower case, explicitely setting that as False
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)X = tokenizer.texts_to_sequences(X)

We Create a matrix of mapping between word-index and vectors. We use this as weights in the embedding layer. Embedding layer accepts the numerical-token of word and outputs corresponding vector to the inner layer. It sends a vector of zeros to the next layer for unknown words which would be tokenized to 0. Input length of Embedding Layer is the length of each news (700 now due to padding and truncating).

Now we will create a sequential Neural Network model and add the weights generated from w2v in the embedding layer and also add an LSTM layer.

#Defining Neural Network
model = Sequential()
#Non-trainable embeddidng layer
model.add(Embedding(vocab_size, output_dim=EMBEDDING_DIM, weights=[embedding_vectors], input_length=maxlen, trainable=False))
#LSTM 
model.add(LSTM(units=128))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

Lets now split dataset into train set and test set using sklearn train_test_split method.

Lets train the model using model.fit(X_train, y_train, validation_split=0.3, epochs=6) . It will take some time, on my machine it took around 40 minutes so sit back have some coffee and relax.

After training is done we will test it on test dataset and generate report using classification_report() method.

Wow, we got 99% accuracy with a good precision and recall so our model looks good, now let’s save it on disk so we can use it in our web application.

3. Building and deploying a web app

Photo by Shahadat Rahman on Unsplash

I am not going into much detail in this part, I’d recommend you to go through my code it is very easy to understand. If you’ve followed along till now, you must have the same directory structure if not then just change path variables in app.py file.

Now upload the whole directory into a GitHub repository.

We will host our web app on Heroku. So if you haven’t already, create a free account on Heroku and then:

Click on create new app
Then select a name
Select GitHub and select the repository from which you want to hold on
Click on deploy.

And BOOM it’s done, your fake news classifier is now live.

Conclusion …

If you’ve made it till the end, congratulations, now you can build and deploy a complex machine learning application.

I know it was a lot to grasp on but kudos to you for making till this far.

Note:The app works on most of the news, just remember to paste the whole paragraph of the news and preferably US news because dataset was constrained to US news.

In case we haven’t met already, I am Eish Kumar you can follow me on Linkedin: https://www.linkedin.com/in/eish-kumar/ .

Follow me for more such articles.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Building and deploying end-to-end fake news classifier

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

数据结构与算法

[美] 乔兹德克 (Drozdek, A. ) / 郑岩、战晓苏 / 清华大学出版社 / 2006-1 / 69.00元

《国外计算机科学经典教材·数据结构与算法：C++版（第3版）》全面系统地介绍了计算机科学教育中的一个重要组成部分——数据结构，并以C++语言实现相关的算法。书中主要强调了数据结构和算法之间的联系，使用面向对象的方法介绍数据结构，其内容包括算法的复杂度分析、链表、栈队列、递归技术、二叉树、图、排序以及散列。《国外计算机科学经典教材·数据结构与算法：C++版（第3版）》还清晰地阐述了同类教材中较少提到......一起来看看《数据结构与算法》这本书的介绍吧!

码农工具