The Right Way to Use Deep Learning for Tabular Data | Entity Embedding

栏目: IT技术 · 发布时间: 6年前

内容简介：Hence, I decided to do a project using tabular to demonstrate the use of entity embeddings. The dataset I use is the IEEE-CIS Fraud Detection data from Kaggle which you can findHere are the step-by-step codes (including Google Colab specific code as I work

Hence, I decided to do a project using tabular to demonstrate the use of entity embeddings. The dataset I use is the IEEE-CIS Fraud Detection data from Kaggle which you can find here .

Here are the step-by-step codes (including Google Colab specific code as I worked on colab).

Firstly, to check the GPU allocated to you in colab you can run the following code.

!nvidia-smi

Next to mount to my google drive,

from google.colab import drivedrive.mount('/content/drive')

Download the dataset from Kaggle, you would need your Kaggle API token for this. If you need any help downloading dataset from Kaggle, this might help.

!mkdir /root/.kaggle!echo '{"username":"USERNAME","key":"KEY"}' > /root/.kaggle/kaggle.json!chmod 600 /root/.kaggle/kaggle.json!kaggle competitions download -c ieee-fraud-detection# unzip all files!unzip train_transaction.csv.zip
!unzip test_transaction.csv.zip

Then, read the csv files into pandas dataframe

train = pd.read_csv("train_transaction.csv")
test = pd.read_csv("test_transaction.csv")

As this is a fraud detection dataset, having an imbalance data shouldn’t be surprising.

train["isFraud"].mean() # 0.03499000914417313

As data exploration and feature engineering is not the purpose of this post, I will use minimum features to predict the fraud label. To make sure you can replicate my code, here are my processing steps.

# generate time of daytrain["Time of Day"] = np.floor(train["TransactionDT"]/3600/183)
test["Time of Day"] = np.floor(test["TransactionDT"]/3600/183)# drop columnstrain.drop("TransactionDT",axis=1,inplace=True)
test.drop("TransactionDT",axis=1,inplace=True)# define continuous and categorical variablescont_vars = ["TransactionAmt"]
cat_vars = ["ProductCD","addr1","addr2","P_emaildomain","R_emaildomain","Time of Day"] + [col for col in train.columns if "card" in col]# set training and testing setx_train = train[cont_vars + cat_vars].copy()
y_train = train["isFraud"].copy()
x_test = train[cont_vars + cat_vars].copy()
y_test = train["isFraud"].copy()# process cont_vars
# scale valuesfrom sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train["TransactionAmt"] = scaler.fit_transform(x_train["TransactionAmt"].values.reshape(-1,1))
x_test["TransactionAmt"] = scaler.transform(x_test["TransactionAmt"].values.reshape(-1,1))# reduce cardinality of categorical variablesidx_list = x_train["card1"].value_counts()[x_train["card1"].value_counts()<=100].index.tolist()
x_train.loc[x_train["card1"].isin(idx_list),"card1"] = "Others"
x_test.loc[x_test["card1"].isin(idx_list),"card1"] = "Others"# fill missingx_train[cat_vars] = x_train[cat_vars].fillna("Missing")
x_test[cat_vars] = x_test[cat_vars].fillna("Missing")for cat, index in categories.items():test[cat] = pd.Categorical(test[cat],categories=categories[cat],ordered=True)

After the processing steps have been done, now we can convert the categorical variables into integers.

# convert to numerical value for modellingdef categorify(df, cat_vars):categories = {}for cat in cat_vars:
df[cat] = df[cat].astype("category").cat.as_ordered()
categories[cat] = df[cat].cat.categoriesreturn categories
def apply_test(test,categories):for cat, index in categories.items():
test[cat] = pd.Categorical(test[cat],categories=categories[cat],ordered=True)# convert to integers
categories = categorify(x_train,cat_vars)
apply_test(x_test,categories)for cat in cat_vars:
x_train[cat] = x_train[cat].cat.codes+1
x_test[cat] = x_test[cat].cat.codes+1

Due to the higly imbalanced dataset, I have to artificially generate more fraud data using a technique called Synthetic Minority Over-sampling Technique (SMOTE). The documentation can be found here .

以上所述就是小编给大家介绍的《The Right Way to Use Deep Learning for Tabular Data | Entity Embedding》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

The Algorithmic Beauty of Plants

Przemyslaw Prusinkiewicz、Aristid Lindenmayer / Springer / 1996-4-18 / USD 99.00

Now available in an affordable softcover edition, this classic in Springer's acclaimed Virtual Laboratory series is the first comprehensive account of the computer simulation of plant development. 150......一起来看看《The Algorithmic Beauty of Plants》这本书的介绍吧!

码农工具

The Right Way to Use Deep Learning for Tabular Data | Entity Embedding

The Algorithmic Beauty of Plants

图片转BASE64编码

HTML 编码/解码

XML 在线格式化