Fraud detection — Unsupervised Anomaly Detection

栏目: IT技术 · 发布时间: 3年前

内容简介:We can see that most samples have a low reconstruction score and then, probably most frauds have more then 50 reconstruction score. Using TSNE we can compare the original data disposition with the PCA compressed data distribution.We can see that most sampl

Fraud detection — Unsupervised Anomaly Detection

An 100% unsupervised approach to discover frauds on credit card transactions

Jun 2 ·6min read

Fraud detection — Unsupervised Anomaly Detection

Photo by Ryan Born on Unsplash

One of the greatest concerns of many business owners is how to protect their company from fraudulent activity. This concern motivated large companies to save data relative to their past frauds, however, whoever performs a fraud aims not to be caught then this kind of data usually is unlabeled or partially labeled.

On this article, we will talk about how to discover frauds on a credit card transaction dataset, unlike most fraud datasets this dataset is completely labeled however, we won’t use the label to discover frauds. Credit card fraud is when someone uses another person’s credit card or account information to make unauthorized purchases or access funds through cash advances. Credit card fraud doesn’t just happen online; it happens in brick-and-mortar stores, too. As a business owner, you can avoid serious headaches — and unwanted publicity — by recognizing potentially fraudulent use of credit cards in your payment environment.

One of the most common approach to find fraudulent transactions was randomly select some transactions and ask and auditor to audit it. This approach was quite unaccurate since the relation between the number of fraudulent transactions and normal transactions is close to 0.1%.

Then, we aim to leverage machine learning to detect and prevent frauds and make fraud fighters more efficient and effective. Comumly, there are the supervised and the unsupervised approach:

Fraud detection — Unsupervised Anomaly Detection

Also, these models can then be deployed to automatically identify new instances/cases of known fraud patterns/types in the future. Ideally the validation of this type of machine learning algorith sometimes need to be a temporal validation since fraud patterns can change over time, however to simplify this article, the validation will be simplified.

The dataset

The project uses a dataset of around 284000 credit card transactions which have been taken from Kaggle .

The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, the original features and more background information about the data are not provided. Features V1, V2, …, V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are “Time” and “Amount”, and there are no null values ( Dataset page ).

Since just the “Time” and “Amount” features are easely intepreted, we can use some visualizations to see the impact of the features on the target variable (fraud). First, frauds happen more on small transactions or big ones?

Fraud detection — Unsupervised Anomaly Detection

Fraud detection — Unsupervised Anomaly Detection

Approach to detect frauds

This article proposes an unsupervised approach to detect frauds, the only place the labels are used is to evaluate the algorithm. One of the biggest challenge of this problem is that the target is highly imbalanced as only 0.17% cases are fraudulent transactions. But the advantage of the representation learning approach is that it is still able to handle such imbalance nature of the problems. Using TSNE we can try to see how the transactions are similar:

Fraud detection — Unsupervised Anomaly Detection

The main ideia of this approach is to compress the data making a “latent representation” and then reconstruct the data. If a sample is similar to the rest of the dataset, the reconstructed data will be similar or even equal to the original data. However, if the sample is not similar to the rest, the reconstructed sample will not be similar to the original one.

In short, we compress the data and reconstruct it. If the reconstructed data is not similar to the original one, we have a fraud.

Principal Component Analysis

Using Principal component analysis (PCA), we managed to compress the data from 30 features to 10 features and calculated the reconstruction score. The histogram for this score is below:

Fraud detection — Unsupervised Anomaly Detection

We can see that most samples have a low reconstruction score and then, probably most frauds have more then 50 reconstruction score. Using TSNE we can compare the original data disposition with the PCA compressed data distribution.

Fraud detection — Unsupervised Anomaly Detection

Fraud detection — Unsupervised Anomaly Detection

Original distribution vs PCA distribution

Now, we need to set a threshold to the reconstruction score. Usualy there domain expertise is used to help to set this threshold since it impacts direcly on the precision and recall trade-off.

Using the mean and standard deviation of the reconstruction score we can set a reasonable threshold. Then, I choose to set the threshold to mean + 2*std. With this, auditing 0.74% of the transactions we managed to find 87% of the frauds.

Autoencoder

An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. It is composed of a enconding part responsable to compress the data and a decoder to reconstruct the data.

Fraud detection — Unsupervised Anomaly Detection

The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction. Analogously to the PCA approach, the reconstrcution score histogram can be seen below:

Fraud detection — Unsupervised Anomaly Detection

We can see that most samples have a low reconstruction score and then, probably most frauds have more then ~60 reconstruction score. Using TSNE we can compare the original data disposition with the Autoencoder compressed data distribution.

Fraud detection — Unsupervised Anomaly Detection

Fraud detection — Unsupervised Anomaly Detection

Original districution vs Autoencoder distribution

The Autoencoder representation seens to split quite well the frauds from the normal data. Now, we need to set a threshold to the reconstruction score. Usualy there domain expertise is used to help to set this threshold since it impacts direcly on the precision and recall trade-off.

Using the mean and standard deviation of the reconstruction score we can set a reasonable threshold. Then, I choose to set the threshold to mean + 2*std. With this, auditing 0.85% of the transactions we managed to find 65% of the frauds.

Conclusion

The objective of the approach was fulfilled, making possible to detect frauds with an 100% unsupervised approach. Nevertheless, there are several ways to make this approach work better, like:

  • Tunning the used models (PCA and Autoencoder);
  • Tune the threshold of reconstruction score;
  • Explore if the PCA and Autoencoder are detecting the same frauds. If they work in different ways, maybe it is worth to make an emsamble;
  • Augment the data with some feature engineering.

The code used to perform this approach along with a little more analysis is available on GitHub :


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

近似算法

近似算法

瓦齐拉尼 / 2010-9 / 49.00元

《近似算法》系统总结了到本世纪初为止近似算法领域的成果,重点关注近似算法的设计与分析,介绍了这个领域中最重要的问题以及所使用的基本方法和思想。全书分为三部分:第一部分使用不同的算法设计技巧给出了下述优化问题的组合近似算法:集合覆盖、施泰纳树和旅行商、多向割和k-割、k-中心、反馈顶点集、最短超字符串、背包、装箱问题、最小时间跨度排序、欧几里得旅行商等。第二部分介绍基于线性规划的近似算法。第三部分包......一起来看看 《近似算法》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

html转js在线工具
html转js在线工具

html转js在线工具