Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

栏目: IT技术 · 发布时间: 4年前

内容简介:After the World Health Organization announced that cash could harbor the coronavirus, several countries took immediate measures in quarantining or destroying large portions of their money supply, with some going to the extent of banning the use of cash alt

COVID-19 has changed the way we pay, the increasing usage of digital payments has pushed the potential for digital fraud at an all-time high in our soon to be cashless economy

Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

May 3 ·8min read

Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

Photo by Muhammad Raufan Yusup & Pixaby on Unsplash & Pexels

COVID-19’s Acceleration of Digitization

After the World Health Organization announced that cash could harbor the coronavirus, several countries took immediate measures in quarantining or destroying large portions of their money supply, with some going to the extent of banning the use of cash altogether, forcing customers to fully embrace digital payments.

Is This The End For Cash?

Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

Photo by Blake Wisz on Unsplash

Analysts now expect global non-cash transactions to surpass the $1 trillion milestone by 2024. Not only that, but this rapid increase in both the number and dollar value of electronic transactions has analysts predicting the elimination of cash in its entirety by government legislation in the near future to prevent another spread of a pandemic.

Additionally, this rapid ascension of digital payments is transforming not only how consumers, businesses, and governments are moving money, but also how criminals steal money: digital fraud.

Digital Fraud’s Growth

Online fraud has grown by 13% to $16.9b in 2019, even as instances of fraud fell from 14.4m to 13m in 2019, hackers managed to shift their focus on higher-value fraud as opposed to multiple lower-value fraud occurrences, overall stealing an extra $3.5b in a year at 1.4m fewer transactions.

Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

Photo by Clint Patterson on Unsplash

This steady uptrend of fraud is now being escalated as a result of quarantined people turning to online platforms more than ever before and attempted online payment fraud is expected to increase by at least 73% in 2020.

Building An Early Warning System — Digital Fraud

To better prepare ourselves for all the threats the digital-era is bringing, we decided to create an autoencoder fraud detection model that will not only detect fraud , but also simulate rare fraudulent cases , creating more “anomaly” transactions to examine.

Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

Photo by Ales Nesetril on Unsplash

Problem: Imbalanced Dataset

The dataset we are using contains credit card transactions that occurred within 2 days, with 492 frauds occurring out of 284,807 transactions , which means that only 0.17% of our dataset has instances of fraud.

In essence, our dataset is highly imbalanced, which means our model would learn how to better identify normal transactions as opposed to fraudulent ones, making it entirely useless when applied against new cases of fraud.

Tradeoff: Recall vs. Precision

Our objective is to maximize recall and trade a bit of the precision, as it is less financially damaging to predict “fraud” on non-fraudulent transactions than to miss any fraudulent ones.

Solution: Autoencoders

Autoencoders are known as complex unsupervised artificial neural networks that learn how to efficiently compress & encode data to reconstruct the data.

In essence, it reconstructs the data from the reduced encoded representation to a representation that acts as the closest replication as possible to the original input.

It does this largely by learning how to ignore the noise in the data to reduce the data dimensions.

Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

Autoencoder: Example of Input / Output Image From MNIST Dataset

Planning Our Model

We will train an Autoencoder Neural Network in an unsupervised manner, and our simulated rare events will vary slightly from the original ones and the model will be able to predict whether a case is fraudulent or not just by the input.

Evaluating Our Model

The main metric that will be used in our project, to determine whether a transaction is fraudulent (1) or normal (0) , is the reconstruction error which will be minimized by the model.

This will allow our autoencoder to learn important features of fraud present in the data , because when a representation allows a good reconstruction of its input, it has secured much of the information present in the input.

Exploratory Data Analysis

A quick summary of the dataset shows 31 columns, in which 2 of them are Time and Amount .

Class (Target Variable)

1: Fraudulent transaction

0: Normal/Non-fraudulent transaction

The remaining 29 variables are from the PCA transformation and have been transformed for security purposes.

Variables — Digital Fraud Model
There are no missing values so we can proceed to plot the data

Visualizing The Imbalanced Dataset

Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

Highly Imbalanced Dataset — Normal to Fraud

Do Fraudulent Transactions Occur At Specific Timeframes?

Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

Fraudulent Transactions — Timeframe Analysis
No visible insight can be extracted with the time variable as transaction lengths seem to vary for both types of transactions.

Data Preprocessing

Data Scaling

The time variable is dropped due to irrelevancy and the values are standardized in preparation for our autoencoder model.

Train-Test Split [80:20]

Unlike most models, our primary focus doesn’t revolve around building a classification model, it is to detect anomalies , hence our train & test split will be slightly different.

To account for the imbalanced dataset, we will train our model only on normal transactions , however, we will refrain from modifying the test set , and it will still maintain the original class split to retain an accurate & unbiased evaluation of the performance of our model.

Building Our Model

Model Setup

Next, the autoencoder model is set up using an input of 14 dimensions to be fed into 4 fully connected layer s with sizes 14,7,7, and 14 respectively.

As mentioned earlier, the first 2 layers represent the encoding part and the remaining 2 layers represent the decoding part.

To build a less complex model and address over-fitting and feature selection, we incorporate Lasso (L1) regularization.

Hyperparameters for each layer are specified with the kernel initializer set to glorot_uniform and alternating sigmoid and RELU activation functions.

The reason we picked these hyperparameters were because they tend to perform well and are considered the industry standard.

X_train = X_train.valuesX_test = X_test.valuesinput_dim = X_train.shape[1]encoding_dim = 14from keras.models import Model, load_modelfrom keras.layers import Input, Densefrom keras import regularizersinput_layer = Input(shape=(X_train.shape[1], ))encoder1 = Dense(14, activation="sigmoid", kernel_initializer= "glorot_uniform",activity_regularizer=regularizers.l1(0.0003))(input_layer)encoder2 = Dense(7, activation="relu", kernel_initializer= "glorot_uniform")(encoder1)decoder1 = Dense(7, activation='sigmoid',kernel_initializer= "glorot_uniform")(encoder2)decoder2 = Dense(X_train.shape[1], activation='relu',kernel_initializer= "glorot_uniform")(decoder1)autoencoder = Model(inputs=input_layer, outputs=decoder2)

Model Training

The model is trained for 20 epochs with a batch size of 32 samples to allow the model to learn the best weights. The best model weights are defined as the weights that minimize the loss function (reconstruction error).

The model is saved using the ModelCheckpoint callback on Tensorboard.

from keras.callbacks import ModelCheckpoint, TensorBoardautoencoder.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])checkpoint = ModelCheckpoint(filepath=r"C:\Users\Ramy\Desktop\AI\autoencode.h", verbose=0, save_best_only=True)tensorboard = TensorBoard(log_dir=r"C:\Users\Ramy\Desktop\AI\logs", histogram_freq=0, write_graph=True, write_images=True)#early_stop = EarlyStopping(monitor=’loss’, patience=2, verbose=0, mode='min')history = autoencoder.fit(X_train, X_train,epochs= 20,batch_size=32,shuffle=True,validation_data=(X_test, X_test),verbose=1,callbacks=[checkpoint, tensorboard]).history

Results

To evaluate our model’s learning capabilities, we plot the train & test model losses to verify that the higher the number of epochs, the lower our error rate.

Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

Model Loss — Epochs vs Loss

Our MSE is rebranded as the reconstruction error, and it seems it converges well on the test & training set.

Summary Statistics of Reconstruction Error & True Class

Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

Reconstruction Error vs. True Class

We then plot the reconstruction errors for both class types ( Normal and Fraudulent ).

Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

Reconstruction — Without Fraud

Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

Reconstruction — With Fraud

Validation

Recall vs. Precision

High Precision:Low False Positive Rate

High Recall:Low False Negative Rate

Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

Autoencoder Model: Recall vs. Precision

The plot shows that the values are very extreme in this case. The model can either do well in precision or recall alone but can’t have both at the same time.

Autoencoder Model’s Optimal Points:

Recall: 20%

Precision: 60%

A granular plot of Precision & Recall curves by threshold

Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

Autoencoder Model: Precision & Recall by Threshold

The graph shows that as the reconstruction error threshold increases, the model precision increases as well. However, the opposite is seen for recall metrics.

Model Testing & Evaluation

To finally differentiate between fraudulent and normal transactions, we will introduce a threshold value. Through using the reconstruction error from the transaction data, if the error is larger than the defined threshold, then that transaction will be marked as fraudulent.

Optimal Threshold Value= 3.2

We could have also estimated the threshold value from the test data. However, there would have been the potential of overfitting which could prove detrimental in the long run.

Visualization of Division Between Normal & Fraudulent Transactions w/ Respect to Threshold Values

Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

Reconstruction Error — Normal vs. Fraudulent

Confusion Matrix

This offers a more comprehensive overview of our model’s precision & recall values. Overall, our autoencoder model is robust with high True Positives & True Negatives (fraud vs. normal transaction detection rates)

Can We Detect Digital Fraud in a Cashless Post COVID-19 Economy Using AI?

Fraudulent Transactions

Looking at the confusion matrix, we can see that there are 16+85 = 101 fraudulent transactions .

85of them were correctly classified as fraudulent and 16 of them were incorrectly classified as normal transactions.

Normal Transactions

On the other hand, 1159 are incorrectly classified as fraudulent , equivalent to approximately 2% of the total normal cases .

Original Objective

In general, it’s much more costly to mistake a fraudulent transaction as a normal transaction as opposed to the reverse.

Solution

To make sure that this objective is satisfied, we try to boost the predictive power of detecting a fraudulent transaction by trading off our ability to accurately predict normal transactions.

Conclusion

Overall, the model is relatively robust because we did catch most of the fraudulent cases. However, this can be further improved if our dataset had been a bit more balanced.

LinkedIn

GitHub


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

长尾理论

长尾理论

[美] 克里斯·安德森 / 乔江涛 / 中信出版社 / 2006-12 / 35.00元

书中阐述,商业和文化的未来不在于传统需求曲线上那个代表“畅销商品”(hits)的头部; 而是那条代表“冷门商品”(misses)经常为人遗忘的长尾。 举例来说, 一家大型书店通常可摆放10万本书,但亚马逊网络书店的图书销售额中,有四分之一来自排名10万以后的书籍。这些“冷门”书籍的销售比例正以高速成长,预估未来可占整体书市的一半。 这意味着消费者在面对无限的选择时,真正想要的东西、和想要取得......一起来看看 《长尾理论》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

MD5 加密
MD5 加密

MD5 加密工具

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具