内容简介:After the World Health Organization announced that cash could harbor the coronavirus, several countries took immediate measures in quarantining or destroying large portions of their money supply, with some going to the extent of banning the use of cash alt
COVID-19 has changed the way we pay, the increasing usage of digital payments has pushed the potential for digital fraud at an all-time high in our soon to be cashless economy
May 3 ·8min read
COVID-19’s Acceleration of Digitization
After the World Health Organization announced that cash could harbor the coronavirus, several countries took immediate measures in quarantining or destroying large portions of their money supply, with some going to the extent of banning the use of cash altogether, forcing customers to fully embrace digital payments.
Is This The End For Cash?
Analysts now expect global non-cash transactions to surpass the $1 trillion milestone by 2024. Not only that, but this rapid increase in both the number and dollar value of electronic transactions has analysts predicting the elimination of cash in its entirety by government legislation in the near future to prevent another spread of a pandemic.
Additionally, this rapid ascension of digital payments is transforming not only how consumers, businesses, and governments are moving money, but also how criminals steal money: digital fraud.
Digital Fraud’s Growth
Online fraud has grown by 13% to $16.9b in 2019, even as instances of fraud fell from 14.4m to 13m in 2019, hackers managed to shift their focus on higher-value fraud as opposed to multiple lower-value fraud occurrences, overall stealing an extra $3.5b in a year at 1.4m fewer transactions.
This steady uptrend of fraud is now being escalated as a result of quarantined people turning to online platforms more than ever before and attempted online payment fraud is expected to increase by at least 73% in 2020.
Building An Early Warning System — Digital Fraud
To better prepare ourselves for all the threats the digital-era is bringing, we decided to create an autoencoder fraud detection model that will not only detect fraud , but also simulate rare fraudulent cases , creating more “anomaly” transactions to examine.
Problem: Imbalanced Dataset
The dataset we are using contains credit card transactions that occurred within 2 days, with 492 frauds occurring out of 284,807 transactions , which means that only 0.17% of our dataset has instances of fraud.
In essence, our dataset is highly imbalanced, which means our model would learn how to better identify normal transactions as opposed to fraudulent ones, making it entirely useless when applied against new cases of fraud.
Tradeoff: Recall vs. Precision
Our objective is to maximize recall and trade a bit of the precision, as it is less financially damaging to predict “fraud” on non-fraudulent transactions than to miss any fraudulent ones.
Solution: Autoencoders
Autoencoders are known as complex unsupervised artificial neural networks that learn how to efficiently compress & encode data to reconstruct the data.
In essence, it reconstructs the data from the reduced encoded representation to a representation that acts as the closest replication as possible to the original input.
It does this largely by learning how to ignore the noise in the data to reduce the data dimensions.
Planning Our Model
We will train an Autoencoder Neural Network in an unsupervised manner, and our simulated rare events will vary slightly from the original ones and the model will be able to predict whether a case is fraudulent or not just by the input.
Evaluating Our Model
The main metric that will be used in our project, to determine whether a transaction is fraudulent (1) or normal (0) , is the reconstruction error which will be minimized by the model.
This will allow our autoencoder to learn important features of fraud present in the data , because when a representation allows a good reconstruction of its input, it has secured much of the information present in the input.
Exploratory Data Analysis
A quick summary of the dataset shows 31 columns, in which 2 of them are Time and Amount .
Class (Target Variable)
1: Fraudulent transaction
0: Normal/Non-fraudulent transaction
The remaining 29 variables are from the PCA transformation and have been transformed for security purposes.
There are no missing values so we can proceed to plot the data
Visualizing The Imbalanced Dataset
Do Fraudulent Transactions Occur At Specific Timeframes?
No visible insight can be extracted with the time variable as transaction lengths seem to vary for both types of transactions.
Data Preprocessing
Data Scaling
The time variable is dropped due to irrelevancy and the values are standardized in preparation for our autoencoder model.
Train-Test Split [80:20]
Unlike most models, our primary focus doesn’t revolve around building a classification model, it is to detect anomalies , hence our train & test split will be slightly different.
To account for the imbalanced dataset, we will train our model only on normal transactions , however, we will refrain from modifying the test set , and it will still maintain the original class split to retain an accurate & unbiased evaluation of the performance of our model.
Building Our Model
Model Setup
Next, the autoencoder model is set up using an input of 14 dimensions to be fed into 4 fully connected layer s with sizes 14,7,7, and 14 respectively.
As mentioned earlier, the first 2 layers represent the encoding part and the remaining 2 layers represent the decoding part.
To build a less complex model and address over-fitting and feature selection, we incorporate Lasso (L1) regularization.
Hyperparameters for each layer are specified with the kernel initializer set to glorot_uniform and alternating sigmoid and RELU activation functions.
The reason we picked these hyperparameters were because they tend to perform well and are considered the industry standard.
X_train = X_train.valuesX_test = X_test.valuesinput_dim = X_train.shape[1]encoding_dim = 14from keras.models import Model, load_modelfrom keras.layers import Input, Densefrom keras import regularizersinput_layer = Input(shape=(X_train.shape[1], ))encoder1 = Dense(14, activation="sigmoid", kernel_initializer= "glorot_uniform",activity_regularizer=regularizers.l1(0.0003))(input_layer)encoder2 = Dense(7, activation="relu", kernel_initializer= "glorot_uniform")(encoder1)decoder1 = Dense(7, activation='sigmoid',kernel_initializer= "glorot_uniform")(encoder2)decoder2 = Dense(X_train.shape[1], activation='relu',kernel_initializer= "glorot_uniform")(decoder1)autoencoder = Model(inputs=input_layer, outputs=decoder2)
Model Training
The model is trained for 20 epochs with a batch size of 32 samples to allow the model to learn the best weights. The best model weights are defined as the weights that minimize the loss function (reconstruction error).
The model is saved using the ModelCheckpoint callback on Tensorboard.
from keras.callbacks import ModelCheckpoint, TensorBoardautoencoder.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])checkpoint = ModelCheckpoint(filepath=r"C:\Users\Ramy\Desktop\AI\autoencode.h", verbose=0, save_best_only=True)tensorboard = TensorBoard(log_dir=r"C:\Users\Ramy\Desktop\AI\logs", histogram_freq=0, write_graph=True, write_images=True)#early_stop = EarlyStopping(monitor=’loss’, patience=2, verbose=0, mode='min')history = autoencoder.fit(X_train, X_train,epochs= 20,batch_size=32,shuffle=True,validation_data=(X_test, X_test),verbose=1,callbacks=[checkpoint, tensorboard]).history
Results
To evaluate our model’s learning capabilities, we plot the train & test model losses to verify that the higher the number of epochs, the lower our error rate.
Our MSE is rebranded as the reconstruction error, and it seems it converges well on the test & training set.
Summary Statistics of Reconstruction Error & True Class
We then plot the reconstruction errors for both class types ( Normal and Fraudulent ).
Validation
Recall vs. Precision
High Precision:Low False Positive Rate
High Recall:Low False Negative Rate
The plot shows that the values are very extreme in this case. The model can either do well in precision or recall alone but can’t have both at the same time.
Autoencoder Model’s Optimal Points:
Recall: 20%
Precision: 60%
A granular plot of Precision & Recall curves by threshold
The graph shows that as the reconstruction error threshold increases, the model precision increases as well. However, the opposite is seen for recall metrics.
Model Testing & Evaluation
To finally differentiate between fraudulent and normal transactions, we will introduce a threshold value. Through using the reconstruction error from the transaction data, if the error is larger than the defined threshold, then that transaction will be marked as fraudulent.
Optimal Threshold Value= 3.2
We could have also estimated the threshold value from the test data. However, there would have been the potential of overfitting which could prove detrimental in the long run.
Visualization of Division Between Normal & Fraudulent Transactions w/ Respect to Threshold Values
Confusion Matrix
This offers a more comprehensive overview of our model’s precision & recall values. Overall, our autoencoder model is robust with high True Positives & True Negatives (fraud vs. normal transaction detection rates)
Fraudulent Transactions
Looking at the confusion matrix, we can see that there are 16+85 = 101 fraudulent transactions .
85of them were correctly classified as fraudulent and 16 of them were incorrectly classified as normal transactions.
Normal Transactions
On the other hand, 1159 are incorrectly classified as fraudulent , equivalent to approximately 2% of the total normal cases .
Original Objective
In general, it’s much more costly to mistake a fraudulent transaction as a normal transaction as opposed to the reverse.
Solution
To make sure that this objective is satisfied, we try to boost the predictive power of detecting a fraudulent transaction by trading off our ability to accurately predict normal transactions.
Conclusion
Overall, the model is relatively robust because we did catch most of the fraudulent cases. However, this can be further improved if our dataset had been a bit more balanced.
GitHub
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。