Overview
On this submission, we will train an autoencoder to detect bank card fraud. Can you deploy and train Keras models in the cloud using a cloud-based GPU-enabled environment?
Our proposed mannequin is grounded in the Kaggle dataset, a collaborative effort between Worldline and Université Libre de Bruxelles (ULB) focusing on large-scale knowledge mining and fraud detection.
The dataset encompasses bankcard transactions from European cardholders over the course of a two-day period in September 2013. Of the 284,807 transactions recorded, a mere 492 were found to be fraudulent. The dataset exhibits an alarmingly high degree of imbalance, with fraudulent transactions comprising a mere 0.172% of all transactions, underscoring the significant disparity between genuine and fraudulent activity.
Studying the information
After downloading the information from various sources, you can load it into R using the read.csv() function. read_csv()
:
These entered variables comprise exclusively numerical values representing the output of a Principal Component Analysis (PCA) transformation. To maintain confidentiality, no additional information regarding these distinctive choices was provided. The 28 options (V1 to V28) were derived through Principal Component Analysis (PCA). Despite ongoing efforts, there remain two exceptions: (and) that require further revision.
Are the seconds that have elapsed since each transaction compared to the initial transaction in the data set? Does the transaction quantity provide a suitable framework for analyzing costs in an educational context? The variable assigns a value of 1 in cases involving fraud, and 0 otherwise.
Autoencoders
Given that an infinitesimally small proportion, merely 0.172%, of our observations consist of fraudulent instances, we confront an extremely skewed classification problem. Given the significant limitations associated with this type of data imbalance, traditional classification methodologies often struggle to produce accurate results due to the scarcity of instances in the minority class.
An autoencoder is a type of neural network that’s commonly used to learn an encoding (or representation) of a dataset, often with the goal of achieving dimensionality reduction. To mitigate this disadvantage, we will train an autoencoder to compress non-fraudulent instances from our training dataset. Given the purported non-uniform distribution of fraudulent transactions, it is reasonable to expect that our autoencoder’s reconstruction errors will be significantly higher for fraudulent samples compared to legitimate ones. The reconstruction error serves as a quantifiable indicator of whether a transaction is legitimate or fraudulent.
To delve deeper into the realm of autoencoders, begin by exploring tutorials on YouTube and the seminal work by Goodfellow et al.?
Visualization
To successfully deploy an autoencoder, we rely on a crucial premise: the underlying data distributions for genuine and fraudulent transactions are mutually exclusive and distinct. Let’s actually create some plots to verify this notion. Variables had been reworked to accommodate new data seamlessly? [0,1]
interval for plotting.
We’ll observe that the distributions of variables for fraudulent transactions significantly diverge from those of regular ones, except for one variable whose distribution exhibits a striking similarity.
Preprocessing
Prior to commencing with the modeling process, it is essential to perform some preliminary data preprocessing tasks. We will partition the dataset for exploration and examine the individual data points, subsequently utilizing this understanding to advantage from neural networks’ superior performance when presented with small input values. Given that the variable exhibits identical distributions for both legitimate and suspicious transactions, we can confidently eliminate it from further analysis.
Using primarily 200,000 observations will serve as our training set for modeling, reserving the remaining data for validation purposes. When leveraging the model, it’s essential to forecast potential fraud instances by analyzing transactions that preceded them.
Normalization of inputs? Two new features have been developed to support our efforts. The initial step will involve obtaining descriptive statistics about the dataset, potentially serving as a foundation for subsequent scaling processes. We now possess a performer tasked with executing the min-max scaling procedure. We employed consistent normalization factors across both training and test sets.
Let’s standardize our data sets to ensure consistency and facilitate further analysis? To accommodate Keras’ requirements, we converted our knowledge frames into matrix form.
Here’s the improved text: We will now outline our model in Keras, a symmetric autoencoder comprising four densely connected layers.
Layer(sorted) Output Form Param #
===================================================================================
Dense Layer (dense_1) None, 15 450
Dense Layer (dense_2) None, 10 160
Dense Layer (dense_3) None, 15 165
Dense Layer (dense_4) None, 29 464
===================================================================================
Total Parameters: 1,239
Trainable Parameters: 1,239
Non-Trainable Parameters: 0
We will subsequently compile our model, utilising mean-squared error loss and the Adam optimiser for training.
Coaching the mannequin
Using the match()
perform. Training a model to interact with a mannequin fairly efficiently, taking around 14 seconds per epoch on my laptop. We will exclusively train our model on data from genuine transactions.
We’ll use callback_model_checkpoint()
In order to save our model after every epoch? By passing the argument save_best_only = TRUE
We’ll store on disk the epoch with the lowest validation loss.
We can even use callback_early_stopping()
To cease coaching once the validation loss ceases to decrease for five consecutive epochs.
Training on 199,615 samples and validating on 84,700 samples.
Epochs 1-100:
Step time: 83us to 94us;
Loss: 0.0036 to 3.2259e-04;
Validation loss: 6.8522e-04 to 4.0852e-04.
Model saved after each epoch as 'mannequin.hdf5'.
Following our coaching sessions, clients are empowered to achieve remarkable results by carefully analyzing their data sets. consider()
fucntion.
loss
0.0003534254
Tuning with CloudML
By optimizing our model’s hyperparameters, we may potentially achieve better results. We’ll refine our model by adjusting parameters such as normalization techniques, training epochs, activation functions, and the number of neurons in each hidden layer. CloudML leverages Bayesian optimization to optimize the hyperparameters of fashion models as detailed in.
Let’s refine the text in a more professional style. Here is the revised version:
To fine-tune our mannequin, we will initially assemble our mission by crafting a plan for each hyperparameter and an tuning.yml
What key hyperparameters should we fine-tune in Cloud ML to optimize model performance?
The Cloud ML engine training script can be found at https://github.com/GoogleCloudPlatform/cloud-ml-samples. Substantial updates to the code were implemented, incorporating coaching flags that significantly enhanced its functionality.
We then used the FLAGS
The variable contains the parameters that define the model’s architecture and performance, such as.
We additionally created a tuning.yml
Throughout the training process, it is crucial to carefully consider and adjust hyperparameters as necessary, simultaneously optimizing a specific metric, such as validation loss. val_loss
).
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
hyperparameters:
objective: MINIMIZE
hyperparameterMetricTag: val_loss
maxTrials: 10
maxParallelTrials: 5
params:
- parameterName: normalization
sort: categorical
categoricalValues: [zscore, minmax]
- parameterName: activation
sort: categorical
categoricalValues: [relu, selu, tanh, sigmoid]
- parameterName: learning_rate
sort: double
minValue: 0.000001
maxValue: 0.1
scaleType: unit_log_scale
- parameterName: hidden_size
sort: integer
minValue: 5
maxValue: 50
scaleType: unit_linear_scale
What type of equipment do we require for this specific project? standard_gpu
To achieve optimal performance during model tuning, it’s crucial to minimize the metric in question, while considering an extensive range of trials encompassing diverse combinations of hyperparameters. We precisely define the search space for each hyperparameter during the tuning process.
You will be taught more about the tuning.yml file and its application.
We’re now equipped to deploy our project on Google Cloud Machine Learning. We will accomplish that through operations.
The cloudml package handles imports and installs necessary R package dependencies for running scripts on Cloud Machine Learning. Utilizing RStudio v1.1 or later, you’ll also benefit from monitoring your job in a background terminal with ease. You can even monitor your job using this.
Once the project is finished, we can access and collect the job’s tangible results.
This copied data with the highest-ranking job. val_loss
Efficiencies realized on migrating CloudML to our native system are succinctly summarized in this coaching report.
Since we employed a callback to prevent redundant mannequin checkpoint creation during training, the model file was also replicated from Google CloudML storage. Records data created during training are replicated to the “runs” subfolder within the working directory from which the session was initiated. cloudml_train()
known as. You may decide this listing for the latest run with a focus on timely and efficient execution.
[1] runs/cloudml_2018_01_23_221244595-03
You can record all earlier runs and their corresponding validation losses.
run directory metric loss metric validation loss
1 2017-12-09 21:01:11 0.2577 0.1482
2 2017-12-09 21:00:11 0.2655 0.1505
3 2017-12-09 19:59:44 0.2597 0.1402
4 2017-12-09 19:56:48 0.2610 0.1459
In the event that the job was downloaded from CloudML, it was successfully saved to runs/cloudml_2018_01_23_221244595-03/
So, the saved mannequin file is now available in the market. runs/cloudml_2018_01_23_221244595-03/mannequin.hdf5
. Using our fine-tuned model, we will now generate predictions.
Making predictions
Now that we’ve thoroughly trained and fine-tuned our mannequin, we’re able to produce accurate predictions using our autoencoder. Within the context of our Market Surveillance Expertise (MSE), we demonstrate a keen interest in each commentary, anticipating that reports of fraudulent transactions will yield even more significant MSE’s.
First, let’s load our mannequin.
Let’s calculate the Mean Squared Error (MSE) for the coaching and examine the set of observations.
Measuring the efficiency of a mannequin in handling extremely imbalanced datasets effectively hinges on the Space Below the Receiver Operating Characteristic Curve, or Area Under the Curve (AUC), as a reliable metric. The area under the curve provides a meaningful interpretation for this limitation: namely, the probability that a fraudulent transaction will yield a higher mean squared error than a genuine one? We will utilize the package to calculate this, which provides implementations for various common machine learning model performance metrics.
[1] 0.9546814
[1] 0.9403554
To utilize the mannequin effectively for making predictions, we need to determine a suitable threshold for Mean Squared Error (MSE), after which any new transaction is classified as fraudulent; otherwise, it is considered regular. While outlining the value of precision and recall is crucial, it’s essential to consider their significance in various contexts.
Deciding on the optimal edge requires pinpoint accuracy; conversely, we can also rely heavily on quantifying potential losses from fraudulent activities.
Every handbook verification of potential fraud incurs a $1 cost, whereas failing to verify a legitimate transaction can result in losing its value entirely? Let’s explore the potential financial losses that could occur at each threshold.
We will determine the optimal threshold for this scenario.
[1] 0.005050505
Manually verifying every fraudulent transaction would likely incur a significant expense of approximately $13,000. By leveraging our mannequin, we’re able to trim the cost down to approximately $2,500.