Credit Card Fraud Detection is the most popular dataset on Kaggle. It’s appeal stems from the fact that transaction fraud detection is a practical application that many businesses care about. It’s pretty cool to stop crime with machine learning.
The dataset is relatively easy to work with given that it’s structured, doesn’t have missing values, and is under 1GB in size. Our goal is to build a binary classifier that identifies which transactions are fraudulent and which are genuine. One challenge is dealing with the highly unbalanced distribution of labels: only 492 of the 284,807 transactions in the dataset are fraudulent (0.173%). Less fraud is good for a credit card company, but makes life a little more difficult for machine learning engineers.
While it would be cool to just build an accurate model, it would be more useful to build a production application that can automatically scale to handle more data, update when new data becomes available, and serve real-time predictions. This usually requires a lot of DevOps work, but we can do it with minimal effort using Cortex, an open source machine learning infrastructure platform. Cortex converts declarative configuration into scalable machine learning pipelines. In this guide, we’ll see how to use Cortex to build and deploy a fraud detection API using Kaggle’s dataset.
Setting up Data Ingestion
Typically, we’d configure Cortex to ingest data from a production data warehouse, but for this example we’ll use a public S3 bucket named cortex-examples. Machine learning applications should not tamper with the data warehouse, so Cortex ingests the data and manages it independently.
The environment configuration below tells Cortex to ingest the CSV data with the defined schema. There’s no need to write custom data wrangling scripts or manage Spark workloads for data processing.
We need to convert the CSV file into columns that we can transform and then use to build our model. For production applications, it’s good to perform type checking and ensure that we don’t have any missing values. The raw columns configuration tells Cortex to validate that class is always an integer, and time, v1-v28, and amount are all floats. Without these checks, we could have data quality issues that are hard to debug and degrade model performance.
- kind: environment
name: dev
data:
type: csv
path: s3a://cortex-examples/fraud.csv
csv_config:
header: true
schema: [time, v1, v2, v3, ..., amount, class]
- kind: raw_column
name: time
type: FLOAT_COLUMN
required: true
- kind: raw_column
name: v1
type: FLOAT_COLUMN
required: true
- kind: raw_column
name: v2
type: FLOAT_COLUMN
required: true
# v3 - v28 omitted for brevity
- kind: raw_column
name: amount
type: FLOAT_COLUMN
required: true
- kind: raw_column
name: class
type: INT_COLUMN
required: trueDefining Data Transformations
When the data is ingested and validated, we’ll need to prepare it for training. time , v1-v28 , and amount are numeric columns with very different ranges, so we should normalize them to prevent some features from being treated as more or less important based on the magnitude of their values.
Normalization requires computing two aggregate values for each data column, namely mean and standard deviation, as well as transforming each value in the column (by subtracting the mean and dividing by the standard deviation). Cortex has these aggregation and transformation functions built-in. The YAML below shows how to configure a Cortex pipeline to compute all the aggregates and transformed columns. Note that the target column classdoesn’t need any modification because its values are already 0 or 1.
- kind: aggregate
name: time_mean
aggregator: cortex.mean
inputs:
columns:
col: time
- kind: aggregate
name: time_stddev
aggregator: cortex.stddev
inputs:
columns:
col: time
- kind: transformed_column
name: time_normalized
transformer: cortex.normalize
inputs:
columns:
num: time
args:
mean: time_mean
stddev: time_stddev
# v1 - v28 omitted for brevity
- kind: aggregate
name: amount_mean
aggregator: cortex.mean
inputs:
columns:
col: amount
- kind: aggregate
name: amount_stddev
aggregator: cortex.stddev
inputs:
columns:
col: amount
- kind: transformed_column
name: amount_normalized
transformer: cortex.normalize
inputs:
columns:
num: amount
args:
mean: amount_mean
stddev: amount_stddevProcessing the Data
Now that the data preparation steps are defined, cortex deploy will launch and orchestrate all required workloads for processing the data at scale. We can deploy our application at any time and Cortex will create the desired state based on the configuration. Subsequent deployments will use cached resources when possible before launching additional workloads. Cortex streams output to our terminal in real time, which we can use to sanity check that our code is working correctly:
Ingesting fraud data from s3a://cortex-examples/fraud.csv
284807 rows ingested
...
v1_mean: -2.237831565309384e-10
v1_stddev: 1.958695804149988
Transforming v1 to v1_normalized
v1: -1.36 1.19 -1.36
v1_normalized: -0.69 0.61 -0.69
...
amount_mean: 88.34961924204623
amount_stddev: 250.12010901734928
Transforming amount to amount_normalized
amount: 149.62 2.69 378.66
amount_normalized: 0.24 -0.34 1.16Configuring Model Training
We’ll use TensorFlow’s pre-made DNNClassifier to keep our example simple, although Cortex supports any TensorFlow code that implements the tf.estimator API.
import tensorflow as tf
def create_estimator(run_config, model_config):
feature_columns = [
tf.feature_column.numeric_column(feature_column["name"])
for feature_column in model_config["feature_columns"]
]
return tf.estimator.DNNClassifier(
feature_columns=feature_columns,
hidden_units=model_config["hparams"]["hidden_units"],
n_classes=2,
config=run_config,
)We configure Cortex to make the normalized columns available to the training workload, automatically split the dataset into 80% for training and 20% for evaluation, and train for 5000 steps.
- kind: model
name: dnn
path: dnn.py
type: classification
target_column: class
feature_columns:
[time_normalized, v1_normalized, v2_normalized, v3_normalized, v4_normalized, v5_normalized, v6_normalized, v7_normalized, v8_normalized, v9_normalized, v10_normalized, v11_normalized, v12_normalized, v13_normalized, v14_normalized, v15_normalized, v16_normalized, v17_normalized, v18_normalized, v19_normalized, v20_normalized, v21_normalized, v22_normalized, v23_normalized, v24_normalized, v25_normalized, v26_normalized, v27_normalized, v28_normalized, amount_normalized]
hparams:
hidden_units: [100, 100, 100]
data_partition_ratio:
training: 0.8
evaluation: 0.2
training:
num_steps: 5000Training the Model
Now, cortex deploy will launch and orchestrate the workloads required for training the model at scale:
...
loss = 0.02041925, step = 4501 (0.703 sec)
loss = 0.012982414, step = 4601 (0.612 sec)
loss = 0.004186087, step = 4701 (0.762 sec)
loss = 0.0035106726, step = 4801 (0.755 sec)
loss = 0.0006830689, step = 4901 (0.640 sec)
accuracy = 0.9965
accuracy_baseline = 0.995
auc = 0.92329776
auc_precision_recall = 0.60935247
precision = 0.60714287
recall = 0.8599.7% accuracy looks good, but precision and recall are too low. This is because the dataset is highly unbalanced. Almost every training sample is a genuine transaction, so the model learns to classify every sample as genuine. Therefore, this model isn’t actually useful for fraud detection.
Adding a Weight Column
We can address this problem in several ways: upsampling, downsampling, or weighting. Upsampling means duplicating samples from the rare class until the ratio of the classes is closer to 1, but this could get expensive in terms of compute resources and storage if the dataset is large. Alternatively, downsampling eliminates samples from the more common class until the ratio is closer to 1, but this could shrink our training dataset significantly. Weighting scales a training sample’s impact on the loss function based on its class. In this application, it will tell the model to take fraudulent transactions a lot more seriously.
We’ll opt to use weighting and create a new data column containing weights for each sample. The weights for the fraud class will be the ratio of genuine transactions to the full dataset, and the weights for the genuine class will be the ratio of fraudulent transactions to the full dataset. So if 99% of transactions are genuine, a fraudulent sample should have a weight 99 times greater than a genuine sample.
We can implement this using PySpark. Cortex already has a built-in class_distribution aggregation function, as well as support for custom PySpark code which we’ll use to create the weight column:
def transform_spark(data, columns, args, transformed_column_name):
import pyspark.sql.functions as F
distribution = args["class_distribution"]
return data.withColumn(
transformed_column_name,
F.when(data[columns["col"]] == 0, distribution[1]).otherwise(distribution[0]),
)Below is the configuration for the class distribution aggregate, the custom PySpark transformer, and the transformed column. Cortex will automatically execute the aggregation and transformation workloads based on this configuration:
- kind: aggregate
name: class_distribution
aggregator: cortex.class_distribution_int
inputs:
columns:
col: class
- kind: transformer
name: weight
path: weight.py
inputs:
columns:
col: INT_COLUMN
args:
class_distribution: {INT: FLOAT}
output_type: FLOAT_COLUMN
- kind: transformed_column
name: weight_column
transformer: weight
inputs:
columns:
col: class
args:
class_distribution: class_distributionUpdating the Model
We can use the weight column in our model by making a small modification to the TensorFlow estimator implementation (line 14):
import tensorflow as tf
def create_estimator(run_config, model_config):
feature_columns = [
tf.feature_column.numeric_column(feature_column["name"])
for feature_column in model_config["feature_columns"]
]
return tf.estimator.DNNClassifier(
feature_columns=feature_columns,
hidden_units=model_config["hparams"]["hidden_units"],
n_classes=2,
weight_column="weight_column",
config=run_config,
)And a small update to the model configuration (line 8):
- kind: model
name: dnn
path: dnn.py
type: classification
target_column: class
feature_columns:
[time_normalized, v1_normalized, v2_normalized, v3_normalized, v4_normalized, v5_normalized, v6_normalized, v7_normalized, v8_normalized, v9_normalized, v10_normalized, v11_normalized, v12_normalized, v13_normalized, v14_normalized, v15_normalized, v16_normalized, v17_normalized, v18_normalized, v19_normalized, v20_normalized, v21_normalized, v22_normalized, v23_normalized, v24_normalized, v25_normalized, v26_normalized, v27_normalized, v28_normalized, amount_normalized]
training_columns: [weight_column]
hparams:
hidden_units: [100, 100, 100]
data_partition_ratio:
training: 0.8
evaluation: 0.2
training:
num_steps: 5000We add weight_column to the training_columns list in the model configuration because this is not a feature column, and will only be used in training, not inference.
Retraining the Model
Once we’ve made these modifications, cortex deploy will create the weight column and retrain the model. The feature data won’t have to be ingested or normalized again because Cortex caches as much as possible.
...
loss = 0.01819101, step = 4501 (0.491 sec)
loss = 0.010051587, step = 4601 (0.461 sec)
loss = 0.008202191, step = 4701 (0.491 sec)
loss = 0.0047955187, step = 4801 (0.716 sec)
loss = 0.002248696, step = 4901 (0.464 sec)
accuracy = 0.9960093
accuracy_baseline = 0.74384576
auc = 0.9978455
auc_precision_recall = 0.99920136
precision = 0.9946642
recall = 1.0These metrics look a lot better! For comparison, precision was 0.61
before adding the weight column. Now our model is much more useful for detecting fraudulent transactions in production.
Configuring Prediction Serving
We can make the model available as a live web service that can serve real-time predictions using the configuration below:
- kind: api
name: fraud
model_name: dnn
compute:
replicas: 3After deploying again, we can test the API with the following prediction request:
$ curl -k
-X POST
-H "Content-Type: application/json" \
-d '{ "samples": [ { "amount": 10, "time": 123, "v1": 1.0, ...} ] }'
https://abc.amazonaws.com/fraud
{"classification_predictions": [{"class_ids":["1"]}]}Cortex will automatically apply the same data transformations that were used during training to the prediction request.
Running this Yourself
Cortex is open source and free to download. The full fraud detection example code can be found here.