Transformed Columns

Transform data at scale.
Config
- kind: transformed_column
  name: <string>  # transformed column name (required)
  transformer: <string>  # the name of the transformer to use (this or transformer_path must be specified)
  transformer_path: <string>  # a path to an transformer implementation file (this or transformer must be specified)
  input: <input_value>  # the input to the transformer, which may contain references to columns, constants, and aggregates (e.g. @column1) (required)
  compute:
    executors: <int>  # number of spark executors (default: 1)
    driver_cpu: <string>  # CPU request for spark driver (default: 1)
    driver_mem: <string>  # memory request for spark driver (default: 500Mi)
    driver_mem_overhead: <string>  # off-heap (non-JVM) memory allocated to the driver (overrides mem_overhead_factor) (default: min[driver_mem * 0.4, 384Mi])
    executor_cpu: <string>  # CPU request for each spark executor (default: 1)
    executor_mem: <string>  # memory request for each spark executor (default: 500Mi)
    executor_mem_overhead: <string>  # off-heap (non-JVM) memory allocated to each executor (overrides mem_overhead_factor) (default: min[executor_mem * 0.4, 384Mi])
    mem_overhead_factor: <float>  # the proportion of driver_mem/executor_mem which will be additionally allocated for off-heap (non-JVM) memory (default: 0.4)
See Data Types for details about input values. Note: the input of the the transformed column must match the input type of the transformer (if specified).
See transformers.yaml for a list of built-in transformers.
Example
- kind: transformed_column
  name: age_normalized
  transformer: cortex.normalize
  input:
    num: @age  # "age" is the name of a numeric raw column
    mean: @age_mean  # "age_mean" is the name of an aggregate which used the cortex.mean aggregator
    stddev: @age_stddev  # "age_stddev" is the name of an aggregate which used the cortex.stddev aggregator
​
- kind: transformed_column
  name: class_indexed
  transformer: cortex.index_string
  input:
    col: @class  # "class" is the name of a string raw column
    index: {"indexes": ["t", "f"], "reversed_index": ["t": 0, "f": 1]}  # a value to be used as the index
​
- kind: transformed_column
  name: price_bucketized
  transformer: cortex.bucketize
  input:
    num: @price  # "price" is the name of a numeric raw column
    bucket_boundaries: @bucket_boundaries  # "bucket_boundaries" is the name of a [FLOAT] constant
Validating Transformers
Cortex does not run transformers on the full dataset until they are required for model training. However, in order to catch bugs as early as possible, Cortex sanity checks all transformed columns by running their transformers against the first 100 samples in the dataset.
Pipeline Deployments - Previous
Custom Transformers
Next - Pipeline Deployments
Estimators
Last updated -2