Validate raw data at scale and define columns.
- kind: raw_columnname: <string> # raw column name (required)type: INT_COLUMN # data type (required)required: <boolean> # whether null values are allowed (default: false)min: <int> # minimum allowed value (optional)max: <int> # maximum allowed value (optional)values: <[int]> # an exhaustive list of allowed values (optional)compute:executors: <int> # number of spark executors (default: 1)driver_cpu: <string> # CPU request for spark driver (default: 1)driver_mem: <string> # memory request for spark driver (default: 500Mi)driver_mem_overhead: <string> # off-heap (non-JVM) memory allocated to the driver (overrides mem_overhead_factor) (default: min[driver_mem * 0.4, 384Mi])executor_cpu: <string> # CPU request for each spark executor (default: 1)executor_mem: <string> # memory request for each spark executor (default: 500Mi)executor_mem_overhead: <string> # off-heap (non-JVM) memory allocated to each executor (overrides mem_overhead_factor) (default: min[executor_mem * 0.4, 384Mi])mem_overhead_factor: <float> # the proportion of driver_mem/executor_mem which will be additionally allocated for off-heap (non-JVM) memory (default: 0.4)​- kind: raw_columnname: <string> # raw column name (required)type: FLOAT_COLUMN # data type (required)required: <boolean> # whether null values are allowed (default: false)min: <float> # minimum allowed value (optional)max: <float> # maximum allowed value (optional)values: <[float]> # an exhaustive list of allowed values (optional)compute:executors: <int> # number of spark executors (default: 1)driver_cpu: <string> # CPU request for spark driver (default: 1)driver_mem: <string> # memory request for spark driver (default: 500Mi)driver_mem_overhead: <string> # off-heap (non-JVM) memory allocated to the driver (overrides mem_overhead_factor) (default: min[driver_mem * 0.4, 384Mi])executor_cpu: <string> # CPU request for each spark executor (default: 1)executor_mem: <string> # memory request for each spark executor (default: 500Mi)executor_mem_overhead: <string> # off-heap (non-JVM) memory allocated to each executor (overrides mem_overhead_factor) (default: min[executor_mem * 0.4, 384Mi])mem_overhead_factor: <float> # the proportion of driver_mem/executor_mem which will be additionally allocated for off-heap (non-JVM) memory (default: 0.4)​- kind: raw_columnname: <string> # raw column name (required)type: STRING_COLUMN # data type (required)required: <boolean> # whether null values are allowed (default: false)values: <[string]> # an exhaustive list of allowed values (optional)compute:executors: <int> # number of spark executors (default: 1)driver_cpu: <string> # CPU request for spark driver (default: 1)driver_mem: <string> # memory request for spark driver (default: 500Mi)driver_mem_overhead: <string> # off-heap (non-JVM) memory allocated to the driver (overrides mem_overhead_factor) (default: min[driver_mem * 0.4, 384Mi])executor_cpu: <string> # CPU request for each spark executor (default: 1)executor_mem: <string> # memory request for each spark executor (default: 500Mi)executor_mem_overhead: <string> # off-heap (non-JVM) memory allocated to each executor (overrides mem_overhead_factor) (default: min[executor_mem * 0.4, 384Mi])mem_overhead_factor: <float> # the proportion of driver_mem/executor_mem which will be additionally allocated for off-heap (non-JVM) memory (default: 0.4)
- kind: raw_columnname: column1type: INT_COLUMNrequired: truemin: 0max: 10​- kind: raw_columnname: column2type: FLOAT_COLUMNrequired: truemin: 1.1max: 2.2​- kind: raw_columnname: column3type: STRING_COLUMNrequired: falsevalues: [a, b, c]
Cortex integrates with your existing data warehouse and runs all validations every time new data is ingested. All raw columns are cached to speed up additional processing.