Environments

Transfer data at scale from data warehouses like S3 into the Cortex environment. Once data is ingested, it’s lifecycle is fully managed by Cortex.
Config
- kind: environment
  name: <string>  # environment name (required)
  limit:
      # specify `num_rows` or `fraction_of_rows` if using `limit`
      num_rows: <int>  # maximum number of rows to select from the dataset
      fraction_of_rows: <float>  # fraction of rows to select from the dataset
      randomize: <bool>  # flag to indicate random selection of data (exact dataset size will not be guaranteed when this flag is true)
      random_seed: <int>  # seed value for randomizing
  log_level:
    tensorflow: <string>  # TensorFlow log level (DEBUG, INFO, WARN, ERROR, or FATAL) (default: DEBUG)
    spark: <string>  # Spark log level (ALL, TRACE, DEBUG, INFO, WARN, ERROR, or FATAL) (default: WARN)
  data:
    <data_config>
CSV Data Config
data:
  type: csv  # file type (required)
  path: s3a://<bucket_name>/<file_name>  # S3 is currently supported (required)
  drop_null: <bool>  # drop any rows that contain at least 1 null value (default: false)
  csv_config: <csv_config>  # optional configuration that can be provided
  schema:
    - <string>  # raw column references listed in the CSV columns' order (required)
      ...
CSV Config
To help ingest different styles of CSV files, Cortex supports the parameters listed below. All of these parameters are optional. A description and default values for each parameter can be found in the PySpark CSV Documentation.
csv_config:
  sep: <string>
  encoding: <string>
  quote: <string>
  escape: <string>
  comment: <string>
  header: <bool>
  ignore_leading_white_space: <bool>
  ignore_trailing_white_space: <bool>
  null_value: <string>
  nan_value: <string>
  positive_inf: <bool>
  negative_inf: <bool>
  max_columns: <int>
  max_chars_per_column: <int>
  multiline: <bool>
  char_to_escape_quote_escaping: <string>
  empty_value: <string>
Parquet Data Config
data:
  type: parquet  # file type (required)
  path: s3a://<bucket_name>/<file_name>  # S3 is currently supported (required)
  drop_null: <bool>  # drop any rows that contain at least 1 null value (default: false)
  schema:
    - parquet_column_name: <string>  # name of the column in the parquet file (required)
      raw_column: <string>  # raw column reference (required)
      ...
Example
- kind: environment
  name: dev
  data:
    type: csv
    path: s3a://my-bucket/data.csv
    schema:
      - @column1
      - @column2
      - @column3
      - @label
​
- kind: environment
  name: prod
  data:
    type: parquet
    path: s3a://my-bucket/data.parquet
    schema:
      - parquet_column_name: column1
        raw_column: @column1
      - parquet_column_name: column2
        raw_column: @column2
      - parquet_column_name: column3
        raw_column: @column3
      - parquet_column_name: column4
        raw_column: @label
Pipeline Deployments - Previous
Tutorial
Next - Pipeline Deployments
Raw Columns
Last updated -2