Transfer data at scale from data warehouses like S3 into the Cortex environment. Once data is ingested, it’s lifecycle is fully managed by Cortex.
- kind: environmentname: <string> # environment name (required)limit:# specify `num_rows` or `fraction_of_rows` if using `limit`num_rows: <int> # maximum number of rows to select from the datasetfraction_of_rows: <float> # fraction of rows to select from the datasetrandomize: <bool> # flag to indicate random selection of data (exact dataset size will not be guaranteed when this flag is true)random_seed: <int> # seed value for randomizinglog_level:tensorflow: <string> # TensorFlow log level (DEBUG, INFO, WARN, ERROR, or FATAL) (default: DEBUG)spark: <string> # Spark log level (ALL, TRACE, DEBUG, INFO, WARN, ERROR, or FATAL) (default: WARN)data:<data_config>
data:type: csv # file type (required)path: s3a://<bucket_name>/<file_name> # S3 is currently supported (required)drop_null: <bool> # drop any rows that contain at least 1 null value (default: false)csv_config: <csv_config> # optional configuration that can be providedschema:- <string> # raw column references listed in the CSV columns' order (required)...
To help ingest different styles of CSV files, Cortex supports the parameters listed below. All of these parameters are optional. A description and default values for each parameter can be found in the PySpark CSV Documentation.
csv_config:sep: <string>encoding: <string>quote: <string>escape: <string>comment: <string>header: <bool>ignore_leading_white_space: <bool>ignore_trailing_white_space: <bool>null_value: <string>nan_value: <string>positive_inf: <bool>negative_inf: <bool>max_columns: <int>max_chars_per_column: <int>multiline: <bool>char_to_escape_quote_escaping: <string>empty_value: <string>
data:type: parquet # file type (required)path: s3a://<bucket_name>/<file_name> # S3 is currently supported (required)drop_null: <bool> # drop any rows that contain at least 1 null value (default: false)schema:- parquet_column_name: <string> # name of the column in the parquet file (required)raw_column: <string> # raw column reference (required)...
- kind: environmentname: devdata:type: csvpath: s3a://my-bucket/data.csvschema:- @column1- @column2- @column3- @label​- kind: environmentname: proddata:type: parquetpath: s3a://my-bucket/data.parquetschema:- parquet_column_name: column1raw_column: @column1- parquet_column_name: column2raw_column: @column2- parquet_column_name: column3raw_column: @column3- parquet_column_name: column4raw_column: @label