def aggregate_spark(data, columns, args):"""Aggregate a column in a PySpark context.​This function is required.​Args:data: A dataframe including all of the raw columns.​columns: A dict with the same structure as the aggregator's inputcolumns specifying the names of the dataframe's columns thatcontain the input columns.​args: A dict with the same structure as the aggregator's input argscontaining the values of the args.​Returns:Any json-serializable object that matches the data type of the aggregator."""pass
def aggregate_spark(data, columns, args):from pyspark.ml.feature import QuantileDiscretizer​discretizer = QuantileDiscretizer(numBuckets=args["num_buckets"], inputCol=columns["col"], outputCol="_").fit(data)​return discretizer.getSplits()
The following packages have been pre-installed and can be used in your implementations:
pyspark==2.4.0boto3==1.9.78msgpack==0.6.1numpy>=1.13.3,<2requirements-parser==0.2.0packaging==19.0.0
You can install additional PyPI packages and import your own Python packages. See Python Packages for more details.