用于设置在Reducer的partition数目少于多少的时候，Sort Based Shuffle内部不使用Merge Sort的方式处理数据，而是直接将每个partition写入单独的文件。这个方式和Hash Based的方式是类似的，区别就是在最后这些文件还是会合并成一个单独的文件，并通过一个index索引文件来标记不同partition的位置信息。从Reducer看来，数据文件和索引文件的格式和内部是否做过Merge Sort是完全相同的。这个可以看做SortBased Shuffle在Shuffle量比较小的时候对于Hash Based Shuffle的一种折衷。当然了它和Hash Based Shuffle一样，也存在同时打开文件过多导致内存占用增加的问题。因此如果GC比较严重或者内存比较紧张，可以适当的降低这个值。
Same as spark.dynamicAllocation.schedulerBacklogTimeout, but used only for subsequent executor requests.
Whether Spark acls should are enabled. If enabled, this checks to see if the user has access permissions to view or modify the job. Note this requires the user to be known, so if the user comes across as null no checks are done. Filters can be used with the UI to authenticate and set the user.
Comma separated list of users/administrators that have view and modify access to all Spark jobs. This can be used if you run on a shared cluster and have a set of administrators or devs who help debug when things work. Putting a “*” in the list means any user can have the priviledge of admin.
Whether Spark authenticates its internal connections. See spark.authenticate.secret if not running on YARN.
Set the secret key used for Spark to authenticate between components. This needs to be set if not running on YARN and authentication is enabled.
Enable encrypted communication when authentication is enabled. This option is currently only supported by the block transfer service.
Disable unencrypted connections for services that support SASL authentication. This is currently supported by the external shuffle service.
How long for the connection to wait for ack to occur before timing out and giving up. To avoid unwilling timeout caused by long pause like GC, you can set larger value.
How long for the connection to wait for authentication to occur before timing out and giving up.
Comma separated list of users that have modify access to the Spark job. By default only the user that started the Spark job has access to modify it (kill it for example). Putting a “*” in the list means any user can have access to modify it.
Comma separated list of filter class names to apply to the Spark web UI. The filter should be a standard javax servlet Filter. Parameters to each filter can also be specified by setting a java system property of:spark..params=’param1=value1,param2=value2’For example:-Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params=’param1=foo,param2=testing’
Comma separated list of users that have view access to the Spark web ui. By default only the user that started the Spark job has view access. Putting a “*” in the list means any user can have view access to this Spark job.
A password to the key-store
A protocol name. The protocol must be supported by JVM. The reference list of protocols one can find on this page.
A path to a trust-store file. The path can be absolute or relative to the directory where the component is started in.
A password to the trust-store.
13. Spark Streaming
Enables or disables Spark Streaming’s internal backpressure mechanism (since 1.5). This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below).
Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark. Minimum recommended - 50 ms. See the performance tuning section in the Spark Streaming programing guide for more details.
Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no limit on the rate. See the deployment guide in the Spark Streaming programing guide for mode details.
Enable write ahead logs for receivers. All the input data received through receivers will be saved to write ahead logs that will allow it to be recovered after driver failures. See the deployment guide in the Spark Streaming programing guide for more details.
Force RDDs generated and persisted by Spark Streaming to be automatically unpersisted from Spark’s memory. The raw input data received by Spark Streaming is also automatically cleared. Setting this to false will allow the raw data and persisted RDDs to be accessible outside the streaming application as they will not be cleared automatically. But it comes at the cost of higher memory usage in Spark.
If true, Spark shuts down the StreamingContext gracefully on JVM shutdown rather than immediately.
Maximum rate (number of records per second) at which data will be read from each Kafka partition when using the new Kafka direct stream API. See the Kafka Integration guide for more details.
Maximum number of consecutive retries the driver will make in order to find the latest offsets on the leader of each partition (a default value of 1 means that the driver will make a maximum of 2 attempts). Only applies to the new Kafka direct stream API.
How many batches the Spark Streaming UI and status APIs remember before garbage collecting.
Whether to close the file after writing a write ahead log record on the driver. Set this to ‘true’ when you want to use S3 (or any file system that does not support flushing) for the metadata WAL on the driver.
Whether to close the file after writing a write ahead log record on the receivers. Set this to ‘true’ when you want to use S3 (or any file system that does not support flushing) for the data WAL on the receivers.
Executable for executing R scripts in cluster modes for both driver and workers.
Executable for executing R scripts in client modes for driver. Ignored in cluster modes