Apache Spark 2.2.0 正式发布

Apache Spark 2.2.0是 2.x 中的第三个发行版,原计划3月底发布,距离上个发行版本(2.1.0)的发布已有6个多月的时间,就Spark的常规发版节奏而言,2.2.0的发版可谓是长了不少。

这个版本中Structured Streaming移除了实验性标记,可用于生产环境。该版本共处理了1100多个Issue,侧重于可用性、Bug修复及稳定性改进,建议所有2.x的用户更新到此版本。

Core and Spark SQL

Spark2.2.0版本中SQL的CBO(Cost-Based Optimizer)优化器基本完成。

  • API updates
    • SPARK-19107: Support creating hive table with DataFrameWriter and Catalog
    • SPARK-13721: Add support for LATERAL VIEW OUTER explode()
    • SPARK-18885: Unify CREATE TABLE syntax for data source and hive serde tables
    • SPARK-16475: Added Broadcast Hints BROADCAST, BROADCASTJOIN, and MAPJOIN, for SQL Queries
    • SPARK-18350: Support session local timezone
    • SPARK-19261: Support ALTER TABLE table_name ADD COLUMNS
    • SPARK-20420: Add events to the external catalog
    • SPARK-18127: Add hooks and extension points to Spark
    • SPARK-20576: Support generic hint function in Dataset/DataFrame
    • SPARK-17203: Data source options should always be case insensitive
    • SPARK-19139: AES-based authentication mechanism for Spark
  • Performance and stability
    • Cost-Based Optimizer
    • SPARK-17075 SPARK-17076 SPARK-19020 SPARK-17077 SPARK-19350: Cardinality estimation for filter, join, aggregate, project and limit/sample operators
    • SPARK-17080: Cost-based join re-ordering
    • SPARK-17626: TPC-DS performance improvements using star-schema heuristics
    • SPARK-17949: Introduce a JVM object based aggregate operator
    • SPARK-18186: Partial aggregation support of HiveUDAFFunction
    • SPARK-18362 SPARK-19918: File listing/IO improvements for CSV and JSON
    • SPARK-18775: Limit the max number of records written per file
    • SPARK-18761: Uncancellable / unkillable tasks shouldn’t starve jobs of resources
    • SPARK-15352: Topology aware block replication
  • Other notable changes
    • SPARK-18352: Support for parsing multi-line JSON files
    • SPARK-19610: Support for parsing multi-line CSV files
    • SPARK-21079: Analyze Table Command on partitioned tables
    • SPARK-18703: Drop Staging Directories and Data Files after completion of Insertion/CTAS against Hive-serde Tables
    • SPARK-18209: More robust view canonicalization without full SQL expansion
    • SPARK-13446: [SPARK-18112] Support reading data from Hive metastore 2.0/2.1
    • SPARK-18191: Port RDD API to use commit protocol
    • SPARK-8425:Add blacklist mechanism for task scheduling
    • SPARK-19464: Remove support for Hadoop 2.5 and earlier
    • SPARK-19493: Remove Java 7 support

Structured Streaming

  • General Availablity
    • SPARK-20844: The Structured Streaming APIs are now GA and is no longer labeled experimental
  • Kafka Improvements
    • SPARK-19719: Support for reading and writing data in streaming or batch to/from Apache Kafka
    • SPARK-19968: Cached producer for lower latency kafka to kafka streams.
  • API updates
    • SPARK-19067: Support for complex stateful processing and timeouts using [flat]MapGroupsWithState
    • SPARK-19876: Support for one time triggers
  • Other notable changes
    • SPARK-20979: Rate source for testing and benchmarks


  • New algorithms in DataFrame-based API
    • SPARK-14709: LinearSVC (Linear SVM Classifier) (Scala/Java/Python/R)
    • SPARK-19635: ChiSquare test in DataFrame-based API (Scala/Java/Python)
    • SPARK-19636: Correlation in DataFrame-based API (Scala/Java/Python)
    • SPARK-13568: Imputer feature transformer for imputing missing values (Scala/Java/Python)
    • SPARK-18929: Add Tweedie distribution for GLMs (Scala/Java/Python/R)
    • SPARK-14503: FPGrowth frequent pattern mining and AssociationRules (Scala/Java/Python/R)
  • Existing algorithms added to Python and R APIs
    • SPARK-18239: Gradient Boosted Trees ®
    • SPARK-18821: Bisecting K-Means ®
    • SPARK-18080: Locality Sensitive Hashing (LSH) (Python)
    • SPARK-6227: Distributed PCA and SVD for PySpark (in RDD-based API)
  • Major bug fixes
    • SPARK-19110: DistributedLDAModel.logPrior correctness fix
    • SPARK-17975: EMLDAOptimizer fails with ClassCastException (caused by GraphX checkpointing bug)
    • SPARK-18715: Fix wrong AIC calculation in Binomial GLM
    • SPARK-16473: BisectingKMeans failing during training with “java.util.NoSuchElementException: key not found” for certain inputs
    • SPARK-19348: pyspark.ml.Pipeline gets corrupted under multi-threaded use
    • SPARK-20047: Box-constrained Logistic Regression


SparkR在2.2.0中的主要变化是添加了诸多Spark SQL的特性。

  • Major features
    • SPARK-19654: Structured Streaming API for R
    • SPARK-20159: Support complete Catalog API in R
    • SPARK-19795: column functions tojson, fromjson
    • SPARK-19399: Coalesce on DataFrame and coalesce on column
    • SPARK-20020: Support DataFrame checkpointing
    • SPARK-18285: Multi-column approxQuantile in R


  • Bug fixes
    • SPARK-18847: PageRank gives incorrect results for graphs with sinks
    • SPARK-14804: Graph vertexRDD/EdgeRDD checkpoint results ClassCastException
  • Optimizations
    • SPARK-18845: PageRank initial value improvement for faster convergence
    • SPARK-5484: Pregel should checkpoint periodically to avoid StackOverflowError


  • MLlib
    • SPARK-18613: spark.ml LDA classes should not expose spark.mllib in APIs. In spark.ml.LDAModel, deprecated oldLocalModel and getModel.
  • SparkR
    • SPARK-20195: deprecate createExternalTable


  • MLlib
    • SPARK-19787: DeveloperApi ALS.train() uses default regParam value 0.1 instead of 1.0, in order to match regular ALS API’s default regParam setting.
  • SparkR
    • SPARK-19291: This added log-likelihood for SparkR Gaussian Mixture Models, but doing so introduced a SparkR model persistence incompatibility: Gaussian Mixture Models saved from SparkR 2.1 may not be loaded into SparkR 2.2. We plan to put in place backwards compatibility guarantees for SparkR in the future.




Apache Spark 2.2.0 官方发版说明:http://spark.apache.org/releases/spark-release-2-2-0.html


Apache Spark 2.2.0 下载地址:http://spark.apache.org/downloads.html