Packages

c

org.apache.spark.sql

DataFrameStatFunctions

final class DataFrameStatFunctions extends AnyRef

Statistic functions for DataFrames.

Annotations
@Stable()
Source
DataFrameStatFunctions.scala
Since

1.4.0

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. DataFrameStatFunctions
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. def approxQuantile(cols: Array[String], probabilities: Array[Double], relativeError: Double): Array[Array[Double]]

    Calculates the approximate quantiles of numerical columns of a DataFrame.

    Calculates the approximate quantiles of numerical columns of a DataFrame.

    cols

    the names of the numerical columns

    probabilities

    a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.

    relativeError

    The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.

    returns

    the approximate quantiles at the given probabilities of each column

    Since

    2.2.0

    Note

    null and NaN values will be ignored in numerical columns before calculation. For columns only containing null or NaN values, an empty array is returned.

    See also

    approxQuantile(col:Str* approxQuantile) for detailed description.

  5. def approxQuantile(col: String, probabilities: Array[Double], relativeError: Double): Array[Double]

    Calculates the approximate quantiles of a numerical column of a DataFrame.

    Calculates the approximate quantiles of a numerical column of a DataFrame.

    The result of this algorithm has the following deterministic bound: If the DataFrame has N elements and if we request the quantile at probability p up to error err, then the algorithm will return a sample x from the DataFrame so that the *exact* rank of x is close to (p * N). More precisely,

    floor((p - err) * N) <= rank(x) <= ceil((p + err) * N)

    This method implements a variation of the Greenwald-Khanna algorithm (with some speed optimizations). The algorithm was first present in Space-efficient Online Computation of Quantile Summaries by Greenwald and Khanna.

    col

    the name of the numerical column

    probabilities

    a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.

    relativeError

    The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.

    returns

    the approximate quantiles at the given probabilities

    Since

    2.0.0

    Note

    null and NaN values will be removed from the numerical column before calculation. If the dataframe is empty or the column only contains null or NaN, an empty array is returned.

  6. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  7. def bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter

    Builds a Bloom filter over a specified column.

    Builds a Bloom filter over a specified column.

    col

    the column over which the filter is built

    expectedNumItems

    expected number of items which will be put into the filter.

    numBits

    expected number of bits of the filter.

    Since

    2.0.0

  8. def bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter

    Builds a Bloom filter over a specified column.

    Builds a Bloom filter over a specified column.

    colName

    name of the column over which the filter is built

    expectedNumItems

    expected number of items which will be put into the filter.

    numBits

    expected number of bits of the filter.

    Since

    2.0.0

  9. def bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter

    Builds a Bloom filter over a specified column.

    Builds a Bloom filter over a specified column.

    col

    the column over which the filter is built

    expectedNumItems

    expected number of items which will be put into the filter.

    fpp

    expected false positive probability of the filter.

    Since

    2.0.0

  10. def bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter

    Builds a Bloom filter over a specified column.

    Builds a Bloom filter over a specified column.

    colName

    name of the column over which the filter is built

    expectedNumItems

    expected number of items which will be put into the filter.

    fpp

    expected false positive probability of the filter.

    Since

    2.0.0

  11. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  12. def corr(col1: String, col2: String): Double

    Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.

    Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.

    col1

    the name of the column

    col2

    the name of the column to calculate the correlation against

    returns

    The Pearson Correlation Coefficient as a Double.

    val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
      .withColumn("rand2", rand(seed=27))
    df.stat.corr("rand1", "rand2", "pearson")
    res1: Double = 0.613...
    Since

    1.4.0

  13. def corr(col1: String, col2: String, method: String): Double

    Calculates the correlation of two columns of a DataFrame.

    Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.

    col1

    the name of the column

    col2

    the name of the column to calculate the correlation against

    returns

    The Pearson Correlation Coefficient as a Double.

    val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
      .withColumn("rand2", rand(seed=27))
    df.stat.corr("rand1", "rand2")
    res1: Double = 0.613...
    Since

    1.4.0

  14. def countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch

    Builds a Count-min Sketch over a specified column.

    Builds a Count-min Sketch over a specified column.

    col

    the column over which the sketch is built

    eps

    relative error of the sketch

    confidence

    confidence of the sketch

    seed

    random seed

    returns

    a CountMinSketch over column colName

    Since

    2.0.0

  15. def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch

    Builds a Count-min Sketch over a specified column.

    Builds a Count-min Sketch over a specified column.

    col

    the column over which the sketch is built

    depth

    depth of the sketch

    width

    width of the sketch

    seed

    random seed

    returns

    a CountMinSketch over column colName

    Since

    2.0.0

  16. def countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch

    Builds a Count-min Sketch over a specified column.

    Builds a Count-min Sketch over a specified column.

    colName

    name of the column over which the sketch is built

    eps

    relative error of the sketch

    confidence

    confidence of the sketch

    seed

    random seed

    returns

    a CountMinSketch over column colName

    Since

    2.0.0

  17. def countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch

    Builds a Count-min Sketch over a specified column.

    Builds a Count-min Sketch over a specified column.

    colName

    name of the column over which the sketch is built

    depth

    depth of the sketch

    width

    width of the sketch

    seed

    random seed

    returns

    a CountMinSketch over column colName

    Since

    2.0.0

  18. def cov(col1: String, col2: String): Double

    Calculate the sample covariance of two numerical columns of a DataFrame.

    Calculate the sample covariance of two numerical columns of a DataFrame.

    col1

    the name of the first column

    col2

    the name of the second column

    returns

    the covariance of the two columns.

    val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10))
      .withColumn("rand2", rand(seed=27))
    df.stat.cov("rand1", "rand2")
    res1: Double = 0.065...
    Since

    1.4.0

  19. def crosstab(col1: String, col2: String): DataFrame

    Computes a pair-wise frequency table of the given columns.

    Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 non-zero pair frequencies will be returned. The first column of each row will be the distinct values of col1 and the column names will be the distinct values of col2. The name of the first column will be col1_col2. Counts will be returned as Longs. Pairs that have no occurrences will have zero as their counts. Null elements will be replaced by "null", and back ticks will be dropped from elements if they exist.

    col1

    The name of the first column. Distinct items will make the first item of each row.

    col2

    The name of the second column. Distinct items will make the column names of the DataFrame.

    returns

    A DataFrame containing for the contingency table.

    val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3)))
      .toDF("key", "value")
    val ct = df.stat.crosstab("key", "value")
    ct.show()
    +---------+---+---+---+
    |key_value|  1|  2|  3|
    +---------+---+---+---+
    |        2|  2|  0|  1|
    |        1|  1|  1|  0|
    |        3|  0|  1|  1|
    +---------+---+---+---+
    Since

    1.4.0

  20. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  21. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  22. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  23. def freqItems(cols: Seq[String]): DataFrame

    (Scala-specific) Finding frequent items for columns, possibly with false positives.

    (Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. Uses a default support of 1%.

    This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

    cols

    the names of the columns to search frequent items in.

    returns

    A Local DataFrame with the Array of frequent items for each column.

    Since

    1.4.0

  24. def freqItems(cols: Seq[String], support: Double): DataFrame

    (Scala-specific) Finding frequent items for columns, possibly with false positives.

    (Scala-specific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou.

    This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

    cols

    the names of the columns to search frequent items in.

    returns

    A Local DataFrame with the Array of frequent items for each column.

    val rows = Seq.tabulate(100) { i =>
      if (i % 2 == 0) (1, -1.0) else (i, i * -1.0)
    }
    val df = spark.createDataFrame(rows).toDF("a", "b")
    // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
    // "a" and "b"
    val freqSingles = df.stat.freqItems(Seq("a", "b"), 0.4)
    freqSingles.show()
    +-----------+-------------+
    |a_freqItems|  b_freqItems|
    +-----------+-------------+
    |    [1, 99]|[-1.0, -99.0]|
    +-----------+-------------+
    // find the pair of items with a frequency greater than 0.1 in columns "a" and "b"
    val pairDf = df.select(struct("a", "b").as("a-b"))
    val freqPairs = pairDf.stat.freqItems(Seq("a-b"), 0.1)
    freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show()
    +----------+
    |   freq_ab|
    +----------+
    |  [1,-1.0]|
    |   ...    |
    +----------+
    Since

    1.4.0

  25. def freqItems(cols: Array[String]): DataFrame

    Finding frequent items for columns, possibly with false positives.

    Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. Uses a default support of 1%.

    This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

    cols

    the names of the columns to search frequent items in.

    returns

    A Local DataFrame with the Array of frequent items for each column.

    Since

    1.4.0

  26. def freqItems(cols: Array[String], support: Double): DataFrame

    Finding frequent items for columns, possibly with false positives.

    Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. The support should be greater than 1e-4.

    This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.

    cols

    the names of the columns to search frequent items in.

    support

    The minimum frequency for an item to be considered frequent. Should be greater than 1e-4.

    returns

    A Local DataFrame with the Array of frequent items for each column.

    val rows = Seq.tabulate(100) { i =>
      if (i % 2 == 0) (1, -1.0) else (i, i * -1.0)
    }
    val df = spark.createDataFrame(rows).toDF("a", "b")
    // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns
    // "a" and "b"
    val freqSingles = df.stat.freqItems(Array("a", "b"), 0.4)
    freqSingles.show()
    +-----------+-------------+
    |a_freqItems|  b_freqItems|
    +-----------+-------------+
    |    [1, 99]|[-1.0, -99.0]|
    +-----------+-------------+
    // find the pair of items with a frequency greater than 0.1 in columns "a" and "b"
    val pairDf = df.select(struct("a", "b").as("a-b"))
    val freqPairs = pairDf.stat.freqItems(Array("a-b"), 0.1)
    freqPairs.select(explode($"a-b_freqItems").as("freq_ab")).show()
    +----------+
    |   freq_ab|
    +----------+
    |  [1,-1.0]|
    |   ...    |
    +----------+
    Since

    1.4.0

  27. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  28. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  29. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  30. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  31. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  32. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  33. def sampleBy[T](col: Column, fractions: Map[T, Double], seed: Long): DataFrame

    (Java-specific) Returns a stratified sample without replacement based on the fraction given on each stratum.

    (Java-specific) Returns a stratified sample without replacement based on the fraction given on each stratum.

    T

    stratum type

    col

    column that defines strata

    fractions

    sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.

    seed

    random seed

    returns

    a new DataFrame that represents the stratified sample

    Since

    3.0.0

  34. def sampleBy[T](col: Column, fractions: Map[T, Double], seed: Long): DataFrame

    Returns a stratified sample without replacement based on the fraction given on each stratum.

    Returns a stratified sample without replacement based on the fraction given on each stratum.

    T

    stratum type

    col

    column that defines strata

    fractions

    sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.

    seed

    random seed

    returns

    a new DataFrame that represents the stratified sample The stratified sample can be performed over multiple columns:

    import org.apache.spark.sql.Row
    import org.apache.spark.sql.functions.struct
    
    val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17),
      ("Alice", 10))).toDF("name", "age")
    val fractions = Map(Row("Alice", 10) -> 0.3, Row("Nico", 8) -> 1.0)
    df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show()
    +-----+---+
    | name|age|
    +-----+---+
    | Nico|  8|
    |Alice| 10|
    +-----+---+
    Since

    3.0.0

  35. def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

    Returns a stratified sample without replacement based on the fraction given on each stratum.

    Returns a stratified sample without replacement based on the fraction given on each stratum.

    T

    stratum type

    col

    column that defines strata

    fractions

    sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.

    seed

    random seed

    returns

    a new DataFrame that represents the stratified sample

    Since

    1.5.0

  36. def sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame

    Returns a stratified sample without replacement based on the fraction given on each stratum.

    Returns a stratified sample without replacement based on the fraction given on each stratum.

    T

    stratum type

    col

    column that defines strata

    fractions

    sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.

    seed

    random seed

    returns

    a new DataFrame that represents the stratified sample

    val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2),
      (3, 3))).toDF("key", "value")
    val fractions = Map(1 -> 1.0, 3 -> 0.5)
    df.stat.sampleBy("key", fractions, 36L).show()
    +---+-----+
    |key|value|
    +---+-----+
    |  1|    1|
    |  1|    2|
    |  3|    2|
    +---+-----+
    Since

    1.5.0

  37. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  38. def toString(): String
    Definition Classes
    AnyRef → Any
  39. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  40. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  41. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()

Inherited from AnyRef

Inherited from Any

Ungrouped