All Classes and Interfaces
Class
Description
Class for absolute error loss calculation (for regression).
Base class for launcher implementations.
Indicates that the source accepts the latest seen offset, which requires streaming execution
to provide the latest seen offset when restarting the streaming query from checkpoint.
:: DeveloperApi ::
Information about an
AccumulatorV2
modified during a task or stage.An internal class used to track accumulators by Spark itself.
The base class for accumulators, that can accumulate inputs of type
IN
, and produce output of
type OUT
.Trait for functions and their derivatives for functional layers
Fit a parametric survival regression model named accelerated failure time (AFT) model
(see
Accelerated failure time model (Wikipedia))
based on the Weibull distribution of the survival time.
Model produced by
AFTSurvivalRegression
.Params for accelerated failure time (AFT) regression.
AggregatedDialect can unify multiple dialects into one virtual Dialect.
Base class of the Aggregate Functions.
Interface for a function that produces a result value by aggregating over multiple input rows.
Aggregation in SQL statement.
:: DeveloperApi ::
A set of functions used to aggregate data.
A base class for user-defined aggregations, which can be used in
Dataset
operations to take
all of the elements of a group and reduce them to a single value.Enum to select the algorithm for the decision tree
A message used by ReceiverTracker to ask all receiver's ids still stored in
ReceiverTrackerEndpoint.
Alternating Least Squares (ALS) matrix factorization.
Alternating Least Squares matrix factorization.
Trait for least squares solvers applied to the normal equation.
Rating class for better code readability.
Model fitted by ALS.
Common params for ALS and ALSModel.
Common params for ALS.
A predicate that always evaluates to
false
.A filter that always evaluates to
false
.A predicate that always evaluates to
true
.A filter that always evaluates to
true
.Thrown when a query fails to analyze, usually because the query itself is invalid.
A predicate that evaluates to
true
iff both left
and right
evaluate to
true
.A filter that evaluates to
true
iff both left
or right
evaluate to true
.ANOVA Test for continuous data.
An
AbstractDataType
that matches any concrete data types.An interface for creating history listeners(to replay event logs) defined in other modules like
SQL, and setup the UI of the plugin to rebuild the history UI.
Implements in-place application of functions in the arrays
An object that computes a function incrementally by merging in results of type U from multiple
tasks.
Computes the area under the curve (AUC) using the trapezoidal rule.
ARPACK routines for MLlib's vectors and matrices.
Implicit methods related to Scala Array.
A column vector backed by Apache Arrow.
This class handles the storage of artifacts as well as preparing the artifacts for use.
Generates association rules from a
RDD[FreqItemset[Item}
.An association rule between sets of items.
An asynchronous queue for events.
A set of asynchronous RDD actions available through an implicit conversion.
Abstract class for ML attributes.
Trait for ML attribute factories.
Attributes that describe a vector ML column.
Keys used to store attributes.
An enum-like type for attribute types:
AttributeType$.Numeric
, AttributeType$.Nominal
,
and AttributeType$.Binary
.An aggregate function that returns the mean of all the values in a group.
A mapper class from Spark supported avro compression codecs to avro compression codecs.
Helper class to perform field lookup/matching on Avro schemas.
:: Experimental ::
A
TaskContext
with extra contextual info and tooling for tasks in a barrier stage.:: Experimental ::
Carries all task infos of a barrier task.
Base class for resource handlers that use app-specific data.
Trait for
MLWriter
and MLReader
.Represents a collection of tuples with a known schema.
Base class for streaming API handlers, provides easy access to the streaming listener that
holds the app's information.
A physical representation of a data source scan for batch queries.
:: DeveloperApi ::
Class having information on completed batches.
An interface that defines how to write the data to data source for batch processing.
:: DeveloperApi ::
A sampler based on Bernoulli trials for partitioning a data sequence.
:: DeveloperApi ::
A sampler based on Bernoulli trials.
Binarize a column of continuous features given a threshold.
A binary attribute.
Evaluator for binary classification, which expects input columns rawPrediction, label and
an optional weight column.
Trait for a binary classification evaluation metric computer.
Evaluator for binary classification.
Abstraction for binary classification results for a given model.
Trait for a binary confusion matrix.
Abstraction for binary logistic regression results for a given model.
Binary logistic regression results for a given model.
Abstraction for binary logistic regression training results.
Binary logistic regression training results.
Abstraction for BinaryRandomForestClassification results for a given model.
Binary RandomForestClassification for a given model.
Abstraction for BinaryRandomForestClassification training results.
Binary RandomForestClassification training results.
Class that represents the group and value of a sample.
The data type representing
Array[Byte]
values.Utility functions that help us determine bounds on adjusted sampling rate to guarantee exact
sample size with high confidence when sampling without replacement.
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques"
by Steinbach, Karypis, and Kumar, with modification to fit Spark.
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques"
by Steinbach, Karypis, and Kumar, with modification to fit Spark.
Model fitted by BisectingKMeans.
Clustering model produced by
BisectingKMeans
.Common params for BisectingKMeans and BisectingKMeansModel
Summary of BisectingKMeans.
BLAS routines for MLlib's vectors and matrices.
BLAS routines for MLlib's vectors and matrices.
Abstracts away how blocks are stored and provides different ways to read the underlying block
data.
Listener object for BlockGenerator events
:: DeveloperApi ::
Identifies a particular Block of data, usually associated with a single file.
:: DeveloperApi ::
This class represent a unique identifier for a BlockManager.
The response message of
GetLocationsAndStatus
request.Driver to Executor message to get a heap histogram.
Driver to Executor message to trigger a thread dump.
Represents a distributed matrix in blocks of local matrices.
::DeveloperApi::
BlockReplicationPrioritization provides logic for prioritizing a sequence of peers for
replicating blocks.
:: DeveloperApi ::
Stores information about a block status in a block manager.
A Bloom filter is a space-efficient probabilistic data structure that offers an approximate
containment test with one-sided error: if it claims that an item is contained in it, this
might be in error, but if it claims that an item is not contained in it, then this is
definitely true.
Specialized version of
Param[Boolean]
for Java.The data type representing
Boolean
values.Configuration options for
GradientBoostedTrees
.A Double value with error bars and associated confidence.
Represents a function that is bound to an input type.
In-place DGEMM and DGEMV for Breeze
A broadcast variable.
An interface for all the broadcast implementations in Spark (to allow
multiple broadcast implementations).
This
BucketedRandomProjectionLSH
implements Locality Sensitive Hashing functions for
Euclidean distance metrics.Model produced by
BucketedRandomProjectionLSH
, where multiple random vectors are stored.Params for
BucketedRandomProjectionLSH
.Bucketizer
maps a column of continuous features to a column of feature buckets.Helper class that ensures a ManagedBuffer is released upon InputStream.close() and
also detects stream corruption if streamCompressedOrEncrypted is true
The data type representing
Byte
values.Basic interface that all cached batches of data must support.
Provides APIs that handle transformations of SQL data associated with the cache/persist APIs.
The class representing calendar intervals.
The data type representing calendar intervals.
Case-insensitive map of string keys to string values.
Represents a cast expression in the public logical expression API.
Catalog interface for Spark.
An API to extend the Spark built-in session catalog.
A catalog in Spark, as returned by the
listCatalogs
method defined in Catalog
.A marker interface to provide a catalog implementation for Spark.
Conversion helpers for working with v2
CatalogPlugin
.::Experimental::
An interface for experimenting with a more direct connection to the query planner.
Split which tests a categorical feature.
Extractor Object for pulling out the root cause of an error.
Enumeration to manage state transitions of an RDD through checkpointing
A mutable class loader that gives preference to its own URLs over the parent class loader
when loading classes and resources.
Deprecated.
use UnivariateFeatureSelector instead.
Creates a ChiSquared feature selector.
Model fitted by
ChiSqSelector
.Chi Squared selector model.
Conduct the chi-squared test for the input RDDs using the specified method.
param: name String name for the method.
Object containing the test results for the chi-squared hypothesis test.
Chi-square hypothesis testing for categorical data.
Compute Cholesky decomposition.
Model produced by a
Classifier
.Represents a classification model that predicts to which of a set of categories an example
belongs.
Abstraction for multiclass classification results for a given model.
Classifier<FeaturesType,E extends Classifier<FeaturesType,E,M>,M extends ClassificationModel<FeaturesType,M>>
Single-label binary or multiclass classification.
(private[spark]) Params for classification.
Listener class used when any item has been cleaned by the Cleaner class.
Classes that represent cleaning tasks.
A WeakReference associated with a CleanupTask.
An interface to represent clocks, so that they can be mocked out in unit tests.
A cleaner that renders closures serializable if they can be done so safely.
This class represents a transform for
ClusterBySpec
.Helper class for storing model data
A distribution where tuples that share the same values for clustering expressions are co-located
in the same partition.
Evaluator for clustering results.
Metrics for clustering, which expects two input columns: prediction and label.
Summary of clustering algorithms.
Metrics for code generation.
:: DeveloperApi ::
An RDD that cogroups its parents.
A function that returns zero or more output records from each grouping key and its values from 2
Datasets.
An
accumulator
for collecting a list of elements.A column in Spark, as returned by
listColumns
method in Catalog
.A column that will be computed based on the data in a
DataFrame
.An interface representing a column of a
Table
.Array abstraction in
ColumnVector
.This class wraps multiple ColumnVectors as a row-wise table.
This class wraps an array of
ColumnVector
and provides a row view.Map abstraction in
ColumnVector
.Row abstraction in
ColumnVector
.A class representing the default value of a column.
A convenient class used for constructing schema.
Utility transformer for removing temporary columns from a DataFrame.
An interface to represent column statistics, which is part of
Statistics
.An interface representing in-memory columnar data in Spark.
Contains basic command line parsing functionality and methods to parse some common Spark CLI
options.
A
FutureAction
for actions that could trigger multiple Spark jobs./**
Represents a
ReadLimit
where the MicroBatchStream
should scan approximately
given maximum number of rows with at least the given minimum number of rows.:: DeveloperApi ::
CompressionCodec allows the customization of choosing different compression implementations
to be used in block storage.
A trait to implement
Configurable
interface.Connected components algorithm.
An input stream that always returns the same RDD on each time step.
Deprecated.
since 4.0.0 as its only usage for Python evaluation is now extinct
For each barrier stage attempt, only at most one barrier() call can be active at any time, thus
we can use (stageId, stageAttemptId) to identify the stage attempt where the barrier() call is
from.
A variation on
PartitionReader
for use with continuous streaming processing.A variation on
PartitionReaderFactory
that returns ContinuousPartitionReader
instead of PartitionReader
.Split which tests a continuous feature.
A
SparkDataStream
for streaming queries with continuous mode.Represents a matrix in coordinate format.
API for correlation functions in MLlib, compatible with DataFrames and Datasets.
Trait for correlation algorithms.
Maintains supported and default correlation names.
Delegates computation to the specific correlation object based on the input method name.
The algorithm which is implemented in this object, instead, is an efficient and parallel
implementation of the Silhouette using the cosine distance measure.
An aggregate function that returns the number of the specific row in a group.
A Count-min sketch is a probabilistic data structure used for cardinality estimation using
sub-linear space.
An aggregate function that returns the number of rows in a group.
Extracts a vocabulary from document collections and generates a
CountVectorizerModel
.Converts a text document to a sparse vector of token counts.
Params for
CountVectorizer
and CountVectorizerModel
.Trait to restrict calls to create and replace operations.
K-fold cross validation performs model selection by splitting the dataset into a set of
non-overlapping randomly partitioned folds which are used as separate training and test datasets
e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs,
each of which uses 2/3 of the data for training and 1/3 for testing.
CrossValidatorModel contains the model with the highest average cross-validation
metric across folds and uses this model to transform input data.
Writer for CrossValidatorModel.
Params for
CrossValidator
and CrossValidatorModel
.A util class for manipulating IO encryption and decryption streams.
Built-in `CustomMetric` that computes average of metric values.
A custom metric.
Built-in `CustomMetric` that sums up metric values.
A custom task metric.
Types of events that can be handled by the DAGScheduler.
A database in Spark, as returned by the
listDatabases
method defined in Catalog
.Functionality for working with missing data in
DataFrame
s.Interface used to load a
Dataset
from external storage systems (e.g.Statistic functions for
DataFrame
s.Interface used to write a
Dataset
to external storage systems (e.g.Interface used to write a
Dataset
to external storage using the v2 API.A Dataset is a strongly typed collection of domain-specific objects that can be transformed
in parallel using functional or relational operations.
A container for a
Dataset
, used for implicit conversions in Scala.Data sources should implement this trait so that they can register an alias to their data source.
Functions for registering user-defined data sources.
Interface used to load a streaming
Dataset
from external storage systems (e.g.Interface used to write a streaming
Dataset
to external storage systems (e.g.The base type of all Spark SQL data types.
Object for grouping error messages from (most) exceptions thrown during query execution.
To get/create specific data type, users should use singleton objects and factory methods
provided by this class.
A collection of methods used to validate data before applying ML algorithms.
A data writer returned by
DataWriterFactory.createWriter(int, long)
and is
responsible for writing data for an input RDD partition.A factory of
DataWriter
returned by
BatchWrite.createBatchWriterFactory(PhysicalWriteInfo)
, which is responsible for
creating and initializing the actual data writer at executor side.The date type represents a valid date in the proleptic Gregorian calendar.
The type represents day-time intervals of the SQL standard.
A feature transformer that takes the 1D discrete cosine transform of a real vector.
A mutable implementation of BigDecimal that can hold a Long if values are small enough.
A
Integral
evidence parameter for Decimals.Common methods for Decimal evidence parameters
A
Fractional
evidence parameter for Decimals.The data type representing
java.math.BigDecimal
values.A class which implements a decision tree learning algorithm for classification and regression.
Decision tree model (http://en.wikipedia.org/wiki/Decision_tree_learning) for classification.
Decision tree learning algorithm (http://en.wikipedia.org/wiki/Decision_tree_learning)
for classification.
Abstraction for Decision Tree models.
Decision tree model for classification or regression.
Helper classes for tree model persistence
Info for a
Node
Info for a
Split
Parameters for Decision Tree-based algorithms.
Decision tree (Wikipedia) model for regression.
Decision tree
learning algorithm for regression.
Returns DefaultAWSCredentialsProviderChain for authentication.
Helper trait for making simple
Params
types readable.Helper trait for making simple
Params
types writable.Coalesce the partitions of a parent RDD (
prev
) into fewer partitions, so that each partition of
this RDD computes one or more of the parent ones.A TopologyMapper that assumes all nodes are in the same rack
A simple implementation of
CatalogExtension
, which implements all the catalog functions
by calling the built-in session catalog directly.An interface that defines how to write a delta of rows during batch processing.
A logical representation of a data source write that handles a delta of rows.
An interface for building a
DeltaWrite
.A data writer returned by
DeltaWriterFactory.createWriter(int, long)
and is
responsible for writing a delta of rows.A factory for creating
DeltaWriter
s returned by
DeltaBatchWrite.createBatchWriterFactory(PhysicalWriteInfo)
, which is responsible for
creating and initializing writers at the executor side.Column-major dense matrix.
Column-major dense matrix.
A dense vector represented by a value array.
A dense vector represented by a value array.
:: DeveloperApi ::
Base class for dependencies.
:: DeveloperApi ::
A stream for reading serialized objects.
A holder for storing the deserialized values.
The deterministic level of RDD's output (i.e.
A parent trait for aggregators used in fitting MLlib models.
A Breeze diff function which represents a cost function for differentiable regularization
of parameters.
Distributed model fitted by
LDA
.Distributed LDA model.
Represents a distributively stored matrix backed by one or more RDDs.
An interface that defines how data is distributed across partitions.
Helper methods to create distributions to pass into Spark.
An
accumulator
for computing sum, count, and averages for double precision
floating numbers.Specialized version of
Param[Array[Array[Double}]
for Java.Specialized version of
Param[Array[Double}
for Java.A function that returns zero or more records of type Double from each input record.
A function that returns Doubles, and can be used to construct DoubleRDDs.
Specialized version of
Param[Double]
for Java.Extra functions available on RDDs of Doubles through an implicit conversion.
The data type representing
Double
values.:: DeveloperApi ::
Driver component of a
SparkPlugin
.A Discretized Stream (DStream), the basic abstraction in Spark Streaming, is a continuous
sequence of RDDs (of the same type) representing a continuous stream of data (see
org.apache.spark.rdd.RDD in the Spark core documentation for more details on RDDs).
Unfortunately, we need a serializer instance in order to construct a DiskBlockObjectWriter.
A single directed edge consisting of a source id, target id,
and the data associated with the edge.
Criteria for filtering edges based on activeness.
Represents an edge along with its neighboring vertices and allows sending messages along the
edge.
The direction of a directed edge relative to a vertex.
EdgeRDD[ED, VD]
extends RDD[Edge[ED}
by storing the edges in columnar format on each
partition for performance.An edge triplet represents an edge along with the vertex attributes of its neighboring vertices.
Compute eigen-decomposition.
Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a
provided "weight" vector.
Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a
provided "weight" vector.
Optimizer for EM algorithm which stores data + parameter graph, plus algorithm parameters.
Placeholder term for the result of undefined interactions, e.g.
Used to convert a JVM object of type
T
to and from the internal Spark SQL representation.Methods for creating an
Encoder
.Enum to select ensemble combining strategy for base learners
Info for one
Node
in a tree ensembleClass for calculating entropy during multiclass classification.
Performs equality comparison, similar to
EqualTo
.A filter that evaluates to
true
iff the column evaluates to a value
equal to value
.A reader to load error information from one or more JSON files.
Information associated with an error class.
Information associated with an error state / SQLSTATE.
Information associated with an error subclass.
Abstract class for estimators that fit models to data.
Abstract class for evaluators that compute metrics from predictions.
:: DeveloperApi ::
Task failed due to a runtime exception.
Manager for
QueryExecutionListener
.:: DeveloperApi ::
Stores information about an executor to pass from the scheduler to SparkListeners.
:: DeveloperApi ::
The task failed because the executor that it was running on was lost.
Executor metric types for executor-level metrics stored in ExecutorMetrics.
:: DeveloperApi ::
Executor component of a
SparkPlugin
.An Executor resource request.
A set of Executor resource requests.
ExpectationAggregator computes the partial expectation results.
:: Experimental ::
Holder for experimental methods for the bravest.
Class used to provide access to expired timer's expiry time.
Generates i.i.d.
Base class of the public logical expression API.
Helper methods to create logical transforms to pass into Spark.
A trait for a session extension to implement that provides addition explain plan
information.
A cluster manager interface to plugin external scheduler.
An interface to execute an arbitrary string command inside an external execution engine rather
than Spark.
Represent an extract function, which extracts and returns the value of a
specified datetime field from a datetime or interval value expression.
Params for Factorization Machines
False positive rate.
Feature hashing projects a set of categorical or numerical features into a feature vector of
specified dimension (typically substantially smaller than that of the original feature
space).
Enum to describe whether a feature is "continuous" or "categorical"
:: DeveloperApi ::
Task failed to fetch shuffle data from a remote node.
A simple file based topology mapper.
A filter predicate for data sources.
Base interface for a function used in Dataset's filter function.
Event fired after
Estimator.fit
.Event fired before
Estimator.fit
.A function that returns zero or more output records from each input record.
A function that takes two inputs and returns zero or more output records.
A function that returns zero or more output records from each grouping key and its values.
::Experimental::
Base interface for a map function used in
org.apache.spark.sql.KeyValueGroupedDataset.flatMapGroupsWithState(
FlatMapGroupsWithStateFunction, org.apache.spark.sql.streaming.OutputMode,
org.apache.spark.sql.Encoder, org.apache.spark.sql.Encoder)
Specialized version of
Param[Float]
for Java.The data type representing
Float
values.Model produced by
FMClassifier
Abstraction for FMClassifier results for a given model.
FMClassifier results for a given model.
Abstraction for FMClassifier training results.
FMClassifier training results.
Factorization Machines learning algorithm for classification.
Params for FMClassifier.
Model produced by
FMRegressor
.Factorization Machines learning algorithm for regression.
Params for FMRegressor
Base interface for a function used in Dataset's foreach function.
Base interface for a function used in Dataset's foreachPartition function.
The abstract class for writing custom logic to process data generated by a query.
A parallel FP-growth algorithm to mine frequent itemsets.
A parallel FP-growth algorithm to mine frequent itemsets.
Frequent itemset.
Model fitted by FPGrowth.
Model trained by
FPGrowth
, which holds frequent itemsets.Common params for FPGrowth and FPGrowthModel
Base interface for functions whose return types do not create special RDDs.
A user-defined function in Spark, as returned by
listFunctions
method in Catalog
.Base class for user-defined functions.
A zero-argument function that returns an R.
A two-argument function that takes arguments of type T1 and T2 and returns an R.
A three-argument function that takes arguments of type T1, T2 and T3 and returns an R.
A four-argument function that takes arguments of type T1, T2, T3 and T4 and returns an R.
Catalog methods for working with Functions.
Commonly used functions available for DataFrame operations.
A future for the result of an action to support cancellation.
FValue test for continuous data.
Generates i.i.d.
Gaussian Mixture clustering.
This class performs expectation maximization for multivariate Gaussian
Mixture Models (GMMs).
Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points
are drawn from each Gaussian i with probability weights(i).
Multivariate Gaussian Mixture Model (GMM) consisting of k Gaussians, where points
are drawn from each Gaussian i=1..k with probability w(i); mu(i) and sigma(i) are
the respective mean and covariance for each Gaussian distribution i=1..k.
Common params for GaussianMixture and GaussianMixtureModel
Summary of GaussianMixture.
Gradient-Boosted Trees (GBTs) (http://en.wikipedia.org/wiki/Gradient_boosting)
model for classification.
Gradient-Boosted Trees (GBTs) (http://en.wikipedia.org/wiki/Gradient_boosting)
learning algorithm for classification.
Parameters for Gradient-Boosted Tree algorithms.
Gradient-Boosted Trees (GBTs)
model for regression.
Gradient-Boosted Trees (GBTs)
learning algorithm for regression.
The general implementation of
AggregateFunc
, which contains the upper-cased function
name, the `isDistinct` flag and all the inputs.GeneralizedLinearAlgorithm implements methods to train a Generalized Linear Model (GLM).
GeneralizedLinearModel (GLM) represents a model trained using
GeneralizedLinearAlgorithm.
Fit a Generalized Linear Model
(see
Generalized linear model (Wikipedia))
specified by giving a symbolic description of the linear
predictor (link function) and a description of the error distribution (family).
Binomial exponential family distribution.
Gamma exponential family distribution.
Gaussian exponential family distribution.
Poisson exponential family distribution.
Params for Generalized Linear Regression.
Model produced by
GeneralizedLinearRegression
.Summary of
GeneralizedLinearRegression
model and predictions.Summary of
GeneralizedLinearRegression
fitting and model.Trait for classes that provide
GeneralMLWriter
.A ML Writer which delegates based on the requested format.
The general representation of SQL scalar expressions, which contains the upper-cased
expression name and all the children expressions.
Class for calculating the Gini impurity
(http://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity)
during multiclass classification.
Helper class for import/export of GLM classification models.
Helper methods for import/export of GLM regression models.
Class used to compute the gradient for a loss function, given a single data point.
A class that implements
Stochastic Gradient Boosting
for regression and binary classification.
Represents a gradient boosted trees model.
Class used to solve an optimization problem using Gradient Descent.
The Graph abstractly represents a graph with arbitrary objects
associated with vertices and edges.
A collection of graph generating functions.
An implementation of
Graph
to support computation on graphs.Provides utilities for loading
Graph
s from files.Contains additional functionality for
Graph
.A filter that evaluates to
true
iff the attribute evaluates to a value
greater than value
.A filter that evaluates to
true
iff the attribute evaluates to a value
greater than or equal to value
.This Spark trait is used for mapping a given userName to a set of groups which it belongs to.
:: Experimental ::
Represents the type of timeouts possible for the Dataset operations
mapGroupsWithState
and flatMapGroupsWithState
.::DeveloperApi::
Hadoop delegation token provider.
Utility functions to simplify and speed-up file listing.
:: DeveloperApi ::
An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS,
sources in HBase, or S3), using the older MapReduce API (
org.apache.hadoop.mapred
).Trait for shared param aggregationDepth (default: 2).
Trait for shared param blockSize.
Trait for shared param checkpointInterval.
Trait for shared param collectSubModels (default: false).
Trait for shared param distanceMeasure (default: "euclidean").
Trait for shared param elasticNetParam.
Trait for shared param featuresCol (default: "features").
Trait for shared param fitIntercept (default: true).
Trait for shared param handleInvalid.
Maps a sequence of terms to their term frequencies using the hashing trick.
Maps a sequence of terms to their term frequencies using the hashing trick.
A
Partitioner
that implements hash-based partitioning using
Java's Object.hashCode
.Trait for shared param inputCol.
Trait for shared param inputCols.
Trait for shared param labelCol (default: "label").
Trait for shared param loss.
Trait for shared param maxBlockSizeInMB (default: 0.0).
Trait for shared param maxIter.
Trait for shared param numFeatures (default: 262144).
Trait for shared param outputCol (default: uid + "__output").
Trait for shared param outputCols.
Trait to define a level of parallelism for algorithms that are able to use
multithreaded execution, and provide a thread-pool based execution context.
A mix-in for input partitions whose records are clustered on the same set of partition keys
(provided via
SupportsReportPartitioning
, see below).A mix-in for input partitions whose records are clustered on the same set of partition keys
(provided via
SupportsReportPartitioning
, see below).Trait for shared param predictionCol (default: "prediction").
Trait for shared param probabilityCol (default: "probability").
Trait for shared param rawPredictionCol (default: "rawPrediction").
Trait for shared param regParam.
Trait for shared param relativeError (default: 0.001).
Trait for shared param seed (default: this.getClass.getName.hashCode.toLong).
Trait for shared param solver.
Trait for shared param standardization (default: true).
Trait for shared param stepSize.
Trait for shared param threshold.
Trait for shared param thresholds.
Trait for shared param tol.
Trait for models that provides Training summary.
Trait for shared param validationIndicatorCol.
Trait for shared param varianceCol.
Trait for shared param weightCol.
Compute gradient and loss for a Hinge loss function, as used in SVM binary classification.
An interface to represent an equi-height histogram, which is a part of
ColumnStatistics
.An interface to represent a bin in an equi-height histogram.
Metrics for access to the hive external catalog.
A servlet filter that implements HTTP security features.
Trait for an object with an immutable unique ID that identifies itself and its derivatives.
Identifies an object in a catalog.
Compute the Inverse Document Frequency (IDF) given a collection of documents.
Inverse document frequency (IDF).
Document frequency aggregator.
Model fitted by
IDF
.Represents an IDF model that can transform term frequency vectors.
image
package implements Spark SQL data source API for loading image data as DataFrame
.Defines the image schema and methods to read and manipulate images.
Factory for Impurity instances.
Trait for calculating information gain.
Imputation estimator for completing missing values, using the mean, median or mode
of the columns in which the missing values are located.
Model fitted by
Imputer
.Params for
Imputer
and ImputerModel
.A filter that evaluates to
true
iff the attribute evaluates to one of the values in the array.Represents a row of
IndexedRowMatrix
.Represents a row-oriented
DistributedMatrix
with
indexed rows.A
Transformer
that maps a column of indices back to a new column of corresponding
string values.Information gain statistics for each split
param: gain information gain value
param: impurity current node impurity
param: leftImpurity left node impurity
param: rightImpurity right node impurity
param: leftPredict left node predict
param: rightPredict right node predict
In-process launcher for Spark applications.
This is the abstract base class for all input streams.
This holds file names of the current Spark task.
:: DeveloperApi ::
Parses and holds information about inputFormat (and files) specified as a parameter.
A serializable representation of an input partition returned by
Batch.planInputPartitions()
and the corresponding ones in streaming .A BaseRelation that can be used to insert data into it through the insert method.
Specialized version of
Param[Array[Int}
for Java.The data type representing
Int
values.A term that may be part of an interaction, e.g.
Implements the feature interaction transform.
A collection of fields and methods concerned with internal accumulators that represent
task level metrics.
A writer for KMeans that handles the "internal" (or default) format
A writer for LinearRegression that handles the "internal" (or default) format
Internal Decision Tree node.
:: DeveloperApi ::
An iterator that wraps around an existing iterator to provide task killing functionality.
Specialized version of
Param[Int]
for Java.An extractor object for parsing strings into integers.
A filter that evaluates to
true
iff the attribute evaluates to a non-null value.A filter that evaluates to
true
iff the attribute evaluates to null.Isotonic regression.
Isotonic regression.
Params for isotonic regression.
Model fitted by IsotonicRegression.
Regression model for isotonic regression.
A Java-friendly interface to
DStream
, the basic
abstraction in Spark Streaming that represents a continuous stream of data.A Java-friendly interface to
InputDStream
.A Kryo serializer for serializing results returned by asJavaIterable.
DStream representing the stream of data generated by
mapWithState
operation on a
JavaPairDStream
.This helper class is used to place some JVM runtime options(eg: `--add-opens`)
required by Spark when using Java 17.
A dummy class as a workaround to show the package doc of
spark.mllib
in generated
Java API docs.A Java-friendly interface to a DStream of key-value pairs, which provides extra methods
like
reduceByKey
and join
.A Java-friendly interface to
InputDStream
of
key-value pairs.A Java-friendly interface to
ReceiverInputDStream
, the
abstract class for defining any input stream that receives data over the network.Java-friendly wrapper for
Params
.Defines operations common to several Java RDD implementations.
A Java-friendly interface to
ReceiverInputDStream
, the
abstract class for defining any input stream that receives data over the network.:: DeveloperApi ::
A Spark serializer that uses Java's built-in serialization.
A Java-friendly version of
SparkContext
that returns
JavaRDD
s and works with Java collections instead of Scala ones.Low-level status reporting APIs for monitoring job and stage progress.
Deprecated.
This is deprecated as of Spark 3.4.0.
Base trait for events related to JavaStreamingListener
::DeveloperApi::
Connection provider which opens connection toward various databases (database specific instance
needed).
:: DeveloperApi ::
Encapsulates everything (extensions, workarounds, quirks) to handle the
SQL dialect of a certain database or jdbc driver.
:: DeveloperApi ::
Registry of dialects that apply to every new jdbc
org.apache.spark.sql.DataFrame
.An RDD that executes a SQL query on a JDBC connection and reads results.
The builder to build a single SELECT query.
:: DeveloperApi ::
A database type definition coupled with the jdbc type needed to send null
values to the database.
Utilities for launching a web server using Jetty's HTTP Server class
Event classes for JobGenerator
Interface used to listen for job completion or failure events after submitting a job to the
DAGScheduler.
:: DeveloperApi ::
A result of a job in the DAGScheduler.
Handle via which a "run" function passed to a
ComplexFutureAction
can submit jobs for execution.Serializes SparkListener events to/from JSON.
Kernel density estimation.
Represents a partitioning where rows are split across partitions based on the
partition transform expressions returned by
KeyGroupedPartitioning.keys
.A
Dataset
has been logically grouped by a user specified grouping key.This is a helper class that wraps the methods in KinesisUtils into more Python-friendly class and
function so that it can be easily instantiated and called from Python's KinesisUtils.
K-means clustering with support for k-means|| initialization proposed by Bahmani et al.
K-means clustering with a k-means++ like initialization mode
(the k-means|| algorithm by Bahmani et al).
KMeansAggregator computes the distances and updates the centers for blocks
in sparse or dense matrix in an online fashion.
Generate test data for KMeans.
Model fitted by KMeans.
A clustering model for K-means.
Common params for KMeans and KMeansModel
Summary of KMeans.
A trait that allows a class to give
SizeEstimator
more accurate size estimation.Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a
continuous distribution.
Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a
continuous distribution.
Object containing the test results for the Kolmogorov-Smirnov test.
Interface implemented by clients to register their classes with Kryo when using Kryo
serialization.
A Spark serializer that uses the
Kryo serialization library.
Updater for L1 regularized problems.
Class that represents the features and label of a data point.
Class that represents the features and labels of a data point.
Label Propagation algorithm.
LAPACK routines for MLlib's vectors and matrices.
Regression model trained using Lasso.
Train a regression model with L1-regularization using Stochastic Gradient Descent.
Trait that holds Layer properties, that are needed to instantiate it.
Trait that holds Layer weights (or parameters).
Class used to solve an optimization problem using Limited-memory BFGS.
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Latent Dirichlet Allocation (LDA), a topic model designed for text documents.
Model fitted by
LDA
.Latent Dirichlet Allocation (LDA) model.
An LDAOptimizer specifies which optimization/learning/inference algorithm to use, and it can
hold optimizer-specific parameters for users to set.
Utility methods for LDA.
Decision tree leaf node.
Compute gradient and loss for a Least-squared loss function, as used in linear regression.
A filter that evaluates to
true
iff the attribute evaluates to a value
less than value
.A filter that evaluates to
true
iff the attribute evaluates to a value
less than or equal to value
.libsvm
package implements Spark SQL data source API for loading LIBSVM data as DataFrame
.Generate sample data used for Linear Data.
Linear regression.
Model produced by
LinearRegression
.Regression model trained using LinearRegression.
Params for linear regression.
Linear regression results evaluated on a dataset.
Linear regression training results.
Train a linear regression model with no regularization using Stochastic Gradient Descent.
Linear SVM Model trained by
LinearSVC
Params for linear SVM Classifier.
Abstraction for LinearSVC results for a given model.
LinearSVC results for a given model.
Abstraction for LinearSVC training results.
LinearSVC training results.
An event bus which posts events to its listeners.
Interface used for arbitrary stateful operations with the v2 API to capture
list value state.
Convenience extractor for any Literal.
Represents a constant literal value in the public expression API.
Tracker for data related to a persisted RDD.
Data about a single partition of a cached RDD.
Trait for classes which can load models and transformers from files.
Event fired after
MLReader.load
.Event fired before
MLReader.load
.An utility object to run K-means locally.
Local (non-distributed) model fitted by
LDA
.Local LDA model.
A special Scan which will happen on Driver locally instead of Executors.
Helper methods for working with the logical expressions API.
This interface contains logical write information that data sources can use when generating a
WriteBuilder
.Compute gradient and loss for a multinomial logistic loss function, as used
in multi-class classification (it is also used in binary logistic regression).
Logistic regression.
Generate test data for LogisticRegression.
Model produced by
LogisticRegression
.Classification model trained using Multinomial/Binary Logistic Regression.
Params for logistic regression.
Abstraction for logistic regression results for a given model.
Multiclass logistic regression results for a given model.
Abstraction for multiclass logistic regression training results.
Multiclass logistic regression training results.
Train a classification model for Multinomial/Binary Logistic Regression using
Limited-memory BFGS.
Train a classification model for Binary Logistic Regression
using Stochastic Gradient Descent.
Class for log loss calculation (for classification).
Generates i.i.d.
:: : DeveloperApi ::
Utils for querying Spark logs with Spark SQL.
An
accumulator
for computing sum, count, and average of 64-bit integers.Specialized version of
Param[Long]
for Java.The data type representing
Long
values.A trait to encapsulate catalog lookup function and helpful extractors.
Extract legacy table identifier from a multi-part identifier.
Extract legacy table identifier from a multi-part identifier.
Extract catalog and identifier from a multi-part name with the current catalog if needed.
Extract catalog and identifier from a multi-part name with the current catalog if needed.
Extract catalog and namespace from a multi-part name with the current catalog if needed.
Extract catalog and namespace from a multi-part name with the current catalog if needed.
Extract non-session catalog and identifier from a multi-part identifier.
Extract non-session catalog and identifier from a multi-part identifier.
Extract session catalog and identifier from a multi-part identifier.
Extract session catalog and identifier from a multi-part identifier.
Trait for adding "pluggable" loss functions for the gradient boosting algorithm.
Trait for loss function
A loss reason that means we don't yet know why the executor exited.
Lower priority implicit methods for converting Scala objects into
Dataset
s.Params for
LSH
.:: DeveloperApi ::
LZ4 implementation of
CompressionCodec
.:: DeveloperApi ::
LZF implementation of
CompressionCodec
.Base interface for a map function used in Dataset's map function.
Base interface for a map function used in GroupedDataset's mapGroup function.
::Experimental::
Base interface for a map function used in
KeyValueGroupedDataset.mapGroupsWithState(MapGroupsWithStateFunction, org.apache.spark.sql.Encoder, org.apache.spark.sql.Encoder)
:: Private ::
Represents the result of writing map outputs for a shuffle map task.
:: Private ::
An opaque metadata tag for registering the result of committing the output of a
shuffle map task.
Base interface for function used in Dataset's mapPartitions.
An AccumulatorV2 counter for collecting a list of (mapper index, row count).
Interface used for arbitrary stateful operations with the v2 API to capture
map value state.
Result returned by a ShuffleMapTask to a scheduler.
The data type for Maps.
DStream representing the stream of data generated by
mapWithState
operation on a
pair DStream
.Factory methods for
Matrix
.Factory methods for
Matrix
.Trait for a local matrix.
Trait for a local matrix.
Represents an entry in a distributed matrix.
Model representing the result of matrix factorization.
Provides utility functions to be used inside SparkSubmit.
An aggregate function that returns the maximum value in a group.
Rescale each feature individually to range [-1, 1] by dividing through the largest maximum
absolute value in each feature.
Model fitted by
MaxAbsScaler
.Params for
MaxAbsScaler
and MaxAbsScalerModel
.An extractor object for parsing JVM memory strings, such as "10g", into an Int representing
the number of megabytes.
MergeIntoWriter
provides methods to define and execute merge actions based
on specified conditions.Default Meta-Algorithm read and write implementation.
Metadata is a wrapper over Map[String, Any] that limits the value type to simple ones: Boolean,
Long, Double, String, Metadata, Array[Boolean], Array[Long], Array[Double], Array[String], and
Array[Metadata].
Builder for
Metadata
.Interface for a metadata column.
Helper utilities for algorithms using ML metadata
Helper class to identify a method.
Generate RDD(s) containing data for Matrix Factorization.
A
SparkDataStream
for streaming queries with micro-batch mode.Helper object that creates instance of
Duration
representing
a given number of milliseconds.An aggregate function that returns the minimum value in a group.
LSH class for Jaccard distance.
Model produced by
MinHashLSH
, where multiple hash functions are stored.Rescale each feature individually to a common range [min, max] linearly using column summary
statistics, which is also known as min-max normalization or Rescaling.
Model fitted by
MinMaxScaler
.Params for
MinMaxScaler
and MinMaxScalerModel
.Helper object that creates instance of
Duration
representing
a given number of minutes.:: DeveloperApi ::
Stores information about an Miscellaneous Process to pass from the scheduler to SparkListeners.
Event emitted by ML operations.
A small trait that defines some methods to send
MLEvent
.ML export formats for should implement this trait so that users can specify a shortname rather
than the fully qualified class name of the exporter.
Machine learning specific Pair RDD functions.
Trait for objects that provide
MLReader
.Abstract class for utility classes that can load ML instances.
Helper methods to load, save and pre-process data used in MLLib.
Trait for classes that provide
MLWriter
.Abstract class for utility classes that can save ML instances in Spark's internal format.
Abstract class to be implemented by objects that provide ML exportability.
A fitted model, i.e., a
Transformer
produced by an Estimator
.Evaluator for multiclass classification, which expects input columns: prediction, label,
weight (optional) and probability (only for logLoss).
Evaluator for multiclass classification.
:: Experimental ::
Evaluator for multi-label classification, which expects two input
columns: prediction and label.
Evaluator for multilabel classification.
Classification model based on the Multilayer Perceptron.
Abstraction for MultilayerPerceptronClassification results for a given model.
MultilayerPerceptronClassification results for a given model.
Abstraction for MultilayerPerceptronClassification training results.
MultilayerPerceptronClassification training results.
Classifier trainer based on the Multilayer Perceptron.
Params for Multilayer Perceptron.
This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution.
This class provides basic functionality for a Multivariate Gaussian (Normal) Distribution.
MultivariateOnlineSummarizer implements
MultivariateStatisticalSummary
to compute the mean,
variance, minimum, maximum, counts, and nonzero counts for instances in sparse or dense vector
format in an online fashion.Trait for multivariate statistical summary of a data matrix.
A
Row
representing a mutable aggregation buffer.:: DeveloperApi ::
A tuple of 2 elements.
URL class loader that exposes the `addURL` method in URLClassLoader.
Naive Bayes Classifiers.
Trains a Naive Bayes model given an RDD of
(label, features)
pairs.Model produced by
NaiveBayes
Model for Naive Bayes Classifiers.
Params for Naive Bayes Classifiers.
Represents a field or column reference in the public logical expression API.
Convenience extractor for any Transform.
NamespaceChange subclasses represent requested changes to a namespace.
A NamespaceChange to remove a namespace property.
A NamespaceChange to set a namespace property.
:: DeveloperApi ::
Base class for dependencies where each partition of the child RDD depends on a small number
of partitions of the parent RDD.
:: DeveloperApi ::
An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS,
sources in HBase, or S3), using the new MapReduce API (
org.apache.hadoop.mapreduce
).A feature transformer that converts the input array of strings into an array of n-grams.
InputStream
implementation which uses direct buffer
to read a file to avoid extra copy of data between Java and
native memory which happens when using BufferedInputStream
.Object used to solve nonnegative least squares problems using a modified
projected gradient method.
Decision tree node interface.
Node in a decision tree.
A nominal attribute.
NOOP dialect object, always returning the neutral element.
Interface for classes that solve the normal equations locally.
Normalize a vector to have unit norm using the given p-norm.
Normalizes samples individually to unit L^p^ norm
A predicate that evaluates to
true
iff child
is evaluated to false
.A filter that evaluates to
true
iff child
is evaluated to false
.A null order used in sorting expressions.
The data type representing
NULL
values.A numeric attribute with optional summary statistics.
A generic, re-usable histogram class that supports partial aggregations.
The Coord class defines a histogram bin, which is just an (x,y) pair.
Simple parser for a numeric structure consisting of three types:
Numeric data types.
Helper class to simplify usage of
Dataset.observe(String, Column, Column*)
:Helper class to simplify usage of
Dataset.observe(String, Column, Column*)
:An abstract representation of progress through a
MicroBatchStream
or
ContinuousStream
.A one-hot encoder that maps a column of category indices to a column of binary vectors, with
at most a single one-value per row that indicates the input category index.
Private trait for params and common methods for OneHotEncoder and OneHotEncoderModel
Provides some helper methods used by
OneHotEncoder
.param: categorySizes Original number of categories for each feature being encoded.
:: DeveloperApi ::
Represents a one-to-one dependency between partitions of the parent and child RDDs.
Reduction of Multiclass Classification to Binary Classification.
Model produced by
OneVsRest
.Params for
OneVsRest
.An online optimizer for LDA.
Trait for optimization problem solvers.
Like
java.util.Optional
in Java 8, scala.Option
in Scala, and
com.google.common.base.Optional
in Google Guava, this class represents a
value of a given type that may or may not exist.A predicate that evaluates to
true
iff at least one of left
or right
evaluates to true
.A filter that evaluates to
true
iff at least one of left
or right
evaluates to true
.A distribution where tuples have been ordered across partitions according
to ordering expressions, but not necessarily within a given partition.
Extra functions available on RDDs of (key, value) pairs where the key is sortable through
an implicit conversion.
OutputMode describes what data will be written to a streaming sink when there is
new data available in a streaming DataFrame/Dataset.
:: DeveloperApi ::
Class having information on output operations.
A paged table that will generate a HTML table for a specified page and also the page navigation.
PageRank algorithm implementation.
Extra functions available on DStream of (key, value) pairs through an implicit conversion.
A function that returns zero or more key-value pair records from each input record.
A function that returns key-value pairs (Tuple2<K, V>), and can be used to
construct PairRDDs.
Extra functions available on RDDs of (key, value) pairs through an implicit conversion.
Form an RDD[(Int, Array[Byte])] from key-value pairs returned from R.
A param with self-contained documentation and optionally default value.
Builder for a param grid used in grid search-based model selection.
A param to value map.
A param and its value.
Trait for components that take parameters.
Factory methods for common validation functions for
Param.isValid
.A class loader which makes some protected methods in ClassLoader accessible.
An identifier for a partition in an RDD.
::DeveloperApi::
A PartitionCoalescer defines how to coalesce the partitions of a given RDD.
An object that defines how the elements in a key-value pair RDD are partitioned by key.
An evaluator for computing RDD partitions.
A factory to create
PartitionEvaluator
.::DeveloperApi::
A group of
Partition
s
param: prefLoc preferred location for the partition groupAn interface to represent the output data partitioning for a data source, which is returned by
SupportsReportPartitioning.outputPartitioning()
.Used for per-partition offsets in continuous processing.
:: DeveloperApi ::
An RDD used to prune RDD partitions/partitions so we can avoid launching tasks on
all partitions.
A partition reader returned by
PartitionReaderFactory.createReader(InputPartition)
or
PartitionReaderFactory.createColumnarReader(InputPartition)
.A factory used to create
PartitionReader
instances.Represents the way edges are assigned to edge partitions based on their source and destination
vertex IDs.
Assigns edges to partitions by hashing the source and destination vertex IDs in a canonical
direction, resulting in a random vertex cut that colocates all edges between two vertices,
regardless of direction.
Assigns edges to partitions using only the source vertex ID, colocating edges with the same
source.
Assigns edges to partitions using a 2D partitioning of the sparse edge adjacency matrix,
guaranteeing a
2 * sqrt(numParts)
bound on vertex replication.Assigns edges to partitions by hashing the source and destination vertex IDs, resulting in a
random vertex cut that colocates all same-direction edges between two vertices.
PCA trains a model to project vectors to a lower dimensional space of the top
PCA!.k
principal components.A feature transformer that projects vectors to a low-dimensional space using PCA.
Model fitted by
PCA
.Model fitted by
PCA
that can project vectors to a low-dimensional space using PCA.Compute Pearson correlation for two RDDs of the type RDD[Double] or the correlation matrix
for an RDD of the type RDD[Vector].
This interface contains physical write information that data sources can use when
generating a
DataWriterFactory
or a StreamingDataWriterFactory
.A simple pipeline, which acts as an estimator.
Represents a fitted pipeline.
A stage in a pipeline, either an
Estimator
or a Transformer
.:: DeveloperApi ::
Context information and operations for plugins loaded by Spark.
Export model to the PMML format
Predictive Model Markup Language (PMML) is an XML-based file format
developed by the Data Mining Group (www.dmg.org).
A writer for KMeans that handles the "pmml" format
A writer for LinearRegression that handles the "pmml" format
Utility functions that help us determine bounds on adjusted sampling rate to guarantee exact
sample sizes with high confidence when sampling with replacement.
Generates i.i.d.
:: DeveloperApi ::
A sampler for sampling with replacement, based on values drawn from Poisson distribution.
Perform feature expansion in a polynomial space.
A class that allows DataStreams to be serialized and moved around by not creating them
until they need to be read
Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by
Lin and Cohen.
Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by
Lin and Cohen.
Cluster assignment.
Model produced by
PowerIterationClustering
.Common params for PowerIterationClustering
Precision.
The general representation of predicate expressions, which contains the upper-cased expression
name and all the children expressions.
Predicted value for a node
param: predict predicted value
param: prob probability of the label (classification only)
Abstraction for a model for prediction tasks (regression and classification).
Predictor<FeaturesType,Learner extends Predictor<FeaturesType,Learner,M>,M extends PredictionModel<FeaturesType,M>>
Abstraction for prediction problems (regression and classification).
(private[ml]) Trait for parameters for prediction (regression and classification).
A parallel PrefixSpan algorithm to mine frequent sequential patterns.
A parallel PrefixSpan algorithm to mine frequent sequential patterns.
Represents a frequent sequence.
Model fitted by
PrefixSpan
param: freqSequences frequent sequencesImplements a Pregel-like bulk-synchronous message-passing API.
ProbabilisticClassificationModel<FeaturesType,M extends ProbabilisticClassificationModel<FeaturesType,M>>
Model produced by a
ProbabilisticClassifier
.ProbabilisticClassifier<FeaturesType,E extends ProbabilisticClassifier<FeaturesType,E,M>,M extends ProbabilisticClassificationModel<FeaturesType,M>>
Single-label binary or multiclass classifier which can output class conditional probabilities.
(private[classification]) Params for probabilistic classification.
:: DeveloperApi ::
ProtobufSerDe
used to represent the API for serialize and deserialize of
Protobuf data related to UI.A Jetty handler to handle redirects to a proxy server.
A BaseRelation that can eliminate unneeded columns and filter using selected
predicates before producing an RDD containing all matching tuples as Row objects.
A BaseRelation that can eliminate unneeded columns before producing an RDD
containing all of its tuples as Row objects.
:: DeveloperApi ::
A class with pseudorandom behavior.
Helper class for
ShuffleBlockFetcherIterator
that encapsulates all the push-based
functionality to fetch push-merged block meta and shuffle chunks.Py4J allows a pure interface so this proxy is required.
Represents QR factors.
QuantileDiscretizer
takes a column with continuous features and outputs a column with binned
categorical features.Params for
QuantileDiscretizer
.Enum for selecting the quantile calculation strategy
Object for grouping error messages from exceptions thrown during query compilation.
Query context of a
SparkThrowable
.The type of
QueryContext
.The trait exposes util methods for preparing error messages such as quoting of error elements.
Object for grouping error messages from (most) exceptions thrown during query execution.
The interface of query execution listener that can be used to analyze execution metrics.
Represents the query info provided to the stateful processor used in the
arbitrary state API v2 to easily identify task retries on the same partition.
Object for grouping all error messages of the query parsing.
Trait for random data generators that generate i.i.d.
ALGORITHM
A class that implements a Random Forest
learning algorithm for classification and regression.
Random Forest model for classification.
Abstraction for multiclass RandomForestClassification results for a given model.
Multiclass RandomForestClassification results for a given model.
Abstraction for multiclass RandomForestClassification training results.
Multiclass RandomForestClassification training results.
Random Forest learning algorithm for
classification.
Represents a random forest model.
Parameters for Random Forest algorithms.
Random Forest model for regression.
Random Forest
learning algorithm for regression.
Generator methods for creating RDDs comprised of
i.i.d.
samples from some distribution.:: DeveloperApi ::
A pseudorandom sampler.
:: DeveloperApi ::
Represents a one-to-one dependency between ranges of partitions in the parent and child RDDs.
A
Partitioner
that partitions sortable records by range into roughly
equal ranges.:: Experimental ::
Evaluator for ranking, which expects two input columns: prediction and label.
Evaluator for ranking algorithms.
A component that estimates the rate at which an
InputDStream
should ingest
records, based on updates at every batch completion.A more compact class to represent a rating than Tuple3[Int, Int, Double].
A helper program that sends blocks of Kryo-serialized text strings out on a socket at a
specified rate.
Authentication handler for connections from the R process.
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
:: Experimental ::
Wraps an RDD in a barrier stage, which forces Spark to launch tasks of this stage together.
Machine learning specific RDD functions.
A custom sequence of partitions based on a mutable linked list.
InputStream
implementation which asynchronously reads ahead from the underlying input
stream when specified amount of data has been read from the current buffer.Represents a
ReadLimit
where the MicroBatchStream
must scan all the data
available at the streaming source.Interface representing limits on how much to read from a
MicroBatchStream
when it
implements SupportsAdmissionControl
.Represents a
ReadLimit
where the MicroBatchStream
should scan files which total
size doesn't go beyond a given maximum total size.Represents a
ReadLimit
where the MicroBatchStream
should scan approximately the
given maximum number of files.Represents a
ReadLimit
where the MicroBatchStream
should scan approximately the
given maximum number of rows.Represents a
ReadLimit
where the MicroBatchStream
should scan approximately
at least the given minimum number of rows.Recall.
Trait representing a received block
Trait that represents a class that handles the storage of blocks received by receiver
Trait that represents the metadata related to storage of blocks
Trait representing any event in the ReceivedBlockTracker that updates its state.
:: DeveloperApi ::
Abstract class of a receiver that can be run on worker nodes to receive external data.
:: DeveloperApi ::
Class having information about a receiver
Abstract class for defining any
InputDStream
that has to start a receiver on worker nodes to receive external data.Messages sent to the Receiver.
Enumeration to identify current state of a Receiver
Messages used by the driver and ReceiverTrackerEndpoint to communicate locally.
Messages used by the NetworkReceiver and the ReceiverTracker to communicate
with each other.
Base interface for function used in Dataset's reduce.
A 'reducer' for output of user-defined functions.
Base class for user-defined functions that can be 'reduced' on another function.
Convenience extractor for any NamedReference.
A regex based tokenizer that extracts tokens either by using the provided regex pattern to split
the text (default) or repeatedly matching the regex (if
gaps
is false).Evaluator for regression, which expects input columns prediction, label and
an optional weight column.
Evaluator for regression.
Model produced by a
Regressor
.Regressor<FeaturesType,Learner extends Regressor<FeaturesType,Learner,M>,M extends RegressionModel<FeaturesType,M>>
Single-label regression
To indicate it's the CUBE
To indicate it's the GroupBy
The Grouping Type
To indicate it's the ROLLUP
Implemented by objects that produce relations for a specific kind of data source.
A mix-in interface for streaming sinks to signal that they can report
metrics.
A mix-in interface for
SparkDataStream
streaming sources to signal that they can report
metrics.A write that requires a specific distribution and ordering of data.
Trait used to help executor/worker allocate resources.
:: DeveloperApi ::
A plugin that can be dynamically loaded into a Spark application to control how custom
resources are discovered.
The default plugin that is loaded into a Spark application to control how custom
resources are discovered.
Resource identifier.
Class to hold information about a type of Resource.
A case class to simplify JSON serialization of
ResourceInformation
.Resource profile to associate with an RDD.
Resource profile builder to build a
ResourceProfile
to associate with an RDD.Class that represents a resource request.
:: DeveloperApi ::
A
org.apache.spark.scheduler.ShuffleMapTask
that completed successfully earlier, but we
lost the executor before the stage completed.Allows Spark to rewrite the given references of the transform during analysis.
Implements the transforms required for fitting a dataset against an R model formula.
Base trait for
RFormula
and RFormulaModel
.Model fitted by
RFormula
.Limited implementation of R formula parsing.
Regression model trained using RidgeRegression.
Train a regression model with L2-regularization using Stochastic Gradient Descent.
Scale features using statistics that are robust to outliers.
Model fitted by
RobustScaler
.Params for
RobustScaler
and RobustScalerModel
.Defines the policy based on which
RollingFileAppender
will
generate rolling files.Represents one row of output from a relational operator.
A factory class used to construct
Row
objects.A logical representation of a data source DELETE, UPDATE, or MERGE operation that requires
rewriting data.
A row-level SQL command.
An interface for building a
RowLevelOperation
.An interface with logical information for a row-level operation such as DELETE, UPDATE, MERGE.
Represents a row-oriented distributed Matrix with no meaningful row indices.
An RDD that stores serialized R objects as Array[Byte].
Runtime configuration interface for Spark.
This is the Scala stub of SparkR read.ml.
Filter that allows loading a fraction of HDFS files.
Trait for models and transformers which may be saved as files.
Event fired after
MLWriter.save
.Event fired before
MLWriter.save
.SaveMode is used to specify the expected behavior of saving a DataFrame to a data source.
Interface for a function that produces a result value for each input row.
A logical representation of a data source scan.
This enum defines how the columnar support for the partitions of the data source
should be determined.
An interface for building the
Scan
.An interface for schedulable entities.
An interface to build Schedulable tree
buildPools: build the tree nodes(pools)
addTaskSetManager: build the leaf nodes(TaskSetManagers)
A backend interface for scheduling systems that allows plugging in different ones under
TaskSchedulerImpl.
An interface for sort algorithm
FIFO: FIFO algorithm between TaskSetManagers
FS: FS algorithm between Pools, and FIFO or FS within Pools
"FAIR" and "FIFO" determines which policy is used
to order tasks amongst a Schedulable's sub-queues
"NONE" is used when the a Schedulable has no sub-queues.
This object contains method that are used to convert sparkSQL schemas to avro schemas and vice
versa.
Internal wrapper for SQL data type and nullability.
Implemented by objects that produce relations for a specific kind of data source
with a given schema.
Utils for handling schemas.
Utils for handling schemas.
Helper object that creates instance of
Duration
representing
a given number of seconds.There are cases when global JVM security configuration must be modified.
Various utility methods used by Spark Security.
Params for
Selector
and SelectorModel
.Extra functions available on RDDs of (key, value) pairs to create a Hadoop SequenceFile,
through an implicit conversion.
Utility functions to serialize, deserialize objects to / from R
Hadoop configuration but serializable.
SerializableWritable<T extends org.apache.hadoop.io.Writable>
An implicit class that allows us to call private methods of ObjectStreamClass.
:: DeveloperApi ::
A stream for writing serialized objects.
A holder for storing the serialized values.
:: DeveloperApi ::
A serializer.
:: DeveloperApi ::
An instance of a serializer, for use by one thread at a time.
A mix-in interface for
TableProvider
.Code generator for shared params (sharedParams.scala).
Computes shortest paths to the given set of landmark vertices, returning a graph where each
vertex attribute is a map containing the shortest-path distance to each reachable landmark.
The data type representing
Short
values.:: Private ::
An interface for plugging in modules for storing and reading temporary shuffle data.
:: DeveloperApi ::
Represents a dependency on the output of a shuffle stage.
:: DeveloperApi ::
The resulting RDD from a shuffle (e.g.
:: Private ::
An interface for building shuffle support modules for the Driver.
:: Private ::
An interface for building shuffle support for Executors.
A listener to be called at the completion of the ShuffleBlockFetcherIterator
param: data the ShuffleBlockFetcherIterator to process
:: Private ::
A top-level writer that returns child writers for persisting the output of a map task,
and then commits all of the writes as one atomic operation.
A common trait between
MapStatus
and MergeStatus
.:: Private ::
An interface for opening streams to persist partition bytes to a backing data store.
Helper class used by the
MapOutputTrackerMaster
to perform bookkeeping for a single
ShuffleMapStage.Various utility methods used by Spark.
Contains utilities for working with posix signals.
A
FutureAction
holding the result of an action that triggers a single job.A
CachedBatch
that stores some simple metrics that can be used for filtering of batches with
the SimpleMetricsCachedBatchSerializer
.Provides basic filtering for
CachedBatchSerializer
implementations.A simple updater for gradient descent *without* any regularization.
Optional extension for partition writing that is optimized for transferring a single
file to the backing store.
Represents singular value decomposition (SVD) factors.
Information about progress made for a sink in the execution of a
StreamingQuery
during a trigger.:: DeveloperApi ::
Estimates the sizes of Java objects (number of bytes of memory they occupy), for use in
memory-aware caches.
:: DeveloperApi ::
Snappy implementation of
CompressionCodec
.A sort direction used in sorting expressions.
Represents a sort order in the public expression API.
Information about progress made for a source in the execution of a
StreamingQuery
during a trigger.A handle to a running Spark application.
Listener for updates to a handle's state.
Represents the application's state.
Serializable interface providing a method executors can call to obtain an
AWSCredentialsProvider instance for authenticating to AWS services.
Builder for
SparkAWSCredentials
instances.Configuration for a Spark application.
Main entry point for Spark functionality.
Object for grouping error messages from (most) exceptions thrown during query execution.
The base interface representing a readable data stream in a Spark streaming query.
:: DeveloperApi ::
Holds all the runtime environment objects for a running Spark instance (either master or worker),
including the serializer, RpcEnv, block manager, map output tracker, etc.
Exposes information about Spark Executors.
Resolves paths to files added through
SparkContext.addFile()
.TODO (PARQUET-1809): This is a temporary workaround; it is intended to be moved to Parquet.
Class that allows users to receive all SparkListener events.
Exposes information about Spark Jobs.
Launcher for Spark applications.
:: DeveloperApi ::
A default implementation for
SparkListenerInterface
that has no-op implementations for
all callbacks.A
SparkListenerEvent
bus that relays SparkListenerEvent
s to its listenersDeprecated.
use SparkListenerExecutorExcluded instead.
Deprecated.
use SparkListenerExecutorExcludedForStage instead.
Periodic updates from executors.
Deprecated.
use SparkListenerExecutorUnexcluded instead.
Interface for listening to events from the Spark scheduler.
An internal class that describes the metadata of an event log.
Deprecated.
use SparkListenerNodeExcluded instead.
Deprecated.
use SparkListenerNodeExcludedForStage instead.
Deprecated.
use SparkListenerNodeUnexcluded instead.
Peak metric values for the executor for the stage, written to the history log at stage
completion.
A collection of regexes for extracting information from the master string.
A canonical representation of a file path.
:: DeveloperApi ::
A plugin that can be dynamically loaded into a Spark application.
Utils for handling schemas.
The entry point to programming Spark with the Dataset and DataFrame API.
Builder for
SparkSession
.:: Experimental ::
Holder for injection points to the
SparkSession
.:: Unstable ::
Exposes information about Spark Stages.
Low-level status reporting APIs for monitoring job and stage progress.
Interface mixed into Throwables thrown from Spark.
Companion object used by instances of
SparkThrowable
to access error class information and
construct error messages.Column-major sparse matrix.
Column-major sparse matrix.
A sparse vector represented by an index array and a value array.
A sparse vector represented by an index array and a value array.
Compute Spearman's correlation for two RDDs of the type RDD[Double] or the correlation matrix
for an RDD of the type RDD[Vector].
A
SparkListener
that detects whether spills have occurred in Spark jobs.Interface for a "Split," which specifies a test made at a decision tree node
to choose the left or right path.
Split applied to a feature
param: feature feature index
param: threshold Threshold for continuous feature.
The entry point for working with structured data (rows and columns) in Spark 1.x.
SQL data types for vectors and matrices.
A collection of implicit methods for converting common Scala objects into
Dataset
s.Implements the transformations which are defined by SQL statement.
::DeveloperApi::
A user-defined type which can be automatically recognized by a SQLContext and registered.
Class for squared error loss calculation.
SquaredEuclideanSilhouette computes the average of the
Silhouette over all the data of the dataset, which is
a measure of how appropriately the data have been clustered.
Updater for L2 regularized problems.
Represents a table which is staged for being committed to the metastore.
:: DeveloperApi ::
Stores information about a stage to pass from the scheduler to SparkListeners.
An optional mix-in for implementations of
TableCatalog
that support staging creation of
the a table before committing the table's metadata along with its contents in CREATE TABLE AS
SELECT or REPLACE TABLE AS SELECT operations.Generates i.i.d.
Standardizes features by removing the mean and scaling to unit variance using column summary
statistics on the samples in the training set.
Standardizes features by removing the mean and scaling to unit std using column summary
statistics on the samples in the training set.
Model fitted by
StandardScaler
.Represents a StandardScaler model that can transform vectors.
Params for
StandardScaler
and StandardScalerModel
.A class for tracking the statistics of a set of numbers (count, mean and variance) in a
numerically robust way.
:: Experimental ::
Abstract class for getting and updating the state in mapping function used in the
mapWithState
operation of a pair DStream
(Scala)
or a JavaPairDStream
(Java).Represents the operation handle provided to the stateful processor used in the
arbitrary state API v2.
Information about updates made to stateful operators in a
StreamingQuery
during a trigger.:: Experimental ::
Abstract class representing all the specifications of the DStream transformation
mapWithState
operation of a
pair DStream
(Scala) or a
JavaPairDStream
(Java).API for statistical functions in MLlib.
An interface to represent statistics for a data source, which is returned by
SupportsReportStatistics.estimateStatistics()
.:: DeveloperApi ::
Simple SparkListener that logs a few summary statistics when each stage completes.
:: DeveloperApi ::
A simple StreamingListener that logs summary statistics across Spark Streaming batches
param: numBatchInfos Number of last batches to consider for generating statistics (default: 10)
This message will trigger ReceiverTrackerEndpoint to send stop signals to all registered
receivers.
A feature transformer that filters out stop words from input.
:: DeveloperApi ::
Flags for controlling the storage of an RDD.
A mapper class easy to obtain storage levels based on their names.
Expose some commonly useful storage level constants.
Helper methods for storage-related objects.
Protobuf type
org.apache.spark.status.protobuf.AccumulableInfo
Protobuf type
org.apache.spark.status.protobuf.AccumulableInfo
Protobuf type
org.apache.spark.status.protobuf.ApplicationAttemptInfo
Protobuf type
org.apache.spark.status.protobuf.ApplicationAttemptInfo
Protobuf type
org.apache.spark.status.protobuf.ApplicationEnvironmentInfo
Protobuf type
org.apache.spark.status.protobuf.ApplicationEnvironmentInfo
Protobuf type
org.apache.spark.status.protobuf.ApplicationEnvironmentInfoWrapper
Protobuf type
org.apache.spark.status.protobuf.ApplicationEnvironmentInfoWrapper
Protobuf type
org.apache.spark.status.protobuf.ApplicationInfo
Protobuf type
org.apache.spark.status.protobuf.ApplicationInfo
Protobuf type
org.apache.spark.status.protobuf.ApplicationInfoWrapper
Protobuf type
org.apache.spark.status.protobuf.ApplicationInfoWrapper
Protobuf type
org.apache.spark.status.protobuf.AppSummary
Protobuf type
org.apache.spark.status.protobuf.AppSummary
Protobuf type
org.apache.spark.status.protobuf.CachedQuantile
Protobuf type
org.apache.spark.status.protobuf.CachedQuantile
Protobuf enum
org.apache.spark.status.protobuf.DeterministicLevel
Protobuf type
org.apache.spark.status.protobuf.ExecutorMetrics
Protobuf type
org.apache.spark.status.protobuf.ExecutorMetrics
Protobuf type
org.apache.spark.status.protobuf.ExecutorMetricsDistributions
Protobuf type
org.apache.spark.status.protobuf.ExecutorMetricsDistributions
Protobuf type
org.apache.spark.status.protobuf.ExecutorPeakMetricsDistributions
Protobuf type
org.apache.spark.status.protobuf.ExecutorPeakMetricsDistributions
Protobuf type
org.apache.spark.status.protobuf.ExecutorResourceRequest
Protobuf type
org.apache.spark.status.protobuf.ExecutorResourceRequest
Protobuf type
org.apache.spark.status.protobuf.ExecutorStageSummary
Protobuf type
org.apache.spark.status.protobuf.ExecutorStageSummary
Protobuf type
org.apache.spark.status.protobuf.ExecutorStageSummaryWrapper
Protobuf type
org.apache.spark.status.protobuf.ExecutorStageSummaryWrapper
Protobuf type
org.apache.spark.status.protobuf.ExecutorSummary
Protobuf type
org.apache.spark.status.protobuf.ExecutorSummary
Protobuf type
org.apache.spark.status.protobuf.ExecutorSummaryWrapper
Protobuf type
org.apache.spark.status.protobuf.ExecutorSummaryWrapper
Protobuf type
org.apache.spark.status.protobuf.InputMetricDistributions
Protobuf type
org.apache.spark.status.protobuf.InputMetricDistributions
Protobuf type
org.apache.spark.status.protobuf.InputMetrics
Protobuf type
org.apache.spark.status.protobuf.InputMetrics
Protobuf type
org.apache.spark.status.protobuf.JobData
Protobuf type
org.apache.spark.status.protobuf.JobData
Protobuf type
org.apache.spark.status.protobuf.JobDataWrapper
Protobuf type
org.apache.spark.status.protobuf.JobDataWrapper
Protobuf enum
org.apache.spark.status.protobuf.JobExecutionStatus
Protobuf type
org.apache.spark.status.protobuf.MemoryMetrics
Protobuf type
org.apache.spark.status.protobuf.MemoryMetrics
Protobuf type
org.apache.spark.status.protobuf.OutputMetricDistributions
Protobuf type
org.apache.spark.status.protobuf.OutputMetricDistributions
Protobuf type
org.apache.spark.status.protobuf.OutputMetrics
Protobuf type
org.apache.spark.status.protobuf.OutputMetrics
Protobuf type
org.apache.spark.status.protobuf.PairStrings
Protobuf type
org.apache.spark.status.protobuf.PairStrings
Protobuf type
org.apache.spark.status.protobuf.PoolData
Protobuf type
org.apache.spark.status.protobuf.PoolData
Protobuf type
org.apache.spark.status.protobuf.ProcessSummary
Protobuf type
org.apache.spark.status.protobuf.ProcessSummary
Protobuf type
org.apache.spark.status.protobuf.ProcessSummaryWrapper
Protobuf type
org.apache.spark.status.protobuf.ProcessSummaryWrapper
Protobuf type
org.apache.spark.status.protobuf.RDDDataDistribution
Protobuf type
org.apache.spark.status.protobuf.RDDDataDistribution
Protobuf type
org.apache.spark.status.protobuf.RDDOperationClusterWrapper
Protobuf type
org.apache.spark.status.protobuf.RDDOperationClusterWrapper
Protobuf type
org.apache.spark.status.protobuf.RDDOperationEdge
Protobuf type
org.apache.spark.status.protobuf.RDDOperationEdge
Protobuf type
org.apache.spark.status.protobuf.RDDOperationGraphWrapper
Protobuf type
org.apache.spark.status.protobuf.RDDOperationGraphWrapper
Protobuf type
org.apache.spark.status.protobuf.RDDOperationNode
Protobuf type
org.apache.spark.status.protobuf.RDDOperationNode
Protobuf type
org.apache.spark.status.protobuf.RDDPartitionInfo
Protobuf type
org.apache.spark.status.protobuf.RDDPartitionInfo
Protobuf type
org.apache.spark.status.protobuf.RDDStorageInfo
Protobuf type
org.apache.spark.status.protobuf.RDDStorageInfo
Protobuf type
org.apache.spark.status.protobuf.RDDStorageInfoWrapper
Protobuf type
org.apache.spark.status.protobuf.RDDStorageInfoWrapper
Protobuf type
org.apache.spark.status.protobuf.ResourceInformation
Protobuf type
org.apache.spark.status.protobuf.ResourceInformation
Protobuf type
org.apache.spark.status.protobuf.ResourceProfileInfo
Protobuf type
org.apache.spark.status.protobuf.ResourceProfileInfo
Protobuf type
org.apache.spark.status.protobuf.ResourceProfileWrapper
Protobuf type
org.apache.spark.status.protobuf.ResourceProfileWrapper
Protobuf type
org.apache.spark.status.protobuf.RuntimeInfo
Protobuf type
org.apache.spark.status.protobuf.RuntimeInfo
Protobuf type
org.apache.spark.status.protobuf.ShufflePushReadMetricDistributions
Protobuf type
org.apache.spark.status.protobuf.ShufflePushReadMetricDistributions
Protobuf type
org.apache.spark.status.protobuf.ShufflePushReadMetrics
Protobuf type
org.apache.spark.status.protobuf.ShufflePushReadMetrics
Protobuf type
org.apache.spark.status.protobuf.ShuffleReadMetricDistributions
Protobuf type
org.apache.spark.status.protobuf.ShuffleReadMetricDistributions
Protobuf type
org.apache.spark.status.protobuf.ShuffleReadMetrics
Protobuf type
org.apache.spark.status.protobuf.ShuffleReadMetrics
Protobuf type
org.apache.spark.status.protobuf.ShuffleWriteMetricDistributions
Protobuf type
org.apache.spark.status.protobuf.ShuffleWriteMetricDistributions
Protobuf type
org.apache.spark.status.protobuf.ShuffleWriteMetrics
Protobuf type
org.apache.spark.status.protobuf.ShuffleWriteMetrics
Protobuf type
org.apache.spark.status.protobuf.SinkProgress
Protobuf type
org.apache.spark.status.protobuf.SinkProgress
Protobuf type
org.apache.spark.status.protobuf.SourceProgress
Protobuf type
org.apache.spark.status.protobuf.SourceProgress
Protobuf type
org.apache.spark.status.protobuf.SparkPlanGraphClusterWrapper
Protobuf type
org.apache.spark.status.protobuf.SparkPlanGraphClusterWrapper
Protobuf type
org.apache.spark.status.protobuf.SparkPlanGraphEdge
Protobuf type
org.apache.spark.status.protobuf.SparkPlanGraphEdge
Protobuf type
org.apache.spark.status.protobuf.SparkPlanGraphNode
Protobuf type
org.apache.spark.status.protobuf.SparkPlanGraphNode
Protobuf type
org.apache.spark.status.protobuf.SparkPlanGraphNodeWrapper
Protobuf type
org.apache.spark.status.protobuf.SparkPlanGraphNodeWrapper
Protobuf type
org.apache.spark.status.protobuf.SparkPlanGraphWrapper
Protobuf type
org.apache.spark.status.protobuf.SparkPlanGraphWrapper
Protobuf type
org.apache.spark.status.protobuf.SpeculationStageSummary
Protobuf type
org.apache.spark.status.protobuf.SpeculationStageSummary
Protobuf type
org.apache.spark.status.protobuf.SpeculationStageSummaryWrapper
Protobuf type
org.apache.spark.status.protobuf.SpeculationStageSummaryWrapper
Protobuf type
org.apache.spark.status.protobuf.SQLExecutionUIData
Protobuf type
org.apache.spark.status.protobuf.SQLExecutionUIData
Protobuf type
org.apache.spark.status.protobuf.SQLPlanMetric
Protobuf type
org.apache.spark.status.protobuf.SQLPlanMetric
Protobuf type
org.apache.spark.status.protobuf.StageData
Protobuf type
org.apache.spark.status.protobuf.StageData
Protobuf type
org.apache.spark.status.protobuf.StageDataWrapper
Protobuf type
org.apache.spark.status.protobuf.StageDataWrapper
Protobuf enum
org.apache.spark.status.protobuf.StageStatus
Protobuf type
org.apache.spark.status.protobuf.StateOperatorProgress
Protobuf type
org.apache.spark.status.protobuf.StateOperatorProgress
Protobuf type
org.apache.spark.status.protobuf.StreamBlockData
Protobuf type
org.apache.spark.status.protobuf.StreamBlockData
Protobuf type
org.apache.spark.status.protobuf.StreamingQueryData
Protobuf type
org.apache.spark.status.protobuf.StreamingQueryData
Protobuf type
org.apache.spark.status.protobuf.StreamingQueryProgress
Protobuf type
org.apache.spark.status.protobuf.StreamingQueryProgress
Protobuf type
org.apache.spark.status.protobuf.StreamingQueryProgressWrapper
Protobuf type
org.apache.spark.status.protobuf.StreamingQueryProgressWrapper
Protobuf type
org.apache.spark.status.protobuf.TaskData
Protobuf type
org.apache.spark.status.protobuf.TaskData
Protobuf type
org.apache.spark.status.protobuf.TaskDataWrapper
Protobuf type
org.apache.spark.status.protobuf.TaskDataWrapper
Protobuf type
org.apache.spark.status.protobuf.TaskMetricDistributions
Protobuf type
org.apache.spark.status.protobuf.TaskMetricDistributions
Protobuf type
org.apache.spark.status.protobuf.TaskMetrics
Protobuf type
org.apache.spark.status.protobuf.TaskMetrics
Protobuf type
org.apache.spark.status.protobuf.TaskResourceRequest
Protobuf type
org.apache.spark.status.protobuf.TaskResourceRequest
Stores all the configuration options for tree construction
param: algo Learning goal.
Auxiliary functions and data structures for the sampleByKey method in PairRDDFunctions.
Deprecated.
This is deprecated as of Spark 3.4.0.
:: DeveloperApi ::
Represents the state of a StreamingContext.
A factory of
DataWriter
returned by
StreamingWrite.createStreamingWriterFactory(PhysicalWriteInfo)
, which is responsible for
creating and initializing the actual data writer at executor side.StreamingKMeans provides methods for configuring a
streaming k-means analysis, training the model on streaming,
and using the model to make predictions on streaming data.
StreamingKMeansModel extends MLlib's KMeansModel for streaming
algorithms, so it can keep track of a continuously updated weight
associated with each cluster, and also update the model by
doing a single iteration of the standard k-means algorithm.
StreamingLinearAlgorithm implements methods for continuously
training a generalized linear model on streaming data,
and using it for prediction on (possibly different) streaming data.
Train or predict a linear regression model on streaming data.
:: DeveloperApi ::
A listener interface for receiving information about an ongoing streaming
computation.
:: DeveloperApi ::
Base trait for events related to StreamingListener
Train or predict a logistic regression model on streaming data.
A handle to a query that is executing continuously in the background as new data arrives.
Exception that stopped a
StreamingQuery
.Interface for listening to events related to
StreamingQueries
.Base type of
StreamingQueryListener
eventsEvent representing that query is idle and waiting for new data to process.
Event representing any progress updates in a query.
Event representing the start of a query
param: id A unique query id that persists across restarts.
Event representing that termination of a query.
A class to manage all the
StreamingQuery
active in a SparkSession
.Information about progress made in the execution of a
StreamingQuery
during
a trigger.Reports information about the instantaneous status of a streaming query.
Performs online 2-sample significance testing for a stream of (Boolean, Double) pairs.
Significance testing methods for
StreamingTest
.An interface that defines how to write the data to data source in streaming queries.
:: DeveloperApi ::
Track the information of input stream at specified batch time.
::Experimental::
Implemented by objects that can produce a streaming
Sink
for a specific format or system.::Experimental::
Implemented by objects that can produce a streaming
Source
for a specific format or system.Specialized version of
Param[Array[String}
for Java.A filter that evaluates to
true
iff the attribute evaluates to
a string that contains the string value
.A filter that evaluates to
true
iff the attribute evaluates to
a string that ends with value
.A label indexer that maps string column(s) of labels to ML column(s) of label indices.
A SQL
Aggregator
used by StringIndexer
to count labels in string columns during fitting.Base trait for
StringIndexer
and StringIndexerModel
.Model fitted by
StringIndexer
.An RDD that stores R objects as Array[String].
A filter that evaluates to
true
iff the attribute evaluates to
a string that starts with value
.The data type representing
String
values.Strongly connected components algorithm implementation.
A field inside a StructType.
A
StructType
object can be constructed byPerforms Students's 2-sample t-test.
:: DeveloperApi ::
Task succeeded.
An aggregate function that returns the summation of all the values in a group.
Tools for vectorized statistics on MLlib Vectors.
A builder object that provides summary statistics about a given column.
A mix-in interface for
SparkDataStream
streaming sources to signal that they can control
the rate of data ingested into the system.An atomic partition interface of
Table
to operate multiple partitions atomically.An interface, which TableProviders can implement, to support table existence checks and creation
through a catalog, without having to use table identifiers.
A mix-in interface for
Table
delete support.A mix-in interface for
Table
delete support.A mix-in interface for
RowLevelOperation
.Write builder trait for tables that support dynamic partition overwrite.
Table methods for working with index
An interface for exposing data columns for a table that are not in the table schema.
Catalog methods for working with namespaces.
Write builder trait for tables that support overwrite by filter.
Write builder trait for tables that support overwrite by filter.
A partition interface of
Table
.A mix-in interface for
ScanBuilder
.A mix-in interface for
ScanBuilder
.A mix-in interface for
ScanBuilder
.A mix-in interface for
ScanBuilder
.A mix-in interface for
ScanBuilder
.A mix-in interface for
Scan
.A mix-in interface for
ScanBuilder
.A mix-in interface for
ScanBuilder
.A mix-in interface of
Table
, to indicate that it's readable.A mix in interface for
Scan
.A mix in interface for
Scan
.A mix in interface for
Scan
.A mix-in interface for
Table
row-level operations support.A mix-in interface for
Scan
.A mix-in interface for
Scan
.Implemented by StreamSourceProvider objects that can generate file metadata columns.
An interface for streaming sources that supports running in Trigger.AvailableNow mode, which
will process all the available data at the beginning of the query in (possibly) multiple batches.
Write builder trait for tables that support truncation.
A mix-in interface of
Table
, to indicate that it's writable.Implementation of SVD++ algorithm.
Configuration parameters for SVDPlusPlus.
Generate sample data used for SVM.
Model for Support Vector Machines (SVMs).
Train a Support Vector Machine (SVM) using Stochastic Gradient Descent.
A table in Spark, as returned by the
listTables
method in Catalog
.An interface representing a logical structured data set of a data source.
Capabilities that can be provided by a
Table
implementation.Catalog methods for working with Tables.
Capabilities that can be provided by a
TableCatalog
implementation.TableChange subclasses represent requested changes to a table.
A TableChange to add a field.
Column position AFTER means the specified column should be put after the given `column`.
A TableChange to delete a field.
Column position FIRST means the specified column should be the first column.
A TableChange to remove a table property.
A TableChange to rename a field.
A TableChange to set a table property.
A TableChange to update the comment of a field.
A TableChange to update the default value of a field.
A TableChange to update the nullability of a field.
A TableChange to update the position of a field.
A TableChange to update the type of a field.
Index in a table
The base interface for v2 data sources which don't have a real catalog.
A BaseRelation that can produce all of its tuples as an RDD of Row objects.
:: DeveloperApi ::
Task requested the driver to commit, but was denied.
:: DeveloperApi ::
Contextual information about a task which can be read or mutated during
execution.
Names of the CSS classes corresponding to each type of task detail.
:: DeveloperApi ::
Various possible reasons why a task ended.
:: DeveloperApi ::
Various possible reasons why a task failed.
:: DeveloperApi ::
Tasks have a lot of indices that are used in a few different places.
:: DeveloperApi ::
Information about a running task attempt inside a TaskSet.
:: DeveloperApi ::
Task was killed intentionally and needs to be rescheduled.
:: DeveloperApi ::
Exception thrown when a task is explicitly killed (i.e., task failure is expected).
A location where a task should run.
A task resource request.
A set of task resource requests.
:: DeveloperApi ::
The task finished successfully, but the result was lost from the executor's block manager before
it was fetched.
Low-level task scheduler interface, currently implemented exclusively by
TaskSchedulerImpl
.An event that SparkContext uses to notify HeartbeatReceiver that SparkContext.taskScheduler is
created.
R formula terms.
:: Experimental ::
Trait for hypothesis test results.
Utilities for tests.
This is a simple class that represents an absolute instant of time.
Represents the time modes (used for specifying timers and ttl) possible for
the Dataset operations
transformWithState
.Class used to provide access to timer values for processing and event time populated
before method invocations using the arbitrary state API v2.
The timestamp without time zone type represents a local time in microsecond precision,
which is independent of time zone.
The timestamp type represents a time instant in microsecond precision.
Intercepts write calls and tracks total time spent writing in order to update shuffle write
metrics.
A tokenizer that converts the input string to lowercase and then splits it by white spaces.
Trait for the artificial neural network (ANN) topology properties
::DeveloperApi::
TopologyMapper provides topology information for a given host
param: conf SparkConf to get required properties, if needed
Trait for ANN topology model
Abstraction for training results.
Validation for hyper-parameter tuning.
Model from train validation split.
Writer for TrainValidationSplitModel.
Params for
TrainValidationSplit
and TrainValidationSplitModel
.Represents a transform function in the public logical expression API.
Event fired after
Transformer.transform
.Abstract class for transformers that transform one dataset into another.
Event fired before
Transformer.transform
.Parameters for Decision Tree-based classification algorithms.
Parameters for Decision Tree-based ensemble classification algorithms.
Abstraction for models which are ensembles of decision trees
Parameters for Decision Tree-based ensemble algorithms.
Parameters for Decision Tree-based ensemble regression algorithms.
Parameters for Decision Tree-based regression algorithms.
Compute the number of triangles passing through each vertex.
Policy used to indicate how often results should be produced by a [[StreamingQuery]].
Represents a subset of the fields of an [[EdgeTriplet]] or [[EdgeContext]].
Represents a table which can be atomically truncated.
TTL Configuration for state variable.
Deprecated.
As of release 3.0.0, please use the untyped builtin aggregate functions.
Deprecated.
please use untyped builtin aggregate functions.
A Spark SQL UDF that has 0 arguments.
A Spark SQL UDF that has 1 arguments.
A Spark SQL UDF that has 10 arguments.
A Spark SQL UDF that has 11 arguments.
A Spark SQL UDF that has 12 arguments.
A Spark SQL UDF that has 13 arguments.
A Spark SQL UDF that has 14 arguments.
A Spark SQL UDF that has 15 arguments.
A Spark SQL UDF that has 16 arguments.
A Spark SQL UDF that has 17 arguments.
A Spark SQL UDF that has 18 arguments.
A Spark SQL UDF that has 19 arguments.
A Spark SQL UDF that has 2 arguments.
A Spark SQL UDF that has 20 arguments.
A Spark SQL UDF that has 21 arguments.
A Spark SQL UDF that has 22 arguments.
A Spark SQL UDF that has 3 arguments.
A Spark SQL UDF that has 4 arguments.
A Spark SQL UDF that has 5 arguments.
A Spark SQL UDF that has 6 arguments.
A Spark SQL UDF that has 7 arguments.
A Spark SQL UDF that has 8 arguments.
A Spark SQL UDF that has 9 arguments.
Functions for registering user-defined functions.
Functions for registering user-defined table functions.
This object keeps the mappings between user classes and their User Defined Types (UDTs).
This trait is shared by the all the root containers for application UI information --
the HistoryServer and the application UI.
Utility functions for generating XML pages with spark content.
Continuously generates jobs that expose various features of the WebUI (internal testing tool).
Abstract class for transformers that take one input column, apply transformation, and output the
result as a new column.
Represents a user-defined function that is not bound to input types.
Generates i.i.d.
Feature selector based on univariate statistical tests against labels.
Model fitted by
UnivariateFeatureSelectorModel
.Params for
UnivariateFeatureSelector
and UnivariateFeatureSelectorModel
.Represents a partitioning where rows are split across partitions in an unknown pattern.
:: DeveloperApi ::
We don't know why the task ended -- for example, because of a ClassNotFound exception when
deserializing the task result.
An unresolved attribute.
A distribution where no promises are made about co-location of data.
Rule that defines which upcasts are allow in Spark.
Class used to perform steps (weight update) using Gradient Descent methods.
The general representation of user defined aggregate function, which implements
AggregateFunc
, contains the upper-cased function name, the canonical function name,
the `isDistinct` flag and all the inputs.Deprecated.
UserDefinedAggregateFunction is deprecated.
A user-defined function.
The general representation of user defined scalar function, which contains the upper-cased
function name, canonical function name and all the children expressions.
The data type for User Defined Types (UDTs).
Various utility methods used by Spark.
A trait that should be implemented by V1 DataSources that would like to leverage the DataSource
V2 read code paths.
A logical write that should be executed using V1 InsertableRelation interface.
The builder to generate SQL from V2 expressions.
A V2 table with V1 fallback support.
Common params for
TrainValidationSplitParams
and CrossValidatorParams
.Interface used for arbitrary stateful operations with the v2 API to capture
single value state.
Class for calculating variance during regression
Feature selector that removes all low-variance features.
Model fitted by
VarianceThresholdSelector
.Params for
VarianceThresholdSelector
and VarianceThresholdSelectorModel
.This class is structurally equivalent to
VariantVal
.Build variant value and metadata by parsing JSON values.
An exception indicating that we are attempting to build a variant with it value or metadata
exceeding the 16MiB size limit.
The data type representing semi-structured values with arbitrary hierarchical data structures.
This class defines constants related to the variant format and provides functions for
manipulating variant binaries.
Represents a numeric vector, whose index type is Int and value type is Double.
Represents a numeric vector, whose index type is Int and value type is Double.
A feature transformer that merges multiple columns into a vector column.
Utility transformer that rewrites Vector attribute names via prefix replacement.
Class for indexing categorical feature columns in a dataset of
Vector
.Model fitted by
VectorIndexer
.Private trait for params for VectorIndexer and VectorIndexerModel
Factory methods for
Vector
.Factory methods for
Vector
.A feature transformer that adds size information to the metadata of a vector column.
This class takes a feature vector and outputs a new feature vector with a subarray of the
original features.
Trait for transformation of a vector
:: AlphaComponent ::
Utilities for working with Spark version strings
VertexPartitionBaseOpsConstructor<T extends org.apache.spark.graphx.impl.VertexPartitionBase<Object>>
A typeclass for subclasses of
VertexPartitionBase
representing the ability to wrap them in a
VertexPartitionBaseOps
.Extends
RDD[(VertexId, VD)]
by ensuring that there is only one entry for each vertex and by
pre-indexing the entries for fast, efficient joins.An interface representing a persisted view.
Catalog methods for working with views.
ViewChange subclasses represent requested changes to a view.
A class that holds view information.
Entry in vocabulary
A function with no return value.
A two-argument function that takes arguments of type T1 and T2 with no return value.
Generates i.i.d.
Performs Welch's 2-sample t-test.
A class for defining actions to be taken when matching rows in a DataFrame during
a merge operation.
A class for defining actions to be taken when no matching rows are found in a DataFrame
during a merge operation.
A class for defining actions to be performed when there is no match by source
during a merge operation in a MergeIntoWriter.
Utility functions for defining window in DataFrames.
A window specification that defines the partitioning, ordering, and frame boundaries.
Word2Vec trains a model of
Map(String, Vector)
, i.e.Word2Vec creates vector representation of words in a text corpus.
Params for
Word2Vec
and Word2VecModel
.Model fitted by
Word2Vec
.Word2Vec model
param: wordIndex maps each word to an index, which can retrieve the corresponding
vector from wordVectors
param: wordVectors array of length numWords * vectorSize, vector corresponding
to the word mapped with index i can be retrieved by the slice
(i * vectorSize, i * vectorSize + vectorSize)
:: Private ::
A thin wrapper around a
WritableByteChannel
.A logical representation of a data source write.
:: DeveloperApi ::
This abstract class represents a write ahead log (aka journal) that is used by Spark Streaming
to save the received data (by receivers) and associated metadata to a reliable storage, so that
they can be recovered after driver failures.
:: DeveloperApi ::
This abstract class represents a handle that refers to a record written in a
WriteAheadLog
.A helper class with utility functions related to the WriteAheadLog interface
An interface for building the
Write
.Configuration methods common to create/replace operations and insert/overwrite operations.
A commit message returned by
DataWriter.commit()
and will be sent back to the driver side
as the input parameter of BatchWrite.commit(WriterCommitMessage[])
or
StreamingWrite.commit(long, WriterCommitMessage[])
.The type represents year-month intervals of the SQL standard.
:: DeveloperApi ::
ZStandard implementation of
CompressionCodec
.