pyspark.pandas.
DataFrame
pandas-on-Spark DataFrame that corresponds to pandas DataFrame logically. This holds Spark DataFrame internally.
_internal – an internal immutable Frame to manage metadata.
Dict can contain Series, arrays, constants, or list-like objects Note that if data is a pandas DataFrame, a Spark DataFrame, and a pandas-on-Spark Series, other arguments should not be used.
Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided
Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided
Data type to force. Only a single dtype is allowed. If None, infer
Copy data from inputs. Only affects DataFrame / 2d ndarray input
Examples
Constructing DataFrame from a dictionary.
>>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = ps.DataFrame(data=d, columns=['col1', 'col2']) >>> df col1 col2 0 1 3 1 2 4
Constructing DataFrame from pandas DataFrame
>>> df = ps.DataFrame(pd.DataFrame(data=d, columns=['col1', 'col2'])) >>> df col1 col2 0 1 3 1 2 4
Notice that the inferred dtype is int64.
>>> df.dtypes col1 int64 col2 int64 dtype: object
To enforce a single dtype:
>>> df = ps.DataFrame(data=d, dtype=np.int8) >>> df.dtypes col1 int8 col2 int8 dtype: object
Constructing DataFrame from numpy ndarray:
>>> df2 = ps.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)), ... columns=['a', 'b', 'c', 'd', 'e']) >>> df2 a b c d e 0 3 1 4 9 8 1 4 8 4 8 4 2 7 6 5 6 7 3 8 7 9 1 0 4 2 5 4 3 9
Methods
abs()
abs
Return a Series/DataFrame with absolute numeric value of each element.
add(other)
add
Get Addition of dataframe and other, element-wise (binary operator +).
add_prefix(prefix)
add_prefix
Prefix labels with string prefix.
add_suffix(suffix)
add_suffix
Suffix labels with string suffix.
agg(func)
agg
Aggregate using one or more operations over the specified axis.
aggregate(func)
aggregate
align(other[, join, axis, copy])
align
Align two objects on their axes with the specified join method.
all([axis])
all
Return whether all elements are True.
any([axis])
any
Return whether any element is True.
append(other[, ignore_index, …])
append
Append rows of other to the end of caller, returning a new object.
apply(func[, axis, args])
apply
Apply a function along an axis of the DataFrame.
applymap(func)
applymap
Apply a function to a Dataframe elementwise.
assign(**kwargs)
assign
Assign new columns to a DataFrame.
astype(dtype)
astype
Cast a pandas-on-Spark object to a specified dtype dtype.
dtype
at_time(time[, asof, axis])
at_time
Select values at particular time of day (example: 9:30AM).
backfill([axis, inplace, limit])
backfill
Synonym for DataFrame.fillna() or Series.fillna() with method=`bfill`.
method=`bfill`
between_time(start_time, end_time[, …])
between_time
Select values between particular times of the day (example: 9:00-9:30 AM).
bfill([axis, inplace, limit])
bfill
bool()
bool
Return the bool of a single element in the current object.
clip([lower, upper])
clip
Trim values at input threshold(s).
combine_first(other)
combine_first
Update null elements with value in the same location in other.
copy([deep])
copy
Make a copy of this object’s indices and data.
corr([method])
corr
Compute pairwise correlation of columns, excluding NA/null values.
count([axis, numeric_only])
count
Count non-NA cells for each column.
cov([min_periods])
cov
Compute pairwise covariance of columns, excluding NA/null values.
cummax([skipna])
cummax
Return cumulative maximum over a DataFrame or Series axis.
cummin([skipna])
cummin
Return cumulative minimum over a DataFrame or Series axis.
cumprod([skipna])
cumprod
Return cumulative product over a DataFrame or Series axis.
cumsum([skipna])
cumsum
Return cumulative sum over a DataFrame or Series axis.
describe([percentiles])
describe
Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
NaN
diff([periods, axis])
diff
First discrete difference of element.
div(other)
div
Get Floating division of dataframe and other, element-wise (binary operator /).
divide(other)
divide
dot(other)
dot
Compute the matrix multiplication between the DataFrame and other.
drop([labels, axis, index, columns])
drop
Drop specified labels from columns.
drop_duplicates([subset, keep, inplace])
drop_duplicates
Return DataFrame with duplicate rows removed, optionally only considering certain columns.
droplevel(level[, axis])
droplevel
Return DataFrame with requested index / column level(s) removed.
dropna([axis, how, thresh, subset, inplace])
dropna
Remove missing values.
duplicated([subset, keep])
duplicated
Return boolean Series denoting duplicate rows, optionally only considering certain columns.
eq(other)
eq
Compare if the current value is equal to the other.
equals(other)
equals
eval(expr[, inplace])
eval
Evaluate a string describing operations on DataFrame columns.
expanding([min_periods])
expanding
Provide expanding transformations.
explode(column)
explode
Transform each element of a list-like to a row, replicating index values.
ffill([axis, inplace, limit])
ffill
Synonym for DataFrame.fillna() or Series.fillna() with method=`ffill`.
method=`ffill`
fillna([value, method, axis, inplace, limit])
fillna
Fill NA/NaN values.
filter([items, like, regex, axis])
filter
Subset rows or columns of dataframe according to labels in the specified index.
first(offset)
first
Select first periods of time series data based on a date offset.
first_valid_index()
first_valid_index
Retrieves the index of the first valid value.
floordiv(other)
floordiv
Get Integer division of dataframe and other, element-wise (binary operator //).
from_dict(data[, orient, dtype, columns])
from_dict
Construct DataFrame from dict of array-like or dicts.
from_records(data[, index, exclude, …])
from_records
Convert structured or record ndarray to DataFrame.
ge(other)
ge
Compare if the current value is greater than or equal to the other.
get(key[, default])
get
Get item from object for given key (DataFrame column, Panel slice, etc.).
get_dtype_counts()
get_dtype_counts
Return counts of unique dtypes in this object.
groupby(by[, axis, as_index, dropna])
groupby
Group DataFrame or Series using one or more columns.
gt(other)
gt
Compare if the current value is greater than the other.
head([n])
head
Return the first n rows.
hist([bins])
hist
Draw one histogram of the DataFrame’s columns.
idxmax([axis])
idxmax
Return index of first occurrence of maximum over requested axis.
idxmin([axis])
idxmin
Return index of first occurrence of minimum over requested axis.
info([verbose, buf, max_cols, null_counts])
info
Print a concise summary of a DataFrame.
insert(loc, column, value[, allow_duplicates])
insert
Insert column into DataFrame at specified location.
isin(values)
isin
Whether each element in the DataFrame is contained in values.
isna()
isna
Detects missing values for items in the current Dataframe.
isnull()
isnull
items()
items
This is an alias of iteritems.
iteritems
iteritems()
Iterator over (column name, Series) pairs.
iterrows()
iterrows
Iterate over DataFrame rows as (index, Series) pairs.
itertuples([index, name])
itertuples
Iterate over DataFrame rows as namedtuples.
join(right[, on, how, lsuffix, rsuffix])
join
Join columns of another DataFrame.
kde([bw_method, ind])
kde
Generate Kernel Density Estimate plot using Gaussian kernels.
keys()
keys
Return alias for columns.
kurt([axis, numeric_only])
kurt
Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0).
kurtosis([axis, numeric_only])
kurtosis
last(offset)
last
Select final periods of time series data based on a date offset.
last_valid_index()
last_valid_index
Return index for last non-NA/null value.
le(other)
le
Compare if the current value is less than or equal to the other.
lt(other)
lt
Compare if the current value is less than the other.
mad([axis])
mad
Return the mean absolute deviation of values.
mask(cond[, other])
mask
Replace values where the condition is True.
max([axis, numeric_only])
max
Return the maximum of the values.
mean([axis, numeric_only])
mean
Return the mean of the values.
median([axis, numeric_only, accuracy])
median
Return the median of the values for the requested axis.
melt([id_vars, value_vars, var_name, value_name])
melt
Unpivot a DataFrame from wide format to long format, optionally leaving identifier variables set.
merge(right[, how, on, left_on, right_on, …])
merge
Merge DataFrame objects with a database-style join.
min([axis, numeric_only])
min
Return the minimum of the values.
mod(other)
mod
Get Modulo of dataframe and other, element-wise (binary operator %).
mul(other)
mul
Get Multiplication of dataframe and other, element-wise (binary operator *).
multiply(other)
multiply
ne(other)
ne
Compare if the current value is not equal to the other.
nlargest(n, columns)
nlargest
Return the first n rows ordered by columns in descending order.
notna()
notna
Detects non-missing values for items in the current Dataframe.
notnull()
notnull
nsmallest(n, columns)
nsmallest
Return the first n rows ordered by columns in ascending order.
nunique([axis, dropna, approx, rsd])
nunique
Return number of unique elements in the object.
pad([axis, inplace, limit])
pad
pct_change([periods])
pct_change
Percentage change between the current and a prior element.
pipe(func, *args, **kwargs)
pipe
Apply func(self, *args, **kwargs).
pivot([index, columns, values])
pivot
Return reshaped DataFrame organized by given index / column values.
pivot_table([values, index, columns, …])
pivot_table
Create a spreadsheet-style pivot table as a DataFrame.
pop(item)
pop
Return item and drop from frame.
pow(other)
pow
Get Exponential power of series of dataframe and other, element-wise (binary operator **).
prod([axis, numeric_only, min_count])
prod
Return the product of the values.
product([axis, numeric_only, min_count])
product
quantile([q, axis, numeric_only, accuracy])
quantile
Return value at the given quantile.
query(expr[, inplace])
query
Query the columns of a DataFrame with a boolean expression.
radd(other)
radd
rank([method, ascending])
rank
Compute numerical data ranks (1 through n) along axis.
rdiv(other)
rdiv
reindex([labels, index, columns, axis, …])
reindex
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index.
reindex_like(other[, copy])
reindex_like
Return a DataFrame with matching indices as other object.
rename([mapper, index, columns, axis, …])
rename
Alter axes labels.
rename_axis([mapper, index, columns, axis, …])
rename_axis
Set the name of the axis for the index or columns.
replace([to_replace, value, inplace, limit, …])
replace
Returns a new DataFrame replacing a value with another value.
reset_index([level, drop, inplace, …])
reset_index
Reset the index, or a level of it.
rfloordiv(other)
rfloordiv
rmod(other)
rmod
rmul(other)
rmul
rolling(window[, min_periods])
rolling
Provide rolling transformations.
round([decimals])
round
Round a DataFrame to a variable number of decimal places.
rpow(other)
rpow
Get Exponential power of dataframe and other, element-wise (binary operator **).
rsub(other)
rsub
Get Subtraction of dataframe and other, element-wise (binary operator -).
rtruediv(other)
rtruediv
sample([n, frac, replace, random_state])
sample
Return a random sample of items from an axis of object.
select_dtypes([include, exclude])
select_dtypes
Return a subset of the DataFrame’s columns based on the column dtypes.
sem([axis, ddof, numeric_only])
sem
Return unbiased standard error of the mean over requested axis.
set_index(keys[, drop, append, inplace])
set_index
Set the DataFrame index (row labels) using one or more existing columns.
shift([periods, fill_value])
shift
Shift DataFrame by desired number of periods.
skew([axis, numeric_only])
skew
Return unbiased skew normalized by N-1.
sort_index([axis, level, ascending, …])
sort_index
Sort object by labels (along an axis)
sort_values(by[, ascending, inplace, …])
sort_values
Sort by the values along either axis.
squeeze([axis])
squeeze
Squeeze 1 dimensional axis objects into scalars.
stack()
stack
Stack the prescribed level(s) from columns to index.
std([axis, ddof, numeric_only])
std
Return sample standard deviation.
sub(other)
sub
subtract(other)
subtract
sum([axis, numeric_only, min_count])
sum
Return the sum of the values.
swapaxes(i, j[, copy])
swapaxes
Interchange axes and swap values axes appropriately.
swaplevel([i, j, axis])
swaplevel
Swap levels i and j in a MultiIndex on a particular axis.
tail([n])
tail
Return the last n rows.
take(indices[, axis])
take
Return the elements in the given positional indices along an axis.
to_clipboard([excel, sep])
to_clipboard
Copy object to the system clipboard.
to_csv([path, sep, na_rep, columns, header, …])
to_csv
Write object to a comma-separated values (csv) file.
to_delta(path[, mode, partition_cols, index_col])
to_delta
Write the DataFrame out as a Delta Lake table.
to_dict([orient, into])
to_dict
Convert the DataFrame to a dictionary.
to_excel(excel_writer[, sheet_name, na_rep, …])
to_excel
Write object to an Excel sheet.
to_html([buf, columns, col_space, header, …])
to_html
Render a DataFrame as an HTML table.
to_json([path, compression, num_files, …])
to_json
Convert the object to a JSON string.
to_latex([buf, columns, col_space, header, …])
to_latex
Render an object to a LaTeX tabular environment table.
to_markdown([buf, mode])
to_markdown
Print Series or DataFrame in Markdown-friendly format.
to_numpy()
to_numpy
A NumPy ndarray representing the values in this DataFrame or Series.
to_orc(path[, mode, partition_cols, index_col])
to_orc
Write the DataFrame out as a ORC file or directory.
to_pandas()
to_pandas
Return a pandas DataFrame.
to_parquet(path[, mode, partition_cols, …])
to_parquet
Write the DataFrame out as a Parquet file or directory.
to_records([index, column_dtypes, index_dtypes])
to_records
Convert DataFrame to a NumPy record array.
to_spark([index_col])
to_spark
Spark related features.
to_spark_io([path, format, mode, …])
to_spark_io
Write the DataFrame out to a Spark data source.
to_string([buf, columns, col_space, header, …])
to_string
Render a DataFrame to a console-friendly tabular output.
to_table(name[, format, mode, …])
to_table
Write the DataFrame into a Spark table.
transform(func[, axis])
transform
Call func on self producing a Series with transformed values and that has the same length as its input.
func
transpose()
transpose
Transpose index and columns.
truediv(other)
truediv
truncate([before, after, axis, copy])
truncate
Truncate a Series or DataFrame before and after some index value.
unstack()
unstack
Pivot the (necessarily hierarchical) index labels.
update(other[, join, overwrite])
update
Modify in place using non-NA values from another DataFrame.
var([axis, ddof, numeric_only])
var
Return unbiased variance.
where(cond[, other, axis])
where
Replace values where the condition is False.
xs(key[, axis, level])
xs
Return cross-section from the DataFrame.
Attributes
T
at
Access a single value for a row/column label pair.
axes
Return a list representing the axes of the DataFrame.
columns
The column labels of the DataFrame.
dtypes
Return the dtypes in the DataFrame.
empty
Returns true if the current DataFrame is empty.
iat
Access a single value for a row/column pair by integer position.
iloc
Purely integer-location based indexing for selection by position.
index
The index (row labels) Column of the DataFrame.
loc
Access a group of rows and columns by label(s) or a boolean Series.
ndim
Return an int representing the number of array dimensions.
shape
Return a tuple representing the dimensionality of the DataFrame.
size
Return an int representing the number of elements in this object.
style
Property returning a Styler object containing methods for building a styled HTML representation for the DataFrame.
values
Return a Numpy representation of the DataFrame or the Series.