pyspark.pandas.Series.str.split#

str.split(pat=None, n=- 1, expand=False)#

Split strings around given separator/delimiter.

Splits the string in the Series from the beginning, at the specified delimiter string. Equivalent to str.split().

Parameters

patstr, optional

String or regular expression to split on. If not specified, split on whitespace.

nint, default -1 (all)

Limit number of splits in output. None, 0 and -1 will be interpreted as return all splits.

expandbool, default False

Expand the split strings into separate columns.

If True, n must be a positive integer, and return DataFrame expanding dimensionality.
If False, return Series, containing lists of strings.

Returns

Series, DataFrame: Type matches caller unless expand=True (see Notes).

See also

str.rsplit: Splits string around given separator/delimiter, starting from the right.
str.join: Join lists contained as elements in the Series/Index with passed delimiter.

Notes

The handling of the n keyword depends on the number of found splits:

If found splits > n, make first n splits only
If found splits <= n, make all splits
If for a certain row the number of found splits < n, append None for padding up to n if expand=True

If using expand=True, Series callers return DataFrame objects with n + 1 columns.

Note

Even if n is much larger than found splits, the number of columns does NOT shrink unlike pandas.

Examples

>>> s = ps.Series(["this is a regular sentence",
...                "https://docs.python.org/3/tutorial/index.html",
...                np.nan])

In the default setting, the string is split by whitespace.

>>> s.str.split()  
0                   [this, is, a, regular, sentence]
1    [https://docs.python.org/3/tutorial/index.html]
2                                               None
dtype: object

Without the n parameter, the outputs of rsplit and split are identical.

>>> s.str.rsplit()  
0                   [this, is, a, regular, sentence]
1    [https://docs.python.org/3/tutorial/index.html]
2                                               None
dtype: object

The n parameter can be used to limit the number of splits on the delimiter. The outputs of split and rsplit are different.

>>> s.str.split(n=2)  
0                     [this, is, a regular sentence]
1    [https://docs.python.org/3/tutorial/index.html]
2                                               None
dtype: object

>>> s.str.rsplit(n=2)  
0                     [this is a, regular, sentence]
1    [https://docs.python.org/3/tutorial/index.html]
2                                               None
dtype: object

The pat parameter can be used to split by other characters.

>>> s.str.split(pat = "/")  
0                         [this is a regular sentence]
1    [https:, , docs.python.org, 3, tutorial, index...
2                                                 None
dtype: object

When using expand=True, the split elements will expand out into separate columns. If NaN is present, it is propagated throughout the columns during the split.

>>> s.str.split(n=4, expand=True)  
                                               0     1     2        3         4
0                                           this    is     a  regular  sentence
1  https://docs.python.org/3/tutorial/index.html  None  None     None      None
2                                           None  None  None     None      None

For slightly more complex use cases like splitting the html document name from a url, a combination of parameter settings can be used.

>>> s.str.rsplit("/", n=1, expand=True)  
                                    0           1
0          this is a regular sentence        None
1  https://docs.python.org/3/tutorial  index.html
2                                None        None

Remember to escape special characters when explicitly using regular expressions.

>>> s = ps.Series(["1+1=2"])
>>> s.str.split(r"\+|=", n=2, expand=True)  
   0  1  2
0  1  1  2