PySpark uses Spark as an engine. PySpark uses Py4J to leverage Spark to submit and computes the jobs.
On the driver side, PySpark communicates with the driver on JVM by using Py4J. When pyspark.sql.SparkSession or pyspark.SparkContext is created and initialized, PySpark launches a JVM to communicate.
pyspark.sql.SparkSession
pyspark.SparkContext
On the executor side, Python workers execute and handle Python native functions or data. They are not launched if a PySpark application does not require interaction between Python workers and JVMs. They are lazily launched only when Python native functions or data have to be handled, for example, when you execute pandas UDFs or PySpark RDD APIs.
This page focuses on debugging Python side of PySpark on both driver and executor sides instead of focusing on debugging with JVM. Profiling and debugging JVM is described at Useful Developer Tools.
Note that,
If you are running locally, you can directly debug the driver side via using your IDE without the remote debug feature. Setting PySpark with IDEs is documented here.
There are many other ways of debugging PySpark applications. For example, you can remotely debug by using the open source Remote Debugger instead of using PyCharm Professional documented here.
This section describes remote debugging on both driver and executor sides within a single machine to demonstrate easily. The ways of debugging PySpark on the executor side is different from doing in the driver. Therefore, they will be demonstrated respectively. In order to debug PySpark applications on other machines, please refer to the full instructions that are specific to PyCharm, documented here.
Firstly, choose Edit Configuration… from the Run menu. It opens the Run/Debug Configurations dialog. You have to click + configuration on the toolbar, and from the list of available configurations, select Python Debug Server. Enter the name of this new configuration, for example, MyRemoteDebugger and also specify the port number, for example 12345.
+
MyRemoteDebugger
12345
pydevd-pycharm
pip install pydevd-pycharm~=<version of PyCharm on the local machine>
To debug on the driver side, your application should be able to connect to the debugging server. Copy and paste the codes with pydevd_pycharm.settrace to the top of your PySpark script. Suppose the script name is app.py:
pydevd_pycharm.settrace
app.py
echo "#======================Copy and paste from the previous dialog=========================== import pydevd_pycharm pydevd_pycharm.settrace('localhost', port=12345, stdoutToServer=True, stderrToServer=True) #======================================================================================== # Your PySpark application codes: from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() spark.range(10).show()" > app.py
Start to debug with your MyRemoteDebugger.
spark-submit app.py
To debug on the executor side, prepare a Python file as below in your current working directory.
echo "from pyspark import daemon, worker def remote_debug_wrapped(*args, **kwargs): #======================Copy and paste from the previous dialog=========================== import pydevd_pycharm pydevd_pycharm.settrace('localhost', port=12345, stdoutToServer=True, stderrToServer=True) #======================================================================================== worker.main(*args, **kwargs) daemon.worker_main = remote_debug_wrapped if __name__ == '__main__': daemon.manager()" > remote_debug.py
You will use this file as the Python worker in your PySpark applications by using the spark.python.daemon.module configuration. Run the pyspark shell with the configuration below:
spark.python.daemon.module
pyspark
pyspark --conf spark.python.daemon.module=remote_debug
Now you’re ready to remotely debug. Start to debug with your MyRemoteDebugger.
spark.range(10).repartition(1).rdd.map(lambda x: x).collect()
top
ps
The Python processes on the driver and executor can be checked via typical ways such as top and ps commands.
On the driver side, you can get the process id from your PySpark shell easily as below to know the process id and resources.
>>> import os; os.getpid() 18482
ps -fe 18482
UID PID PPID C STIME TTY TIME CMD 000 18482 12345 0 0:00PM ttys001 0:00.00 /.../python
To check on the executor side, you can simply grep them to figure out the process ids and relevant resources because Python workers are forked from pyspark.daemon.
grep
pyspark.daemon
ps -fe | grep pyspark.daemon
000 12345 1 0 0:00PM ttys000 0:00.00 /.../python -m pyspark.daemon 000 12345 1 0 0:00PM ttys000 0:00.00 /.../python -m pyspark.daemon 000 12345 1 0 0:00PM ttys000 0:00.00 /.../python -m pyspark.daemon 000 12345 1 0 0:00PM ttys000 0:00.00 /.../python -m pyspark.daemon ...
memory_profiler is one of the profilers that allow you to check the memory usage line by line.
Unless you are running your driver program in another machine (e.g., YARN cluster mode), this useful tool can be used to debug the memory usage on driver side easily. Suppose your PySpark script name is profile_memory.py. You can profile it as below.
profile_memory.py
echo "from pyspark.sql import SparkSession #===Your function should be decorated with @profile=== from memory_profiler import profile @profile #===================================================== def my_func(): session = SparkSession.builder.getOrCreate() df = session.range(10000) return df.collect() if __name__ == '__main__': my_func()" > profile_memory.py
python -m memory_profiler profile_memory.py
Filename: profile_memory.py Line # Mem usage Increment Line Contents ================================================ ... 6 def my_func(): 7 51.5 MiB 0.6 MiB session = SparkSession.builder.getOrCreate() 8 51.5 MiB 0.0 MiB df = session.range(10000) 9 54.4 MiB 2.8 MiB return df.collect()
PySpark provides remote memory_profiler for Python/Pandas UDFs, which can be enabled by setting spark.python.profile.memory configuration to true. That can be used on editors with line numbers such as Jupyter notebooks. An example on a Jupyter notebook is as shown below.
spark.python.profile.memory
true
pyspark --conf spark.python.profile.memory=true
from pyspark.sql.functions import pandas_udf df = spark.range(10) @pandas_udf("long") def add1(x): return x + 1 added = df.select(add1("id")) added.show() sc.show_profiles()
The result profile is as shown below.
============================================================ Profile of UDF<id=2> ============================================================ Filename: ... Line # Mem usage Increment Occurrences Line Contents ============================================================= 4 974.0 MiB 974.0 MiB 10 @pandas_udf("long") 5 def add1(x): 6 974.4 MiB 0.4 MiB 10 return x + 1
The UDF IDs can be seen in the query plan, for example, add1(...)#2L in ArrowEvalPython as shown below.
add1(...)#2L
ArrowEvalPython
added.explain()
== Physical Plan == *(2) Project [pythonUDF0#11L AS add1(id)#3L] +- ArrowEvalPython [add1(id#0L)#2L], [pythonUDF0#11L], 200 +- *(1) Range (0, 10, step=1, splits=16)
This feature is not supported with registered UDFs or UDFs with iterators as inputs/outputs.
Python Profilers are useful built-in features in Python itself. These provide deterministic profiling of Python programs with a lot of useful statistics. This section describes how to use it on both driver and executor sides in order to identify expensive or hot code paths.
To use this on driver side, you can use it as you would do for regular Python programs because PySpark on driver side is a regular Python process unless you are running your driver program in another machine (e.g., YARN cluster mode).
echo "from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() spark.range(10).show()" > app.py
python -m cProfile app.py
... 129215 function calls (125446 primitive calls) in 5.926 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1198/405 0.001 0.000 0.083 0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist) 561 0.001 0.000 0.001 0.000 <frozen importlib._bootstrap>:103(release) 276 0.000 0.000 0.000 0.000 <frozen importlib._bootstrap>:143(__init__) 276 0.000 0.000 0.002 0.000 <frozen importlib._bootstrap>:147(__enter__) ...
To use this on executor side, PySpark provides remote Python Profilers for executor side, which can be enabled by setting spark.python.profile configuration to true.
spark.python.profile
pyspark --conf spark.python.profile=true
>>> rdd = sc.parallelize(range(100)).map(str) >>> rdd.count() 100 >>> sc.show_profiles() ============================================================ Profile of RDD<id=1> ============================================================ 728 function calls (692 primitive calls) in 0.004 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 12 0.001 0.000 0.001 0.000 serializers.py:210(load_stream) 12 0.000 0.000 0.000 0.000 {built-in method _pickle.dumps} 12 0.000 0.000 0.001 0.000 serializers.py:252(dump_stream) 12 0.000 0.000 0.001 0.000 context.py:506(f) ...
To use this on Python/Pandas UDFs, PySpark provides remote Python Profilers for Python/Pandas UDFs, which can be enabled by setting spark.python.profile configuration to true.
>>> from pyspark.sql.functions import pandas_udf >>> df = spark.range(10) >>> @pandas_udf("long") ... def add1(x): ... return x + 1 ... >>> added = df.select(add1("id")) >>> added.show() +--------+ |add1(id)| +--------+ ... +--------+ >>> sc.show_profiles() ============================================================ Profile of UDF<id=2> ============================================================ 2300 function calls (2270 primitive calls) in 0.006 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 10 0.001 0.000 0.005 0.001 series.py:5515(_arith_method) 10 0.001 0.000 0.001 0.000 _ufunc_config.py:425(__init__) 10 0.000 0.000 0.000 0.000 {built-in method _operator.add} 10 0.000 0.000 0.002 0.000 series.py:315(__init__) ...
The UDF IDs can be seen in the query plan, for example, add1(...)#2L in ArrowEvalPython below.
>>> added.explain() == Physical Plan == *(2) Project [pythonUDF0#11L AS add1(id)#3L] +- ArrowEvalPython [add1(id#0L)#2L], [pythonUDF0#11L], 200 +- *(1) Range (0, 10, step=1, splits=16)
This feature is not supported with registered UDFs.
AnalysisException
AnalysisException is raised when failing to analyze a SQL query plan.
Example:
>>> df = spark.range(1) >>> df['bad_key'] Traceback (most recent call last): ... pyspark.errors.exceptions.AnalysisException: Cannot resolve column name "bad_key" among (id)
Solution:
>>> df['id'] Column<'id'>
ParseException
ParseException is raised when failing to parse a SQL command.
>>> spark.sql("select * 1") Traceback (most recent call last): ... pyspark.errors.exceptions.ParseException: [PARSE_SYNTAX_ERROR] Syntax error at or near '1': extra input '1'.(line 1, pos 9) == SQL == select * 1 ---------^^^
>>> spark.sql("select *") DataFrame[]
IllegalArgumentException
IllegalArgumentException is raised when passing an illegal or inappropriate argument.
>>> spark.range(1).sample(-1.0) Traceback (most recent call last): ... pyspark.errors.exceptions.IllegalArgumentException: requirement failed: Sampling fraction (-1.0) must be on interval [0, 1] without replacement
>>> spark.range(1).sample(1.0) DataFrame[id: bigint]
PythonException
PythonException is thrown from Python workers.
You can see the type of exception that was thrown from the Python worker and its stack trace, as TypeError below.
TypeError
>>> import pyspark.sql.functions as F >>> from pyspark.sql.functions import udf >>> def f(x): ... return F.abs(x) ... >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect() 22/04/12 14:52:31 ERROR Executor: Exception in task 7.0 in stage 37.0 (TID 232) org.apache.spark.api.python.PythonException: Traceback (most recent call last): ... TypeError: Invalid argument, not a string or column: -1 of type <class 'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
>>> def f(x): ... return abs(x) ... >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect() [Row(id=-1, abs='1'), Row(id=0, abs='0')]
StreamingQueryException
StreamingQueryException is raised when failing a StreamingQuery. Most often, it is thrown from Python workers, that wrap it as a PythonException.
>>> sdf = spark.readStream.format("text").load("python/test_support/sql/streaming") >>> from pyspark.sql.functions import col, udf >>> bad_udf = udf(lambda x: 1 / 0) >>> (sdf.select(bad_udf(col("value"))).writeStream.format("memory").queryName("q1").start()).processAllAvailable() Traceback (most recent call last): ... org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ... pyspark.errors.exceptions.StreamingQueryException: [STREAM_FAILED] Query [id = 74eb53a8-89bd-49b0-9313-14d29eed03aa, runId = 9f2d5cf6-a373-478d-b718-2c2b6d8a0f24] terminated with exception: Job aborted
Fix the StreamingQuery and re-execute the workflow.
SparkUpgradeException
SparkUpgradeException is thrown because of Spark upgrade.
>>> from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime >>> df = spark.createDataFrame([("2014-31-12",)], ["date_str"]) >>> df2 = df.select("date_str", to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa")))) >>> df2.collect() Traceback (most recent call last): ... pyspark.sql.utils.SparkUpgradeException: You may get a different result due to the upgrading to Spark >= 3.0: Fail to recognize 'yyyy-dd-aa' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
>>> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY") >>> df2 = df.select("date_str", to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa")))) >>> df2.collect() [Row(date_str='2014-31-12', to_date(from_unixtime(unix_timestamp(date_str, yyyy-dd-aa), yyyy-MM-dd HH:mm:ss))=None)]
There are specific common exceptions / errors in pandas API on Spark.
ValueError: Cannot combine the series or dataframe because it comes from a different dataframe
Operations involving more than one series or dataframes raises a ValueError if compute.ops_on_diff_frames is disabled (disabled by default). Such operations may be expensive due to joining of underlying Spark frames. So users should be aware of the cost and enable that flag only when necessary.
ValueError
compute.ops_on_diff_frames
Exception:
>>> ps.Series([1, 2]) + ps.Series([3, 4]) Traceback (most recent call last): ... ValueError: Cannot combine the series or dataframe because it comes from a different dataframe. In order to allow this operation, enable 'compute.ops_on_diff_frames' option.
>>> with ps.option_context('compute.ops_on_diff_frames', True): ... ps.Series([1, 2]) + ps.Series([3, 4]) ... 0 4 1 6 dtype: int64
RuntimeError: Result vector from pandas_udf was not the required length
>>> def f(x) -> ps.Series[np.int32]: ... return x[:-1] ... >>> ps.DataFrame({"x":[1, 2], "y":[3, 4]}).transform(f) 22/04/12 13:46:39 ERROR Executor: Exception in task 2.0 in stage 16.0 (TID 88) org.apache.spark.api.python.PythonException: Traceback (most recent call last): ... RuntimeError: Result vector from pandas_udf was not the required length: expected 1, got 0
>>> def f(x) -> ps.Series[np.int32]: ... return x ... >>> ps.DataFrame({"x":[1, 2], "y":[3, 4]}).transform(f) x y 0 1 3 1 2 4
Py4JJavaError
Py4JJavaError is raised when an exception occurs in the Java client code. You can see the type of exception that was thrown on the Java side and its stack trace, as java.lang.NullPointerException below.
java.lang.NullPointerException
>>> spark.sparkContext._jvm.java.lang.String(None) Traceback (most recent call last): ... py4j.protocol.Py4JJavaError: An error occurred while calling None.java.lang.String. : java.lang.NullPointerException ..
>>> spark.sparkContext._jvm.java.lang.String("x") 'x'
Py4JError
Py4JError is raised when any other error occurs such as when the Python client program tries to access an object that no longer exists on the Java side.
>>> from pyspark.ml.linalg import Vectors >>> from pyspark.ml.regression import LinearRegression >>> df = spark.createDataFrame( ... [(1.0, 2.0, Vectors.dense(1.0)), (0.0, 2.0, Vectors.sparse(1, [], []))], ... ["label", "weight", "features"], ... ) >>> lr = LinearRegression( ... maxIter=1, regParam=0.0, solver="normal", weightCol="weight", fitIntercept=False ... ) >>> model = lr.fit(df) >>> model LinearRegressionModel: uid=LinearRegression_eb7bc1d4bf25, numFeatures=1 >>> model.__del__() >>> model Traceback (most recent call last): ... py4j.protocol.Py4JError: An error occurred while calling o531.toString. Trace: py4j.Py4JException: Target Object ID does not exist for this gateway :o531 ...
Access an object that exists on the Java side.
Py4JNetworkError
Py4JNetworkError is raised when a problem occurs during network transfer (e.g., connection lost). In this case, we shall debug the network and rebuild the connection.
There are Spark configurations to control stack traces:
spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled is true by default to simplify traceback from Python UDFs.
spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled
spark.sql.pyspark.jvmStacktrace.enabled is false by default to hide JVM stacktrace and to show a Python-friendly exception only.
spark.sql.pyspark.jvmStacktrace.enabled
Spark configurations above are independent from log level settings. Control log levels through pyspark.SparkContext.setLogLevel().
pyspark.SparkContext.setLogLevel()