pyspark.sql.DataFrameReader.jdbc

DataFrameReader.jdbc(url: str, table: str, column: Optional[str] = None, lowerBound: Union[int, str, None] = None, upperBound: Union[int, str, None] = None, numPartitions: Optional[int] = None, predicates: Optional[List[str]] = None, properties: Optional[Dict[str, str]] = None) → DataFrame[source]

Construct a DataFrame representing the database table named table accessible via JDBC URL url and connection properties.

Partitions of the table will be retrieved in parallel if either column or predicates is specified. lowerBound, upperBound and numPartitions is needed when column is specified.

If both column and predicates are specified, column will be used.

New in version 1.4.0.

Parameters
tablestr

the name of the table

columnstr, optional

alias of partitionColumn option. Refer to partitionColumn in Data Source Option in the version you use.

predicateslist, optional

a list of expressions suitable for inclusion in WHERE clauses; each one defines one partition of the DataFrame

propertiesdict, optional

a dictionary of JDBC database connection arguments. Normally at least properties “user” and “password” with their corresponding values. For example { ‘user’ : ‘SYSTEM’, ‘password’ : ‘mypassword’ }

Returns
DataFrame
Other Parameters
Extra options

For the extra options, refer to Data Source Option in the version you use.

Notes

Don’t create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.