Data Types
Supported Data Types
Spark SQL and DataFrames support the following data types:
- Numeric types
ByteType
: Represents 1-byte signed integer numbers. The range of numbers is from-128
to127
.ShortType
: Represents 2-byte signed integer numbers. The range of numbers is from-32768
to32767
.IntegerType
: Represents 4-byte signed integer numbers. The range of numbers is from-2147483648
to2147483647
.LongType
: Represents 8-byte signed integer numbers. The range of numbers is from-9223372036854775808
to9223372036854775807
.FloatType
: Represents 4-byte single-precision floating point numbers.DoubleType
: Represents 8-byte double-precision floating point numbers.DecimalType
: Represents arbitrary-precision signed decimal numbers. Backed internally byjava.math.BigDecimal
. ABigDecimal
consists of an arbitrary precision integer unscaled value and a 32-bit integer scale.
- String type
StringType
: Represents character string values.
- Binary type
BinaryType
: Represents byte sequence values.
- Boolean type
BooleanType
: Represents boolean values.
- Datetime type
TimestampType
: Represents values comprising values of fields year, month, day, hour, minute, and second, with the session local time-zone. The timestamp value represents an absolute point in time.DateType
: Represents values comprising values of fields year, month and day, without a time-zone.
- Complex types
ArrayType(elementType, containsNull)
: Represents values comprising a sequence of elements with the type ofelementType
.containsNull
is used to indicate if elements in aArrayType
value can havenull
values.MapType(keyType, valueType, valueContainsNull)
: Represents values comprising a set of key-value pairs. The data type of keys is described bykeyType
and the data type of values is described byvalueType
. For aMapType
value, keys are not allowed to havenull
values.valueContainsNull
is used to indicate if values of aMapType
value can havenull
values.StructType(fields)
: Represents values with the structure described by a sequence ofStructField
s (fields
).StructField(name, dataType, nullable)
: Represents a field in aStructType
. The name of a field is indicated byname
. The data type of a field is indicated bydataType
.nullable
is used to indicate if values of these fields can havenull
values.
All data types of Spark SQL are located in the package org.apache.spark.sql.types
.
You can access them by doing
import org.apache.spark.sql.types._
Data type | Value type in Scala | API to access or create a data type |
---|---|---|
ByteType | Byte | ByteType |
ShortType | Short | ShortType |
IntegerType | Int | IntegerType |
LongType | Long | LongType |
FloatType | Float | FloatType |
DoubleType | Double | DoubleType |
DecimalType | java.math.BigDecimal | DecimalType |
StringType | String | StringType |
BinaryType | Array[Byte] | BinaryType |
BooleanType | Boolean | BooleanType |
TimestampType | java.sql.Timestamp | TimestampType |
DateType | java.sql.Date | DateType |
ArrayType | scala.collection.Seq | ArrayType(elementType, [containsNull]) Note: The default value of containsNull is true. |
MapType | scala.collection.Map | MapType(keyType, valueType, [valueContainsNull]) Note: The default value of valueContainsNull is true. |
StructType | org.apache.spark.sql.Row | StructType(fields) Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed. |
StructField | The value type in Scala of the data type of this field(For example, Int for a StructField with the data type IntegerType) | StructField(name, dataType, [nullable]) Note: The default value of nullable is true. |
All data types of Spark SQL are located in the package of
org.apache.spark.sql.types
. To access or create a data type,
please use factory methods provided in
org.apache.spark.sql.types.DataTypes
.
Data type | Value type in Java | API to access or create a data type |
---|---|---|
ByteType | byte or Byte | DataTypes.ByteType |
ShortType | short or Short | DataTypes.ShortType |
IntegerType | int or Integer | DataTypes.IntegerType |
LongType | long or Long | DataTypes.LongType |
FloatType | float or Float | DataTypes.FloatType |
DoubleType | double or Double | DataTypes.DoubleType |
DecimalType | java.math.BigDecimal | DataTypes.createDecimalType() DataTypes.createDecimalType(precision, scale). |
StringType | String | DataTypes.StringType |
BinaryType | byte[] | DataTypes.BinaryType |
BooleanType | boolean or Boolean | DataTypes.BooleanType |
TimestampType | java.sql.Timestamp | DataTypes.TimestampType |
DateType | java.sql.Date | DataTypes.DateType |
ArrayType | java.util.List | DataTypes.createArrayType(elementType) Note: The value of containsNull will be true. DataTypes.createArrayType(elementType, containsNull). |
MapType | java.util.Map | DataTypes.createMapType(keyType, valueType) Note: The value of valueContainsNull will be true. DataTypes.createMapType(keyType, valueType, valueContainsNull) |
StructType | org.apache.spark.sql.Row | DataTypes.createStructType(fields) Note: fields is a List or an array of StructFields.Also, two fields with the same name are not allowed. |
StructField | The value type in Java of the data type of this field (For example, int for a StructField with the data type IntegerType) | DataTypes.createStructField(name, dataType, nullable) |
All data types of Spark SQL are located in the package of pyspark.sql.types
.
You can access them by doing
Data type | Value type in Python | API to access or create a data type |
---|---|---|
ByteType | int or long Note: Numbers will be converted to 1-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -128 to 127. |
ByteType() |
ShortType | int or long Note: Numbers will be converted to 2-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -32768 to 32767. |
ShortType() |
IntegerType | int or long | IntegerType() |
LongType | long Note: Numbers will be converted to 8-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -9223372036854775808 to 9223372036854775807.Otherwise, please convert data to decimal.Decimal and use DecimalType. |
LongType() |
FloatType | float Note: Numbers will be converted to 4-byte single-precision floating point numbers at runtime. |
FloatType() |
DoubleType | float | DoubleType() |
DecimalType | decimal.Decimal | DecimalType() |
StringType | string | StringType() |
BinaryType | bytearray | BinaryType() |
BooleanType | bool | BooleanType() |
TimestampType | datetime.datetime | TimestampType() |
DateType | datetime.date | DateType() |
ArrayType | list, tuple, or array | ArrayType(elementType, [containsNull]) Note:The default value of containsNull is True. |
MapType | dict | MapType(keyType, valueType, [valueContainsNull]) Note:The default value of valueContainsNull is True. |
StructType | list or tuple | StructType(fields) Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed. |
StructField | The value type in Python of the data type of this field (For example, Int for a StructField with the data type IntegerType) |
StructField(name, dataType, [nullable]) Note: The default value of nullable is True. |
Data type | Value type in R | API to access or create a data type |
---|---|---|
ByteType | integer Note: Numbers will be converted to 1-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -128 to 127. |
“byte” |
ShortType | integer Note: Numbers will be converted to 2-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -32768 to 32767. |
“short” |
IntegerType | integer | “integer” |
LongType | integer Note: Numbers will be converted to 8-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -9223372036854775808 to 9223372036854775807. Otherwise, please convert data to decimal.Decimal and use DecimalType. |
“long” |
FloatType | numeric Note: Numbers will be converted to 4-byte single-precision floating point numbers at runtime. |
“float” |
DoubleType | numeric | “double” |
DecimalType | Not supported | Not supported |
StringType | character | “string” |
BinaryType | raw | “binary” |
BooleanType | logical | “bool” |
TimestampType | POSIXct | “timestamp” |
DateType | Date | “date” |
ArrayType | vector or list | list(type=”array”, elementType=elementType, containsNull=[containsNull]) Note: The default value of containsNull is TRUE. |
MapType | environment | list(type=”map”, keyType=keyType, valueType=valueType, valueContainsNull=[valueContainsNull]) Note: The default value of valueContainsNull is TRUE. |
StructType | named list | list(type=”struct”, fields=fields) Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed. |
StructField | The value type in R of the data type of this field (For example, integer for a StructField with the data type IntegerType) | list(name=name, type=dataType, nullable=[nullable]) Note: The default value of nullable is TRUE. |
The following table shows the type names as well as aliases used in Spark SQL parser for each data type.
Data type | SQL name |
---|---|
BooleanType | BOOLEAN |
ByteType | BYTE, TINYINT |
ShortType | SHORT, SMALLINT |
IntegerType | INT, INTEGER |
LongType | LONG, BIGINT |
FloatType | FLOAT, REAL |
DoubleType | DOUBLE |
DateType | DATE |
TimestampType | TIMESTAMP |
StringType | STRING |
BinaryType | BINARY |
DecimalType | DECIMAL, DEC, NUMERIC |
CalendarIntervalType | INTERVAL |
ArrayType | ARRAY<element_type> |
StructType | STRUCT<field1_name: field1_type, field2_name: field2_type, …> |
MapType | MAP<key_type, value_type> |
Floating Point Special Values
Spark SQL supports several special floating point values in a case-insensitive manner:
- Inf/+Inf/Infinity/+Infinity: positive infinity
FloatType
: equivalent to ScalaFloat.PositiveInfinity
.DoubleType
: equivalent to ScalaDouble.PositiveInfinity
.
- -Inf/-Infinity: negative infinity
FloatType
: equivalent to ScalaFloat.NegativeInfinity
.DoubleType
: equivalent to ScalaDouble.NegativeInfinity
.
- NaN: not a number
FloatType
: equivalent to ScalaFloat.NaN
.DoubleType
: equivalent to ScalaDouble.NaN
.
Positive/Negative Infinity Semantics
There is special handling for positive and negative infinity. They have the following semantics:
- Positive infinity multiplied by any positive value returns positive infinity.
- Negative infinity multiplied by any positive value returns negative infinity.
- Positive infinity multiplied by any negative value returns negative infinity.
- Negative infinity multiplied by any negative value returns positive infinity.
- Positive/negative infinity multiplied by 0 returns NaN.
- Positive/negative infinity is equal to itself.
- In aggregations, all positive infinity values are grouped together. Similarly, all negative infinity values are grouped together.
- Positive infinity and negative infinity are treated as normal values in join keys.
- Positive infinity sorts lower than NaN and higher than any other values.
- Negative infinity sorts lower than any other values.
NaN Semantics
There is special handling for not-a-number (NaN) when dealing with float
or double
types that
do not exactly match standard floating point semantics.
Specifically:
- NaN = NaN returns true.
- In aggregations, all NaN values are grouped together.
- NaN is treated as a normal value in join keys.
- NaN values go last when in ascending order, larger than any other numeric value.
Examples
SELECT double('infinity') AS col;
+--------+
| col|
+--------+
|Infinity|
+--------+
SELECT float('-inf') AS col;
+---------+
| col|
+---------+
|-Infinity|
+---------+
SELECT float('NaN') AS col;
+---+
|col|
+---+
|NaN|
+---+
SELECT double('infinity') * 0 AS col;
+---+
|col|
+---+
|NaN|
+---+
SELECT double('-infinity') * (-1234567) AS col;
+--------+
| col|
+--------+
|Infinity|
+--------+
SELECT double('infinity') < double('NaN') AS col;
+----+
| col|
+----+
|true|
+----+
SELECT double('NaN') = double('NaN') AS col;
+----+
| col|
+----+
|true|
+----+
SELECT double('inf') = double('infinity') AS col;
+----+
| col|
+----+
|true|
+----+
CREATE TABLE test (c1 int, c2 double);
INSERT INTO test VALUES (1, double('infinity'));
INSERT INTO test VALUES (2, double('infinity'));
INSERT INTO test VALUES (3, double('inf'));
INSERT INTO test VALUES (4, double('-inf'));
INSERT INTO test VALUES (5, double('NaN'));
INSERT INTO test VALUES (6, double('NaN'));
INSERT INTO test VALUES (7, double('-infinity'));
SELECT COUNT(*), c2 FROM test GROUP BY c2;
+---------+---------+
| count(1)| c2|
+---------+---------+
| 2| NaN|
| 2|-Infinity|
| 3| Infinity|
+---------+---------+