Counting Null, Nan and Empty Values in PySpark and Spark Dataframes
A critical data quality check in machine learning and analytics workloads is determining how many data points from source data being prepared for processing have null, NaN or empty values with a view to either dropping them or replacing them with meaningful values. Depending on the context, it is generally understood that the fewer the number of null, nan or empty values existing in a dataset the better the quality of the data.
In Spark, null value means nothing or no value. It is used to represent instances where no useful values exist. Note however, that null is different from an empty string or zero value.
NaN (Not-a-Number) is usually used to denote values that are not numbers. In Python, NaN is considered as a special floating-point value that cannot be converted to any other type than float.
The goal in this post is to design a reusable function in Scala and in Python that can be used to extract easily the counts of all nulls, nans and empty values from a given dataframe or table.
Use Case
If you need to count null, nan and blank values across all the columns in your table or dataframe, the helper functions described below might be helpful.
Solution Steps
- Get all your dataframe column names into an array/list
- Create an empty array/list to temporarily store column names and count values
- Iterate the fields in the column array; selecting, filtering and counting each fields' null, nan and blank values
- Append results of the iteration to the array/list
- Convert the array/list into a new dataframe
Code Design - Spark Scala
// import packages
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.collection.mutable.ArrayBuffer
def getNullNanBlankCount()(df: DataFrame): DataFrame = {
// get all the column names into an array
val allColsArr = df.columns
// create an empty array buffer to temporarily store column names and count values
val arrBuffer = ArrayBuffer[(String, Long, Long, Long)]()
// iterate the columns array, select each column, filter and count its null, nan and blank values
for(field <- allColsArr) {
// count nulls
val nullCondition = col(field).isNull
val nullCount = df.select(col(field)).filter(nullCondition).count()
// count nans
val nanCondition = col(field).isNaN
val nanCount = df.select(col(field)).filter(nanCondition).count()
// count blank/empty values
val blankCount = df.select(col(field)).filter(col(field) === " ").count()
// append column names and count values to the array buffer
arrBuffer.append((field, nullCount, nanCount, blankCount))
}
// convert the array buffer into a new dataframe
val resDf = spark.createDataFrame(arrBuffer).toDF("table_column_name", "null_count", "nan_count", "blank_count")
resDf
}
Code Design - PySpark
# import packages
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types
def getNullNanBlankCount(df: DataFrame) -> DataFrame:
# get all the column names into a list
allColsList = df.columns
# create an empty list to temporarily store column names and count values
listBuffer = []
for field in allColsList:
# count nulls
nullCondition = F.isnull(F.col(field))
nullCount = df.select(F.col(field)).filter(nullCondition).count()
# count nans
nanCondition = F.isnan(F.col(field))
nanCount = df.select(F.col(field)).filter(nanCondition).count()
# count blank/empty values
blankCount = df.select(F.col(field)).filter(F.col(field) == " ").count()
# append column names and count values to the list
listBuffer.append((field, nullCount, nanCount, blankCount))
# convert listBuffer into a new dataframe
resDf = spark.createDataFrame(listBuffer, ["table_column_name", "null_count", "nan_count", "blank_count"])
return resDf
Get A Sample Dataframe
# PySpark Sample Dataframe
pdf = spark.read.option("inferSchema", "true").option("header", "true").csv("/databricks-datasets/definitive-guide/data/bike-data/201508_trip_data.csv")
pdf.show(10, truncate=False)
# Sample Dataframe Records
+-------+--------+---------------+---------------------------------------------+--------------+---------------+---------------------------------------------+------------+------+---------------+--------+
|Trip ID|Duration|Start Date |Start Station |Start Terminal|End Date |End Station |End Terminal|Bike #|Subscriber Type|Zip Code|
+-------+--------+---------------+---------------------------------------------+--------------+---------------+---------------------------------------------+------------+------+---------------+--------+
|913460 |765 |8/31/2015 23:26|Harry Bridges Plaza (Ferry Building) |50 |8/31/2015 23:39|San Francisco Caltrain (Townsend at 4th) |70 |288 |Subscriber |2139 |
|913459 |1036 |8/31/2015 23:11|San Antonio Shopping Center |31 |8/31/2015 23:28|Mountain View City Hall |27 |35 |Subscriber |95032 |
|913455 |307 |8/31/2015 23:13|Post at Kearny |47 |8/31/2015 23:18|2nd at South Park |64 |468 |Subscriber |94107 |
|913454 |409 |8/31/2015 23:10|San Jose City Hall |10 |8/31/2015 23:17|San Salvador at 1st |8 |68 |Subscriber |95113 |
|913453 |789 |8/31/2015 23:09|Embarcadero at Folsom |51 |8/31/2015 23:22|Embarcadero at Sansome |60 |487 |Customer |9069 |
|913452 |293 |8/31/2015 23:07|Yerba Buena Center of the Arts (3rd @ Howard)|68 |8/31/2015 23:12|San Francisco Caltrain (Townsend at 4th) |70 |538 |Subscriber |94118 |
|913451 |896 |8/31/2015 23:07|Embarcadero at Folsom |51 |8/31/2015 23:22|Embarcadero at Sansome |60 |363 |Customer |92562 |
|913450 |255 |8/31/2015 22:16|Embarcadero at Sansome |60 |8/31/2015 22:20|Steuart at Market |74 |470 |Subscriber |94111 |
|913449 |126 |8/31/2015 22:12|Beale at Market |56 |8/31/2015 22:15|Temporary Transbay Terminal (Howard at Beale)|55 |439 |Subscriber |94130 |
|913448 |932 |8/31/2015 21:57|Post at Kearny |47 |8/31/2015 22:12|South Van Ness at Market |66 |472 |Subscriber |94702 |
+-------+--------+---------------+---------------------------------------------+--------------+---------------+---------------------------------------------+------------+------+---------------+--------+
only showing top 10 rows
Call Function
// Spark Scala - get count of nulls, nans and blank values
sdf.transform(getNullNanBlankCount()).show(false)
# PySpark - get count of nulls, nans and blank values
pdf.transform(getNullNanBlankCount).show(truncate=False)
Function Output
+-----------------+----------+---------+-----------+
|table_column_name|null_count|nan_count|blank_count|
+-----------------+----------+---------+-----------+
|Trip ID |0 |0 |0 |
|Duration |0 |0 |0 |
|Start Date |0 |0 |0 |
|Start Station |0 |0 |0 |
|Start Terminal |0 |0 |0 |
|End Date |0 |0 |0 |
|End Station |0 |0 |0 |
|End Terminal |0 |0 |0 |
|Bike # |0 |0 |0 |
|Subscriber Type |0 |0 |0 |
|Zip Code |278 |0 |0 |
+-----------------+----------+---------+-----------+
Thanks for reading.