Counting Null, Nan and Empty Values in PySpark and Spark Dataframes
4 min read

Counting Null, Nan and Empty Values in PySpark and Spark Dataframes

Counting Null, Nan and Empty Values in PySpark and Spark Dataframes

A critical data quality check in machine learning and analytics workloads is determining how many data points from source data being prepared for processing have null, NaN or empty values with a view to either dropping them or replacing them with meaningful values. Depending on the context, it is generally understood that the fewer the number of null, nan or empty values existing in a dataset the better the quality of the data.

In Spark, null value means nothing or no value. It is used to represent instances where no useful values exist. Note however, that null is different from an empty string or zero value.

NaN (Not-a-Number) is usually used to denote values that are not numbers. In Python, NaN is considered as a special floating-point value that cannot be converted to any other type than float.

The goal in this post is to design a reusable function in Scala and in Python that can be used to extract easily the counts of all nulls, nans and empty values from a given dataframe or table.

Use Case

If you need to count null, nan and blank values across all the columns in your table or dataframe, the helper functions described below might be helpful.

Solution Steps

  • Get all your dataframe column names into an array/list
  • Create an empty array/list to temporarily store column names and count values
  • Iterate the fields in the column array; selecting, filtering and counting each fields' null, nan and blank values
  • Append results of the iteration to the array/list
  • Convert the array/list into a new dataframe

Code Design - Spark Scala


// import packages 
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.collection.mutable.ArrayBuffer


def getNullNanBlankCount()(df: DataFrame): DataFrame = {
  
   // get all the column names into an array
  val allColsArr = df.columns
  
  // create an empty array buffer to temporarily store column names and count values
  val arrBuffer = ArrayBuffer[(String, Long, Long, Long)]()
  
  // iterate the columns array, select each column, filter and count its null, nan and blank values
  for(field <- allColsArr) { 
  
    // count nulls
    val nullCondition = col(field).isNull 
    val nullCount = df.select(col(field)).filter(nullCondition).count()  
    
    // count nans
    val nanCondition = col(field).isNaN
    val nanCount = df.select(col(field)).filter(nanCondition).count()  
    
    // count blank/empty values
    val blankCount = df.select(col(field)).filter(col(field) === " ").count()  
     
    // append column names and count values to the array buffer
    arrBuffer.append((field, nullCount, nanCount, blankCount))
  }
   
  // convert the array buffer into a new dataframe
  val resDf = spark.createDataFrame(arrBuffer).toDF("table_column_name", "null_count", "nan_count", "blank_count")
 
  resDf
}

Code Design - PySpark


# import packages 
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types


def getNullNanBlankCount(df: DataFrame) -> DataFrame: 
  
  # get all the column names into a list
  allColsList = df.columns
  
  # create an empty list to temporarily store column names and count values
  listBuffer = []
  
  for field in allColsList:   
    # count nulls
    nullCondition = F.isnull(F.col(field))
    nullCount = df.select(F.col(field)).filter(nullCondition).count()  
    
    # count nans
    nanCondition = F.isnan(F.col(field))
    nanCount = df.select(F.col(field)).filter(nanCondition).count()  
    
    # count blank/empty values
    blankCount = df.select(F.col(field)).filter(F.col(field) == " ").count()  
     
    # append column names and count values to the list
    listBuffer.append((field, nullCount, nanCount, blankCount))

  # convert listBuffer into a new dataframe
  resDf = spark.createDataFrame(listBuffer, ["table_column_name", "null_count", "nan_count", "blank_count"])
 
  return resDf
  
  

Get A Sample Dataframe


# PySpark Sample Dataframe
pdf = spark.read.option("inferSchema", "true").option("header", "true").csv("/databricks-datasets/definitive-guide/data/bike-data/201508_trip_data.csv")
pdf.show(10, truncate=False)


# Sample Dataframe Records 
+-------+--------+---------------+---------------------------------------------+--------------+---------------+---------------------------------------------+------------+------+---------------+--------+
|Trip ID|Duration|Start Date     |Start Station                                |Start Terminal|End Date       |End Station                                  |End Terminal|Bike #|Subscriber Type|Zip Code|
+-------+--------+---------------+---------------------------------------------+--------------+---------------+---------------------------------------------+------------+------+---------------+--------+
|913460 |765     |8/31/2015 23:26|Harry Bridges Plaza (Ferry Building)         |50            |8/31/2015 23:39|San Francisco Caltrain (Townsend at 4th)     |70          |288   |Subscriber     |2139    |
|913459 |1036    |8/31/2015 23:11|San Antonio Shopping Center                  |31            |8/31/2015 23:28|Mountain View City Hall                      |27          |35    |Subscriber     |95032   |
|913455 |307     |8/31/2015 23:13|Post at Kearny                               |47            |8/31/2015 23:18|2nd at South Park                            |64          |468   |Subscriber     |94107   |
|913454 |409     |8/31/2015 23:10|San Jose City Hall                           |10            |8/31/2015 23:17|San Salvador at 1st                          |8           |68    |Subscriber     |95113   |
|913453 |789     |8/31/2015 23:09|Embarcadero at Folsom                        |51            |8/31/2015 23:22|Embarcadero at Sansome                       |60          |487   |Customer       |9069    |
|913452 |293     |8/31/2015 23:07|Yerba Buena Center of the Arts (3rd @ Howard)|68            |8/31/2015 23:12|San Francisco Caltrain (Townsend at 4th)     |70          |538   |Subscriber     |94118   |
|913451 |896     |8/31/2015 23:07|Embarcadero at Folsom                        |51            |8/31/2015 23:22|Embarcadero at Sansome                       |60          |363   |Customer       |92562   |
|913450 |255     |8/31/2015 22:16|Embarcadero at Sansome                       |60            |8/31/2015 22:20|Steuart at Market                            |74          |470   |Subscriber     |94111   |
|913449 |126     |8/31/2015 22:12|Beale at Market                              |56            |8/31/2015 22:15|Temporary Transbay Terminal (Howard at Beale)|55          |439   |Subscriber     |94130   |
|913448 |932     |8/31/2015 21:57|Post at Kearny                               |47            |8/31/2015 22:12|South Van Ness at Market                     |66          |472   |Subscriber     |94702   |
+-------+--------+---------------+---------------------------------------------+--------------+---------------+---------------------------------------------+------------+------+---------------+--------+
only showing top 10 rows

Call Function


// Spark Scala - get count of nulls, nans and blank values
sdf.transform(getNullNanBlankCount()).show(false)


# PySpark - get count of nulls, nans and blank values
pdf.transform(getNullNanBlankCount).show(truncate=False)

Function Output


+-----------------+----------+---------+-----------+
|table_column_name|null_count|nan_count|blank_count|
+-----------------+----------+---------+-----------+
|Trip ID          |0         |0        |0          |
|Duration         |0         |0        |0          |
|Start Date       |0         |0        |0          |
|Start Station    |0         |0        |0          |
|Start Terminal   |0         |0        |0          |
|End Date         |0         |0        |0          |
|End Station      |0         |0        |0          |
|End Terminal     |0         |0        |0          |
|Bike #           |0         |0        |0          |
|Subscriber Type  |0         |0        |0          |
|Zip Code         |278       |0        |0          |
+-----------------+----------+---------+-----------+

Thanks for reading.