Open Password-Protected Excel Files in Python Without Manual Input of Password
4 min read

Open Password-Protected Excel Files in Python Without Manual Input of Password

Open Password-Protected Excel Files in Python Without Manual Input of Password

Spreadsheet is a common tool in many organizations for inputting, computing, processing and presenting data. Microsoft Excel, as a variant of spreadsheet is particularly popular for it's ease of use and less efforts required to learn and use it for meaningful tasks.

In analytics world, Excel files, like CSV files are ubiquitous. Oftentimes, data sources that are required for downstream analytical processing or transformation are usually in excel or in csv format. Hence almost all data processing tools, framework or libraries have features to load, read or parse excel and csv files.

When processing excel files, it is common to come across files that are password-protected as a result of contents that are confidential. Such files are easy to open and use as long as the password used to lock them is known to the user. This simple manual task however, can become tedious where there are multitude of such protected files that are needed to be unlocked or decrypted before they can be consumed for further processing. In such a situation, even when the password is known, it is impractical to manually input password on Excel interface to open each file.

It would thus be desirable to design a reusable Python function that can read a password-protected excel file programmatically without having to manually type the password in Excel. The function will automate the process of decrypting/unlocking password-protected excel file, read the data and expose the outcome as dataframe that can be consumed by downstream processing tools. This is the the goal of this post.

In my search for a potential Python package that can handle the situation described above, a number of libraries came up but most of them have one or two downsides that are not helpful in achieving the purpose of this post. For example, pywin32 library can accept password input programmatically when reading a password-protected Excel file. However, it has a limitation that it can only work in a Windows environment.

I eventually settled on a library called msoffcrypto-tool. According to the developer of the package, msoffcrypto-tool is a "Python tool and library for decrypting encrypted MS Office files with password, intermediate key, or private key which generated its escrow key". See the project page here.

Charles Heckroth provides a detailed tutorial on the msoffcrypto-tool usage on his blog page. See it here.

Solution Steps

  • Define the path of the Excel file to be read or opened. For the purpose of this post, the password-protected Excel file is assumed to be located in an S3 bucket.
  • Create an in-memory ByteIO object where the byte streams from the Excel file will be written to.
  • Open the protected file in binary mode. This mode allows data to be read and written as bytes objects.
  • Provide the file password to the msoffcrypto-tool's load_key method to decrypt the file. To ensure confidentiality, the password can be stored in the AWS System Manager's Parameter Store and referenced in the code.
  • Use Pandas library to read and load the decrypted file as a dataframe.
  • Optionally convert Pandas dataframe to Spark dataframe for distributed data processing use cases.

Code Design


from pyspark.sql import DataFrame
import io
import msoffcrypto as mso
import pandas as pd
import pyspark.pandas as ps
import s3fs
import openpyxl


def readPasswordProtectedXLFileFromS3(objectPath, password, **kwargs):
  
  """ A function to read/load a password-protected Excel file without having to manually type it in Excel interface
    Args:
        objectPath (str): The path to the object in the S3 bucket.
        password (str): The password to the object.
        **kwargs: Any additional arguments to pass to the Pandas's read_excel method. E.g. header=1, sheet_name='testSheet'
    Returns:
        DataFrame: Pandas and Spark DataFrames.
  """
  
  # mount s3 like local fs
  s3 = s3fs.S3FileSystem (anon=False)

  # create an in-memory ByteIO object 
  decrypted_wb = io.BytesIO()

  with s3.open(objectPath, 'rb') as xlsfile:
    
    # open the protected file
    office_file = mso.OfficeFile(xlsfile)
    
    # provide the password
    office_file.load_key(password=password)
    
    # decrypt and write to output file
    office_file.decrypt(decrypted_wb)
    
  # read the output file with pandas
  pandasDf = pd.read_excel(decrypted_wb, engine="openpyxl", **kwargs)
    
  # extract spark dataframe from pandas dataframe
  sparkDf = ps.from_pandas(pandasDf).to_spark()
    
  return (pandasDf, sparkDf)
    
    

Function Test

We will test the function using a sample password-protected excel file (LMIA2021p.xlsx) located in an S3 bucket. To enable reading the excel file using Python's open method, we will mount the S3 like local file system using s3fs Python library as shown below.

Note that you have to provide AWS credentials (key and secret ids) when using s3fs method to open a file in S3 bucket from outside your AWS account. Otherwise this will lead to a permission error.


s3 = s3fs.S3FileSystem (anon=False, key="XXXXXXXXXXXXXXXXXXXX", secret="YYYYYYYYYYYYYYYYYY")

The password to unlock the file will be fetched as secured string from AWS System Manager's Parameter store using the code below:


import boto3

ssm = boto3.client("ssm") 
parameter = ssm.get_parameter(Name="SampleExcelPassword", WithDecryption=True)["Parameter"]
file_password = parameter["Value"]

Function Output


# get both pandas and spark dataframes
pDf, sDf = readPasswordProtectedXLFileFromS3("s3://xxxxx/LMIA2021p.xlsx", file_password, header=1)


# get only pandas dataframe
pDf, _ = readPasswordProtectedXLFileFromS3("s3://xxxxx/LMIA2021p.xlsx", file_password, header=1)

pDf.head()


# get only spark dataframe
_, sDf = readPasswordProtectedXLFileFromS3("s3://xxxxx/LMIA2021p.xlsx", file_password, header=1)

sDf.show(truncate=False)

+-------------------------+--------------+---------------------------------------+----------------------------------+-----------------------------------+-----------------------+--------------+------------------+
|Province/Territory       |Program Stream|Employer                               |Address                           |Occupation                         |Incorporate Status     |Approved LMIAs|Approved Positions|
+-------------------------+--------------+---------------------------------------+----------------------------------+-----------------------------------+-----------------------+--------------+------------------+
|Newfoundland and Labrador|    High Wage |Central Regional Health Authority      |GRAND FALLS - WINDSOR, NL A2A 2E1 |3111-Specialist physicians         |Non-Profit Organization|2.0           |3.0               |
|Newfoundland and Labrador|    High Wage |Eastern Regional Health Authority      |Mount Pearl, NL A1N 3J5           |3111-Specialist physicians         |Unknown                |2.0           |2.0               |
|Newfoundland and Labrador|    High Wage |Green Bay Fibre Products ltd           |King's Point, NL A0J 1H0          |9531-Boat assemblers and inspectors|Unknown                |1.0           |3.0               |
|Newfoundland and Labrador|   Low Wage   |10565 NFLD Inc                         |Happy Valley-Goose Bay, NL A0P 1C0|6731-Light duty cleaners           |Unknown                |1.0           |2.0               |
|Newfoundland and Labrador|   Low Wage   |67527 Newfoundland and Labrador Limited|COME-BY-CHANCE, NL A0B 1N0        |6322-Cooks                         |Unknown                |1.0           |3.0               |
+-------------------------+--------------+---------------------------------------+----------------------------------+-----------------------------------+-----------------------+--------------+------------------+

See the code in my Github repository.

Thanks for reading.