How to Easily Extract Salesforce Data Using PySpark in Python

Published in

Python in Plain English

3 min readMar 4, 2025

Extracting Salesforce data with PySpark can be a daunting task, especially when you’re faced with outdated modules and tricky dependencies. After spending days figuring out the process, we wanted to share a simplified guide to help you avoid the same struggles.

If you prefer the video version, click here!

Key Steps to Extract Salesforce Data

Understand the PySpark Module

PySpark uses the spark-salesforce module, which is functional but not actively maintained.
While it offers support for Scala, Java, and R, its Python support isn’t as straightforward. Here is an example for Python usage.

df = spark.read.format("com.springml.spark.salesforce") \
        .option("username", username) \
        .option("password", f"{password}{security_token}") \
        .option("soql", soql) \
        .option("version", "58.0") \
        .load()

Prerequisites

Username: Your Salesforce email.
Password and Security Token: Concatenated together.
SOQL Queries: Salesforce’s version of SQL.

Manage Dependencies

Dependencies are a major roadblock. The module’s official repository lists one dependency, but it doesn’t work for PySpark. After trial and error, I curated the required JAR dependencies.

I’ve forked the original repository, added these dependencies, and updated the README for clarity. Here is the link to my fork.

These dependencies make the module compatible with PySpark, saving you countless hours.

$ bin/spark-shell --packages com.springml:spark-salesforce_2.12:1.1.4,org.mockito:mockito-core:2.0.31-beta,org.codehaus.woodstox:woodstox-core-asl:4.4.0,com.springml:salesforce-wave-api:1.0.10,com.force.api:force-partner-api:40.0.0,com.force.api:force-wsc:40.0.0,com.fasterxml.jackson.dataformat:jackson-dataformat-xml:2.4.4

Learn SOQL Basics

SOQL (Salesforce Object Query Language) is similar to SQL but tailored for Salesforce’s object-based structure. Features like WHERE date = LAST_N_DAYS are unique and useful. Use extensions like Salesforce Inspector for quick query exports and testing.

Tip:

To bypass SOQL’s limitation of 200 rows per query, extract columns first (LIMIT 1), then use the metadata to dynamically query all data. This method prevents manual updates when fields are added or removed.

Handle API Inconsistencies

Salesforce’s displayed field names differ from the API field names.

Use the Tooling API and the sfdc-cli Python package to map display names to API names. Query the FieldDefinition object to fetch mappings for seamless integration.

Important: Ensure you have the right permissions and know whether you’re working in a sandbox or production environment.

Why This Matters

By following this guide, you’ll avoid the frustration of outdated documentation and mismatched dependencies. Whether you’re building ETL pipelines or exploring Salesforce data for insights, this method streamlines the process, ensuring scalability and efficiency.

Bonus: GitHub repository

Looking to enhance your data engineering processes or scale your pipeline architecture? At enabledata.io, we offer expert insights and tailored solutions. Schedule a FREE consulting call today or reach out to us at contact@enabledata.io. Let’s build smarter, scalable solutions together!

Thank you for being a part of the community

Before you go:

Be sure to clap and follow the writer ️👏️️
Follow us: X | LinkedIn | YouTube | Newsletter | Podcast | Differ
Check out CoFeed, the smart way to stay up-to-date with the latest in tech 🧪
Start your own free AI-powered blog on Differ 🚀
Join our content creators community on Discord 🧑🏻‍💻
For more content, visit plainenglish.io + stackademic.com

Python in Plain English

How to Easily Extract Salesforce Data Using PySpark in Python

Key Steps to Extract Salesforce Data

Manage Dependencies

Learn SOQL Basics

Tip:

Handle API Inconsistencies

Why This Matters

Thank you for being a part of the community

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Python in Plain English

Written by Aline Rodrigues

No responses yet