How to Easily Extract Salesforce Data Using PySpark in Python

Extracting Salesforce data with PySpark can be a daunting task, especially when you’re faced with outdated modules and tricky dependencies. After spending days figuring out the process, we wanted to share a simplified guide to help you avoid the same struggles.
If you prefer the video version, click here!
Key Steps to Extract Salesforce Data
Understand the PySpark Module
- PySpark uses the
spark-salesforce
module, which is functional but not actively maintained. - While it offers support for Scala, Java, and R, its Python support isn’t as straightforward. Here is an example for Python usage.
df = spark.read.format("com.springml.spark.salesforce") \
.option("username", username) \
.option("password", f"{password}{security_token}") \
.option("soql", soql) \
.option("version", "58.0") \
.load()
Prerequisites
- Username: Your Salesforce email.
- Password and Security Token: Concatenated together.
- SOQL Queries: Salesforce’s version of SQL.
Manage Dependencies
Dependencies are a major roadblock. The module’s official repository lists one dependency, but it doesn’t work for PySpark. After trial and error, I curated the required JAR dependencies.
I’ve forked the original repository, added these dependencies, and updated the README for clarity. Here is the link to my fork.
These dependencies make the module compatible with PySpark, saving you countless hours.
$ bin/spark-shell --packages com.springml:spark-salesforce_2.12:1.1.4,org.mockito:mockito-core:2.0.31-beta,org.codehaus.woodstox:woodstox-core-asl:4.4.0,com.springml:salesforce-wave-api:1.0.10,com.force.api:force-partner-api:40.0.0,com.force.api:force-wsc:40.0.0,com.fasterxml.jackson.dataformat:jackson-dataformat-xml:2.4.4
Learn SOQL Basics
SOQL (Salesforce Object Query Language) is similar to SQL but tailored for Salesforce’s object-based structure. Features like WHERE date = LAST_N_DAYS
are unique and useful. Use extensions like Salesforce Inspector for quick query exports and testing.
Tip:
To bypass SOQL’s limitation of 200 rows per query, extract columns first (LIMIT 1
), then use the metadata to dynamically query all data. This method prevents manual updates when fields are added or removed.
Handle API Inconsistencies
Salesforce’s displayed field names differ from the API field names.
Use the Tooling API and the sfdc-cli
Python package to map display names to API names. Query the FieldDefinition
object to fetch mappings for seamless integration.
Important: Ensure you have the right permissions and know whether you’re working in a sandbox or production environment.
Why This Matters
By following this guide, you’ll avoid the frustration of outdated documentation and mismatched dependencies. Whether you’re building ETL pipelines or exploring Salesforce data for insights, this method streamlines the process, ensuring scalability and efficiency.
Bonus: GitHub repository
Looking to enhance your data engineering processes or scale your pipeline architecture? At enabledata.io, we offer expert insights and tailored solutions. Schedule a FREE consulting call today or reach out to us at contact@enabledata.io. Let’s build smarter, scalable solutions together!
Thank you for being a part of the community
Before you go:
- Be sure to clap and follow the writer ️👏️️
- Follow us: X | LinkedIn | YouTube | Newsletter | Podcast | Differ
- Check out CoFeed, the smart way to stay up-to-date with the latest in tech 🧪
- Start your own free AI-powered blog on Differ 🚀
- Join our content creators community on Discord 🧑🏻💻
- For more content, visit plainenglish.io + stackademic.com