Python in Plain English

New Python content every day. Follow to join our 3.5M+ monthly readers.

Follow publication

How to Easily Extract Salesforce Data Using PySpark in Python

--

Extracting Salesforce data with PySpark can be a daunting task, especially when you’re faced with outdated modules and tricky dependencies. After spending days figuring out the process, we wanted to share a simplified guide to help you avoid the same struggles.

If you prefer the video version, click here!

Key Steps to Extract Salesforce Data

Understand the PySpark Module

  • PySpark uses the spark-salesforce module, which is functional but not actively maintained.
  • While it offers support for Scala, Java, and R, its Python support isn’t as straightforward. Here is an example for Python usage.
df = spark.read.format("com.springml.spark.salesforce") \
.option("username", username) \
.option("password", f"{password}{security_token}") \
.option("soql", soql) \
.option("version", "58.0") \
.load()

Prerequisites

  • Username: Your Salesforce email.
  • Password and Security Token: Concatenated together.
  • SOQL Queries: Salesforce’s version of SQL.

Manage Dependencies

Dependencies are a major roadblock. The module’s official repository lists one dependency, but it doesn’t work for PySpark. After trial and error, I curated the required JAR dependencies.

I’ve forked the original repository, added these dependencies, and updated the README for clarity. Here is the link to my fork.

These dependencies make the module compatible with PySpark, saving you countless hours.

$ bin/spark-shell --packages com.springml:spark-salesforce_2.12:1.1.4,org.mockito:mockito-core:2.0.31-beta,org.codehaus.woodstox:woodstox-core-asl:4.4.0,com.springml:salesforce-wave-api:1.0.10,com.force.api:force-partner-api:40.0.0,com.force.api:force-wsc:40.0.0,com.fasterxml.jackson.dataformat:jackson-dataformat-xml:2.4.4

Learn SOQL Basics

SOQL (Salesforce Object Query Language) is similar to SQL but tailored for Salesforce’s object-based structure. Features like WHERE date = LAST_N_DAYS are unique and useful. Use extensions like Salesforce Inspector for quick query exports and testing.

Tip:

To bypass SOQL’s limitation of 200 rows per query, extract columns first (LIMIT 1), then use the metadata to dynamically query all data. This method prevents manual updates when fields are added or removed.

Handle API Inconsistencies

Salesforce’s displayed field names differ from the API field names.

Use the Tooling API and the sfdc-cli Python package to map display names to API names. Query the FieldDefinition object to fetch mappings for seamless integration.

Important: Ensure you have the right permissions and know whether you’re working in a sandbox or production environment.

Why This Matters

By following this guide, you’ll avoid the frustration of outdated documentation and mismatched dependencies. Whether you’re building ETL pipelines or exploring Salesforce data for insights, this method streamlines the process, ensuring scalability and efficiency.

Bonus: GitHub repository

Looking to enhance your data engineering processes or scale your pipeline architecture? At enabledata.io, we offer expert insights and tailored solutions. Schedule a FREE consulting call today or reach out to us at contact@enabledata.io. Let’s build smarter, scalable solutions together!

Thank you for being a part of the community

Before you go:

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Published in Python in Plain English

New Python content every day. Follow to join our 3.5M+ monthly readers.

Written by Aline Rodrigues

Co-founder @ enabledata. Check out our 2-week FREE Data Journey for companies at enabledata.io

No responses yet

Write a response