Supporting wide Salesforce objects

Introduction

When objects have lots of (e.g. 800) attributes, the SOQL for these objects can hit URL length limits. This page documents an approach to handle this limit, by splitting attributes into multiple batches, without the knowledge of users

Use case(s)

I have a Salesforce object with over 800 attributes. When I send a SOQL query for this object with all attributes in it to the Salesforce APIs, I hit URL length limits. To get around this situation, I have wrapped my code for querying Salesforce into a script that retrieves data for an object in batches, and then joins the batches together. This process is extremely manual, custom, one-off and error prone, and therefore extremely hard to manage and maintain. I would like the ability to for the Salesforce plugins in CDAP to batch the attributes before submitting the SOQL to Salesforce, and handle this grunt work automatically.

User Storie(s)

Plugin Type

This is not a new plugin, but enhancements to existing plugins.

Configurables

This section defines properties that are configurable for this plugin. 

User Facing NameTypeDescriptionConstraints








Design / Implementation Tips

  • Tip #1
  • Tip #2

Design

Approach(s)

  1. In CDAP Salesforce Source Batch plugin user specifies sObject (ex: Opportunity). Using Salesforce describe functionality, we receive all fields and generate SOQL 
    1. length of the generated SOQL does not exceed 20,000 [1] characters then we will use standard approach based on Bulk API. Details: Salesforce Batch Source.
    2. length of the generated SOQL exceeds 20,000 characters [1] then we will use 2 APIs (SOAP and Bulk) to solve length problem:
      • Fetch all IDs [2] for provided sObject using Bulk API (generate: select id from sObject).
      • Create chunks by 2,000 ids and query all fields using SOAP API (2,000 is a limit of records exposed by SOAP API [3]). Details: Detailed wide object flow
      • Make retrieve() request for each chunk.
      • Save to sink or expose to further transformations in CDAP.
  2. In CDAP Salesforce Source Batch plugin user specifies SOQL (ex: select  id, type … from Opportunity where type = ‘ABC’)
    1. length of the SOQL does not exceed 20,000 characters then we will use standard approach based on Bulk API. Details: Salesforce Batch Source.
    2. length of the SOQL exceeds 20,000 characters then we will use 2 APIs (SOAP and Bulk) to solve length problem:
      • Parse given SOQL query and extract all fields in SELECT clause and replace them with id field (ex: select  id, type … from Opportunity where type = ‘ABC' -> select id from Opportunity where type = ‘ABC').
      • Fetch all IDs [2] produced by generated SOQL query (ex: select id from Opportunity where type = ‘ABC’).
      • Create chunks by 2,000 ids and query fields extracted from initial SOQL query using SOAP APl. Details: Detailed wide object flow
      • Make retrieve() request for each chunk.
      • Save to sink or expose to further transformations in CDAP.

[1] https://developer.salesforce.com/docs/atlas.en-us.salesforce_app_limits_cheatsheet.meta/salesforce_app_limits_cheatsheet/salesforce_app_limits_platform_soslsoql.htm
[2] https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/system_fields.htm
[3] https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/sforce_api_calls_retrieve.htm

Detailed flow

  1. Initial wide object SOQL: `SELECT Id, Name, Type ...., Column_800__c from sObject`
  2. Extract query fields from wide SOQL and replace with field `Id`.
    • Extracted fieldList: `Id, Name, Type ...., Column_800__c`
    • Generated SOQL: `select id from sObject`
  3. Execute SOQL using standard flow (Bulk API)
  4. Execution result:
    • Total count of records: 6005
    • Number of batches: 1
    • Result CSV

      Id
      1
      2
      3
      ...
      6005
  5. Each resulting batch will be processed using standard flow: for each batch new RecordReader will be created
  6. RecordReader will retrieve list of Ids from assigned batchId. This Ids will be split into sub-lists by 2000 records (in our case to 4):
    • SubList_A: "1", ..., "2000"
    • SubList_B: "2001", ..., "4000"
    • SubList_C: "4001", ..., 6000
    • SubList_D: "6001", ..., 6005
  7. Each sub-list will be executed in parallel through SOAP API.
    • `connection.retrieve(fieldList, sObject, SubList_A)`
    • `connection.retrieve(fieldList, sObject, SubList_B)`
    • `connection.retrieve(fieldList, sObject, SubList_C)`
    • `connection.retrieve(fieldList, sObject, SubList_D)`
  8. All responses will be transformed to CDAP Records and exposed to further CDAP transformations.

Properties

Security

Limitation(s)

To add wide sObject query support SOAP API will be used but it has some limitations described below.

  • Single SOAP API retrieve() call can contain a maximum of 2000 records [1] while Bulk API batch can contain a maximum of 10,000 records. [2]

Since first we will fetch all IDs via Bulk API and then process them by 2000 via SOAP API, this may lead to performance degradation compared to the queries executed via Bulk API only.

[1] https://developer.salesforce.com/docs/atlas.en-us.218.0.api.meta/api/sforce_api_calls_retrieve.htm
[2] https://developer.salesforce.com/docs/atlas.en-us.api_asynch.meta/api_asynch/asynch_api_concepts_limits.htm

Future Work

  • Some future work – HYDRATOR-99999
  • Another future work – HYDRATOR-99999

Test Case(s)

  • Test case #1
  • Test case #2

Sample Pipeline

Please attach one or more sample pipeline(s) and associated data. 

Pipeline #1

Pipeline #2



Table of Contents

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature

Created in 2020 by Google Inc.