Skip to content

Important Things to Know Before Starting

ETL Phases

Extract

  • Raw data is extracted from various sources such as DB/API/SQL.
  • Important: extract only the necessary columns (avoid commands like SELECT *).
  • On ConnectX:
    • There are various connectors in the framework that just need configuration.
    • Each connector returns a DataFrame as output.
    • All columns in the DataFrame are of type String.
    • You can have N extracts per flow.

Transform

  • The transformation phase is the most critical and complex.
  • During transformation, raw data is cleaned and merged to achieve the desired format.
  • First, ensure that extracted data is unique; group it if necessary before performing joins.
  • On ConnectX:
    • All transformations are done using Polars through DataFrame manipulations.

Load

  • Transformed data is ready to be loaded into the destination.
  • On ConnectX:
    • Various connectors are available, just need configuration.
    • Each connector requires a DataFrame and a MAPPING dictionary as input.
    • You can have N loads per flow.
  • You can repeat these phases multiple times in a flow: after a LOAD, you can start a new extract, transform, load sequence.

Do You Need Only ConnectX Knowledge to Implement an ETL Flow?

  • No, developing an ETL flow requires a mix of:
    • Soft Skills: understanding and analyzing client requirements, anticipating risks (e.g., duplicate records), optimizing processing for large data volumes.
    • Hard Skills: coding ability.
  • Soft skills often have a bigger impact on successful ETL projects, so always ask plenty of questions.

Guidelines for Approaching a Project

Information to Define Clearly

  • Connection
    • How do we connect to the source data?
    • Is the connection secure?
  • Flows
    • How many flows are there?
    • What is the direction of each flow?
      • Import, Export, or Bidirectional
      • ⚠️ Bidirectional flows require careful handling to avoid overwriting data.
    • Data volume per flow
      • Is an initial bulk import needed?
      • Is the source the same?
      • Is additional RAM required for the first import?
      • Is Delta mode supported?
        • If not, propose to process data in DELTA mode:
          • DB/API sources: request a “last modified” field.
          • File sources: generate incremental files or compute a diff between today’s and last execution.
  • Flow relationships
    • Are relationships between flows clear?
    • Respect module relationships in the application; orphaned data can cause complex recovery procedures.
  • Client usage
    • Is the application already live?
    • If so, pay special attention to the first ETL run.
    • Ensure existing data is consistent with defined rules.
  • Key fields
    • Are key fields clear for each flow?
    • Is deduplication required?
      • DEV should verify source data uniqueness and report any inconsistencies.
    • Can clients modify keys or create duplicate records without keys?
      • Use the technical field (external_key or sync_key) provided by CRM modules; it is always indexed.
  • Data Consistency
    • Do the data follow application rules?
    • Example: In Hubspot Accounts, email is unique and mandatory. Source data should respect this rule, otherwise data cleansing is required.