Important Things to Know Before Starting

ETL Phases

Extract

Raw data is extracted from various sources such as DB/API/SQL.
Important: extract only the necessary columns (avoid commands like SELECT *).
On ConnectX:
- There are various connectors in the framework that just need configuration.
- Each connector returns a DataFrame as output.
- All columns in the DataFrame are of type String.
- You can have N extracts per flow.

Transform

The transformation phase is the most critical and complex.
During transformation, raw data is cleaned and merged to achieve the desired format.
First, ensure that extracted data is unique; group it if necessary before performing joins.
On ConnectX:
- All transformations are done using Polars through DataFrame manipulations.

Load

Transformed data is ready to be loaded into the destination.
On ConnectX:
- Various connectors are available, just need configuration.
- Each connector requires a DataFrame and a MAPPING dictionary as input.
- You can have N loads per flow.
You can repeat these phases multiple times in a flow: after a LOAD, you can start a new extract, transform, load sequence.

Do You Need Only ConnectX Knowledge to Implement an ETL Flow?

No, developing an ETL flow requires a mix of:
- Soft Skills: understanding and analyzing client requirements, anticipating risks (e.g., duplicate records), optimizing processing for large data volumes.
- Hard Skills: coding ability.
Soft skills often have a bigger impact on successful ETL projects, so always ask plenty of questions.

Guidelines for Approaching a Project

Information to Define Clearly

Connection
- How do we connect to the source data?
- Is the connection secure?
Flows
- How many flows are there?
- What is the direction of each flow?
  - Import, Export, or Bidirectional
  - ⚠️ Bidirectional flows require careful handling to avoid overwriting data.
- Data volume per flow
  - Is an initial bulk import needed?
  - Is the source the same?
  - Is additional RAM required for the first import?
  - Is Delta mode supported?
    - If not, propose to process data in DELTA mode:
      - DB/API sources: request a “last modified” field.
      - File sources: generate incremental files or compute a diff between today’s and last execution.
Flow relationships
- Are relationships between flows clear?
- Respect module relationships in the application; orphaned data can cause complex recovery procedures.
Client usage
- Is the application already live?
- If so, pay special attention to the first ETL run.
- Ensure existing data is consistent with defined rules.
Key fields
- Are key fields clear for each flow?
- Is deduplication required?
  - DEV should verify source data uniqueness and report any inconsistencies.
- Can clients modify keys or create duplicate records without keys?
  - Use the technical field (external_key or sync_key) provided by CRM modules; it is always indexed.
Data Consistency
- Do the data follow application rules?
- Example: In Hubspot Accounts, email is unique and mandatory. Source data should respect this rule, otherwise data cleansing is required.