ETL Testing in a nutshell. ETL is an automated process to move | Jul, 2021

This includes testing of any component that’s involved in ETL workflow. It could be micro-services, transformation engines (like Google Dataflow, Spark, etc.), pipelines (Kafka, Google PubSub, etc.), storage components (like HDFS, Google GCS buckets, etc.), landing zones (like unix directories, etc.), workflow managers and orchestrators (like Argo, etc.) and any other participating component.

It is important to assure that each component is working as expected as per business requirements and system design – both functionally and non-functionally (performance, load, disaster recovery, etc.).

This testing focuses on testing of component in isolation using mocks and also its integration with other components.

Data quality checks verify that the source data is of acceptable quality as per metadata agreements defined in requirements. There should not be any inconsistency or inaccuracy in data. Some common checks could be null-value checks on certain columns, dup-checks, blank header and/or footers, etc.

Any bad data is rejected, separated and logged for audit purposes. The good data is ingested and goes through the rest of the ETL stages. Testing should cover both scenarios.

The source systems usually send a lot of raw data. Some of that data may not be relevant from a business perspective. That’s why it needs to be filtered. While the RAW data layer stores all the data, only FILTERed IN data gets propagated to further processing. FILTER OUT data is, again, segregated and logged for audit purposes. Filter conditions are defined as per business requirements.

Some examples:

  1. filter all records older than a certain date. E.g. all accounts opened before 01/03/21
  2. filter all records where a specific column has a specific value. E.g. all accounts with Acc_Status = Open

Just like filter conditions, transformation logics are also defined in business requirements. Data is transformed and enriched based on these requirements. Some transformations are as simple as direct conversion of values whereas some transformations can be complex calculations.

Test scenarios are derived based on the logics and cover positive, negative and boundary-value scenarios. It is also important to test how transformation engines behave when any expected errors are encountered, especially in calculation-based rules.

Output of data transformation is clean and curated data.

Some transformation rule examples:

  1. Transform True/False value in source to 1/0 in target.
  2. Transform Account_Id in target by prefixing 0s in Account_Id of source e.g. 72737455 to 00072737455
  3. Transform predictive_index in target by multiplying eod_balance by 0.025

Schema validation refers to testing of source and target schemas. Schemas are structure definitions of source and target, defined by solution architects. This typically includes column data-types, max/min length, unique check, headers/footers, etc.

For example: Telephone_Number field in source schema could be 11 digits whereas 10 in target schema. This will lead to a schema error during ETL.

This refers to which column(s) in source map to which column(s) in target. It is defined in business requirements usually in a simple excel format. However, technical implementation treats this differently and uses other technologies.

For example:

  1. Account_Id in source should map to int_acc_id in target
  2. telephone_num in source should map to prim_tele_no on target

Data reconciliation is a verification process to ensure that the data-migration or ETL process has transferred the data correctly from source (LHS)to target (RHS) by comparing the target data against the original source. This comparison can be between two ETL stages OR between source and target.

Since the volume of data is huge, it is nearly impossible to compare data manually of using generic comparison tools. Reconciliation rules are special mathematical checks using which confidence is gained that the data has moved successfully. Recon rules typically generate 3 types of reports – DIFF, DETAIL and SUMMARY.

Common types of reconciliation techniques are:

  • Record Count Reconciliation: Comparing # of records at source vs target.
  • Missing Records (Destination break): Records that exist at source but do not exist in target.
  • Extra Records (Source break): Records that exist in target but not at source.
  • Duplicate Records: Records that occur multiple times in target.
  • Summary Reconciliation (MIN, MAX, SUM, AVG): Mathematical calculations on column(s) at source vs target. For e.g. average eod_balance on source should be equal to average eod_balance in target.
  • Record Content Reconciliation: Field-by-field comparison of source and target records.

Recons are driven by Financial and Operational proof points.

Reconciliation testing is usually done on full volume data; that’s where it yields most value.

Reconciliations are automated using technologies like Spark or similar which are capable of handling huge data processing. There are some commercial tools available as well, like QuerySurge. Some organizations prefer to build their own bespoke tools as well.

Some recons (like record content) can be very resource-intensive and may take quite long to run (depending on the volume of data). This needs to be kept in mind and such recons should be done only if it’s absolutely necessary.

When all ETL stages and their components are well-tested, E2E testing is performed. This is to ensure that the ETL process works flawlessly from start to finish and everything is orchestrated as expected.

Full Volume testing refers to E2E testing of ETL with large datasets (millions and billions of records.) The objective is to test how ETL behaves and performs with huge and varied dataset. This helps to uncover unforeseen scenarios and performance issues, if any.

Apart from the above, there are some very specialized testings that are performed on ETL, like Data Integrity, Data Lineage and Data Classification Testing.

This is to ensure that the source file(s) are created as expected and contain right data from the source system(s). Testing of all data extraction conditions and logics.

Related posts

Celebrating Nowruz in the Digital Age: The Intersection of Technology and Spirituality

Google Pixel 9a:Performance and Value

Google Pixel 9a: The Unsung Hero of Budget Smartphones – Why It Deserves Your Attention

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Read More