ETL VS DATABASE TESTING

Last updated on Sep 21 2022
Sarika Patil

Table of Contents

ETL VS DATABASE TESTING

Differences between the ETL and the Database Testing

ETL and database testing involve data validation, but both are not same. ETL testing is usually performed on data in a data warehouse, whereas database testing performed on transactional systems. Data comes from different applications into the transactional database.

Operations performed in ETL Testing

ETL testing involves the following operations:

  • Validation of data movement from source to the target system.
  • Data count verification in the source and target system.
  • ETL testing verifies the transformation, extraction as per requirement and expectation.
  • ETL testing ensures that data loads within an expected time frame to improve the scalability and performance.

The operation performed in Database Testing

Database testing focuses on data accuracy, the correctness of data, and valid values.

Database testing performs the following operations:

  • Database testing focuses on verifying the column in a table has valid data values.
  • To verify whether the primary or foreign key is maintained, database testing used.
  • Database testing verifies whether the data is missing in the column. Here, we will check there are any null values exist in a column which should have a valid value.
  • We are verifying the accuracy of data in columns.

For example, column of number of months shouldn’t have a value greater than 12.

Function ETL Testing Database Testing
Primary Goal ETL testing is performed for data extraction, transform and loading for BI reporting. Database testing is performed to validate and integrate the data.
Business Need ETL testing used for information, forecasting, and analytical reporting. This testing is used to integrate the data from multiple applications and server impact.
Applicable System ETL testing contains historic data and not in a business flow environment. ETL testing contains the transactional system where the flow of business occurs.
Modeling The multidimensional method used. ER method used.
Database Type ETL testing applied to OLAP systems. Database testing is used in OLTP system.
Data Type ETL using the data type de-normalized data with fewer joins, more indexes, and aggregations. The database used normalized data with joins.
Common Tools QuerySurge, Informatica, etc. tools used. QTP, Selenium tools used in database testing.

 ETL Process

ETL stands for Extraction, Transformation, and loading. ETL is a process which is used to Extract data, Transform the data and loading of the data to the final source. ETL follows a process of loading the data from the source system to the Data Warehouse.

data 3

Steps to perform the ETL process are:

Extraction

Extraction is the first process where data from different sources like text file, XML file, Excel file, or various other sources are collected.

Transformation

Transformation is the second step of the ETL process, where all the collected data has been transformed into the same format. The format can be anything as per our requirement. In this step, a set of rules of functions are applied to the extracted data to convert it into a single standard format. It may involve the following tasks:

  • Filtering: Only specific attributes are loading into the data warehouse.
  • Cleaning: Filling up the null values with specific default values.
  • Joining: Join the multiple attributes into the one.
  • Splitting: Splitting the single attribute into multiple attributes.
  • Sorting: Sort the tuples based on the attributes.

Loading

Loading is the final step of the ETL process. The big chunk of data is collected from various sources, transformed them, and finally loaded to the data warehouse.

ETL is a process to extract the data from different source systems, transform the data, and load the data into the data warehouse. ETL process requires active input from various stakeholders including, developers, analysts, testers, top executive.

ETL (Extract, Transform and Load) is an automated process of extracting the information from the raw data which is required for analysis and transforms it into a format that can serve business needs and loads it into a data warehouse. ETL typically summarize data to reduce its size and improve performance for a specific type of analysis.

ETL process uses the pipelining concept. In this concept, as soon as the data is extracted, it can be transformed, and during the period of transformation, new data can be obtained. And when the modified data is being loaded into the data warehouse, the already extracted data can be transformed.

data 4

When we build an ETL infrastructure, we must integrate data sources, carefully plan, and test to ensure that we transform the source data correctly.

Here, we are explaining three ways to build an ETL infrastructure, and one more way to build a data pipeline without using ETL.

1) Building an ETL pipeline with batch processing

Here, are the processes to build a traditional ETL process, in which we transfer and process the data in batches from the source database to the data warehouse. It is challenging to develop an enterprise ETL Pipeline; we will typically rely on ETL tools such as Stitch and Blendo, which simplify and automate the process.

Building ETL with batch processing, here is the ETL best practice.

1) Reference Data: Here, we will create a set of data that defines the set of permissible values, and may contain the data.

Example: In a country data field, we can define the country codes which are allowed.

2) Extract from Data Reference: The success of ETL steps is to extract the data correctly. Most of the ETL systems combine the data from multiple source systems, each with its data organization and format including relational database, non-relational database, XML, JSON, CSV files and after successful extraction, data is converted into a single format for standardizing the format.

3) Data Validation: An automated process confirms whether the pulled data from sources have expected values. For example, A data field should contain valid dates within the past 12 months in a database of financial transaction from the past year. The validation engine rejects the data if it fails the validation rules. We analyze the rejected records, on a regular basis, to identify what went wrong. Here we correct the source data or modify extracted data to resolve the problem in the next batches.

4) Transform Data: Removing extraneous or erroneous data, applying business rules, checking data integrity (ensuring that data was not corrupted in the source or corrupted by ETL and no data was dropped in the previous stages), and creating aggregate as necessary. If we analyze revenue, we can summarize the dollar amount of invoices into a daily or monthly total. We need to program and test a series of rules or functions that can achieve the required transformation and run them on the extracted data.

5) Stage: We will not typically load transformed data directly into the target data warehouse. Data should be entered first into a staging database, making it easier to roll back if something goes wrong. At this point, we can also generate audit reports for regulatory compliance or diagnose and repair data problems.

6) Publish to Data Warehouse: Loading the data to the target tables. Some data warehouse overwrites existing information every time, the ETL pipeline loads a new batch daily, monthly or weekly. In other words, ETL can add new data without overwriting, with a timestamp indicating it is unique. We must do this carefully to prevent data warehouse from “bursting” due to disk space and performance limitation.

2) Building an ETL Pipeline with stream processing

Modern data processes often include real-time data. For example, from a large e-commerce website web analytics data. In these use cases, we cannot extract and transform data in large batches, and the need arises to perform ETL on data streams, which means as client applications write data to the data source, data should be treated, transformed and saved immediately to the target data store. Many stream processing tools are available today including- apache Samza, Apache store, and Apache Kafka.

data 5

Building streaming ETL based on Kafka involves the following points:

1) Extracting data into Kafka: The JDBC connector pulls each row of the source table. As client applications add rows to the table, Kafka automatically writes them as a new message to the Kafka topic, which enables the real-time data stream.

2) Pulling data from Kafka: ETL application extract messages from the Kafka topics as Avro Records, which creates an Avro schema file and deserialize them, and created KStream objects from the messages.

3) Transform Data in KStream object: Using Kafka Streams API, the stream processor receives one record at a time, processes it and can produce one or more output records from the downstream processor. These can transform one message at a time, filter them based on conditions, or perform data operations on multiple messages.

4) Load Data to other systems: ETL applications still hold the data, and now this need to stream it into target systems, such as data warehouse or data lakes. Their purpose is to use a S3 sink connector to stream the data to Amazon S3. We can implement integration with other systems. For example: stream the data to a Redshift data warehouse using Amazon Kinesis.

So, this brings us to the end of blog. This Tecklearn ‘ETL VS Database Testing’ blog helps you with commonly asked questions if you are looking out for a job in ETL. If you wish to learn Data Warehousing and build a career in ETL domain, then check out our interactive, ETL Testing Training, that comes with 24*7 support to guide you throughout your learning period. Please find the link for course details:

https://www.tecklearn.com/course/etl-testing-training/

ETL Testing Training

About the Course

Today’s businesses have to work with data in multiple formats extracted from multiple sources. All this makes ETL Testing all the more important. Tecklearn’s ETL Testing training offers an in-depth understanding of Data warehousing and business intelligence concepts through real-world examples. You will also gain the essential knowledge of ETL testing, Performance Tuning, cubes, etc., through hands-on projects, and this will help you to become a successful ETL Testing expert.

Why Should you take ETL Testing Training?

  • An ETL Developer can earn $100,000 per year – indeed.com
  • Global Big Data Analytics Market to reach $40.6 Billion in 4 years.
  • Most companies estimate that they’re analysing a mere 12% of the available data. – Forrester Research

What you will Learn in this Course?

Introduction to ETL testing

  • Introduction to ETL testing
  • Life cycle of ETL Testing
  • Database concepts and ETL in Business Intelligence
  • Understanding the difference between OLTP and OLAP and data warehousing

Database Testing and Data Warehousing Testing

  • Introduction to Relational Database Management Systems (RDBMS)
  • Concepts of Relational database
  • Data warehousing testing Versus database
  • Integrity constraints
  • Test data warehousing testing
  • Hands On

ETL Testing Scenarios

  • Data warehouse workflow
  • ETL Testing scenarios and ETL Mapping
  • Data Warehouse Testing
  • Data Mismatch and Data Loss Testing
  • Creation of Data warehouse workflow
  • Create ETL Mapping
  • Hands On

Various Testing Scenarios

  • Introduction to various testing scenarios
  • Structure validation and constraint validation
  • Data correctness, completeness, quality and Data validation
  • Negative testing
  • Hands On

Data Checks using SQL

  • Using SQL for checking data
  • Understanding database structure
  • Working with SQL Scripts
  • Hands On

Reports & Cube testing

  • Reports and Cube Testing
  • Scope of Business Intelligence Testing
  • Hands On

Got a question for us? Please mention it in the comments section and we will get back to you.

0 responses on "ETL VS DATABASE TESTING"

Leave a Message

Your email address will not be published. Required fields are marked *