3. Techniques
3.10 Extract, Transform, and Load (ETL)
Guide to Business Data Analytics
3.10.1 Purpose
Extract, Transform, and Load (ETL) refers to a data sourcing and curation approach for application development, business intelligence, and the analytics domain. ETL is often used to make data available from a variety of sources to a target repository that acts as a single standardized source of data for further analysis to drive important insights.
Extract, Transform, and Load (ETL) refers to a data sourcing and curation approach for application development, business intelligence, and the analytics domain. ETL is often used to make data available from a variety of sources to a target repository that acts as a single standardized source of data for further analysis to drive important insights.
3.10.2 Description
The core principles of ETL are:
If applied to significantly large or diverse data, such as big data, the underlying architecture and processes to handle this data will change. For example, when using big data approaches and tools, the order of transform and load tasks may be changed or delegated to the analytics tools or solutions. Many of the tools and methods used for ETL in the context of traditional data sources use structured languages such as SQL to interact with relational databases or structured but multi-dimensional data from data warehouses.
Various architectures such as Lambda and Kappa exist for handling unstructured and distributed data. While knowledge of these architectures and distributed environments for data storage and processing is important, analysts also understand their conceptual differences, advantages, and limitations. The finer points of big data technologies, concepts, and architectures fall outside the scope of this guide.
Most analytics solutions transition through a proof of concept stage where traditional ETL processes are more relevant to analytics. Advanced big data or streaming data implementations may be used while deploying advanced analytics solutions. For example, analytics model development versus analytics model deployment. Most often, analysts conceptualize and interpret analytical solutions at the proof of concept stage.
The core principles of ETL are:
- Identify high-quality data, free of data integrity and consistency errors, from a variety of sources (for example, data warehouses, data marts, data lakes, transactional databases, external websites, or mobile data).
- Transition this data to a target repository creating a “single source of truth,” while applying data governance guidelines and verifying business and data rules.
- Provide easy access and the ability to use this repository of data to support analytics related work.
If applied to significantly large or diverse data, such as big data, the underlying architecture and processes to handle this data will change. For example, when using big data approaches and tools, the order of transform and load tasks may be changed or delegated to the analytics tools or solutions. Many of the tools and methods used for ETL in the context of traditional data sources use structured languages such as SQL to interact with relational databases or structured but multi-dimensional data from data warehouses.
Various architectures such as Lambda and Kappa exist for handling unstructured and distributed data. While knowledge of these architectures and distributed environments for data storage and processing is important, analysts also understand their conceptual differences, advantages, and limitations. The finer points of big data technologies, concepts, and architectures fall outside the scope of this guide.
Most analytics solutions transition through a proof of concept stage where traditional ETL processes are more relevant to analytics. Advanced big data or streaming data implementations may be used while deploying advanced analytics solutions. For example, analytics model development versus analytics model deployment. Most often, analysts conceptualize and interpret analytical solutions at the proof of concept stage.
3.10.3 Elements
.1 Extract Data
Data extraction is the first stage in the ETL process which enables identification and aggregation of data across multiple heterogeneous or homogeneous sources.
Key steps include:
Transform Data is the translation of extracted data to a usable and accurate format. It ensures the data follows sound business logic. Analysts may be involved in verifying or prescribing the following common transformation types:
.3 Load Data
Load data involves transitioning the transformed data to a target repository (database, data warehouse, or data lake) that serves the business or analytics applications. In most cases, analytics applications interact directly with the target repository to receive and execute analytical models or generate reports based on analysis of the data. Many of the target databases involve different internal controls; some data warehouses allow only an incremental load which checks the delta between existing data and then adds changed data from the sources. This type of incremental load allows faster generation of business intelligence (BI) reports. On the other hand, a full extraction from sources may correspond to complete replacement of the data on the target system, which allows for better predictive and prescriptive analytics tasks.
The typical tasks in the load process involve:
.1 Extract Data
Data extraction is the first stage in the ETL process which enables identification and aggregation of data across multiple heterogeneous or homogeneous sources.
Key steps include:
- Identify data sources and types: Analysts may suggest relevant data sources and the type of data required based on the business problem to be solved. Data classifications, meta-data descriptions, and schema are usually developed through a top-down approach, starting with coarse granularity drilled down to finer granularity as the engagement progresses. For example, if the business problem involves tracking the success of a marketing campaign, the typical sources may be customer relationship management (CRM) data, sales data, and channel data. These can be drilled down to customer data, engagement data, in-store sales, online sales, marketing automation platform data, loyalty programs data, web analytics, and so on.
- Create a universal classification of data: Various data sources include different conventions and data dictionaries. When aggregating data from varied sources, the definition, description, and data formats need to be reconciled so that a uniform scheme for extraction can be used.
- Verify data integrity: Integrity of extracted data can be automatically verified during data extraction through a universal schema that maps source elements to extracted data elements (conformity of data sizes, formats, redundancy, and data transmission losses) through ETL tools. However, many of the data integrity checks can be manually performed by analysts to verify minimum values, maximum values, identifiers, valid and expected values, or by using data sampling.
Transform Data is the translation of extracted data to a usable and accurate format. It ensures the data follows sound business logic. Analysts may be involved in verifying or prescribing the following common transformation types:
- Reconciling encoded values from the source to the target format. For example, date formats or list of values.
- Deriving calculated values. For example, sales = price x units.
- Standardizing values of a variable. For example, rescaling.
- Dropping redundant attributes of an entity.
- Masking attributes. For example, social security number or credit card numbers.
- Translating text data to word vectors, specifically for Natural Language Processing (NLP) applications.
- Binning data. For example, age-to-age groups.
- Splitting strings to different columns, (for example, USIL12@# to Country|State|ID).
- Missing data imputation.
- Joins, merges, pivots, roll-ups.
.3 Load Data
Load data involves transitioning the transformed data to a target repository (database, data warehouse, or data lake) that serves the business or analytics applications. In most cases, analytics applications interact directly with the target repository to receive and execute analytical models or generate reports based on analysis of the data. Many of the target databases involve different internal controls; some data warehouses allow only an incremental load which checks the delta between existing data and then adds changed data from the sources. This type of incremental load allows faster generation of business intelligence (BI) reports. On the other hand, a full extraction from sources may correspond to complete replacement of the data on the target system, which allows for better predictive and prescriptive analytics tasks.
The typical tasks in the load process involve:
- Reviewing target data format and load types in alignment of business need.
- Performing data load from the staging area to the target. The staging area is a logical placeholder for data which allows easy transformations.
- Generating audit trails of changed or replaced data.
- Standardizing the workflows (for example, data pipelines), which enables repeatability of the data load process. For example, when generating quarterly sales reports, once the metrics computation and data transformation requirements are finalized, there would be refreshed data available each quarter to re-generate the report.
3.10.4 Usage Considerations
.1 Strengths
.1 Strengths
- Provides well-established process steps for analytics tasks primarily for descriptive analytics problems.
- Many of the ETL tools provide a graphical view of the operation and connections to enterprise data sources that facilitate easy data sourcing and self-service analytics.
- Easy maintenance of data by maintaining automatic traces to data sources and audit trails.
- Serves as a robust pre-analysis prior to modelling activities by providing a summary of the nature of the data.
- Heterogeneous and streaming data are difficult to process through traditional ETL tools and technologies.
- High-volume data movement through ETL requires a lot of planning and effort to maintain consistency.
- ETL is not suitable for near real-time interventions through algorithmic decision-making.