2.3 Tasks

2.3.2 Prepare Data

Guide to Business Data Analytics

IIBA.org KnowledgeHub Guide to Business Data Analytics 2. Business Data Analytics Domains and Tasks 2.3 Analyze Data 2.3.2 Prepare Data

Preparing data involves obtaining access to the planned data sources and establishing the relationships and linkages between sources in order to create a coherent dataset. Data scientists identify how different datasets are related, consider whether the data can be linked in theory, and decide whether it can happen in practice.

Preparing data includes understanding the relationships that exist between data. For example, do two tables have a 0 to 1, 1 to 1, or 1 to many relationships? Preparing data also involves establishing the joins or linkages between sources, normalizing data to reduce data redundancy, standardization, scaling, and converting data. Sometimes the data collected is uninterpretable and must be transformed to lend value to the analytics effort. Data cleansing is a process by which data is transformed to correct or remove bad data.

Data preprocessing, scaling, normalization, imputation, and cleansing are some of the common terminologies used in analytics.

Data scientists identify the rules for consolidating data, perform the consolidation, and then validate the results to see if the business rules are being adhered to. Any mechanisms data scientists build to automate the data acquisition or preparation processes can be repurposed for use by other analytics teams.

Data scientists leverage a host of techniques when preparing data. Weighting is one technique applied to data to correct bias. Sample weights can be applied to address the probability of unequal samples and survey weights applied to address bias in surveys. Data scientists use strong technical skills and knowledge of statistics when preparing data for use in an analytics initiative.

When preparing data, analysts provide the business context for data that may or may not differ from the statistical interpretation. For example, if there are missing data elements, a data scientist may choose to attribute those elements with mean or median value to retain the distribution of a variable intact. While this may be a sound approach from a statistical point of view, it may conflict with some business rules which the analyst may be able to highlight.

Similarly, if there is a portion of the data with missing information, a data scientist may choose to ignore the observations and continue the analysis because it may be statistically insignificant. But from a business standpoint further investigation may be required to determine the course of analysis. These scenarios are best handled by analysts with facilitation, collaboration, and elicitation skills who can supplement the information by stakeholder collaboration and investigation of the recording process.