2.3 Tasks

2.3.1 Develop Data Analysis Plan

Guide to Business Data Analytics

IIBA.org KnowledgeHub Guide to Business Data Analytics 2. Business Data Analytics Domains and Tasks 2.3 Analyze Data 2.3.1 Develop Data Analysis Plan

The data analysis plan may be formal or informal. The objective is to ensure sufficient time to plan the data analysis activities required for the initiative.

When developing the data analysis plan, the analyst determines:

which mathematical or statistical techniques the data scientist plans to use,
which statistical and algorithmic models are expected for use (such as regression, logistics regression, decision trees/random forest, support vector machines, and neural nets),
which data sources will be used and how data will be linked or joined, and
how data will be preprocessed and cleaned.

The business analysis professional provides insights into the plan or may draft the initial plan for review by the data scientist. It is the data scientist who possesses deep technical expertise to decide how the data analysis will be conducted. Analysis skills are applied by ensuring sufficient information about the business domain is provided to the data scientist so an effective approach to data analysis is developed. Analysts understand the mathematical techniques and algorithmic models in sufficient detail to explain the analysis approach to business stakeholders: why a particular model may be chosen for the given research question.

If the data analysis plan is formally documented, analysts use templates to ensure consistency and guide planning decisions. Analysts use metrics and key performance indicators to assist the data scientist in determining if the outcomes from data analysis are producing the results required to address the business need. Organizational knowledge helps business analysis professionals provide the context for the data scientist's work.

Planning Business Data Analytics Approach at Various Stages

Analysts may not require a rigorous understanding of the various algorithmic models used in predictive analytics exercises, but it is helpful to understand these at a high-level. A foundational understanding of these models help analysts describe what models are being considered, and why, to stakeholders.

A limited sample of different models is presented below with some of their advantages and disadvantages.

Model Name	Description	Advantages	Disadvantages
Ordinary Least Squares Regression	This model uses linear regression. A linear relationship can be established between predictor variables and the independent variable by minimizing the squared errors between observed values and the predictions.	Used extensively Easy to understand and explain	May perform poorly due to simple construct
ARIMA Method (Auto-Regressive Integrated Moving Average)	Primarily used for time-series data analysis. For example, stock movements based on moving averages and data trends.	Can handle time-series data with trends	Slowly getting phased out by more accurate algorithms
Decision Trees	Variables are iteratively chosen that can separate the predictions into buckets with the maximum number of observations.	Easy to understand and visualize Decision rules can be extracted	May have generalization errors (may perform poorly if the future data is significantly different from the training data)
Random Forest	Takes many shallow decision trees and combines the result through voting.	Works in most cases with high accuracy	Complex to explain the result Too general a purpose
Logistic Regression	Maximizes the probability difference between different classes.	Used for primarily binary classification	Can have a high bias towards model assumptions Requires preprocessing and normalization of data
KNN (K-Nearest Neighbors)	Classifies new data based on its distance to other nearest data points.	General-purpose algorithm	Too many modelling assumptions Fails in higher dimensions
Naïve Bayes (NB)	Based on computing conditional probabilities of data and predicting the outcome.	Works well with text processing	There are better algorithms that outperform NB NB gives conditional independence assumption that will affect the posterior probability estimate
SVM (Support Vector Machine)	Maximizes the margin between two disparate classes of data.	Good performance for image, video use cases	Requires specific hyperparameter tuning expertise (algorithms)
Perceptron	Fewer model assumptions and a building block for neural nets and deep learning.	Easy to understand Chained together in a neural network (NN) to produce accurate predictions	Extremely complicated when used in neural networks Low performance outside a neural network