3. Techniques
3.11 Exploratory Data Analysis
Guide to Business Data Analytics
3.11.1 Purpose
Exploratory data analysis (EDA) is an approach used to maximize the insights gained from data by investigating, analyzing, and summarizing data to uncover relevant patterns. Exploratory data analysis often uses visual analysis to gain a comfort level with the data before applying more formal approaches such as hypothesis testing, machine learning algorithms, or advanced statistical inferences.
Exploratory data analysis (EDA) is an approach used to maximize the insights gained from data by investigating, analyzing, and summarizing data to uncover relevant patterns. Exploratory data analysis often uses visual analysis to gain a comfort level with the data before applying more formal approaches such as hypothesis testing, machine learning algorithms, or advanced statistical inferences.
3.11.2 Description
EDA serves as an investigative mechanism into the problem. It is an iterative approach to understanding data where data is investigated and explored without any prior assumption or bias. Analysts use iterative discovery processes to build their understanding of the business domain and clarify the research problem.
Some of the key outcomes from EDA are:
EDA serves as an investigative mechanism into the problem. It is an iterative approach to understanding data where data is investigated and explored without any prior assumption or bias. Analysts use iterative discovery processes to build their understanding of the business domain and clarify the research problem.
Some of the key outcomes from EDA are:
- Recognizing the available data labels, data types, and individual characteristics of the data variables in context of the business environment and the research problem.
- Refining the research problem.
- Recognizing missing and incorrect data within the available data to ensure the right data is sourced for the given business problem.
- Preliminarily analyzing underlying structure of the data and important data variables (features/predictors) that are relevant to the research problem.
- Understanding the interdependencies, collinearity, and correlations between data variables.
- Detecting outliers and anomalies that do not conform to the underlying data structure.
- Optimizing data variables that are most suited for more formal analysis and analytical modelling. This is often called feature engineering where derived and optimized variables are identified for further analysis.
- Preparing visual representations that communicate initial insights for different stakeholders.
- Identifying ancillary research questions or insights that may not be directly related to the research question but may be relevant to the business.
3.11.3 Elements
.1 Exploratory Data Analysis Scheme
Exploratory data analysis depends on investigation and exploration which allows flexibility of analysis. However, a basic step-by-step structure can be formulated to ensure common practices are followed. This foundational scheme helps analysts to take a structured approach while adapting analysis practices as needed. Activity diagrams, decision trees, and sequence diagrams all help with recording and tracking each step taken during the EDA exercise.
The following describes a typical approach:
Similarly, natural language data may require stop words removal and part-of- speech tagging in an EDA scheme.
.2 Exploratory Data Analysis Visuals
Visuals are the primary vehicle through which the data is understood in exploratory data analysis. With visual descriptions and summarizations analysts apply pattern recognition abilities to discover insights. The right choice of visuals depends on the number of variables involved, type of data (categorical, continuous), and specific step in the EDA scheme.
Typical visualizations and graphs include:
.3 Exploratory Data Analysis Findings
Findings and decisions from EDA require the right packaging when insights are shared with stakeholders. The findings are summarized in a way that the actions taken to clean the data, as well as resulting business insights, can be articulated with appropriate justifications. Analysts validate the results by applying other business analysis techniques such as business rule analysis, process analysis, and elicitation to corroborate the findings from EDA. The EDA scheme, underlying assumptions during EDA, each of the decision points, and the data sourcing process may be communicated along with the business insights.
.1 Exploratory Data Analysis Scheme
Exploratory data analysis depends on investigation and exploration which allows flexibility of analysis. However, a basic step-by-step structure can be formulated to ensure common practices are followed. This foundational scheme helps analysts to take a structured approach while adapting analysis practices as needed. Activity diagrams, decision trees, and sequence diagrams all help with recording and tracking each step taken during the EDA exercise.
The following describes a typical approach:
- Check consistency and integrity when data is loaded to the analytics platform.
- Verify the data dimensions - number of data elements, size, data labels.
- Verify missing data.
- Separate training data and validation data sets.
- Review descriptive statistics for individual variables.
- Distribution types and shapes.
- Basic parameters - central tendencies (for example, mean, median, mode), variance, skewness, and kurtosis.
- Verify variable inter-dependence and collinearity.
- Correlation and variance inflation.
- ANOVA, t-test, F-test, chi square.
- Formulate missing data, outlier treatment, and data imputation.
- Replacement with mean, median, mode
- Bayesian/algorithmic replacement.
- Business rules application.
- Provide appropriate visualization at relevant steps and derive preliminary insights.
- Conduct feature engineering.
- Build and test a basic hypothesis to test insights.
- Refine research problems based on insights.
- Report the initial finding with visual analysis and corroborations.
Similarly, natural language data may require stop words removal and part-of- speech tagging in an EDA scheme.
.2 Exploratory Data Analysis Visuals
Visuals are the primary vehicle through which the data is understood in exploratory data analysis. With visual descriptions and summarizations analysts apply pattern recognition abilities to discover insights. The right choice of visuals depends on the number of variables involved, type of data (categorical, continuous), and specific step in the EDA scheme.
Typical visualizations and graphs include:
- Univariate plots: histograms, probability distribution plots, and run- sequence plots. These types of visuals show frequency or the distribution shape of a variable. For example, stock market returns for an equity are usually a negatively skewed plot when there is a higher probability of negative returns than positive returns.
- Bivariate plots: bar graphs, scatter plots, boxplots, correlation plots (heat maps), and others. Bar graphs and scatter plots are good for recognizing trends, boxplots are good for identifying outliers, and correlation heat maps show interrelationship between variables.
- Special purpose plots: pair plots, contour plots, and density plots all show more than two variables. For example, how does sales volume change with advertisement and delivery time? Spider charts demonstrate a dominant variable among many other variables. Lag plots, auto- correlation plots, and Box-Jenkins can all be used for time-series data. Weibull, log, and lognormal plots can all be used to visualize distributions clearly by controlling scale of axis to exponential or logarithm format.
.3 Exploratory Data Analysis Findings
Findings and decisions from EDA require the right packaging when insights are shared with stakeholders. The findings are summarized in a way that the actions taken to clean the data, as well as resulting business insights, can be articulated with appropriate justifications. Analysts validate the results by applying other business analysis techniques such as business rule analysis, process analysis, and elicitation to corroborate the findings from EDA. The EDA scheme, underlying assumptions during EDA, each of the decision points, and the data sourcing process may be communicated along with the business insights.
3.11.4 Usage Considerations
.1 Strengths
.1 Strengths
- Integrates a visual and intuitive approach to understanding data in a more scientific way.
- Refines the research problem by providing business insights and capturing most of the assumptions related to data upfront.
- Aids in sourcing the right data for a research problem.
- Improves stakeholder confidence that the analytics effort is going in the right direction by providing preliminary findings visually and improves stakeholder engagement.
- Prepares the data for more formal analysis by staging and transforming data before it is used, which increases the performance of the future models.
- Often limits the analysis to purely quantitative and statistical intuitions. Analysts may get caught up in analyzing existing data and disregard the bigger picture.
- Usually requires scientific software and analytics platforms. Knowledge of specific programming languages such as R, Python, and associated packages like pandas, SciPy, and seaborn are needed to perform EDA exercises in a meaningful way.
- Models built on the EDA assumptions are not always scalable if the business environment and underlying goals change significantly. A fresh analytics engagement and EDA exercise may be required.