3. Techniques

3.9 Descriptive and Inferential Statistics

Guide to Business Data Analytics

IIBA.org KnowledgeHub Guide to Business Data Analytics 3. Techniques 3.9 Descriptive and Inferential Statistics

3.9.1 Purpose

Descriptive statistics derive information about the population under study. Inferential statistics helps to assess information about a sample of the population and make informed generalizations. Together, they are important techniques through which business information is quantified, compared, or predicted.

3.9.2 Description

Business data is generally plentiful in most organizations. The challenge is how to interpret and use that data. The first step is to quantify the data in meaningful ways so it can be compared, contrasted, and assessed to discern trends. Once quantified, the data needs to be interpreted in the context of the business to drive important insights. These insights can then be used to support business decision-making. Descriptive and inferential statistics provide powerful tools and techniques for addressing these challenges.

Descriptive statistics are a set of formal techniques and tools that allow summarization of data and provide the means to describe that data. Most of these tools are rooted in reality and how people naturally understand data. For example, if an organization's sales figures need to be compared across quarters, the most natural way to think about it is the average sales.

Inferential statistics provide the tools and methods to infer some meaning from statistical summaries that are based on a sample size representation of the overall population that is under study. Most analytical methods are based on this and extensions of these concepts. For example, most advanced analytics, machine learning applications, and neural nets use inferential statistics methods as their first principle.

An introduction to descriptive statistics and inferential statistics is presented in this technique. This technique purposely avoids the mathematical complexity inherent in these approaches. Statistical textbooks can be referenced for more detailed explanations, as needed.

3.9.3 Elements

.1 Descriptive Statistics

The foundational concepts of descriptive statistics include:

Types of data: There are different types of data used in an analytics exercise in its most atomic form. Revenue, sales, profit, and age are continuous data. The number of employees, number of items ordered, and number of stocks purchased are discrete-valued data. Data that represent different types of categories such as gender, types of products of a company, and different departments are categorical data. If there is an order to the categorical data then it is referred to as ordinal data, such as grades in a subject (for example, A, B, C).

Mixing up types of data in data sourcing is not helpful for analysis. For example, treating ordinal data as categorical will not provide accurate analysis as the order is ignored; that is, a grade A and grade F in a subject are treated equally in the analysis.

Organization of data: Most business data is either structured or unstructured. Structured data, often referred to as rectangular data, are highly organized. The relation between each element is pre-defined and managed through rules. Unstructured data, on the other hand, are an aggregation of non-homogenous types of data.

Depending on the analytics context, structured and unstructured data can be used. For faster access, processing, and computation, structured data can be used. If volume and variety of data is needed, unstructured data may work better.
Measures of central tendency/location: This is the most natural way to summarize data by saying where the data is most concentrated. Mean (average) is the sum of values over the number of items. Median is the value that halves the data (50% on either side of the median value). Mode represents the most frequent value observed in the data. Quantiles and percentiles are the values that partition the data into several pieces.

The use of a measure is contextual to the research problem. Mean is susceptible to outliers, the median is not used universally, and mode values can be many within a dataset. Median is often used as a data imputation value when the data is missing.
Measures of deviation: Mean values alone may not describe the data fully. For example, consider two stocks with the same yearly mean return but one of the stock's daily returns fluctuates wildly. This indicates the other stock is most likely less risky to purchase. This concept leads to measuring the variation in data captured through variance and standard deviation that can be used to describe how much the data is dispersed around the mean.

Variance is the square of standard deviation and not in the same unit as mean. It may prove confusing for business stakeholders to interpret this type of data. However, variance is a common measure used in analytics models. Skewness and kurtosis are higher-order measures that describe the shape of the distribution of values around the mean.
Sample measures: It is often difficult to measure the entirety of the population; using a sample set can be sufficient. For example, it is not advisable to survey every customer of a company to identify quality issues with service delivery. All the measures described such as mean and variance apply to sample sets and can be used to estimate the population parameters.

The sample statistics only estimate the population parameter. There are small differences in the way parameters are estimated for the moments around the mean (for example, variance, skewness, et al). The values are adjusted to reflect accurate values when used in modelling or for reporting purposes.
Probability basics: Probability is a branch of mathematics and statistics that is heavily used for analytical modelling. All outcomes and predictions are not certain events but only likely events. The probability of some event is computed as the expected outcome over the possible outcomes. The expected value is calculated as a weighted mean of values with associated probabilities.

Most predictive problems use probabilities and expected value as outputs. It is possible to optimize the wrong measure in an analytics problem. For example, an organization targeting a lower rate of attrition may be optimizing probability of employees leaving, but it should focus on the expected value of the loss of business due to attrition.
Multivariate measures: When summarizing more than one variable, the focus of analysis is on understanding their interrelationships. For example, the total sale may depend on advertising spend and the number of distribution channels. To establish such relationships, covariance and correlation are used. That is, for a small change in one variable, how much change is expected in the other? Correlation is a scaled version of covariance which ranges between -1 and 1. High correlation, positive or negative, in the right context suggests a relationship.

Correlation between both dependent and independent variables needs to be studied. In the example, if the distribution channel and advertisement spend are highly correlated, there is a possibility of double counting on sales. Covariance and correlation have many applications in modelling. For example, dimensionality reduction of data which helps simplify the analysis.
Probability distributions: Distribution of a variable refers to the probability of encountering different values around the mean value. For example, daily sales numbers and their associated probabilities in a year, if plotted, would look like a probability distribution graph.

Probability distribution provides a lot of information about how a variable behaves. Knowledge of various distribution types is an essential element of analytical models. The distribution graphs are best represented as visuals as they may be used for explaining the behaviour of different variables.

Statistical Tests:

z-test is used to verify whether a population parameter is significantly different from some hypothesized value.
t-test is used when the number of observed values is less
f-test is used to compare population variances to sample variances.
chi square test is used to verify how well a distribution resembles a hypothesized distribution.

.2 Inferential Statistics

The foundational concepts of inferential statistics include:

Statistical tests: Statistical tests are used to conclude the summary statistics observed when describing the data. The sample statistics (for example, sample mean, variance, and so on) are used to draw inferences about the population. A z-test deduces what range an observed sample mean can take. A t-test is used instead of z-test when the number of observations in a sample is less. A chi square test is mostly used to test the variation of test statistics such as sample variances.

Different tests apply based on the business situation. Particularly in finance and healthcare industries, these tests are industry best practices, and some knowledge is expected from analysts to engage the stakeholders in an analytics context.
Regression analysis: Regression is one of the most used tools in predictive analytics. The outcome can be first related and then predicted based on one or more dependent variables. In regression analysis, the goal is to predict the values of a dependent variable based upon the values of an independent variable, so when new data is encountered, this knowledge can be applied to predict the dependent variable.

When analysts communicate complex analytical models to business stakeholders, regression analysis can be used to explain the basic principles. However, during actual modelling activities data scientists focus on model development, and analysts support this by verifying the key factors (independent variables) using their business acumen and domain knowledge.
Bayesian inference: This is a simple model to infer population characteristics based on the application of Bayes theorem. Simply stated, this means that future values can be estimated by historical values if they can be computed from the data that exists.

Bayesian inference is used for quick benchmarking in the industry about the success rate of predictions. It also has the added advantage that such inference can be quickly updated based on new data.

3.9.4 Usage Considerations

.1 Strengths

Statistical methods are embedded in the core business rules of most organizations and are already used for decision-making. For analytics engagements, these concepts form the foundation for more advanced techniques.
These concepts allow the enumeration of business decisions into quantitative values, supporting quantitative research and evidence- based decisions.
They are a good tool to develop consensus among stakeholders as the quantitative analysis does not change.
Most business metrics are applications of the statistical concepts and this foundational knowledge helps engage stakeholders in setting analytics goals.

.2 Limitations

The application of statistical methods into real-world context requires many assumptions. These assumptions must be identified and recorded.
It is not possible to identify all the factors that may influence business predictions; all predictions carry some degree of uncertainty.
For initiatives where the influencing factors are not distinguishable, a subjective assessment may be more appropriate.
The use of statistical methods makes the analysis more difficult to communicate without the aid of visuals and stories.