Fundamentals of Statistics for Data Science

By Sushmita Rai, 3EA
Fundamentals of Statistics for Data Science

Data science is one of the practical tools that help business be more effective and serve the business to reduce costs and increase revenues. Many times we encounter the frequency distribution of the data (the smooth curve like structure that defines the data). It is very important to know the basic concept of distribution which provides the basis for analytics and inferential statistics. The probability concept gives the mathematical calculations whereas distribution helps us to actually visualize what's happening underneath.

Data science is about using data to make decisions that drive actions. The foremost goal of data science is to use data analytics thinking to:
1- Replace intuition with data-driven analytical decisions
2- Transform raw data to valuable asset
3- Increase pace of action

Data science involves:
1- Finding data
2- Acquiring data
3- Cleaning and transforming data
4- Understanding relationships in data
5- Delivering value from data

Statistics in data science is a diverse field and there is a number of classification algorithms, clustering algorithms, neural network algorithms, decision trees and so on that helps in understanding the concepts clearly. Some of the basic fundamentals of statistics used in the field of data science are:

Bayes' Theorem: Bayes' Theorem greatly simplifies complex concepts and through this, we can easily predict the probability of any hypothesis using just the data points. For example, by using this theorem we can predict the probability of someone having cancer just by knowing their age. It also lets us know if an email is a spam based on the number of words. This theorem is an essence used to remove uncertainty.

K-Nearest Neighbour Algorithm: It is considered as one of the easiest algorithms both in terms of understanding and implementation. This algorithm is used to find groups closest to each other. It searches for local groups in and around a specified number of focal points. The concept is basically used for feature clustering, basic market segmentation and seeking out outliers from a group of data entries.

Bagging (Bootstrap aggregating): It is considered as one of the most useful technique to create more than one model of a single algorithm like a decision tree. Each of the built models is trained on a different sample data and each decision tree is made using a different set of sample data which solves the problem of overfitting. In other languages, it can also be defined as a tool that is basically designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression.

Cross-Validation: It is a technique used for validating the model performance, and it is done by splitting the training data.

Classification Technique: It is a data mining technique that assigns categories to a collection of data in order to aid in more accurate predictions and analysis. It is also known as a Decision tree and basically intended to make the analysis of very large datasets effective. There are two major classification techniques used for the purpose of analysis:

Logistics Regression: It is also called as predictive analysis and mostly used to describe data and the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. It is mostly used to evaluate the problems like whether body intake, fat intake, calorie intake and participant age have an influence on a heart attack.

Discriminant Analysis: In discriminant analysis, two or more groups or clusters are classified into one of the known populations based on the measured characteristics. It models the distribution of the predictors and then uses Bayes' theorem to evaluate the estimates for the probability of response category chosen.

The basic statistical concepts and probability distributions have many applications and widespread use. Some of the data science professionals run algorithms through python and R libraries to understand the basics of statistical analysis for easier manipulation and abstraction. Thus, it is very important to learn these basic fundamental concepts of statistics for data science.

#ReadyBusinessPlan #Ask3EA #LearnAt3EA #3EA #BusinessPlan #CapacityEnhancement #CapacityBuilding #Capacity #Assessment #Global #DataScience #FoundationsOfDataAnalysis #StatisticsforDataScience

Article by: Sushmita Rai, 3EA