Introduction to Statistics for Data Science
Introduction
Statistics
is the science of conducting studies to collect, organize, summarize, analyze
and draw a conclusion out of the data. It is nothing but learning from data.
The field
of math Statistics mainly deals with collective information, interpreting those
information from data set and drawing conclusion from it. It can be used in
various fields.
For
example, when we observe any cricket matches there are various terms used like
batting average, bowling economy, strike rate, etc. Also we can observe many
graphs and data visualizations. This things are the part of statistics. Here
information is analyzed and various results are shown accordingly.
We can
talk about statistics all the time but do we know the science behind it?
Here by
using various methods various large cricket organizations compare players,
teams and rank them accordingly. So if we learn the science behind it we can
create our ranking, compare different thing and debate with hard facts.
Stats is
very important in the field of analytics, Data Science, artificial intelligence
ai, machine learning models, deep neural networks (deep learning). It is a used
to process complex problems in the real world so that data professionals like
data analyst and data scientist can analyze data and retrieve meaningful
insights from data.
In simple
words, stats can be used to derive meaningful insights from data by performing
mathematical computations on it.
The field
of statistics is divided into two parts Descriptive statistics and Inferential
statistics. And data has two types quantitative data and qualitative data and
it can be either labelled data or unlabeled data.
Some important terms used
Population: In statistics, a population is
the entire pool from which statistical sample is drawn. For example:
Consider all students in a college. All students in the college are considered
as population. Population can be contrasted with samples.
Samples: Sample is subset of the population.
Sample is derived from population. It is representative of population. It
refers to set of observation drawn from population.
It is
necessary to use samples for research because it is impractical to study the
whole population. For example, we want to know the average heights of boys in
college.
So we can’t consider population as there can lots of boys and measuring height and calculating height is not reliable. So for such cases samples are taken. As sample is representative of population. Certain amount of boys are selected as a sample and average is computed.
Variable: A characteristic of each
element of population or a sample is called as variable.
Types
of Statistics
So basically statistics is divided into 2 major categories i.e. Descriptive and Inferential statistics.
Descriptive statistics:
This
is one of the very important part of stats. In this type we deal with numbers
that can be numbers, figures or information to describe any certain phenomena.
These numbers are known as descriptive statistics.
It
helps us to organize and summarize data using numbers and graphs to look for a
pattern in the data set.
Some
examples of this type of statistics are Measures of central tendency which include mean, median,
mode, etc. Also includes Measures
of variability that
are standard deviation, range, variance, etc.
Example:
Reports of production, cricket batting averages, ages, ratings, marks, etc.
Inferential statistics:
To
make an inference or draw a conclusion from the population sample data is used.
Inferential statistics is a decision, estimate, prediction or generalization
about a population based on the sample.
Inferential statistics is used to make interferences from the data whereas descriptive statistics simply describes what’s going on in our data.
Scenario based study:
Suppose
a particular college has 1000 students. We are interested to find out how many
of the total students prefer eating in canteen and how much prefer eating in
mess. A random group of 100 students were selected and hence it becomes our
sample data.
So, population size = 1000 college students
sample size = 100 random students selected
So now we can do survey with this 100 student sample and after doing the survey we get the following insights.
So after analyzing the data we get the following visualizations.
Insights
rederived:
- 72
% of students prefer eating in canteen.
- Of
the total students who prefer canteen 44.4 % are from 4th year.
- Of
the total number of students who prefer canteen 72% are from 3rd and 4th
year.
- 1st
year students are more inclined towards eating in mess.
The
above statistics give the trends of data among the sample data. In this
insights we are using numbers hence this all is included in Descriptive Statistics.
Now,
suppose we wanted to open a canteen or mess in the college from the above
insights we can assume that –
- 3rd
year and 4th year students are main target to start the business.
- To
get more sales you can provide discounts to 1st year and 2nd year
students.
- Since
from the above insights we can conclude that canteen is better option than
that of mess to run a business and most of the students in the data are
inclined towards canteen than that of mess.
So
here we made interferences/assumptions/estimations from the above insights for
the whole college on the basis of the sample data. Hence this is a crucial part
of Inferential
statistics.
- Jay Charole
- Mar, 11 2022
Kalash Jindal
Awesome