UNIT-4

Data sciences is a domain of AI related to data systems and processes, in which the system collects numerous data, maintains data sets and derives meaning/sense out of them. The information extracted through data science can be used to make a decision about it.

Data Sciences majorly work around analysing the data and when it comes to AI, the analysis helps in making the machine intelligent enough to perform tasks by itself.

Artificial Intelligence is a technology which completely depends on data. It is the data which is fed into the machine which makes it intelligent. And depending upon the type of data we have; AI can be classified into three broad domains:

Types of Data/Data Formats For Data Science:

Usually the data is collected in the form of tables. These tabular datasets can be stored in different formats. Some of the commonly used formats are:

1. CSV: CSV stands for comma separated values. It is a simple file format used to store tabular data. Each line of this file is a data record and reach record consists of one or more fields which are separated by commas. Since the values of records are separated by a comma, hence they are known as CSV files.

2. Spreadsheet: A Spreadsheet is a piece of paper or a computer program which is used for accounting and recording data using rows and columns into which information can be entered. Microsoft excel is a program which helps in creating spreadsheets.

3. SQL: SQL is a programming language also known as Structured Query Language. It is a domain specific language used in programming and is designed for managing data held in different kinds of DBMS (Database Management System) It is particularly useful in handling structured data. A lot of other formats of databases also exist, you can explore them online!

4.XML: (e-Xtensible markup language)

5.JSON(javascript object notation)

6. XLSX: A file is a MS Open XML format spreadsheet

(csv, xml, xlsx, spreadsheet, JSON, SQL )

Data Access

After collecting the data, to be able to use it for programming purposes, we should know how to access the same in a Python code. To make our lives easier, there exist various Python packages which help us in accessing structured data (in tabular form) inside the code. Let us take a look at some of these python packages:

Numpy: Numerical Python is a python libary that provide functions to handle arrays(collection of homogenous data )

Pandas : and Matplotlib

Matplotlib* Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multi platform data visualization library built on NumPy arrays. One of the greatest benefits of visualization is that it allows us visual access to huge amounts of data in easily digestible visuals. Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and to make correlations. They’re typically instruments for reasoning about quantitative information. Some types of graphs that we can make with this package are listed below:

Not just plotting, but you can also modify your plots the way you wish. You can stylise them and make them more descriptive and communicable.

Pandas

Pandas is a software library written for the Python programming language for data manipulation and analysis.

Pandas is well suited for many different kinds of data:

• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet

• Ordered and unordered (not necessarily fixed-frequency) time series data.

• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels

Applications of Data Sciences

Fraud and Risk Detection*:

The earliest applications of data science were in Finance. Companies were fed up of bad debts and losses every year. However, they had a lot of data which use to get collected during the initial paperwork while sanctioning loans. They decided to bring in data scientists in order to rescue them from losses. Over the years, banking companies learned to divide and conquerdata via customer profiling, past expenditures, and other essential variables to analyse the probabilities of risk and default. Moreover, it also helped them to push their banking products based on customer’s purchasing power.

Genetics & Genomics:

Data Science applications also enable an advanced level of treatment personalization through research in genetics and genomics. The goal is to understand the impact of the DNA on our health and find individual biological connections between genetics, diseases, and drug response. Data science techniques allow integration of different kinds of data with genomic data in disease research, which provides a deeper understanding of genetic issues in reactions to particular drugs and diseases. As soon as we acquire reliable personal genome data, we will achieve a deeper understanding of the human DNA. The advanced genetic risk prediction will be a major step towards more individual care.

Internet Search:

When we talk about search engines, we think ‘Google’. Right? But there are many other search engines like Yahoo, Bing, Ask, AOL, and so on. All these search engines (including Google) make use of data science algorithms to deliver the best result for our searched query in the fraction of a second. Considering the fact that Google processes more than 20 petabytes of data every day, had there been no data science, Google wouldn’t have been the ‘Google’ we know today.

Targeted Advertising: If you thought Search would have been the biggest of all data science applications, here is a challenger – the entire digital marketing spectrum. Starting from the display banners on various websites to the digital billboards at the airports – almost all of them are decided by using data science algorithms. This is the reason why digital ads have been able to get a much higher CTR (Call-Through Rate) than traditional advertisements. They can be targeted based on a user’s past behaviour.

Website Recommendations: Aren’t we all used to the suggestions about similar products on Amazon? They not only help us find relevant products from billions of products available with them but also add a lot to the user experience. A lot of companies have fervidly used this engine to promote their products in accordance with the user’s interest and relevance of information. Internet giants like Amazon, Twitter, Google Play, Netflix, LinkedIn, IMDB and many more use this system to improve the user experience. The recommendations are made based on previous search results for a user.

Airline Route Planning: The Airline Industry across the world is known to bear heavy losses. Except for a few airline service providers, companies are struggling to maintain their occupancy ratio and operating profits. With high rise in air-fuel prices and the need to offer heavy discounts to customers, the situation has got worse. It wasn’t long before airline companies started using Data Science to identify the strategic areas of improvements. Now, while using Data Science, the airline companies can:

Basic Statistics

Measures of Central Tendency(Central location) :

It is a single value that represents a typical, or central, entry of a data set . The most common measrues of central location are mean, mode and median

Mean: It is the average of the numbers

for example 7,13,22

(7+13+22)/3 = 14

Median: The middle number in a set of data that is ordered form least to greatest

for example 7,13,22,15

arrange data in increasing order 7,13,15,22

cosider the middle numbers from list , two middle numbers are 13 and 15

take the average of two middle numbers (13+15)/2=14

Mode: The mode is the number that occurs most often

for example 7,13,22,13

find the number which repeats most, which is 13 - it is the mode for the given set

Measures of Variability :

Variability means scatter or spread. Measures of variability refers to the scatter or spread of scores around their central tendency

The Most common measures of variability are : Range: variance and standard deviation

Range: difference between the lowest and highest values of a given set of values.

for example: 7,13,15,22

22-7=15 is the range

Variance: Is the average of the squared differences from the mean

for example 7,13,22

have mean (7+13+22)/3 = 14

14-7 =7 --> (7*7) = 49

14-13 =1 --> (1*1) = 1

14-22=-8 --> (-8 * -8 ) = 64

sum of squares of differences=49+1+64 = 114

Average squared difference (variance ) = 114/3 = 38

Standard Deviation: Square root of the variance. shown with Greek letter sigma.

It is a stastic used to describe the spread of data about the mean

square root of 38 will be 6.164=6

functions of numpy mean(), median(), var() and std( )

K-Nearest Neighbour Model/algorithm (KNN )

Supervised learning algorithm mostly used for classification based problems

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique.
K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories.
K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.
K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems.
K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data.

The data-point is classified on the basis of its k Nearest Neighbors, followed by the majority vote of those nearest neighbors; a query point(unlabelled point ) is assigned the data class which has the most representatives within the nearest neighbors of the point.

The value of K signifies Number of neighbors

Activities:

Game: Rock, Paper & Scissors https://next.rockpaperscissors.ai/

Page updated

Google Sites

Report abuse