UNIT-6(: NATURAL LANGUAGE PROCESSING)

Data Science works around numbers and tabular data while Computer Vision is all about visual data like images and videos.

The third domain, Natural Language Processing (commonly called NLP) takes in the data of Natural Languages which humans use in their daily lives and operates on this.

NLP is branch of AI that enable computer to process human language in the form of text or voice data. It is a domain of AI.

Applications of NLP

Automatic Text Summarization:

In this approach we build algorithms or programs which will reduce the text size and create a summary of our text data. This is called automatic text summarization in machine learning.

Sentiment Analysis:

Sentiment analysis (or opinion mining) is a NLP technique used to determine whether data is positive, negative or neutral. Sentiment analysis is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback, and understand customer needs.

Companies use Natural Language Processing applications, such as sentiment analysis, to identify opinions and sentiment online to help them understand what customers think about their products and services

Text classification:

Text classification also known as text tagging or text categorization is the process of categorizing unstructured text into organized groups. By using NLP, text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content.

Virtual Assistants(Smart Assistants) :

Nowadays Google Assistant, Microsoft's Cortana, Apple's Siri, Amazon's Alexa, etc have become an integral part of our lives. Not only can we talk to them but they also have the abilities to make our lives easier.

NLP Based programs that are automated to communicate in human voice ,mimiking human interaction to help ease your day to day task such showing weather reports, creating reminders, making shopping list etc.

Digital Phone Calls :

Automated systems direct customer calls to a service representative or online Chatbot's, which respond to customer requests with helpful information

Modern days chatbots that your often see in the form of a text box when you open a website or contact a customer service and that interact with you in the text box and based on words used by you, they either connect you with a customer support executive or redirect to another webpage link,

Chatbot: A chatbot is a computer program that simulates and processes human conversation(written or spoken), allowing human to interact with digital devices as if they were communicating with a real person

Two type of chatbots: script bot(Simple chatbot) (easy to make,limited functionality, no or limited coding required) and smart bot(flexible and powerful, wide functionality ,coding required, use AI and ML)

Example of chatbot : Mitsuku Bot, Jabberwacky, Rose,CleverBot , Ochatbot /

Syntax: Syntax refers to the grammatical structure of a sentence.

• Rose* http://ec2-54-215-197-164.us-west-1.compute.amazonaws.com/speech.php

• Mitsuku Bot* https://www.pandorabots.com/mitsuku/

• CleverBot* https://www.cleverbot.com/

Jabberwacky* http://www.jabberwacky.com/

• Ochatbot* https://www.ometrics.com/blog/list-of-fun-chatbots/

HUMAN VS COMPUTER LANGUAGES AND NLP

The main function of both the human and computer languages in the same: communicating message across (communication)

Humans communicate through language which we process all the time. Our brain keeps on processing the sounds that it hears around itself and tries to make sense out of them all the time.

On the other hand, the computer understands the language of numbers. Everything that is sent to the machine has to be converted to numbers. And while typing, if a single mistake is made, the computer throws an error and does not process that part. The communications made by the machines are very basic and simple.

Human languages are natural and used for communication between people, often varying by culture and region. They can be ambiguous and context-dependent, and are dynamic, changing over time.

Computer languages, on the other hand, are synthetic and used for communication between computers and humans.

Both lagu

HUMAN LANGUAGE V/S COMPUTER LANGUAGE

MORPHOLOGY NO MORPHOLOGY

GRAMMAR IS BOTH LOGICAL AND EMOTIONAL GRAMMAR IS FIXED AND SELF DEFINING

Varies by culture and region UNIVERSAL

EVOLVING FIXED

DYNAMIC STATIC

Uses grammar and syntax rules Uses strict syntax and semantic rules

CONCEPT OF NLP

NLP takes in the data of Natural Languages in the form of written words and spoken words which humans use in their daily lives and operate on this.

(a)Text Normalization: The text normalization divide the text into smaller components called tokens(words).Aim of text normalization is to cnvert text to a standard form

Steps for Normalization: 1. Sentence segmentation: The whole text is divided into individual sentences 2. Tokenization: Each sentence is further divided into tokens 3.Removing stop words, special characters and numbers: 4. Stemming: is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eaten.

E.g Crying-> Cry, smiling->smili, caring ->car , smiles->smile 5. Lemmatization The process of converting a word to its actual root form linguistically (as per language). The words extracted through lemmatization are called lemmas

Cried->cry , smiling-> smile , smiled->smile , caring->care

(b) Case normalisation: Convert the all the words in same case(lower case)

STOP WARDS: ARE THE WORDS IN ANY LANGUAGE WHICH DO NOT ADD MUCH MEANING TO A SENTENCE. THEY CAN SAFELY BE IGNORED WITHOUT SACRIFICING MEANING OF SENTENCE. e.g is,are,a,an,so etc.

ADVANTAGE OF REMOVING STOP WORDS: 1. DATASET SIZE DECREASES 2. THE TIME TO TRAIN THE AI MODEL DECREASES 3. IMPROVE PERFORMANCE OF AI MODEL

Case Normalisation

(C) Finally convert to Numbers : As computer language understand numbers better than alphabets and words, we have to convert the normalised text into numbers

Bag of words (BoW): It is a statistical language model used to analyse text and documents based on word count. It is a representation of text that describe the occurrence of words within a document. A Bag of words contains two things (1) A vocabulary of known words (2) Frequency of words

Steps to Implement BoW Model:

1. Text Normalisation :

2. Design vocabulary: Make the list of words in our model

3. Create document vector

4. calculate TF-IDF

Design vocabulary: Make the list of words . Collection of these words called corpus.

Dictionary in NLP means a list of all the unique words occurring in the corpus. If some words are repeated in different documents, they are all written just once while creating the dictionary.

Create a document vectors

The document Vector contains the frequency of each word of the vocabulary in a particular document. Now, for each word in the document, if it matches the vocabulary, put a 1 under it. If the same word appears again, increment the previous value by 1. And if the word does not occur in that document, put a 0 under it.