Data Science works around numbers and tabular data while Computer Vision is all about visual data like images and videos.
The third domain, Natural Language Processing (commonly called NLP) takes in the data of Natural Languages which humans use in their daily lives and operates on this.
NLP is branch of AI that enable computer to process human language in the form of text or voice data. It is a domain of AI.
Applications of NLP
Automatic Text Summarization:
In this approach we build algorithms or programs which will reduce the text size and create a summary of our text data. This is called automatic text summarization in machine learning.
Sentiment Analysis:
Sentiment analysis (or opinion mining) is a NLP technique used to determine whether data is positive, negative or neutral. Sentiment analysis is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback, and understand customer needs.
Companies use Natural Language Processing applications, such as sentiment analysis, to identify opinions and sentiment online to help them understand what customers think about their products and services
Text classification:
Text classification also known as text tagging or text categorization is the process of categorizing unstructured text into organized groups. By using NLP, text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content.
Virtual Assistants(Smart Assistants) :
Nowadays Google Assistant, Microsoft's Cortana, Apple's Siri, Amazon's Alexa, etc have become an integral part of our lives. Not only can we talk to them but they also have the abilities to make our lives easier.
NLP Based programs that are automated to communicate in human voice ,mimiking human interaction to help ease your day to day task such showing weather reports, creating reminders, making shopping list etc.
Digital Phone Calls :
Automated systems direct customer calls to a service representative or online Chatbot's, which respond to customer requests with helpful information
Modern days chatbots that your often see in the form of a text box when you open a website or contact a customer service and that interact with you in the text box and based on words used by you, they either connect you with a customer support executive or redirect to another webpage link,
Chatbot: A chatbot is a computer program that simulates and processes human conversation(written or spoken), allowing human to interact with digital devices as if they were communicating with a real person
Two type of chatbots: script bot(Simple chatbot) (easy to make,limited functionality, no or limited coding required) and smart bot(flexible and powerful, wide functionality ,coding required, use AI and ML)
Example of chatbot : Mitsuku Bot, Jabberwacky, Rose,CleverBot , Ochatbot /
Syntax: Syntax refers to the grammatical structure of a sentence.
HUMAN VS COMPUTER LANGUAGES AND NLP
The main function of both the human and computer languages in the same: communicating message across (communication)
Humans communicate through language which we process all the time. Our brain keeps on processing the sounds that it hears around itself and tries to make sense out of them all the time.
On the other hand, the computer understands the language of numbers. Everything that is sent to the machine has to be converted to numbers. And while typing, if a single mistake is made, the computer throws an error and does not process that part. The communications made by the machines are very basic and simple.
Human languages are natural and used for communication between people, often varying by culture and region. They can be ambiguous and context-dependent, and are dynamic, changing over time.
Computer languages, on the other hand, are synthetic and used for communication between computers and humans.
Both lagu
HUMAN LANGUAGE V/S COMPUTER LANGUAGE
MORPHOLOGY NO MORPHOLOGY
GRAMMAR IS BOTH LOGICAL AND EMOTIONAL GRAMMAR IS FIXED AND SELF DEFINING
Varies by culture and region UNIVERSAL
EVOLVING FIXED
DYNAMIC STATIC
Uses grammar and syntax rules Uses strict syntax and semantic rules
CONCEPT OF NLP
NLP takes in the data of Natural Languages in the form of written words and spoken words which humans use in their daily lives and operate on this.
(a)Text Normalization: The text normalization divide the text into smaller components called tokens(words).Aim of text normalization is to cnvert text to a standard form
Steps for Normalization: 1. Sentence segmentation: The whole text is divided into individual sentences 2. Tokenization: Each sentence is further divided into tokens 3.Removing stop words, special characters and numbers: 4. Stemming: is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eaten.
E.g Crying-> Cry, smiling->smili, caring ->car , smiles->smile 5. Lemmatization The process of converting a word to its actual root form linguistically (as per language). The words extracted through lemmatization are called lemmas
Cried->cry , smiling-> smile , smiled->smile , caring->care
(b) Case normalisation: Convert the all the words in same case(lower case)
STOP WARDS: ARE THE WORDS IN ANY LANGUAGE WHICH DO NOT ADD MUCH MEANING TO A SENTENCE. THEY CAN SAFELY BE IGNORED WITHOUT SACRIFICING MEANING OF SENTENCE. e.g is,are,a,an,so etc.
ADVANTAGE OF REMOVING STOP WORDS: 1. DATASET SIZE DECREASES 2. THE TIME TO TRAIN THE AI MODEL DECREASES 3. IMPROVE PERFORMANCE OF AI MODEL
Case Normalisation
(C) Finally convert to Numbers : As computer language understand numbers better than alphabets and words, we have to convert the normalised text into numbers
Bag of words (BoW): It is a statistical language model used to analyse text and documents based on word count. It is a representation of text that describe the occurrence of words within a document. A Bag of words contains two things (1) A vocabulary of known words (2) Frequency of words
Steps to Implement BoW Model:
1. Text Normalisation :
2. Design vocabulary: Make the list of words in our model
3. Create document vector
4. calculate TF-IDF
Design vocabulary: Make the list of words . Collection of these words called corpus.
Dictionary in NLP means a list of all the unique words occurring in the corpus. If some words are repeated in different documents, they are all written just once while creating the dictionary.
Create a document vectors
The document Vector contains the frequency of each word of the vocabulary in a particular document. Now, for each word in the document, if it matches the vocabulary, put a 1 under it. If the same word appears again, increment the previous value by 1. And if the word does not occur in that document, put a 0 under it.
Now create a document vector table for all documents
TFIDF, or Term Frequency-Inverse Document Frequency, is a technique
in data processing that assesses the significance of a word in a
document relative to a larger collection of documents (corpus). Two
applications of TFIDF include document ranking for search engines and
keyword extraction for identifying important terms in a collection of
documents
Term Frequency: It is the frequency of a word in one document.
Term frequency can easily be found in the document vector table
Calculate TF-IDF:
IDF:
(1) Take total frequency of each word in all documents
(2) Now divide the total frequency of each word with the total no of documents
(3)Now calculate the TF-IDF for each word as per formula
TDIDF(W)=TF(W)*log (IDF(w))
Application of TDIDF:
Document classification : it helps in classifying the type of document by looking at the frequencies of words in the text
Keyword Extraction: It is also useful for extracting keywords from the text.
Topic Modeling: It helps in predicting the topic from a large text
Stop word filtering: It helps in removing the unnecessary words out of a text body.
Information retrieval system: helpful to extract important information from a corpus or large text