Getting Started with Natural Language Processing in Python

A significant portion of the data that’s generated today is unstructured. Unstructured data includes social media comments, browsing history, and customer feedback. Have you found yourself in a situation with a bunch of textual data to analyze, and no idea how to proceed? Natural language processing in Python can help.

The objective of this tutorial is to enable you to analyze textual data in Python through the concepts of natural language processing (NLP). You’ll first learn how to tokenize your text into smaller chunks, normalize words to their root forms, and then remove any noise in your documents to prepare them for further analysis.

Let’s get started!

Prerequisites

In this tutorial, we’ll use Python’s nltk library to perform all NLP operations on the text. At the time of writing this tutorial, we’re using version 3.4 of nltk. To install the library, you can use the pip command on the terminal:

pip install nltk==3.4

To check which version of nltk you have in the system, you can import the library into the Python interpreter and check the version:

import nltk
print(nltk.__version__)

To perform certain actions within nltk in this tutorial, you may have to download specific resources. We’ll describe each resource as and when required.

However, if you’d like to avoid downloading individual resources later in the tutorial and grab them now in one go, run the following command:

python -m nltk.downloader all

Step 1: Convert into Tokens

A computer system cann’t find meaning in natural language by itself. The first step in processing natural language is to convert the original text into tokens. A token is a combination of continuous characters, with some meaning. It’s up to you to decide how to break a sentence into tokens. For instance, an easy method is to split a sentence by whitespace to break it into individual words.

In the NLTK library, you can use the word_tokenize() function to convert a string to tokens. However, you’ll first need to download the punkt resource. Run the following command in the terminal:

nltk.download('punkt')

Next, you need to import word_tokenize from nltk.tokenize to use it:

from nltk.tokenize import word_tokenize
print(word_tokenize("Hi, this is a nice hotel."))

The output of the code is as follows:

['Hi', ',', 'this', 'is', 'a', 'nice', 'hotel', '.']

You’ll notice that word_tokenize doesn’t simply split a string based on whitespace, but also separates punctuation into tokens. It’s up to you if you’d like to retain the punctuation marks in the analysis.

Step 2: Convert Words to their Base Forms

When you’re processing natural language, you’ll often notice that there are various grammatical forms of the same word. For instance, “go”, “going” and “gone” are forms of the same verb, “go”.

While the necessities of your project may require you to retain words in various grammatical forms, let’s discuss a way to convert various grammatical forms of the same word into its base form. There are two techniques that you can use to convert a word to its base.

The first technique is stemming. Stemming is a simple algorithm that removes affixes from a word. There are various stemming algorithms available for use in NLTK. We’ll use the Porter algorithm in this tutorial.

We first import PorterStemmer from nltk.stem.porter. Next, we initialize the stemmer to the stemmer variable, and then use the .stem() method to find the base form of a word:

from nltk.stem.porter import PorterStemmer 
stemmer = PorterStemmer()
print(stemmer.stem("going"))

The output of the code above is go. If you run the stemmer for the other forms of “go” described above, you’ll notice that the stemmer returns the same base form, “go”. However, as stemming is only a simple algorithm based on removing word affixes, it fails when the words are less commonly used in language.

For example, when you try the stemmer on the word “constitutes”, it gives an unintuitive result:

print(stemmer.stem("constitutes"))

You’ll notice the output is “constitut”.

This issue is solved by moving on to a more complex approach to finding the base form of a word in a given context. The process is called lemmatization. Lemmatization normalizes a word based on the context and vocabulary of the text. In NLTK, you can lemmatize sentences using the WordNetLemmatizer class.

First, you need to download the wordnet resource from the NLTK downloader in the Python terminal:

nltk.download('wordnet')

Once it’s downloaded, you need to import the WordNetLemmatizer class and initialize it:

from nltk.stem.wordnet import WordNetLemmatizer 
lem = WordNetLemmatizer()

To use the lemmatizer, use the .lemmatize() method. It takes two arguments: the word, and the context. In our example, we’ll use “v” for context. Let’s explore the context further after looking at the output of the .lemmatize() method:

print(lem.lemmatize('constitutes', 'v'))

You’ll notice that the .lemmatize() method correctly converts the word “constitutes” to its base form, “constitute”. You’ll also notice that lemmatization takes longer than stemming, as the algorithm is more complex.

Let’s check how to determine the second argument of the .lemmatize() method programmatically. NLTK has a pos_tag() function that helps in determining the context of a word in a sentence. However, you first need to download the averaged_perceptron_tagger resource through the NLTK downloader:

nltk.download('averaged_perceptron_tagger')

Next, import the pos_tag() function and run it on a sentence:

from nltk.tag import pos_tag
sample = "Hi, this is a nice hotel."
print(pos_tag(word_tokenize(sample)))

You’ll notice that the output is a list of pairs. Each pair consists of a token and its tag, which signifies the context of a token in the overall text. Notice that the tag for a punctuation mark is itself:

[('Hi', 'NNP'),
(',', ','),
('this', 'DT'),
('is', 'VBZ'),
('a', 'DT'),
('nice', 'JJ'),
('hotel', 'NN'),
('.', '.')]

How do you decode the context of each token? Here’s a full list of all tags and their corresponding meanings on the Web. Notice that the tags of all nouns begin with “N”, and for all verbs begin with “V”. We can use this information in the second argument of our .lemmatize() method:

def lemmatize_tokens(stentence):
  lemmatizer = WordNetLemmatizer()
  lemmatized_tokens = []
  for word, tag in pos_tag(stentence):
    if tag.startswith('NN'):
      pos = 'n'
    elif tag.startswith('VB'):
      pos = 'v'
    else:
      pos = 'a'
    lemmatized_tokens.append(lemmatizer.lemmatize(word, pos))
  return lemmatized_tokens

sample = "Legal authority constitutes all magistrates."
print(lemmatize_tokens(word_tokenize(sample)))

The output of the code above is as follows:

['Legal', 'authority', 'constitute', 'all', 'magistrate', '.']

This output is expected, where “constitutes” and “magistrates” have been converted to “constitute” and “magistrate” respectively.

Continue reading
Getting Started with Natural Language Processing in Python
on SitePoint.

This article was republished from its original source.

Call Us: 1(800)730-2416

Pixeldust is a 20-year-old web development agency specializing in Drupal and WordPress and working with clients all over the country. With our best in class capabilities, we work with small businesses and fortune 500 companies alike. Give us a call at 1(800)730-2416 and let’s talk about your project.

Let's Talk About a Project

FREE Drupal SEO Audit

Test your site below to see which issues need to be fixed. We will fix them and optimize your Drupal site 100% for Google and Bing. (Allow 30-60 seconds to gather data.)

Getting Started with Natural Language Processing in Python

On-Site Drupal SEO Master Setup

We make sure your site is 100% optimized (and stays that way) for the best SEO results.

With Pixeldust On-site (or On-page) SEO we make changes to your site’s structure and performance to make it easier for search engines to see and understand your site’s content. Search engines use algorithms to rank sites by degrees of relevance. Our on-site optimization ensures your site is configured to provide information in a way that meets Google and Bing standards for optimal indexing.

This service includes:

Pathauto install and configuration for SEO-friendly URLs.
Meta Tags install and configuration with dynamic tokens for meta titles and descriptions for all content types.
Install and fix all issues on the SEO checklist module.
Install and configure XML sitemap module and submit sitemaps.
Install and configure Google Analytics Module.
Install and configure Yoast.
Install and configure the Advanced Aggregation module to improve performance by minifying and merging CSS and JS.
Install and configure Schema.org Metatag.
Configure robots.txt.
Google Search Console setup snd configuration.
Find & Fix H1 tags.
Find and fix duplicate/missing meta descriptions.
Find and fix duplicate title tags.
Improve title, meta tags, and site descriptions.
Optimize images for better search engine optimization. Automate where possible.
Find and fix the missing alt and title tag for all images. Automate where possible.
The project takes 1 week to complete.