Have you ever considered how search engines identify that running, runs, and ran all originate from the root word ‘run’?
Have you thought about how chatbots determine that they can use various words while still responding meaningfully?
The answer lies in stemming, a fundamental technique in Natural Language Processing (NLP) that allows for the identification of the base form of a word by removing prefixes and suffixes to get the root meaning.
Stemming enables machines to analyze text more effectively, ultimately improving search result accuracy, sentiment analysis, and even spam detection.
But how does it work, and why should we care about NLP? Let’s find out.
What is Stemming?

Stemming is a natural language processing technique that reduces words to their root or base form, also known as the “stem.”
The purpose of stemming is to simplify text by consolidating words with similar meanings, enabling better analysis in various applications such as search engines, text mining, and information retrieval.
For example, the words “running,” “runner,” and “ran” share the same root meaning related to the action of moving quickly.
By converting these variations to their root form, “run,” we can streamline data processing, which helps improve the precision of analysis.
Step-by-Step Process of Stemming

Step 1: Identify the Word
Begin with a word that may include prefixes, root forms, and suffixes. For instance:
Input Word: “believable”
Step 2: Analyze the Word Structure
Examine the components of each word to determine its origin, prefixes, and suffixes. For “believable”:
- Prefix: “be-“
- Core/root: “lie”
- Suffix: “-able”
Step 3: Remove Affixes
The next step involves applying rules to eliminate any recognized affixes. The goal is to reach the root of the word. In this case, using stemming algorithms, you would remove the suffix “-able” and the prefix “be-“, simplifying “believable” to “lie” (or, in some cases, it may be further simplified to “believ”).
Step 4: Apply Stemming Algorithm
This step involves using a specific algorithm designed to remove affixes systematically. Some commonly used stemming algorithms include:
Porter Stemmer: A widely-used stemming algorithm that applies a set of rules to remove common suffixes. For instance, it would stem:
- “running” → “run”
- “happiness” → “happi” (in this case, it strips more aggressively)
Snowball Stemmer: An improvement over the Porter Stemmer that produces better results in different languages. It might yield:
- “happiness” → “happy”
- “running” → “run”
Step 5: Return the Reduced Form
Once the algorithm processes the word, it returns the simplified or stemmed version suitable for analysis. Using the Porter Stemmer as an example:
- Output for “running”: “run”
- Output for “fishing”: “fish”
These outputs can vary depending on the algorithm’s design and rules.
Step 6: Handle Irregular Forms
Few words may not follow standard rules, with the stemming algorithms occasionally producing “stems” that aren’t actual words; however, they are still useful in the context of matching. For example:
Input Word: “better”
Stemmed Form (using Porter): “better” might not change at all, since it doesn’t have recognizable affixes in derived forms.
Step 7: Final Output and Usage
The final output constructs a list or a set of unique stems representing your original set of words. This list serves analytic purposes such as:
- Reduces the number of unique tokens, allowing a model to generalize better.
- Combines similar meanings and grammatical variations of words, which helps in improving search functionalities.
Example of Stemming:
We can consider input words: [“connection”, “connects”, “connected”, “connecting”, “connections”]
Stemming Process:
- “connection” → “connect”
- “connects” → “connect”
- “connected” → “connect”
- “connecting” → “connect”
- “connections” → “connect”
Types of Stemming Algorithms

1. Porter Stemmer
Description
Developed by Martin Porter in 1980, this is one of the most popular stemming algorithms. It uses a set of rules to iteratively strip suffixes from words to produce stems.

How it Works
The algorithm processes words in multiple steps, where each step applies specific rules to remove common suffixes such as “-ing,” “-ed,” and “-es.”
Example: “running” → “run”, “happiness” → “happi”
2. Lovins Stemmer
Description
Created by Julie Beth Lovins in 1968, this was one of the first stemming algorithms used but is less widely adopted today.

How it Works
It works by removing prefixes and suffixes based on a large set of predefined rules. It identifies the root of the word in a single pass.
Example: “fishing” → “fish”, “runner” → “run”
3. Paice & Husk Stemmer
Description
Brought forward in 1990 by Paice and Husk, this is a more elaborate stemming method utilizing a comprehensive set of rules.

How it Works
Unlike other more basic stemming algorithms, it not only strips suffixes but also addresses special cases based on pre-defined conditions and affix changes.
Example: “happily” → “happy”
4. Dawson Stemmer
Description
This algorithm is an extension of the principles used in the Porter Stemmer, focusing primarily on the morphological features of words.

How it Works
The Dawson Stemmer applies a series of rules for affix removal but is designed to reduce errors associated with truncating words too aggressively.
Example: “administered” → “administrator”
5. Snowball Stemmer
Description
Also known as the “Porter2” stemmer, developed by Martin Porter as an improvement over the original Porter Stemmer. It supports multiple languages.

How it Works
It applies a more elaborate set of rules and works effectively across different languages, producing more intuitive results than its predecessor.
Example: “running” → “run”, “better” → “better”
6. Lancaster Stemmer
Description
A more aggressive stemming algorithm developed by Chris Paice. It uses a simple set of rules for suffix stripping but tends to be harsher than the Porter Stemmer.

How it Works
It frequently removes more characters and may produce stems that are not actual words. It’s particularly known for losing a lot of the original meaning.
Example: “believes” → “believ”, “connection” → “connect”
7. N-Gram Stemmer
Description
This technique derives words by splitting them into n-grams (contiguous sets of n items from a sample of text).

How it Works
It exploits patterns in strings instead of performing basic suffix stripping, extracting semantic similarities based on character sequences.
Example: For “running” & “runner,” an n-gram model would notice common character sequences to group the words together.
Comparison of Stemming Algorithms
Stemming Algorithm | Approach | Strengths | Weaknesses |
Porter Stemmer | Rule-based, stepwise suffix removal | Popular, balanced accuracy | Sometimes over-stems words |
Lovins Stemmer | Longest suffix removal | Fast and simple | Less accurate |
Paice-Husk Stemmer | Iterative rule-based stripping | More aggressive than Porter | Can remove too much |
Dawson Stemmer | Extended Lovins | Handles more suffixes | Computationally expensive |
Snowball Stemmer | Improved Porter, supports multiple languages | More precise than Porter | Still rule-based |
Lancaster Stemmer | Aggressive truncation | Very fast | Over-stemming issues |
N-Gram Stemmer | Character n-grams | Works well for noisy text | Less traditional stem |
Applications of Stemming in NLP

1. Search Engines and Information Retrieval
Real-Life Example: If you type “buying shoes” on Google, the search engine also brings up results with “buy,” “bought,” or “shoe purchase” because stemming brings words to their base form, presenting more relevant results.
Benefit: Improves search accuracy by linking various word forms with a shared root.
2. Text Classification and Sentiment Analysis
Real-Life Example: Movie review analysis on platforms like IMDb or Rotten Tomatoes uses stemming to group words like “amazing,” “amazingly,” and “amazement” under the root “amaz,” aiding sentiment analysis models in determining if a review is positive or negative.
Benefit: Ensures consistency in analyzing sentiment, leading to more accurate predictions.
3. Document Clustering and Topic Modeling
Real-Life Example: News aggregators such as Google News use stemming to categorize similar stories. For example, stories including “political,” “politician,” and “politics” can be categorized under a single topic for users to find similar stories in one place.
Benefits: Facilitates grouping lots