2. Specialised models vs. LLMs for NLP tasks

In this lecture, we explore the differences between specialized models and Large Language Models (LLMs) in the context of NLP tasks. The field of Natural Language Processing (NLP) has evolved significantly with the advent of LLMs, which have demonstrated remarkable capabilities across a wide range of applications. However, despite their versatility, these models come with significant computational, environmental, and economic costs.

Specialized models, on the other hand, are designed to perform specific NLP tasks efficiently, often requiring fewer resources and providing faster results for well-defined tasks like sentiment analysis, named entity recognition (NER), and intent classification. They can typically outperform LLMs in specific domains when appropriately tuned.

This lecture will focus on comparing the environmental and monetary impacts of LLMs with specialized models, and investigate how specialized models like spaCy or BERT can be used for standard NLP tasks.

Environmental and Monetary Impact of LLMs

As the usage of LLMs like GPT-3 and GPT-4 has grown, the computational, environmental, and economic costs have become increasingly significant. LLMs are energy-intensive due to the immense processing power required for both training and inference, which can lead to higher carbon emissions and greater financial costs. These models are over-parameterized for many simple NLP tasks, such as sentiment analysis or intent classification, where more efficient, specialized models could be employed instead.

From a practical perspective, it does not make sense to rely on LLMs for standardized NLP tasks when highly optimized, task-specific models have been available for decades. These smaller models are energy-efficient and less costly, and they often outperform LLMs in specific applications when fine-tuned to the task at hand. In business settings, where time and cost efficiency are crucial, the use of specialized models like BERT or spaCy makes more economic sense for standard tasks.

A recent study by Luccioni et al. (2024) offers insights into the energy consumption of LLMs compared to smaller, task-specific models. Their research shows that multipurpose, generative models like GPT-3 consume far more energy during inference than task-specific models designed for discrete tasks such as text classification or sentiment analysis. For example, a fine-tuned BERT-based model for sentiment analysis, such as bert-base-multilingual-uncased-sentiment, emits 0.32g of CO2 per 1,000 queries, while larger, multipurpose models like Flan-T5-XL and BLOOMz-7B emit 2.66g and 4.67g CO2, respectively, per 1,000 queries. This stark difference highlights the inefficiency of using large, generative models for tasks that can be handled by much smaller, purpose-built models (Luccioni et al., 2024).

The study also illustrates how the complexity of the task plays a role in energy consumption. Generative tasks like text generation, summarization, and translation are much more energy-intensive compared to discriminative tasks like text classification. Decoder-only models, such as those used in many generative LLMs, are particularly inefficient for tasks with longer outputs, making them less suitable for applications where energy consumption is a concern.

In many real-world scenarios, LLMs are deployed without fully weighing their environmental impact against their utility, particularly for tasks where smaller, specialized models would suffice. As the technology industry leans more toward deploying general-purpose LLMs for a broad array of tasks, it's crucial to consider the environmental and economic trade-offs.

Thus, while LLMs are revolutionary in terms of their capabilities, their use for standard, well-defined NLP tasks is both environmentally and financially inefficient. Deploying smaller, specialized models saves energy and reduces emissions and lowers operational expenses, making them a more sustainable option for businesses and researchers alike.

Introduction to Specialized Models for NLP

While LLMs offer broad versatility, specialized models are optimized for specific functions like sentiment analysis, named entity recognition (NER), and intent classification. These models often require fewer computational resources, have faster inference times, and can be more accurate in their specific domains when fine-tuned properly. We will explore some of the most popular specialized models and libraries used in NLP, including spaCy and Hugging Face Transformers.

spaCy

spaCy is a powerful, open-source library for advanced NLP in Python, designed for production use. It supports a wide range of NLP tasks and provides efficient, pre-trained models for many languages. spaCy is known for its speed and flexibility, making it suitable for building applications that require processing and understanding large volumes of text.

Key Features of spaCy:

Tokenization: spaCy segments text into words, punctuation marks, and other elements, forming the basic building blocks for further NLP processing.
Part-of-Speech (POS) Tagging: spaCy assigns part-of-speech tags to tokens, categorizing them as nouns, verbs, adjectives, etc., which is crucial for understanding grammatical structure.
Dependency Parsing: This feature helps in understanding the syntactic structure of sentences by identifying relationships between tokens, such as subject-object relationships.
Lemmatization: spaCy can reduce words to their base forms (lemmas), such as converting “was” to “be” or “rats” to “rat.”
Sentence Boundary Detection (SBD): spaCy can detect and segment individual sentences within a document.
Named Entity Recognition (NER): This component labels named entities like people, organizations, and locations in the text.
Entity Linking (EL): spaCy disambiguates entities by linking them to unique identifiers in a knowledge base.
Similarity Analysis: spaCy provides tools for comparing words, phrases, and documents to measure their similarity.
Text Classification: It can assign categories or labels to entire documents or specific parts of a document.
Rule-Based Matching: This feature allows finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.

spaCy’s versatility is enhanced through its support for over 75 languages, 84 pre-trained pipelines, and the ability to integrate with custom components. It is highly extensible, supporting custom models in PyTorch, TensorFlow, and other frameworks. For visualization, spaCy includes built-in tools for syntax and NER, making it easier to understand and debug NLP models.

Hugging Face Transformers & BERT

Hugging Face Transformers is a versatile and widely used library that provides seamless access to various state-of-the-art NLP models, such as BERT, GPT, and many others. It enables researchers and developers to fine-tune these models for specific tasks like text classification, machine translation, named entity recognition, and sentiment analysis. The library offers pre-trained models, easy-to-use APIs, and tools for model training and deployment, making advanced NLP technology more accessible and usable across different applications.

One of the key models in this library is the bert-base-multilingual-uncased-sentiment model. This model is a fine-tuned version of the multilingual BERT model designed specifically for sentiment analysis on product reviews. It supports six languages: English, Dutch, German, French, Spanish, and Italian. The model predicts the sentiment of a review as a number of stars (between 1 and 5), making it particularly useful for evaluating customer feedback in multiple languages and contexts.

Although BERT models, including the multilingual version, are more resource-intensive than specialized libraries like spaCy, they offer significantly better accuracy and generalization across diverse linguistic settings. This superior performance is achieved by capturing intricate semantic nuances through their deep learning architecture, making them ideal for more complex and high-stakes sentiment analysis tasks.

It's worth noting that while BERT requires more computational resources compared to traditional NLP methods, it is still considerably more efficient than large-scale decoder-only LLMs like GPT-3 or PaLM, especially for inference tasks. As highlighted by Luccioni et al. (2024), using BERT-based models for specific tasks like sentiment analysis can lead to a more balanced approach, offering high accuracy without the extreme energy consumption associated with deploying massive generative models for such tasks. This balance between performance and resource efficiency makes BERT a practical choice for many real-world applications where both effectiveness and sustainability are concerns.

Sentiment Analysis

Sentiment analysis, also known as opinion mining, is a crucial NLP task that involves determining the emotional tone behind a piece of text. It is widely used in various applications such as customer feedback analysis, social media monitoring, and market research. By classifying text as positive, negative, or neutral, sentiment analysis helps organizations gain insights into public opinion and customer satisfaction. This chapter will explore different approaches and tools for performing sentiment analysis, including specialized models like spaCy with plugins, traditional libraries like NLTK, and advanced methods using Hugging Face Transformers and BERT.

Sample text

text = "spaCy makes NLP tasks so easy! I love using it for my projects."

Using BERT for Sentiment Analysis

The following code demonstrates how to perform sentiment analysis using a pre-trained BERT model from the Hugging Face Transformers library. In this example, we use the nlptown/bert-base-multilingual-uncased-sentiment model, which is fine-tuned for sentiment analysis on product reviews in six different languages, including English, German, French, and Spanish.

Loading the Model:
```
# Load the pre-trained BERT sentiment analysis model
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
classifier = pipeline("sentiment-analysis", model=model_name)
```
Here, the BERT model is loaded using the pipeline function from the Transformers library. This pipeline is specifically configured for sentiment analysis and uses the nlptown/bert-base-multilingual-uncased-sentiment model, which assigns sentiment scores based on a five-star rating system.
Classifying the Sentiment:
```
# Classify the sentiment of the input text
results = classifier(text)
```
The input text is passed to the classifier, which predicts the sentiment based on the model's training data. The result contains the predicted label (e.g., '1 star' to '5 stars') and the confidence score, indicating the certainty of the prediction.
Mapping Sentiment Labels:
```
# Map the BERT model's sentiment labels to our custom sentiment labels
label_map = {
    '1 star': 'sentiment_negative',
    '2 stars': 'sentiment_negative',
    '3 stars': 'sentiment_neutral',
    '4 stars': 'sentiment_positive',
    '5 stars': 'sentiment_positive'
}
```
Since the model uses a five-star rating system, we map these ratings to more generalized sentiment categories. For example, '1 star' and '2 stars' are mapped to 'sentiment_negative', while '4 stars' and '5 stars' are mapped to 'sentiment_positive'. This step helps simplify the analysis by grouping similar sentiments together.

Extracting and Displaying Results:

# Extract the original label, mapped label, and score
original_label = results[0]['label']
mapped_label = label_map[original_label]
score = results[0]['score']

# Print the sentiment result in a clearer format
print(f"\nSentiment Analysis Result:\n{'-'*30}")
print(f"Text: '{text}'")
print(f"Original Label: '{original_label}'")
print(f"Mapped Label: '{mapped_label}'")
print(f"Confidence Score: {score:.2f}")

The code extracts the original sentiment label, maps it to the generalized label, and prints the result along with the confidence score. This gives a clear summary of the sentiment classification.

For the example text, “spaCy makes NLP tasks so easy! I love using it for my projects.”, the model predicted a sentiment of '5 stars' with a confidence score of 0.75. After mapping, this corresponds to a 'sentiment_positive' label, indicating that the text expresses a positive sentiment with moderate confidence.

Output

Sentiment Analysis Result:
------------------------------
Text: 'spaCy makes NLP tasks so easy! I love using it for my projects.'
Original Label: '5 stars'
Mapped Label: 'sentiment_positive'
Confidence Score: 0.75

Advantages:
- Accuracy and Generalization: BERT models are highly accurate and can generalize well across multiple languages and contexts, making them ideal for complex sentiment analysis tasks.
- Multilingual Support: This model supports multiple languages, which is beneficial for applications dealing with multilingual datasets.
Limitations:
- Resource-Intensive: BERT models are computationally expensive compared to simpler models like those used in spaCy. However, they are still more efficient than large-scale decoder LLMs like GPT-3 or GPT-4 (Luccioni et al., 2024).
- Latency: The processing time can be longer, making them less suitable for real-time applications in resource-constrained environments.

Training a Custom Sentiment Model with spaCy

The following code shows how to train a custom sentiment analysis model using spaCy. This approach provides more control and flexibility compared to pre-trained models like BERT. Custom training allows us to tailor the model to specific datasets and domains, enhancing its performance for specialized tasks.

Model Setup and Initialization:

import spacy
from spacy.training.example import Example
import random

# Load a blank English model
nlp = spacy.blank("en")

# Load the pre-trained spaCy model and copy the vocab to the custom model for later merging
pretrained_nlp = spacy.load("en_core_web_md")
nlp.vocab = pretrained_nlp.vocab

We start by initializing a blank English model in spaCy and copying the vocabulary from a pre-trained model (en_core_web_md). This helps the custom model inherit useful linguistic features, which will be beneficial during training.

Creating and Configuring the TextCategorizer:
```
# Create a new TextCategorizer
textcat = nlp.add_pipe("textcat", name="sentiment")

# Add the labels to the model
textcat.add_label("sentiment_positive")
textcat.add_label("sentiment_neutral")
textcat.add_label("sentiment_negative")
```
A TextCategorizer component is added to the spaCy pipeline for text classification. We define three labels: sentiment_positive, sentiment_neutral, and sentiment_negative. These labels represent the possible sentiment categories our model will classify text into.

Preparing and Formatting Training Data:

# Prepare training data in the desired format
train_data = [
    ("I love this product!", "sentiment_positive"),
    ("This is a disappointment.", "sentiment_negative"),
    ("It's okay.", "sentiment_neutral"),
    ("It's useful.", "sentiment_neutral"),
    ("Absolutely fantastic experience.", "sentiment_positive"),
    ("Terrible service, will not come back.", "sentiment_negative"),
]

# Format training data
formatted_train_data = []
for train_text, train_label in train_data:
    # Initialize all categories with 0.0
    cats = {"sentiment_positive": 0.0, "sentiment_neutral": 0.0, "sentiment_negative": 0.0}
    # Set the correct label to 1.0
    cats[train_label] = 1.0
    formatted_train_data.append((train_text, {"cats": cats}))

The training data is formatted as a list of tuples, with each tuple containing a text sample and its corresponding sentiment label. We then convert this data into a format suitable for spaCy’s training process, where each label is represented as a dictionary of categories.

Training the Model:

# Configuration for training
EPOCHS = 20  # Number of training epochs
DROP_RATE = 0.5  # Dropout rate for the TextCategorizer
SEED = 42
random.seed(SEED)  # Seed the Python random module
spacy.util.fix_random_seed(SEED)  # Set the seed for spaCy's internal operations

# Training loop
optimizer = nlp.begin_training()
for epoch in range(EPOCHS):
    # Shuffle the training data before each epoch
    random.shuffle(formatted_train_data)
    losses = {}

    # Update the model with all training data in each epoch
    for train_text, annotations in formatted_train_data:
        example = Example.from_dict(nlp.make_doc(train_text), annotations)
        nlp.update([example], drop=DROP_RATE, losses=losses)
    
    print(f"Epoch {epoch+1}/{EPOCHS}, Loss: {losses['sentiment']:.4f}")

The training process involves running multiple epochs (20 in this case), where the model iteratively learns from the training data. A dropout rate is used to prevent overfitting, and the model's loss is monitored to gauge its learning progress. By the end of training, the model is optimized for sentiment classification based on the provided examples.

Saving and Loading the Custom Model:
```
# Save the trained model
nlp.to_disk("en_custom_sentiment_model")
print("Model saved to 'en_custom_sentiment_model'")
```
Once training is complete, the custom sentiment model is saved to disk for future use. This allows us to reload and apply the model to new text data for sentiment analysis.

Using the Trained Model for Sentiment Analysis:

import spacy

# Load the trained custom sentiment model
nlp = spacy.load("en_custom_sentiment_model")

# Process the text with the custom sentiment model
doc = nlp(text)

# Get sentiment scores
polarity_scores = doc.cats

# Determine the sentiment label based on the highest score
sentiment_label = max(polarity_scores, key=polarity_scores.get)

# Print the sentiment result in a formatted way
print(f"Sentiment Analysis Result (Custom Model):\n{'-'*40}")
print(f"Text: '{text}'")
for label, score in polarity_scores.items():
    print(f"{label}: {score:.2f}")
print(f"Sentiment Label: '{sentiment_label}'")

The custom model is loaded, and text is processed to obtain sentiment scores. The model evaluates the text across the three sentiment categories, assigning a probability score to each. The label with the highest score is selected as the final sentiment.

Output:

Sentiment Analysis Result (Custom Model):
----------------------------------------
Text: 'spaCy makes NLP tasks so easy! I love using it for my projects.'
sentiment_positive: 0.60
sentiment_neutral: 0.04
sentiment_negative: 0.37
Sentiment Label: 'sentiment_positive'

Merging the Custom Model with a Pre-Trained spaCy Model:

To enhance the capabilities of the custom sentiment model, we can merge it with a pre-trained spaCy model like en_core_web_md:

import spacy

# Load the pre-trained spaCy model (e.g., en_core_web_md)
pretrained_nlp = spacy.load("en_core_web_md")

# Load the custom sentiment model
custom_nlp = spacy.load("en_custom_sentiment_model")

# Add the custom TextCategorizer from the custom model to the pre-trained model
pretrained_nlp.add_pipe("sentiment", source=custom_nlp, name="sentiment", last=True)

# Optional: Save the merged model
pretrained_nlp.to_disk("en_core_web_md_sentiment")
print("Merged model saved to 'en_core_web_md_sentiment'")

By integrating the custom sentiment component into a robust pre-trained model, we create a more comprehensive NLP solution capable of handling various tasks with enhanced sentiment analysis capabilities.

Key Takeaways:

Custom Training Flexibility: Custom models provide the flexibility to be fine-tuned for specific applications, improving their performance in domain-specific contexts.
Resource Efficiency: Although custom models are not as sophisticated as large-scale LLMs, they offer a good balance between resource efficiency and accuracy for targeted NLP tasks.
Enhanced Capabilities: Merging custom models with pre-trained ones leverages the strengths of both, creating powerful, versatile NLP solutions.

Intent Classification

In this subsection, we will explore how to use spaCy's pre-trained word embeddings and cosine similarity to perform intent classification efficiently. Unlike traditional approaches that rely on pre-defined rules or intent classifiers, this method leverages the inherent similarity between vector representations of words and phrases. By using cosine similarity, we can measure how close the meaning of a user's query is to predefined intent categories, allowing for a high-performance and adaptable solution.

Loading spaCy and Defining and Precomputing Intents:

import spacy

# Load a spaCy English model
nlp = spacy.load('en_core_web_md')

# Define sample intents and their example phrases
intents = { 
    "Greeting": ["Hello", "Hi", "Hey there"], 
    "Farewell": ["Bye", "Goodbye", "See you later"], 
    "Place_Order": [ 
        "I would like to order pizza",
        "Can I get a burger?",
        "Place an order for food",
    ],
}

# Pre-compute the vectors for all example phrases using only the make_doc method
precomputed_vectors = {}
for intent, examples in intents.items():
    precomputed_vectors[intent] = [nlp.make_doc(example) for example in examples]

We begin by loading the en_core_web_md spaCy model, which contains pre-trained word vectors for English. Next, we define a set of sample intents along with example phrases representing each intent. These intents serve as the reference against which the input text will be compared.

Processing Input Text:
```
# Define a similarity threshold for non-matching intents
similarity_threshold = 0.7

# Input text for testing
text = "Hello, I want to have pizzas."

# Process the input text with spaCy
doc = nlp(text)
```
Here, we define an input text sample and process it using the spaCy pipeline. The nlp object transforms the input text into a doc object containing vector embeddings for the entire text, which can be used for similarity comparison.

Calculating Cosine Similarity:

# List to store the best match for each intent
intent_matches = []

# For each intent, check similarity with the example phrases
for intent, example_docs in precomputed_vectors.items():
    best_similarity = 0.0  # Track the highest similarity for this intent
    for example_doc in example_docs:
        # Calculate similarity between input text and precomputed example phrase
        similarity = doc.similarity(example_doc)
        
        # Print similarity for each example (to see the comparison process)
        print(f"Similarity with '{example_doc.text}' ({intent}): {similarity:.2f}")
        
        # Update the best similarity for this intent
        if similarity > best_similarity:
            best_similarity = similarity
    
    # Store the best similarity score for this intent
    intent_matches.append((intent, best_similarity))

For each predefined intent, we compute the cosine similarity between the input text and each example phrase. The doc.similarity() function calculates the cosine similarity between the vector embeddings of the input text and the example phrases, indicating how closely related they are in meaning.

Determining the Best Matching Intent:

# Sort the intents based on the best similarity score (highest first)
intent_matches.sort(key=lambda x: x[1], reverse=True)

# Print the intents in order of similarity
print("\nIntents ordered by similarity:")
for intent, score in intent_matches:
    print(f"Intent: '{intent}', Similarity Score: {score:.2f}")

# Get the best matching intent (the one with the highest similarity)
best_intent, best_score = intent_matches[0]

# Check if the best score is above the similarity threshold
if best_score >= similarity_threshold:
    print(f"\nBest matching intent: '{best_intent}' with similarity score: {best_score:.2f}")
else:
    print(f"\nNo matching intent found. Scores are below the threshold of {similarity_threshold}.")

After calculating the similarity scores for each intent, we sort them in descending order based on their best similarity score. The intent with the highest score is considered the best match. If the best score surpasses a predefined threshold, we identify the intent; otherwise, we conclude that no matching intent was found.

The output provides a list of intents ordered by their similarity scores, showing how closely each intent matches the input text.

Output

Similarity with 'Hello' (Greeting): 0.23
Similarity with 'Hi' (Greeting): 0.11
Similarity with 'Hey there' (Greeting): 0.59
Similarity with 'Bye' (Farewell): -0.01
Similarity with 'Goodbye' (Farewell): 0.06
Similarity with 'See you later' (Farewell): 0.63
Similarity with 'I would like to order pizza' (Place_Order): 0.91
Similarity with 'Can I get a burger?' (Place_Order): 0.69
Similarity with 'Place an order for food' (Place_Order): 0.35

Intents ordered by similarity:
Intent: 'Place_Order', Similarity Score: 0.91
Intent: 'Farewell', Similarity Score: 0.63
Intent: 'Greeting', Similarity Score: 0.59

Best matching intent: 'Place_Order' with similarity score: 0.91

The input text “Hello, I want to have pizzas.” best matches the “Place_Order” intent with a high similarity score of 0.91, indicating that the user intends to place an order.

Key Takeaways:

Efficiency and Scalability: This method efficiently handles intent classification with high performance, leveraging spaCy’s robust vector embeddings and similarity measures.
Adaptability: The intent classification system can be easily adapted to new use cases by updating the list of intents and example phrases.
Limitations: The effectiveness depends on the quality and diversity of the example phrases. Expanding the example set can improve the model's ability to recognize diverse user inputs.

By using this method, businesses can implement an effective intent classification system for chatbots, virtual assistants, or any application that requires quick and reliable understanding of user queries.

Named Entity Recognition (NER)

Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) that involves identifying and classifying entities within a text. These entities can be people, organizations, dates, locations, and more, providing structured information from unstructured text. NER can be used in various applications, such as information extraction, question answering, and content categorization.

This chapter will explore different approaches to implementing NER using the spaCy library. We will cover the standard spaCy NLP pipeline, followed by custom approaches that leverage rule-based, vector-based, and Part-of-Speech (POS)-based methods for advanced entity recognition tasks.

Let's begin by understanding the standard spaCy NLP pipeline and its components.

Standard spaCy NLP Pipeline

The spaCy library provides a robust, pre-trained pipeline for NLP tasks, including NER. This pipeline consists of several stages that process the input text sequentially, performing various linguistic analyses to extract structured information.

Code Example

Here’s the code for processing an input sentence with the standard spaCy pipeline:

import spacy

# Load a spaCy English model (mid size)
nlp = spacy.load('en_core_web_md')

# Input sentence
text = "In 1898, Marie Curie discovered the chemical element radium in Paris."

# Process the sentence
doc = nlp(text)

Pipeline Stages Explained

Tokenization and tok2vec: The input text is segmented into individual tokens, which can be words, punctuation, or symbols. Each token is then converted into a dense vector representation using the tok2vec layer, capturing contextual information for each token.
Part of Speech (POS) Tagging: Each token is assigned a POS tag, such as noun, verb, or adjective, indicating its grammatical role in the sentence.
Dependency Parsing and Chunking: Dependency parsing identifies relationships between tokens, building a parse tree that shows how words are connected. This step is crucial for understanding the syntactic structure of the sentence. Chunking involves grouping related tokens, like noun phrases or verb phrases, based on the dependency tree.
Lemmatization (Optional): Tokens are reduced to their base forms, known as lemmas (e.g., “running” becomes “run”).
Morphological Analysis (Optional): This step involves analyzing the morphological features of each token, such as number, tense, or gender.
Sentence Boundary Detection (Optional): Identifies the boundaries of sentences within the text. This is especially useful for processing longer documents where multiple sentences need to be handled separately.
Named Entity Classification (NER): Entities in the text are identified and classified into predefined categories like PERSON, DATE, GPE (Geopolitical Entity), etc.
Entity Extraction: Extracted entities are then presented with their corresponding labels.

Visualizing Named Entities

The following figure shows a visual representation of the named entities detected in the input text. For example, “1898” is recognized as a DATE, “Marie Curie” as a PERSON, and “Paris” as a GPE (Geopolitical Entity). This visualization aids in understanding how spaCy identifies and categorizes various elements within the sentence.

Visualizing Dependency Parse

The following figure displays the dependency parse tree of the sentence, including POS tags. The parse tree illustrates the syntactic structure, with arrows pointing from heads (main words) to dependents (related words), showing relationships such as subject, object, and modifiers.

With these standard techniques, spaCy provides a powerful and efficient tool for performing NER and other NLP tasks. The following subsections will dive into custom NER implementations, expanding beyond the capabilities of the default spaCy pipeline.

Rule-based Custom Named Entity Recognition (NER)

In this section, we explore how to create a custom Named Entity Recognition (NER) system using a rule-based approach. Unlike statistical models, which rely on large annotated datasets and machine learning algorithms, rule-based NER systems identify entities based on predefined patterns or dictionaries. This method is particularly useful for recognizing domain-specific entities, such as chemical elements or technical terms, where data might be scarce or where precision is critical.

Rule-based NER in spaCy is implemented using the EntityRuler component. This component allows us to add custom patterns to the NLP pipeline, enabling the model to recognize entities that might not be captured by the pre-trained NER model. We define these patterns using a dictionary of entities and their associated labels. The EntityRuler then uses these patterns to identify and classify entities within the text.

Code Explanation

Defining Custom Entities: We start by defining a dictionary of custom entities. In this example, we are focusing on chemical elements, which will be labeled as CHEMICAL_ELEMENT.

# Define custom entities with their respective items
custom_entities = {
    "CHEMICAL_ELEMENT": [
        "lithium", "beryllium", "boron", "carbon", "nitrogen", "radium",
        "oxygen", "fluorine", "neon", "sodium", "magnesium", "aluminum", "silicon",
        "phosphorus", "sulfur", "chlorine", "argon", "potassium", "calcium",
        "sodium chloride", "carbon dioxide", "nitric acid", "sulfuric acid"
    ]
}

This dictionary maps each chemical element to the label CHEMICAL_ELEMENT, allowing us to recognize these specific terms in the text.

Adding the EntityRuler to the spaCy Pipeline: We then use the EntityRuler component to add these custom patterns to the spaCy pipeline. This is done before the default ner component to ensure that our custom entities are recognized first.

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load('en_core_web_md')

# Remove existing entity ruler if present
if "custom_entity_ruler" in nlp.pipe_names:
    nlp.remove_pipe("custom_entity_ruler")

# Create a new EntityRuler and add it to the pipeline
ruler = nlp.add_pipe("entity_ruler", before="ner", name="custom_entity_ruler")

Defining Patterns: We create a list of patterns for each entity in our dictionary. Each pattern is a dictionary with a label and a matching pattern, which is simply the entity name in this case.

# Define patterns for all custom entities
patterns = []
for entity_label, entity_list in custom_entities.items():
    for entity in entity_list:
        patterns.append({"label": entity_label, "pattern": entity})

Adding Patterns to the EntityRuler: These patterns are then added to the EntityRuler, allowing the model to recognize the defined entities in the text.
```
# Add the patterns to the EntityRuler
ruler.add_patterns(patterns)
```
Processing the Text: Finally, we process the input text with the modified NLP pipeline, which now includes the custom EntityRuler. The model will recognize and classify the custom entities along with the standard entities.
```
# Process the text with the NLP model
doc = nlp(text)
```

Output: The resulting output visualization shows the custom entities alongside the standard ones.

Vector-based Custom Named Entity Recognition (NER)

In the previous section, we utilized a rule- and dictionary-based approach for Named Entity Recognition (NER). While this method works well for predefined entities, it has limitations when the input contains terms that are not explicitly included in the dictionary. For instance, if our custom entity list does not include the term “radium”, as shown below, the rule-based approach would fail to recognize it as a chemical element.

# Define custom entities with their respective items without radium
custom_entities = {
    "CHEMICAL_ELEMENT": [
        "lithium", "beryllium", "boron", "carbon", "nitrogen", 
        "oxygen", "fluorine", "neon", "sodium", "magnesium", "aluminum", "silicon",
        "phosphorus", "sulfur", "chlorine", "argon", "potassium", "calcium",
        "sodium chloride", "carbon dioxide", "nitric acid", "sulfuric acid"
    ]
}

In such cases, we can leverage vector-based similarity to identify entities that are semantically similar to those in our custom list, even if they are not an exact match. This approach relies on word embeddings, which represent words in a high-dimensional vector space, capturing their semantic meaning. By comparing the vectors of unknown words to those of known entities, we can detect and classify similar terms.

Implementing Vector-based NER with spaCy

Calculating Vector Similarity: We start by comparing the vector of each token in the input text against the vectors of known chemical elements in our dictionary. If the similarity score exceeds a predefined threshold, we consider the token as similar to a chemical element.

import spacy
from spacy.language import Language
from spacy.tokens import Span

nlp = spacy.load('en_core_web_md')

# Similarity threshold
similarity_threshold = 0.7

# Define a custom component for the spaCy pipeline
@Language.component("similarity_entity_recognizer")
def similarity_entity_recognizer(doc):
    similar_elements = set()  # Use set to avoid duplicate patterns
    
    # Check similarity of each token in the sentence with all custom entity terms
    for token in doc:
        for entity_label, entity_list in custom_entities.items():
            for element in entity_list:
                element_doc = nlp.make_doc(element)
                # Compute the similarity with the individual token
                if token.has_vector and element_doc.vector_norm:
                    similarity = token.similarity(element_doc)
                    if similarity >= similarity_threshold:
                        print(f"'{token.text}' is similar to '{element}' ({entity_label}) with similarity {similarity:.2f}")
                        similar_elements.add((token, entity_label))
    
    # Add custom entities as spans to doc.ents
    doc.ents = list(doc.ents) + [Span(doc, token.i, token.i + 1, label=label) for token, label in similar_elements]
    return doc

In this step, we loop through each token in the input text and compare it against each entity in the custom dictionary using cosine similarity. If the similarity score is above 0.7 (our threshold), we identify the token as a potential entity.

Creating a Custom Component: Afterward, we add the previously defined to a custom component in the spaCy pipeline. This allows us to capture these similar terms as entities during text processing.

# Remove the custom component if it already exists
if "similarity_entity_recognizer" in nlp.pipe_names:
    nlp.remove_pipe("similarity_entity_recognizer")

# Add the custom component to the pipeline before the default 'ner' component
nlp.add_pipe("similarity_entity_recognizer", before="ner")

# Reprocess the text with the updated NLP model
doc = nlp(text)

By doing this, we enable the NLP model to recognize these terms as custom entities in the future.

Output Analysis: The modified NLP model processes the text and identifies both standard and custom entities based on vector similarity. The output shows that even though “radium” was not explicitly included in our custom dictionary, the model successfully identifies it as a CHEMICAL_ELEMENT due to its high similarity score with known elements.

Output

Entity: '1898', Type: 'DATE'
Entity: 'Marie Curie', Type: 'PERSON'
Entity: 'radium', Type: 'CHEMICAL_ELEMENT'
Entity: 'Paris', Type: 'GPE'

Benefits of Vector-based NER

Flexibility: Unlike rule-based approaches, vector-based NER does not require exact matches, making it robust to variations in language and terminology.
Scalability: This method can be applied to large corpora without extensive manual annotation or dictionary expansion.
Domain Adaptation: By leveraging pre-trained embeddings, vector-based NER can be easily adapted to new domains with minimal retraining.

Limitations

Accuracy: The performance heavily relies on the quality of the pre-trained word embeddings and the chosen similarity threshold.
Computational Cost: Calculating similarity scores for every token against a large set of entities can be computationally expensive.

In summary, vector-based custom NER provides a powerful tool for identifying entities in cases where rule-based methods fall short. By leveraging semantic similarity, this approach enhances the model's ability to recognize entities with flexibility and precision. In the next section, we will further refine our custom NER capabilities using a POS-based custom noun chunker.

POS-based Custom Noun Chunker for Vector-based Custom NER

In the previous section, we explored how vector-based Named Entity Recognition (NER) can help identify entities not present in a predefined dictionary.

# Define custom entities with their respective items wihtout radium
custom_entities = {
    "CHEMICAL_ELEMENT": [
        "lithium", "beryllium", "boron", "carbon", "nitrogen", 
        "oxygen", "fluorine", "neon", "sodium", "magnesium", "aluminum", "silicon",
        "phosphorus", "sulfur", "chlorine", "argon", "potassium", "calcium",
        "sodium chloride", "carbon dioxide", "nitric acid", "sulfuric acid"
    ]
}

However, this approach only works well when the entities are represented as single tokens. In real-world scenarios, entities often span multiple words, such as “the chemical element radium”. Using a single-token approach would miss the complete entity, especially if the exact term isn't in the dictionary.

The built-in noun chunker in spaCy is quite effective at identifying multi-word phrases, such as “the chemical element radium”. It segments the text into meaningful phrases, which can be directly accessed using:

for chunk in doc.noun_chunks:
    print(chunk.text)

In the example above, spaCy would correctly extract “the chemical element radium” as a single chunk. However, this might pose a problem when performing similarity matching with a custom entity list. If the similarity threshold is set too high, the entire chunk might not match any of the entities because it contains additional words that aren’t in the custom list.

To address this issue, we will use a custom Part-of-Speech (POS) based noun chunker to identify multi-word entities. This approach allows us to capture more comprehensive chunks of text that could represent complex entities, such as chemical compounds or scientific terms, even when individual words are not found in the dictionary.

Custom POS-based Noun Chunker

A noun chunker segments a sentence into syntactically related groups of words, often corresponding to noun phrases. By customizing the chunking process using a specific POS pattern, we can better capture complex entities.

The following pattern is used to define our custom noun chunker:

pattern = [
    {"POS": "DET", "OP": "?"},  # Optional determiner
    {"POS": "NUM", "OP": "*"},  # Optional numbers
    {"POS": "ADJ", "OP": "*"},  # Optional adjectives
    {"POS": "NOUN", "OP": "*"},  # Optional noun modifiers
    {
        "POS": {"IN": ["NOUN", "PROPN", "PRON"]},
        "OP": "+",
    },  # One or more nouns, proper nouns, or pronouns
    {"POS": "NUM", "OP": "?"},  # Optional numbers
]

This pattern can be broken down as follows:

Determiner (DET): An optional word that typically precedes a noun to specify reference, such as “the” or “a”.
Number (NUM): Optional numeric modifiers that can indicate quantities or orders, like “two” or “3rd”.
Adjective (ADJ): Optional descriptive words that modify nouns, such as “chemical” or “famous”.
Noun (NOUN): Optional noun modifiers that add context or specification to the main noun phrase.
Core Noun (NOUN, PROPN, PRON): One or more core elements that form the primary subject of the chunk. This includes:
- Nouns (NOUN): Common nouns (e.g., “book”, “element”).
- Proper Nouns (PROPN): Specific names (e.g., “Microsoft”, “CES”).
- Pronouns (PRON): Substitutes for nouns (e.g., “she”, “they”).
Trailing Number (NUM): An optional number that may follow the core noun phrase, useful for entities with numeric elements, such as “CES 2024”.

This pattern is designed to capture a wide range of noun phrases, including those with optional determiners, numbers, and adjectives, as well as one or more core nouns. It is flexible enough to recognize complex entities like “the chemical element radium”.

Implementing the Custom Noun Chunker

We first define the pattern in the Matcher component of spaCy, then extract and print all identified noun chunks:

import spacy
from spacy.language import Language
from spacy.tokens import Span, Doc
from spacy.matcher import Matcher
import spacy.util

# Load spaCy model
nlp = spacy.load('en_core_web_md')

# Create and define the custom noun chunker using Matcher outside the component
matcher = Matcher(nlp.vocab)
matcher.add("NOUN_CHUNK", [pattern])

# Extend Doc class with the custom attribute entity_chunks
Doc.set_extension("entity_chunks", default=[], force=True)

# Define the custom entity matcher component
@Language.component("entity_matcher")
def entity_matcher(doc):
    # Use the globally defined matcher to extract matches based on the predefined pattern
    matches = matcher(doc)
    # Create noun chunks from matches
    noun_chunks = [Span(doc, start, end) for match_id, start, end in matches]
    # Filter to keep only the largest, non-overlapping noun chunks
    largest_noun_chunks = spacy.util.filter_spans(noun_chunks)
    
    # Create entity_chunks as a list of (noun_chunk, largest_noun_chunk) pairs
    entity_chunks = []
    for chunk in noun_chunks:
        # Find the largest noun chunk that contains the current noun chunk
        for largest_chunk in largest_noun_chunks:
            if chunk.start >= largest_chunk.start and chunk.end <= largest_chunk.end:
                entity_chunks.append((chunk, largest_chunk))
                break  # Found the corresponding largest chunk, no need to check further

    # Store only entity_chunks in the custom attribute
    doc._.entity_chunks = entity_chunks
    
    return doc

The custom noun chunker identifies complex noun phrases that might represent entities not found in the predefined dictionary. For example, it will extract “the chemical element radium” as a single chunk.

Using Entity Chunks for Similarity Search

After identifying potential noun chunks, we use vector similarity to compare each chunk with known entities. This allows us to find multi-word entities that are semantically similar to our custom entity list:

# Similarity threshold
similarity_threshold = 0.7

# Define the similarity entity recognizer component
@Language.component("similarity_entity_recognizer")
def similarity_entity_recognizer(doc):
    # Use the entity_chunks directly from the custom attribute
    entity_chunks = doc._.entity_chunks

    # Precompute vectors for custom entity elements
    entity_vectors = {
        element: (nlp.make_doc(element), entity_label)
        for entity_label, entity_list in custom_entities.items()
        for element in entity_list
    }

    # Perform similarity search for each chunk and its largest corresponding chunk
    spans = []
    for chunk, largest_chunk in entity_chunks:
        for element, (element_doc, entity_label) in entity_vectors.items():
            if chunk.has_vector:
                similarity = chunk.similarity(element_doc)
                if similarity >= similarity_threshold:
                    print(f"Noun Chunk '{chunk.text}' is similar to '{element}' ({entity_label}) with similarity {similarity:.2f}")
                    spans.append(Span(doc, largest_chunk.start, largest_chunk.end, label=entity_label))
    
    # Add custom entities as spans to doc.ents
    doc.ents = spacy.util.filter_spans(spans)
    return doc

Here, we only take the most overlapping chunk of the sub-chunks available.

Add Custom Components

We are adding the new chunking and entity recognizer components to the pipeline for processing:

# Remove components if they already exist in the pipeline
if "entity_matcher" in nlp.pipe_names:
    nlp.remove_pipe("entity_matcher")
if "similarity_entity_recognizer" in nlp.pipe_names:
    nlp.remove_pipe("similarity_entity_recognizer")

# Add the custom entity matcher component before the similarity entity recognizer
nlp.add_pipe("similarity_entity_recognizer", before="ner")
nlp.add_pipe("entity_matcher", before="similarity_entity_recognizer")

Output

Reprocess the text to recognize new custom entities:

# Reprocess the text with the updated NLP model
doc = nlp(text)

Finally, we get the complete custom entity value as shown in the following figure:

Conclusion

The POS-based custom noun chunker allows us to capture more complex, multi-word entities that might be overlooked by single-token-based or rule-based approaches. By integrating the chunker with vector-based similarity search, we can effectively recognize and classify sophisticated entities, even when they are not explicitly listed in a predefined dictionary. This method enhances the versatility and accuracy of our custom NER system.

References

Luccioni, A. S., Jernite, Y., & Strubell, E. (2024). Power Hungry Processing: Watts Driving the Cost of AI Deployment? https://doi.org/10.1145/3630106.3658542

Previous1. Introduction to NLP & Generative AI Next3. LLM Selection, Tooling & Monitoring

Last updated 9 months ago