2. Specialised models vs. LLMs for NLP tasks
In this lecture, we explore the differences between specialized models and Large Language Models (LLMs) in the context of NLP tasks. The field of Natural Language Processing (NLP) has evolved significantly with the advent of LLMs, which have demonstrated remarkable capabilities across a wide range of applications. However, despite their versatility, these models come with significant computational, environmental, and economic costs.
Specialized models, on the other hand, are designed to perform specific NLP tasks efficiently, often requiring fewer resources and providing faster results for well-defined tasks like sentiment analysis, named entity recognition (NER), and intent classification. They can typically outperform LLMs in specific domains when appropriately tuned.
This lecture will focus on comparing the environmental and monetary impacts of LLMs with specialized models, and investigate how specialized models like spaCy or BERT can be used for standard NLP tasks.
Environmental and Monetary Impact of LLMs
As the usage of LLMs like GPT-3 and GPT-4 has grown, the computational, environmental, and economic costs have become increasingly significant. LLMs are energy-intensive due to the immense processing power required for both training and inference, which can lead to higher carbon emissions and greater financial costs. These models are over-parameterized for many simple NLP tasks, such as sentiment analysis or intent classification, where more efficient, specialized models could be employed instead.
From a practical perspective, it does not make sense to rely on LLMs for standardized NLP tasks when highly optimized, task-specific models have been available for decades. These smaller models are energy-efficient and less costly, and they often outperform LLMs in specific applications when fine-tuned to the task at hand. In business settings, where time and cost efficiency are crucial, the use of specialized models like BERT or spaCy makes more economic sense for standard tasks.
A recent study by Luccioni et al. (2024) offers insights into the energy consumption of LLMs compared to smaller, task-specific models. Their research shows that multipurpose, generative models like GPT-3 consume far more energy during inference than task-specific models designed for discrete tasks such as text classification or sentiment analysis. For example, a fine-tuned BERT-based model for sentiment analysis, such as bert-base-multilingual-uncased-sentiment, emits 0.32g of CO2 per 1,000 queries, while larger, multipurpose models like Flan-T5-XL and BLOOMz-7B emit 2.66g and 4.67g CO2, respectively, per 1,000 queries. This stark difference highlights the inefficiency of using large, generative models for tasks that can be handled by much smaller, purpose-built models (Luccioni et al., 2024).
The study also illustrates how the complexity of the task plays a role in energy consumption. Generative tasks like text generation, summarization, and translation are much more energy-intensive compared to discriminative tasks like text classification. Decoder-only models, such as those used in many generative LLMs, are particularly inefficient for tasks with longer outputs, making them less suitable for applications where energy consumption is a concern.
In many real-world scenarios, LLMs are deployed without fully weighing their environmental impact against their utility, particularly for tasks where smaller, specialized models would suffice. As the technology industry leans more toward deploying general-purpose LLMs for a broad array of tasks, it's crucial to consider the environmental and economic trade-offs.
Thus, while LLMs are revolutionary in terms of their capabilities, their use for standard, well-defined NLP tasks is both environmentally and financially inefficient. Deploying smaller, specialized models saves energy and reduces emissions and lowers operational expenses, making them a more sustainable option for businesses and researchers alike.
Introduction to Specialized Models for NLP
While LLMs offer broad versatility, specialized models are optimized for specific functions like sentiment analysis, named entity recognition (NER), and intent classification. These models often require fewer computational resources, have faster inference times, and can be more accurate in their specific domains when fine-tuned properly. We will explore some of the most popular specialized models and libraries used in NLP, including spaCy and Hugging Face Transformers.
spaCy
spaCy is a powerful, open-source library for advanced NLP in Python, designed for production use. It supports a wide range of NLP tasks and provides efficient, pre-trained models for many languages. spaCy is known for its speed and flexibility, making it suitable for building applications that require processing and understanding large volumes of text.
Key Features of spaCy:
Tokenization: spaCy segments text into words, punctuation marks, and other elements, forming the basic building blocks for further NLP processing.
Part-of-Speech (POS) Tagging: spaCy assigns part-of-speech tags to tokens, categorizing them as nouns, verbs, adjectives, etc., which is crucial for understanding grammatical structure.
Dependency Parsing: This feature helps in understanding the syntactic structure of sentences by identifying relationships between tokens, such as subject-object relationships.
Lemmatization: spaCy can reduce words to their base forms (lemmas), such as converting “was” to “be” or “rats” to “rat.”
Sentence Boundary Detection (SBD): spaCy can detect and segment individual sentences within a document.
Named Entity Recognition (NER): This component labels named entities like people, organizations, and locations in the text.
Entity Linking (EL): spaCy disambiguates entities by linking them to unique identifiers in a knowledge base.
Similarity Analysis: spaCy provides tools for comparing words, phrases, and documents to measure their similarity.
Text Classification: It can assign categories or labels to entire documents or specific parts of a document.
Rule-Based Matching: This feature allows finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
spaCy’s versatility is enhanced through its support for over 75 languages, 84 pre-trained pipelines, and the ability to integrate with custom components. It is highly extensible, supporting custom models in PyTorch, TensorFlow, and other frameworks. For visualization, spaCy includes built-in tools for syntax and NER, making it easier to understand and debug NLP models.
Hugging Face Transformers & BERT
Hugging Face Transformers is a versatile and widely used library that provides seamless access to various state-of-the-art NLP models, such as BERT, GPT, and many others. It enables researchers and developers to fine-tune these models for specific tasks like text classification, machine translation, named entity recognition, and sentiment analysis. The library offers pre-trained models, easy-to-use APIs, and tools for model training and deployment, making advanced NLP technology more accessible and usable across different applications.
One of the key models in this library is the bert-base-multilingual-uncased-sentiment model. This model is a fine-tuned version of the multilingual BERT model designed specifically for sentiment analysis on product reviews. It supports six languages: English, Dutch, German, French, Spanish, and Italian. The model predicts the sentiment of a review as a number of stars (between 1 and 5), making it particularly useful for evaluating customer feedback in multiple languages and contexts.
Although BERT models, including the multilingual version, are more resource-intensive than specialized libraries like spaCy, they offer significantly better accuracy and generalization across diverse linguistic settings. This superior performance is achieved by capturing intricate semantic nuances through their deep learning architecture, making them ideal for more complex and high-stakes sentiment analysis tasks.
It's worth noting that while BERT requires more computational resources compared to traditional NLP methods, it is still considerably more efficient than large-scale decoder-only LLMs like GPT-3 or PaLM, especially for inference tasks. As highlighted by Luccioni et al. (2024), using BERT-based models for specific tasks like sentiment analysis can lead to a more balanced approach, offering high accuracy without the extreme energy consumption associated with deploying massive generative models for such tasks. This balance between performance and resource efficiency makes BERT a practical choice for many real-world applications where both effectiveness and sustainability are concerns.
Sentiment Analysis
Sentiment analysis, also known as opinion mining, is a crucial NLP task that involves determining the emotional tone behind a piece of text. It is widely used in various applications such as customer feedback analysis, social media monitoring, and market research. By classifying text as positive, negative, or neutral, sentiment analysis helps organizations gain insights into public opinion and customer satisfaction. This chapter will explore different approaches and tools for performing sentiment analysis, including specialized models like spaCy with plugins, traditional libraries like NLTK, and advanced methods using Hugging Face Transformers and BERT.
Using BERT for Sentiment Analysis
The following code demonstrates how to perform sentiment analysis using a pre-trained BERT model from the Hugging Face Transformers library. In this example, we use the nlptown/bert-base-multilingual-uncased-sentiment
model, which is fine-tuned for sentiment analysis on product reviews in six different languages, including English, German, French, and Spanish.
Loading the Model:
Here, the BERT model is loaded using the
pipeline
function from the Transformers library. This pipeline is specifically configured for sentiment analysis and uses thenlptown/bert-base-multilingual-uncased-sentiment
model, which assigns sentiment scores based on a five-star rating system.Classifying the Sentiment:
The input text is passed to the classifier, which predicts the sentiment based on the model's training data. The result contains the predicted label (e.g., '1 star' to '5 stars') and the confidence score, indicating the certainty of the prediction.
Mapping Sentiment Labels:
Since the model uses a five-star rating system, we map these ratings to more generalized sentiment categories. For example, '1 star' and '2 stars' are mapped to 'sentiment_negative', while '4 stars' and '5 stars' are mapped to 'sentiment_positive'. This step helps simplify the analysis by grouping similar sentiments together.
Extracting and Displaying Results:
The code extracts the original sentiment label, maps it to the generalized label, and prints the result along with the confidence score. This gives a clear summary of the sentiment classification.
For the example text, “spaCy makes NLP tasks so easy! I love using it for my projects.”, the model predicted a sentiment of '5 stars' with a confidence score of 0.75. After mapping, this corresponds to a 'sentiment_positive' label, indicating that the text expresses a positive sentiment with moderate confidence.
Advantages:
Accuracy and Generalization: BERT models are highly accurate and can generalize well across multiple languages and contexts, making them ideal for complex sentiment analysis tasks.
Multilingual Support: This model supports multiple languages, which is beneficial for applications dealing with multilingual datasets.
Limitations:
Resource-Intensive: BERT models are computationally expensive compared to simpler models like those used in spaCy. However, they are still more efficient than large-scale decoder LLMs like GPT-3 or GPT-4 (Luccioni et al., 2024).
Latency: The processing time can be longer, making them less suitable for real-time applications in resource-constrained environments.
Training a Custom Sentiment Model with spaCy
The following code shows how to train a custom sentiment analysis model using spaCy. This approach provides more control and flexibility compared to pre-trained models like BERT. Custom training allows us to tailor the model to specific datasets and domains, enhancing its performance for specialized tasks.
Model Setup and Initialization:
We start by initializing a blank English model in spaCy and copying the vocabulary from a pre-trained model (
en_core_web_md
). This helps the custom model inherit useful linguistic features, which will be beneficial during training.Creating and Configuring the TextCategorizer:
A
TextCategorizer
component is added to the spaCy pipeline for text classification. We define three labels:sentiment_positive
,sentiment_neutral
, andsentiment_negative
. These labels represent the possible sentiment categories our model will classify text into.Preparing and Formatting Training Data:
The training data is formatted as a list of tuples, with each tuple containing a text sample and its corresponding sentiment label. We then convert this data into a format suitable for spaCy’s training process, where each label is represented as a dictionary of categories.
Training the Model:
The training process involves running multiple epochs (20 in this case), where the model iteratively learns from the training data. A dropout rate is used to prevent overfitting, and the model's loss is monitored to gauge its learning progress. By the end of training, the model is optimized for sentiment classification based on the provided examples.
Saving and Loading the Custom Model:
Once training is complete, the custom sentiment model is saved to disk for future use. This allows us to reload and apply the model to new text data for sentiment analysis.
Using the Trained Model for Sentiment Analysis:
The custom model is loaded, and text is processed to obtain sentiment scores. The model evaluates the text across the three sentiment categories, assigning a probability score to each. The label with the highest score is selected as the final sentiment.
Merging the Custom Model with a Pre-Trained spaCy Model:
To enhance the capabilities of the custom sentiment model, we can merge it with a pre-trained spaCy model like en_core_web_md
:
By integrating the custom sentiment component into a robust pre-trained model, we create a more comprehensive NLP solution capable of handling various tasks with enhanced sentiment analysis capabilities.
Key Takeaways:
Custom Training Flexibility: Custom models provide the flexibility to be fine-tuned for specific applications, improving their performance in domain-specific contexts.
Resource Efficiency: Although custom models are not as sophisticated as large-scale LLMs, they offer a good balance between resource efficiency and accuracy for targeted NLP tasks.
Enhanced Capabilities: Merging custom models with pre-trained ones leverages the strengths of both, creating powerful, versatile NLP solutions.
Intent Classification
In this subsection, we will explore how to use spaCy's pre-trained word embeddings and cosine similarity to perform intent classification efficiently. Unlike traditional approaches that rely on pre-defined rules or intent classifiers, this method leverages the inherent similarity between vector representations of words and phrases. By using cosine similarity, we can measure how close the meaning of a user's query is to predefined intent categories, allowing for a high-performance and adaptable solution.
Loading spaCy and Defining and Precomputing Intents:
We begin by loading the
en_core_web_md
spaCy model, which contains pre-trained word vectors for English. Next, we define a set of sample intents along with example phrases representing each intent. These intents serve as the reference against which the input text will be compared.Processing Input Text:
Here, we define an input text sample and process it using the spaCy pipeline. The
nlp
object transforms the input text into adoc
object containing vector embeddings for the entire text, which can be used for similarity comparison.Calculating Cosine Similarity:
For each predefined intent, we compute the cosine similarity between the input text and each example phrase. The
doc.similarity()
function calculates the cosine similarity between the vector embeddings of the input text and the example phrases, indicating how closely related they are in meaning.Determining the Best Matching Intent:
After calculating the similarity scores for each intent, we sort them in descending order based on their best similarity score. The intent with the highest score is considered the best match. If the best score surpasses a predefined threshold, we identify the intent; otherwise, we conclude that no matching intent was found.
The output provides a list of intents ordered by their similarity scores, showing how closely each intent matches the input text.
The input text “Hello, I want to have pizzas.” best matches the “Place_Order” intent with a high similarity score of 0.91, indicating that the user intends to place an order.
Key Takeaways:
Efficiency and Scalability: This method efficiently handles intent classification with high performance, leveraging spaCy’s robust vector embeddings and similarity measures.
Adaptability: The intent classification system can be easily adapted to new use cases by updating the list of intents and example phrases.
Limitations: The effectiveness depends on the quality and diversity of the example phrases. Expanding the example set can improve the model's ability to recognize diverse user inputs.
By using this method, businesses can implement an effective intent classification system for chatbots, virtual assistants, or any application that requires quick and reliable understanding of user queries.
Named Entity Recognition (NER)
Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) that involves identifying and classifying entities within a text. These entities can be people, organizations, dates, locations, and more, providing structured information from unstructured text. NER can be used in various applications, such as information extraction, question answering, and content categorization.
This chapter will explore different approaches to implementing NER using the spaCy library. We will cover the standard spaCy NLP pipeline, followed by custom approaches that leverage rule-based, vector-based, and Part-of-Speech (POS)-based methods for advanced entity recognition tasks.
Let's begin by understanding the standard spaCy NLP pipeline and its components.
Standard spaCy NLP Pipeline
The spaCy library provides a robust, pre-trained pipeline for NLP tasks, including NER. This pipeline consists of several stages that process the input text sequentially, performing various linguistic analyses to extract structured information.
Code Example
Here’s the code for processing an input sentence with the standard spaCy pipeline:
Pipeline Stages Explained
Tokenization and tok2vec: The input text is segmented into individual tokens, which can be words, punctuation, or symbols. Each token is then converted into a dense vector representation using the
tok2vec
layer, capturing contextual information for each token.Part of Speech (POS) Tagging: Each token is assigned a POS tag, such as noun, verb, or adjective, indicating its grammatical role in the sentence.
Dependency Parsing and Chunking: Dependency parsing identifies relationships between tokens, building a parse tree that shows how words are connected. This step is crucial for understanding the syntactic structure of the sentence. Chunking involves grouping related tokens, like noun phrases or verb phrases, based on the dependency tree.
Lemmatization (Optional): Tokens are reduced to their base forms, known as lemmas (e.g., “running” becomes “run”).
Morphological Analysis (Optional): This step involves analyzing the morphological features of each token, such as number, tense, or gender.
Sentence Boundary Detection (Optional): Identifies the boundaries of sentences within the text. This is especially useful for processing longer documents where multiple sentences need to be handled separately.
Named Entity Classification (NER): Entities in the text are identified and classified into predefined categories like PERSON, DATE, GPE (Geopolitical Entity), etc.
Entity Extraction: Extracted entities are then presented with their corresponding labels.
Visualizing Named Entities
The following figure shows a visual representation of the named entities detected in the input text. For example, “1898” is recognized as a DATE, “Marie Curie” as a PERSON, and “Paris” as a GPE (Geopolitical Entity). This visualization aids in understanding how spaCy identifies and categorizes various elements within the sentence.
Visualizing Dependency Parse
The following figure displays the dependency parse tree of the sentence, including POS tags. The parse tree illustrates the syntactic structure, with arrows pointing from heads (main words) to dependents (related words), showing relationships such as subject, object, and modifiers.
With these standard techniques, spaCy provides a powerful and efficient tool for performing NER and other NLP tasks. The following subsections will dive into custom NER implementations, expanding beyond the capabilities of the default spaCy pipeline.
Rule-based Custom Named Entity Recognition (NER)
In this section, we explore how to create a custom Named Entity Recognition (NER) system using a rule-based approach. Unlike statistical models, which rely on large annotated datasets and machine learning algorithms, rule-based NER systems identify entities based on predefined patterns or dictionaries. This method is particularly useful for recognizing domain-specific entities, such as chemical elements or technical terms, where data might be scarce or where precision is critical.
Rule-based NER in spaCy is implemented using the EntityRuler
component. This component allows us to add custom patterns to the NLP pipeline, enabling the model to recognize entities that might not be captured by the pre-trained NER model. We define these patterns using a dictionary of entities and their associated labels. The EntityRuler
then uses these patterns to identify and classify entities within the text.
Code Explanation
Defining Custom Entities: We start by defining a dictionary of custom entities. In this example, we are focusing on chemical elements, which will be labeled as
CHEMICAL_ELEMENT
.This dictionary maps each chemical element to the label
CHEMICAL_ELEMENT
, allowing us to recognize these specific terms in the text.Adding the EntityRuler to the spaCy Pipeline: We then use the
EntityRuler
component to add these custom patterns to the spaCy pipeline. This is done before the defaultner
component to ensure that our custom entities are recognized first.Defining Patterns: We create a list of patterns for each entity in our dictionary. Each pattern is a dictionary with a label and a matching pattern, which is simply the entity name in this case.
Adding Patterns to the EntityRuler: These patterns are then added to the
EntityRuler
, allowing the model to recognize the defined entities in the text.Processing the Text: Finally, we process the input text with the modified NLP pipeline, which now includes the custom
EntityRuler
. The model will recognize and classify the custom entities along with the standard entities.
Output: The resulting output visualization shows the custom entities alongside the standard ones.
Vector-based Custom Named Entity Recognition (NER)
In the previous section, we utilized a rule- and dictionary-based approach for Named Entity Recognition (NER). While this method works well for predefined entities, it has limitations when the input contains terms that are not explicitly included in the dictionary. For instance, if our custom entity list does not include the term “radium”, as shown below, the rule-based approach would fail to recognize it as a chemical element.
In such cases, we can leverage vector-based similarity to identify entities that are semantically similar to those in our custom list, even if they are not an exact match. This approach relies on word embeddings, which represent words in a high-dimensional vector space, capturing their semantic meaning. By comparing the vectors of unknown words to those of known entities, we can detect and classify similar terms.
Implementing Vector-based NER with spaCy
Calculating Vector Similarity: We start by comparing the vector of each token in the input text against the vectors of known chemical elements in our dictionary. If the similarity score exceeds a predefined threshold, we consider the token as similar to a chemical element.
In this step, we loop through each token in the input text and compare it against each entity in the custom dictionary using cosine similarity. If the similarity score is above 0.7 (our threshold), we identify the token as a potential entity.
Creating a Custom Component: Afterward, we add the previously defined to a custom component in the spaCy pipeline. This allows us to capture these similar terms as entities during text processing.
By doing this, we enable the NLP model to recognize these terms as custom entities in the future.
Output Analysis: The modified NLP model processes the text and identifies both standard and custom entities based on vector similarity. The output shows that even though “radium” was not explicitly included in our custom dictionary, the model successfully identifies it as a CHEMICAL_ELEMENT
due to its high similarity score with known elements.
Benefits of Vector-based NER
Flexibility: Unlike rule-based approaches, vector-based NER does not require exact matches, making it robust to variations in language and terminology.
Scalability: This method can be applied to large corpora without extensive manual annotation or dictionary expansion.
Domain Adaptation: By leveraging pre-trained embeddings, vector-based NER can be easily adapted to new domains with minimal retraining.
Limitations
Accuracy: The performance heavily relies on the quality of the pre-trained word embeddings and the chosen similarity threshold.
Computational Cost: Calculating similarity scores for every token against a large set of entities can be computationally expensive.
In summary, vector-based custom NER provides a powerful tool for identifying entities in cases where rule-based methods fall short. By leveraging semantic similarity, this approach enhances the model's ability to recognize entities with flexibility and precision. In the next section, we will further refine our custom NER capabilities using a POS-based custom noun chunker.
POS-based Custom Noun Chunker for Vector-based Custom NER
In the previous section, we explored how vector-based Named Entity Recognition (NER) can help identify entities not present in a predefined dictionary.
However, this approach only works well when the entities are represented as single tokens. In real-world scenarios, entities often span multiple words, such as “the chemical element radium”. Using a single-token approach would miss the complete entity, especially if the exact term isn't in the dictionary.
The built-in noun chunker in spaCy is quite effective at identifying multi-word phrases, such as “the chemical element radium”. It segments the text into meaningful phrases, which can be directly accessed using:
In the example above, spaCy would correctly extract “the chemical element radium” as a single chunk. However, this might pose a problem when performing similarity matching with a custom entity list. If the similarity threshold is set too high, the entire chunk might not match any of the entities because it contains additional words that aren’t in the custom list.
To address this issue, we will use a custom Part-of-Speech (POS) based noun chunker to identify multi-word entities. This approach allows us to capture more comprehensive chunks of text that could represent complex entities, such as chemical compounds or scientific terms, even when individual words are not found in the dictionary.
Custom POS-based Noun Chunker
A noun chunker segments a sentence into syntactically related groups of words, often corresponding to noun phrases. By customizing the chunking process using a specific POS pattern, we can better capture complex entities.
The following pattern is used to define our custom noun chunker:
This pattern can be broken down as follows:
Determiner (
DET
): An optional word that typically precedes a noun to specify reference, such as “the” or “a”.Number (
NUM
): Optional numeric modifiers that can indicate quantities or orders, like “two” or “3rd”.Adjective (
ADJ
): Optional descriptive words that modify nouns, such as “chemical” or “famous”.Noun (
NOUN
): Optional noun modifiers that add context or specification to the main noun phrase.Core Noun (
NOUN
,PROPN
,PRON
): One or more core elements that form the primary subject of the chunk. This includes:Nouns (
NOUN
): Common nouns (e.g., “book”, “element”).Proper Nouns (
PROPN
): Specific names (e.g., “Microsoft”, “CES”).Pronouns (
PRON
): Substitutes for nouns (e.g., “she”, “they”).
Trailing Number (
NUM
): An optional number that may follow the core noun phrase, useful for entities with numeric elements, such as “CES 2024”.
This pattern is designed to capture a wide range of noun phrases, including those with optional determiners, numbers, and adjectives, as well as one or more core nouns. It is flexible enough to recognize complex entities like “the chemical element radium”.
Implementing the Custom Noun Chunker
We first define the pattern in the Matcher
component of spaCy, then extract and print all identified noun chunks:
The custom noun chunker identifies complex noun phrases that might represent entities not found in the predefined dictionary. For example, it will extract “the chemical element radium” as a single chunk.
Using Entity Chunks for Similarity Search
After identifying potential noun chunks, we use vector similarity to compare each chunk with known entities. This allows us to find multi-word entities that are semantically similar to our custom entity list:
Here, we only take the most overlapping chunk of the sub-chunks available.
Add Custom Components
We are adding the new chunking and entity recognizer components to the pipeline for processing:
Output
Reprocess the text to recognize new custom entities:
Finally, we get the complete custom entity value as shown in the following figure:
Conclusion
The POS-based custom noun chunker allows us to capture more complex, multi-word entities that might be overlooked by single-token-based or rule-based approaches. By integrating the chunker with vector-based similarity search, we can effectively recognize and classify sophisticated entities, even when they are not explicitly listed in a predefined dictionary. This method enhances the versatility and accuracy of our custom NER system.
References
Last updated