3. LLM Selection, Tooling & Monitoring

As Large Language Models (LLMs) become increasingly embedded in various applications, selecting the right model and building a robust deployment pipeline have emerged as crucial steps in leveraging their full potential. This lecture focuses on the decision-making process involved in selecting an LLM, covering both technical and strategic factors. We’ll look at the distinctions between proprietary and open models, evaluate local versus API-based hosting options, and explore practical aspects like model quality, infrastructure costs, and precision.

Choosing the right LLM goes beyond assessing raw performance; it requires understanding the unique demands of the application, the sensitivity of data involved, and the operational constraints. For instance, while proprietary models like GPT-4 and Claude offer high-quality language generation, they may not provide the transparency or flexibility that open models like LLaMA or Mistral can, especially when fine-tuning or local hosting is required.

Following model selection, the lecture will cover the engineering and monitoring aspects of deploying LLMs in real-world applications. We'll discuss agent engineering—designing model-based agents that respond accurately and efficiently to user prompts—and explore how tooling frameworks and monitoring practices can ensure consistent, controlled, and secure performance. Monitoring tools allow organizations to track the accuracy, latency, and effectiveness of deployed models, providing crucial insights for ongoing optimization and error correction.

Selection of a Large Language Model (LLM)

Choosing the right Large Language Model (LLM) for business or research applications requires a careful evaluation of multiple factors. Each LLM has unique strengths and limitations, which can significantly impact its suitability for specific use cases. The following six criteria provide a structured framework for comparing and selecting an LLM. By systematically assessing these factors, organizations can make more informed decisions, balancing performance, cost, and operational requirements.

Key Criteria for Selecting an LLM:

  1. Proprietary vs. Open Model: Decide between proprietary models, like OpenAI’s GPT series, which are often optimized for general-purpose use, and open models, like Meta’s LLaMA, which allow for more control and customization but may require additional technical resources.

  2. API-only vs. API & Local Use: Consider whether you need a model accessible via API only, which is simpler but can introduce data security concerns, or a model that allows for local hosting, offering greater control over data handling and potentially lower latency.

  3. Maximum Context Length: Evaluate the model's ability to handle long texts in a single input, which can be critical for applications that require reasoning across large documents or processing lengthy conversations without context loss.

  4. Model Quality / Benchmarks: Assess the model's accuracy and quality through benchmarks and testing on relevant NLP tasks. This is especially important for applications where precision is crucial, such as legal or technical document analysis.

  5. API / Infrastructure Costs & Performance: Analyze the cost-effectiveness of each model, including both API usage fees and the infrastructure required for local hosting, to ensure that the chosen solution aligns with budget constraints.

  6. Model Precision: Determine the model's ability to produce accurate, coherent responses, especially in complex, domain-specific tasks. Higher precision often correlates with better user experience, but may come at a higher computational cost.

1. Proprietary vs. Open Model

When selecting a large language model (LLM) for business or research, a critical decision is whether to use a proprietary or open model. This choice directly influences flexibility, security, cost, and potential applications of the model. Here, we'll delve into the key distinctions and implications for each option.

Proprietary Models

Proprietary models, like GPT-4 (OpenAI), Claude 2 (Anthropic), or Gemini (Google), are typically well-documented and supported, providing stable performance and high accuracy due to extensive fine-tuning and resources devoted by the companies behind them. These models are often hosted on secure infrastructure, ensuring robustness and reliability, but they come with significant trade-offs:

  • Restricted Access: Most proprietary models are accessible only via API, meaning businesses have limited control over where and how the data is processed, which may be a concern for industries with strict data privacy requirements, such as finance or healthcare.

  • High Cost: Since proprietary models are controlled by private companies, usage fees apply, often based on token or query volume. This cost structure can escalate for organizations with high demand, making long-term scaling expensive.

  • Limited Customization: Unlike open models, proprietary LLMs often restrict users from fine-tuning or retraining the model on proprietary data, which can limit their adaptability to specific domains.

Open Models

Open models, such as LLaMA 3.1 (Meta), Mistral 7B (Mistral AI), and BLOOM (BigScience), offer transparency and flexibility, particularly because their weights and often code are publicly available. These models present an alternative to proprietary options, providing certain advantages:

  • Customizability and Control: Open models allow for local deployment, which means businesses can modify, fine-tune, and optimize the model according to specific needs. This is valuable for use cases involving domain-specific knowledge, where open models can be tailored for better performance.

  • Cost-Efficiency for Scale: Organizations with the infrastructure to support local hosting of open models may avoid the recurring costs of API calls associated with proprietary models. However, this option does come with the overhead of managing computational resources, including potentially high hardware and energy costs.

  • Enhanced Transparency: The open-source community typically provides detailed documentation and access to training datasets, making it easier to evaluate the model's underlying biases, limitations, and ethical considerations. This transparency is beneficial for academic research and applications requiring full accountability in decision-making processes.

Examples and Developments

Two major releases recently illustrate the momentum in open models: Meta’s LLaMA series and Mistral by Mistral AI. The release of LLaMA marked a significant milestone, providing an accessible, high-quality open model that allowed developers to experiment with advanced NLP capabilities without relying solely on proprietary solutions. Similarly, Mistral's models, which include even lighter-weight versions, have helped address demand for models that are optimized for performance while remaining open.

Key Considerations in Choosing Between Proprietary and Open Models

The choice between a proprietary and open model depends on several factors:

  • Data Sensitivity: Proprietary models, if used via API, may raise concerns around data sharing. Open models, hosted locally, provide tighter data control.

  • Budget and Infrastructure: For organizations with robust in-house computing infrastructure, open models can be more cost-effective, while proprietary models might suit organizations seeking high performance without the need for self-hosting.

  • Flexibility Needs: Open models provide flexibility for modification, while proprietary models may better suits cases where performance and ease of integration are prioritized.

2. API-only vs. API & Local Use

Using an API-only model, such as OpenAI’s GPT-4, requires all data to be processed through an external server managed by the provider. This model is hosted by the provider, and the user interacts with it via an API (Application Programming Interface).

This approach has several advantages, including the ease of integration and maintenance, as the user does not need to manage the infrastructure or update the model. Moreover, models like GPT-4 are continually optimized by providers, allowing users to leverage the latest advancements without additional configuration.

However, API-only models come with limitations for organizations that handle sensitive data, as all information must be sent to a third-party server, raising data privacy and compliance concerns. For sectors like finance or healthcare, this may be a barrier due to stringent data protection requirements.

In contrast, open models available on platforms like Hugging Face (e.g., Mistral NeMo) allow users to download and run the model locally on their hardware or within a private cloud environment. This approach offers flexibility and control, particularly when dealing with sensitive or proprietary data, as it keeps data processing within the organization’s secure environment.

Running a model locally provides customization opportunities, as the model can be fine-tuned or adapted for specific needs. However, this requires significant technical resources, such as high-performance hardware, and the organization must maintain the model, including any updates or optimizations.

Additionally, local deployment can be more cost-effective over time for large-scale, intensive use cases. Instead of paying for each API call, organizations can leverage their infrastructure once the initial deployment costs are covered.

Key Decision Factors:

  • The choice between these approaches depends largely on data sensitivity, technical capabilities, and budget. API-only solutions are suitable for scenarios requiring rapid deployment with minimal maintenance, but less suited for applications where data sovereignty and control are paramount.

  • Local deployment is preferable for organizations with the technical resources to manage and secure the infrastructure, particularly in regulated industries or when data privacy is a high priority.

3. Maximum Context Length

The context length of a large language model (LLM) is a crucial factor that influences its ability to retain and process information within a single session. In LLMs, the context length represents the maximum number of tokens (words, parts of words, or punctuation) that the model can consider at once. This becomes particularly relevant when working with extensive texts, complex documents, or conversational applications where continuity is essential.

Key Considerations for Context Length

  1. Token Limitations and Implications: Each model has a predefined context length that caps the number of tokens it can handle in a single input. For instance, OpenAI's GPT-4 and Meta's Llama 3.1 can process up to 128,000 tokens. In practice, this means that the model can consider approximately 128,000 tokens in the input context, enabling it to work with extended conversations or detailed documents.

  2. Practical Application of Context Length: For example, a typical single page of text (~500 words) requires around 750 tokens, assuming a moderate token-per-word ratio of 1.5. For models like Google Gemini 1.5 Pro, this token limit allows up to 2 million tokens, which is beneficial for applications such as legal document analysis, long-form content creation, and research where large context windows enhance model accuracy by preserving contextual relevance.

  3. Limitations in Real-world Scenarios: Although some models offer huge context windows, not all APIs support the maximum context length in practice. The infrastructure requirements and latency issues associated with large context windows mean that the effective usable length might be lower. This restriction necessitates strategies like splitting documents, using memory management techniques, or summarizing parts of the text to fit within the context length.

  4. Application to Conversational Models: For conversational applications, context length also affects the continuity of dialogue. In models with smaller context windows, such as earlier iterations of GPT or smaller open-source models, conversations lose continuity as the model forgets earlier parts of the dialogue. Larger context windows mitigate this by enabling more sustained engagement, making these models suitable for dynamic, real-time interactions like customer service or educational tutoring.

In summary, choosing a model with an appropriate context length involves balancing the technical capabilities with the demands of the application. While larger context windows allow for more comprehensive responses and better handling of long documents, they also come with increased computational costs and may require careful infrastructure planning.

4. Model Quality & Size / Benchmarks

Selecting a Large Language Model (LLM) involves understanding the capabilities and limitations of these models concerning reasoning, understanding, and knowledge representation. Reasoning in LLMs remains largely simulated and non-symbolic, meaning these models don’t perform logical deductions or abstract reasoning in the way a human would. There is currently no evidence to suggest that LLMs have actual understanding; instead, they generate responses based on patterns in data without grasping underlying meaning. Similarly, knowledge in LLMs is not stored as explicit, structured facts. Instead, it exists as materialized information or potential misinformation, emerging only during the inferencing process when the model generates responses.

Understanding Model Size

Larger LLMs, with more parameters, do not hold information in a traditional sense. Instead, they capture complex informational patterns as parametric structures distributed across their neural network architecture. This synthetic form of information, or potentially misinformation, is not a direct repository of knowledge. It becomes meaningful only during inferencing—when the model draws on its encoded patterns to respond to a query. The process of inferencing leverages these learned patterns to produce output that appears knowledgeable but is, in fact, a form of probabilistic simulation based on prior training data.

Rule of Thumb: Larger models with more parameters can potentially “capture” more nuanced patterns and associations within their network. This increased capacity often translates to an ability to handle more intricate language structures, produce more contextually relevant responses, and resulting the possibility of materializing more information.

However, larger models also demand greater computational resources, and their performance gains need to be balanced against infrastructure costs.

  • Large Models (70–405 billion parameters): These models are designed for in-depth analysis and tasks that require simulated, non-symbolic reasoning over long texts and multimodal data. They are suitable for applications requiring complex pattern recognition, but they are resource-intensive in terms of memory and processing power.

  • Mid-sized Models (8–70 billion parameters): Providing a balance between performance and efficiency, these models perform well in use cases like customer service and retrieval-augmented generation (RAG), where the tasks require moderate depth and precision without the highest resource demands.

  • Small Models (1–3 billion parameters): These models are more efficient and faster, making them suitable for lightweight applications like mobile assistants or edge deployments where resource limitations are a priority.

Types of Benchmarks for Model Evaluation

Benchmarks are essential for assessing an LLM's quality and its suitability for specific applications. They measure the model’s capabilities across various domains, helping users determine the best model for particular tasks.

  1. Dataset-based Benchmarks (Static) These benchmarks use predefined datasets to evaluate a model’s ability to answer questions or complete multiple-choice tasks.

    • MMLU (Multitask Language Understanding): Assesses general language understanding across multiple domains.

    • HellaSwag: Evaluates common-sense reasoning by testing the model’s ability to predict the most plausible scenario.

    • GSM8K: Tests the model’s proficiency in mathematical problem-solving.

    • HumanEval: Measures coding skills, assessing the model’s ability to generate functional code.

    • ToxicChat: Evaluates the model’s safety by measuring its response to sensitive or toxic language.

  2. Live Benchmarks These benchmarks test models in real-time scenarios, often through user interactions or competitions.

    • Contests / Exams (e.g., Codeforces): Provides models with live programming challenges, testing their ability to adapt to fresh questions and generate solutions in real-time.

    • Crowdsourced Benchmarks (e.g., Chatbot Arena): Involves human users interacting with models in live chat scenarios, enabling dynamic assessments of conversational quality, engagement, and accuracy.

  3. Human Preference-based Benchmarks These benchmarks assess model performance based on human judgments of response quality.

    • LLM Judgment (e.g., MT-Bench, AlpacaEval): Uses human feedback to score the model on attributes such as helpfulness, accuracy, and coherence, typically comparing it to responses generated by other models.

Practical Benchmarking Platforms

Various platforms offer accessible interfaces and leaderboards to compare LLMs based on benchmark results:

  • Open LLM Leaderboard: Hosted by Hugging Face, this leaderboard allows users to evaluate models across benchmarks like MMLU, HellaSwag, and GSM8K, providing insight into model performance across various reasoning and understanding tasks.

  • LMSYS Chatbot Arena: A crowdsourced evaluation platform that ranks LLMs based on user interactions in live chat scenarios, measuring engagement, conversational quality, and response accuracy.

Parameters and Model Capabilities

The quality and size of an LLM are fundamentally tied to its parameters, the adjustable values within the neural network that are optimized during training. These parameters shape how the model interprets and generates language, controlling which associations and patterns are prioritized. Parameters can be categorized into:

  • Weights: These regulate the strength of associations between tokens, adjusting how words and phrases are connected.

  • Biases: These set starting values, guiding the model towards certain orientations or interpretations.

As a general rule:

  • Larger models, with more parameters, capture and encode more intricate linguistic structures, enabling them to process nuanced language and handle complex, contextually rich content. However, these models require significant computational resources.

  • Smaller models are faster, less resource-intensive, and often more suitable for applications with simpler language requirements or constrained environments.

Summary

Assessing model quality and size through benchmarks is crucial in the selection of an LLM. By examining performance across tasks like reasoning, coding, and conversational engagement, users can align model capabilities with application needs. The key is to balance the pattern recognition capacity of larger models with the resource efficiency of smaller models, ensuring that the selected LLM can effectively meet both the functional requirements and operational constraints of the intended application.

5. API / Infrastructure Costs & Performance

Selecting an LLM provider involves assessing the associated API and infrastructure costs along with the model's performance, especially if the model will be used frequently or in resource-intensive applications. These factors are crucial for ensuring cost-efficiency and optimizing operational expenses in long-term usage.

Key Cost and Performance Metrics

  1. Output Speed: This refers to the model’s capability to generate tokens per second. Faster output speeds can enhance response times and improve user experience, particularly in applications requiring real-time interaction. For example, the Cerebras WSE-3 demonstrates high output speed, handling up to 2100+ tokens per second, ideal for high-performance needs but resource-intensive.

  2. Latency: Latency measures the time taken from sending an API request to receiving the first token of the response. Lower latency is beneficial in use cases where quick response times are necessary, such as conversational agents and live data processing.

  3. Cost Per Token: This metric, often measured as cost per million tokens, provides a straightforward way to estimate API costs. Prices vary significantly between providers. Cerebras, for instance, offers a lower cost per token due to high efficiency but requires considerable initial infrastructure investment.

  4. Processor Type: Different providers utilize specialized hardware to optimize LLM performance. For instance:

    • Cerebras WSE-3: Employs a Wafer Scale Engine with numerous cores, optimized for parallel processing. This setup enables high output but demands significant memory resources.

    • NVIDIA H100 GPUs: These GPUs are highly adaptable, handling varied workloads, and are commonly used across different cloud providers, ensuring flexibility.

    • Groq LPU: The Language Processing Unit (LPU) by Groq uses interconnected chips to optimize memory-intensive tasks, balancing efficiency and speed, which can be advantageous for sequential tasks.

Balancing Cost and Performance

Choosing between providers and configurations should consider both initial and ongoing costs:

  • Cerebras WSE-3 offers high performance with lower per-token costs but at a higher initial setup cost, suitable for organizations needing to support high-volume and real-time processing.

  • SambaNova SN40L is designed for memory efficiency, offering moderate speed and cost balance for mid-size applications.

  • NVIDIA GPUs like the H100 offer flexibility across various workloads, making them a versatile choice for businesses needing adaptable hardware without specialized setup requirements.

In summary, when evaluating API and infrastructure costs, organizations must consider their specific use cases, balancing speed, efficiency, and long-term costs.

6. Model Precision – Quantization Techniques

In the pursuit of deploying large language models efficiently, model precision, specifically quantization, is a critical area of focus. Quantization refers to the process of reducing the precision of model parameters, shifting from the typical 32-bit floating-point representation (FP32) to lower bit-width formats, such as 16-bit (e.g., BF16 or FP16) and even 8-bit or smaller integer formats. The motivation behind quantization lies in reducing both the computational load and memory requirements, allowing for deployment on less powerful hardware without significantly compromising model performance.

Quantization Methods and Goals:

  • Reduction in Computational Overhead: Lower bit-widths mean less data to process per parameter, resulting in faster computations, especially on CPUs and specific hardware optimized for integer operations rather than floating-point calculations.

  • Memory Efficiency: For models with billions of parameters, storage requirements can be prohibitively high. Quantizing a 32-bit model down to 16-bit halves memory usage, and further reductions are possible with 8-bit or even 4-bit formats, significantly cutting down VRAM demands.

Example Illustration: As shown in the comparison of the original and quantized images in the following, quantization can reduce color information in images while retaining essential details. This metaphorically parallels how model parameters can be simplified without losing core functionality.

Image adapted by Maarten Grootendorst from the original by Slava Sidorov
Recommended article by Maarten Grootendorst

Key Quantization Types:

  1. BF16 (Brain Floating Point 16): Unlike standard FP16, BF16 maintains an 8-bit exponent like FP32, allowing for a wider dynamic range, which is beneficial in high-performance settings for large-scale models. BF16 offers a balance between memory efficiency and maintaining model accuracy, making it popular in LLM deployments.

  2. INT8 (8-Bit Integer): By representing model parameters as integers rather than floating-point values, INT8 quantization reduces data requirements further. This is particularly effective on hardware that supports integer arithmetic, where calculations can be significantly faster and less power-intensive than floating-point operations.

  3. Post-Training Quantization (PTQ) vs. Quantization-Aware Training (QAT):

    • PTQ: Applies quantization after the model has been trained, adjusting weights to lower precision without further training. This is often used when computational resources are limited, and minor accuracy trade-offs are acceptable.

    • QAT: Integrates quantization during the training process, allowing the model to adjust to lower precision, reducing potential accuracy loss. This technique is ideal for more demanding applications where precision is critical.

  4. GGUF Quantization Framework: Designed for flexibility across mixed hardware (CPU/GPU) environments, GGUF enables efficient processing by offloading model layers to CPU if GPU memory is limited. GGUF supports quantization down to as low as 2-bit, 4-bit, and 6-bit formats, allowing high adaptability to hardware constraints, such as local deployment on laptops with limited RAM.

Conclusion

Quantization is a powerful tool for optimizing large language models, facilitating their deployment on diverse platforms with constrained resources. By intelligently reducing precision, quantization allows models to maintain essential performance characteristics while minimizing computation and memory demands, ensuring feasible, cost-effective application across various hardware environments.

Agent Engineering & Monitoring

In discussing Agent Engineering & Monitoring, we focus on creating structured, enterprise-grade LLM agents with real-time monitoring to ensure reliability, alignment with business objectives, and adaptability in dynamic environments. This approach emphasizes using robust frameworks and monitoring strategies to build and maintain high-quality LLM deployments.

1. LLM Agent Design

  • LLM Tooling: This involves selecting the appropriate tools and frameworks necessary for building, deploying, and iterating on LLM agents. The tooling process typically includes selecting the LLM model, applying prompt engineering, and setting up any required integrations.

  • Monitoring: Real-time monitoring is essential to ensure that deployed agents function as expected. This involves tracking metrics such as response accuracy, latency, and user satisfaction. Effective monitoring allows for identifying issues in performance or behavior, prompting refinements to the agent's setup and potentially necessitating updates in the training data, prompt structure, or model choice.

The LLM Agent Design framework involves Build-Time and Run-Time stages:

  • Build-Time: Here, the foundation of the LLM agent is established. This includes setting up prompt templates, defining safeguarding measures, and determining evaluation metrics. The agent is deployed with initial prompts and guardrails, and specific criteria for performance evaluation are established.

  • Run-Time: This stage includes real-time interactions, prompt transformation, moderation, inferencing, and output safeguarding. The agent’s performance is monitored constantly, allowing for real-time adjustments based on collected metrics and usage patterns.

LLM Agent Design Framework

2. LLM Tooling

The LLM tooling landscape comprises various tools and frameworks to support different aspects of agent development and deployment, ranging from basic prompt templates to advanced multi-agent orchestration.

  • Pro-Code Agent Frameworks: Such as Haystack, LangChain, and LlamaIndex, these frameworks are powerful for users with coding expertise, providing tools for prompt chaining, data retrieval, and conversational AI.

  • No-Code Agent Frameworks: For users seeking ease of use without programming, tools like LangFlow and Flowise enable basic agent creation without complex setups.

  • Playgrounds: Examples include Hugging Face Playground, OpenAI, and Cohere. These platforms allow for direct interaction with models and are valuable for prototyping and testing prompts.

  • Dialogue Frameworks and UI Interfaces: Tools like Voiceflow, botpress, Gradio, and Streamlit support the development of chat and dialogue interfaces, providing a more user-friendly way to test agent responses.

LLM Frameworks

Currently, there are three main pro-code frameworks available:

  • Haystack: An open-source Python framework that integrates seamlessly with major LLM providers and databases, offering stability and production-readiness. Its capabilities in document indexing, RAG workflows, and context-aware query processing make it well-suited for complex applications.

  • LangChain: A general-purpose framework that provides tools for text generation, translation, summarization, and more. It supports the structuring of workflows with prompt chaining, making it well-suited for multi-tool integrations.

  • LlamaIndex: Specializes in search and retrieval applications, ideal for content generation and virtual assistants. It is optimized for retrieval-augmented generation (RAG) workflows with seamless integration of custom data sources.

3. Monitoring

In the context of Agent Engineering & Monitoring, PromptOps is a crucial concept that mirrors the principles of DevOps but is tailored specifically for managing and optimizing prompts in LLM systems. PromptOps encompasses a range of practices focused on continuous monitoring, management, and optimization of prompt interactions with LLMs to ensure sustained performance and reliability.

PromptOps strategies are:

  • Real-Time Monitoring: Capturing detailed metrics on prompt performance, response times, and usage patterns, which can help identify areas for improvement.

  • Iterative Prompt Optimization: By continuously analyzing prompt effectiveness, allows us to refine prompts based on observed outcomes, improving agent performance over time.

  • Feedback Loops: Providing mechanisms to log interactions, track user feedback, and generate insights, which are crucial for making data-driven adjustments to the agent.

Examples of monitoring tools include Lunary AI, LangFuse, LangSmith Hub, Humanloop, PromptLayer, and Weights & Biases, each offering unique capabilities for monitoring, managing, and optimizing model prompts in real time.

Last updated