Expand Model-as-a-Service for secure enterprise AI

This is the fourth and final article in our series on Models-as-a-Service for enterprises. In this article, we will focus on comprehensive security and scalability measures for a Models-as-a-Service (MaaS) platform in an enterprise environment.

Security is a core tenet of MaaS

The security underpinning the connections detailed in this article is an inherent part of the MaaS framework. It is a fundamental design principle that addresses a range of potential vulnerabilities and compliance requirements. The platform proactively mitigates risks associated with deploying and accessing LLMs, ensuring data integrity, confidentiality, and adherence to regulatory standards.

1. Platform security for model deployment and management

Secure and adaptable AI platform with Red Hat OpenShift AI: Red Hat OpenShift AI serves as the bedrock of the MaaS platform, offering a highly secure and flexible environment for the entire AI lifecycle. Organizations have the freedom to choose where to deploy models: on-premises for maximum control, in the public cloud for scalability, or at the edge for localized processing. This versatility is coupled with comprehensive support for model training, fine-tuning, and serving, streamlining the AI workflow.
End-to-end AI governance enabled by integrated technology stack: The MaaS solution stack, comprising OpenShift AI, 3Scale API Gateway, and single sign-on (SSO), establishes a comprehensive system for AI governance. This integrated architecture fosters a managed and regulated environment, providing organizations with the visibility and control necessary to oversee every stage of the AI lifecycle, from development to deployment and usage.
Automated and secure credential handling during deployment: OpenShift AI enhances security by automating the handling of sensitive credentials during model deployment. Connection parameters, such as Access Keys and Secret Keys for S3 storage, are automatically injected as environment variables directly into the model runtime or workbench. This eliminates the risky practice of embedding credentials in code, significantly reducing the attack surface and potential for exposure.
Option for configurable token authentication at model serving endpoint: To fortify security, OpenShift AI offers the option for token authentication during model deployment. Although presented as unchecked for demonstration purposes in certain workshop settings, this capability underscores the platform's ability to enable token-based security at the model serving endpoint. This added layer of authentication prevents unauthorized access and ensures only validated requests are processed.

2. API Gateway with 3scale for secure access and compliance

API Gateway with 3scale for secure access and compliance is crucial for the connections described in this article. MaaS architecture heavily relies on the 3scale API Gateway to expose LLM services securely. This API Gateway provides enterprise-grade control and security for model APIs, acting as a critical intermediary between applications and the LLMs.

Enforced API authentication with JWT/OAuth2: 3scale enforces robust API authentication using standards like JWT (JSON Web Tokens) and OAuth2 for all LLM access. This ensures that only authorized applications and developers, with valid credentials, can interact with the deployed models, effectively preventing unauthorized or malicious access.
End-to-end encrypted traffic for data privacy: All API traffic to and from the LLM services is secured with encryption, ensuring that data in transit remains private and protected from eavesdropping. This encryption is vital for upholding data confidentiality, especially when dealing with sensitive information.
Comprehensive audit logs for regulatory compliance: The API Gateway generates detailed audit logs that track all API usage. These logs are essential for demonstrating compliance with regulations such as GDPR, HIPAA, and SOC2, providing organizations with the means to monitor and verify adherence to data security and privacy standards. The logs offer a transparent and auditable record of all interactions with the LLM services.
Usage policies and governance for cost management and control: 3Scale allows administrators to define and enforce rate limits and quotas on API usage. These controls are critical for managing costs, preventing excessive consumption of resources, and monitoring LLM API usage on a granular level (by team or project). This facilitates better planning and cost optimization.
Developer enablement with integrated security: The self-service developer portal offered by 3Scale streamlines the process of LLM API discovery and provides automatically generated API documentation. It also handles the secure management of access credentials, like API keys, simplifying the integration process for developers while ensuring security is not compromised.

3. Unified identity management for zero-trust access

The MaaS solution stack integrates an authentication component built on SSO (based on Keycloak) to implement unified identity management for all LLM services. This empowers a zero-trust security model.

Zero-trust security through centralized authentication: Centralized authentication via protocols like OIDC (OpenID Connect) and SAML (Security Assertion Markup Language) is implemented for all LLM tools. This ensures every request for access is verified, adhering to the principle of “never trust, always verify.”
Role-based access control (RBAC) for granular permissions: The platform uses RBAC, enabling fine-grained permissions to LLM services and resources. Access is determined based on user roles, granting only the necessary privileges, which minimizes the risk of unauthorized access.
Multifactor authentication (MFA) support for enhanced security: For sensitive AI workloads, the platform provides support for Multifactor Authentication (MFA). This adds an extra layer of security, requiring users to provide multiple forms of identification before granting access.
Enterprise identity integration with existing systems: The platform integrates with existing identity providers such as Active Directory or LDAP, allowing for seamless integration with enterprise infrastructure. This integration streamlines user provisioning and deprovisioning, ensuring that access permissions are always up to date.
Single sign-on (SSO) for uniform access across hybrid cloud environments: SSO is supported for all internal AI portals. This ensures consistent access policies are enforced across all hybrid cloud environments, simplifying user experience while maintaining stringent security standards.

Specific considerations include:

Model indemnification: With the IBM Granite series, pick a competitively priced model with less infrastructure requirement, IP indemnification, and an easy-to-use toolkit for model customization and application integration
Ensuring model authenticity implicitly: While detailed procedures for verifying the authentic source, or provenance of models, are not explicitly outlined, the MaaS approach inherently addresses this. By centralizing model management within the organization's IT function, control over which open source models are deployed and how they are modified with proprietary data is gained. This centralized control implies an internal vetting process for models from trusted sources. The emphasis on complying with existing security, data, and privacy policies by avoiding third-party hosted models further reinforces this.

Scaling the inference service for MaaS for enterprises

The Models-as-a-Service solution significantly leverages vLLM to provide high-performance, scalable, and cost-effective large language model (LLM) inference and serving. Key scalability aspects include:

High-throughput and memory-efficient inference: Utilizing vLLM as the inference server enables the handling of a substantial volume of requests from applications like AnythingLLM, thanks to its design optimized for high-throughput and memory efficiency.
Efficient memory management with PagedAttention: The vLLM's PagedAttention mechanism reduces memory wastage by efficiently managing attention key and value memory, allowing for increased batching and concurrent serving of requests on the same hardware, thus enhancing throughput and scalability.
Continuous batching of incoming requests: The continuous batching feature in vLLM processes incoming requests together, regardless of arrival time or length variations, thereby optimizing GPU utilization and boosting throughput, which is critical for managing unpredictable loads from multiple applications.
Parallelism and distributed inference: For scaling large models, vLLM supports tensor and pipeline parallelism for distributed inference, enabling efficient distribution of LLMs across multiple hardware accelerators and nodes to serve very large models or a higher number of concurrent requests.
Quantization for reduced resource consumption: Integrating various quantization techniques (GPTQ, AWQ, INT4, INT8, FP8) reduces the model's memory footprint and computational demands, allowing more models to be hosted or larger batch sizes to be processed on existing hardware, directly impacting the MaaS platform's scalability and cost-effectiveness.
Optimized kernels and execution: vLLM employs optimized CUDA kernels, including integration with FlashAttention and FlashInfer, and features fast model execution with CUDA/HIP graphs, resulting in faster inference speeds, quicker response times to requests from applications like AnythingLLM, and higher query per second (QPS).
Advanced scheduling and caching features: Advanced scheduling, speculative decoding, chunked prefill, and prefix caching within the vLLM framework optimize request processing and reuse intermediate computations. This leads to improved latency and higher throughput, especially for common prompts or longer sequences.
Disaggregated serving: The llm-d framework, which builds on vLLM, uses disaggregated serving to run prefill and decode on independent instances, enhancing scalability by optimizing resource allocation for different stages of LLM inference.
Hardware agnostic performance: Designed to perform efficiently across various hardware, including NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, Google TPUs, AWS Neuron, and CPUs, vLLM allows enterprises to scale LLM deployments using diverse and potentially more cost-effective hardware, avoiding vendor lock-in and supporting hybrid cloud strategies.
Continuous performance optimization: Ongoing efforts, such as those on Google TPUs, demonstrate significant performance improvements for models like Llama 3, achieved through optimizations to the ragged attention kernel, KV cache writing, compilation fixes, and communication within the TPU pod, enhancing the scalability and efficiency of LLM serving on specific hardware.
Centralized API management for scalability and governance: The 3Scale API Gateway manages scalability for the LLM services by enabling administrators to set rate limits and quotas, preventing cost overruns and service overload, thus ensuring controlled and predictable scaling of LLM consumption across different teams or projects.
Automated deployment of new models and products: The MaaS approach streamlines the deployment of new models by automating 3scale configuration via its operator. This leads to faster innovation and deployment speed. It enables enterprises to quickly introduce new LLM capabilities to their applications and scale their AI offerings.

By incorporating vLLM's advanced optimizations and leveraging the robust capabilities of OpenShift AI and 3scale API Gateway, the MaaS platform ensures that LLMs are not only accessible but also served efficiently, cost-effectively, and scalably to meet enterprise demands.

Wrap up

This brings us to the end of our series. The Models-as-a-Service (MaaS) platform securely deploys AI models using Red Hat OpenShift AI and 3Scale API Gateway, emphasizing security through platform protections, AI governance, and API authentication. It scales using vLLM for efficient LLM serving, supported by single sign-on for identity management. This infrastructure ensures secure, scalable, and cost-effective deployment, protecting data and adhering to regulations, thus enabling an enterprise-wide AI approach.

If you haven't already, check out the other articles in this series:

Part 1: Discover the 6 benefits of Models-as-a-Service for enterprises, an introduction to MaaS for enterprises.
Part 2: Explore broad architectural details and learn why enterprises need MaaS.
Part 3: Learn about how to implement MaaS in an enterprise and its various components.

Linux

Java runtimes & frameworks

Kubernetes

Integration & App Connectivity

AI/ML

Automation

Developer tools

Developer Sandbox

Programming Languages & Frameworks

System Design & Architecture

Developer Productivity

Secure Development & Architectures

Platform Engineering

Automated Data Processing

Start exploring in the Developer Sandbox for free

E-Books

Cheat Sheets

Documentation

Red Hat Learning