Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Expand Model-as-a-Service for secure enterprise AI

July 17, 2025
Ritesh Shah Karl Eklund Juliano Mohr gmoutier@redhat.com, egranger@redhat.com
Related topics:
Artificial intelligenceData ScienceOpen sourcePlatform engineeringSecuritySummit 2025
Related products:
Red Hat 3scale API ManagementRed Hat OpenShift GitOpsRed Hat OpenShiftRed Hat Single sign-on

Share:

    This is the fourth and final article in our series on Models-as-a-Service for enterprises. In this article, we will focus on comprehensive security and scalability measures for a Models-as-a-Service (MaaS) platform in an enterprise environment.

    Security is a core tenet of MaaS

    The security underpinning the connections detailed in this article is an inherent part of the MaaS framework. It is a fundamental design principle that addresses a range of potential vulnerabilities and compliance requirements. The platform proactively mitigates risks associated with deploying and accessing LLMs, ensuring data integrity, confidentiality, and adherence to regulatory standards.

    1. Platform security for model deployment and management

    • Secure and adaptable AI platform with Red Hat OpenShift AI: Red Hat OpenShift AI serves as the bedrock of the MaaS platform, offering a highly secure and flexible environment for the entire AI lifecycle. Organizations have the freedom to choose where to deploy models: on-premises for maximum control, in the public cloud for scalability, or at the edge for localized processing. This versatility is coupled with comprehensive support for model training, fine-tuning, and serving, streamlining the AI workflow.
    • End-to-end AI governance enabled by integrated technology stack: The MaaS solution stack, comprising OpenShift AI, 3Scale API Gateway, and single sign-on (SSO), establishes a comprehensive system for AI governance. This integrated architecture fosters a managed and regulated environment, providing organizations with the visibility and control necessary to oversee every stage of the AI lifecycle, from development to deployment and usage.
    • Automated and secure credential handling during deployment: OpenShift AI enhances security by automating the handling of sensitive credentials during model deployment. Connection parameters, such as Access Keys and Secret Keys for S3 storage, are automatically injected as environment variables directly into the model runtime or workbench. This eliminates the risky practice of embedding credentials in code, significantly reducing the attack surface and potential for exposure.
    • Option for configurable token authentication at model serving endpoint: To fortify security, OpenShift AI offers the option for token authentication during model deployment. Although presented as unchecked for demonstration purposes in certain workshop settings, this capability underscores the platform's ability to enable token-based security at the model serving endpoint. This added layer of authentication prevents unauthorized access and ensures only validated requests are processed.

    2. API Gateway with 3scale for secure access and compliance

    API Gateway with 3scale for secure access and compliance is crucial for the connections described in this article. MaaS architecture heavily relies on the 3scale API Gateway to expose LLM services securely. This API Gateway provides enterprise-grade control and security for model APIs, acting as a critical intermediary between applications and the LLMs.

    • Enforced API authentication with JWT/OAuth2: 3scale enforces robust API authentication using standards like JWT (JSON Web Tokens) and OAuth2 for all LLM access. This ensures that only authorized applications and developers, with valid credentials, can interact with the deployed models, effectively preventing unauthorized or malicious access.
    • End-to-end encrypted traffic for data privacy: All API traffic to and from the LLM services is secured with encryption, ensuring that data in transit remains private and protected from eavesdropping. This encryption is vital for upholding data confidentiality, especially when dealing with sensitive information.
    • Comprehensive audit logs for regulatory compliance: The API Gateway generates detailed audit logs that track all API usage. These logs are essential for demonstrating compliance with regulations such as GDPR, HIPAA, and SOC2, providing organizations with the means to monitor and verify adherence to data security and privacy standards. The logs offer a transparent and auditable record of all interactions with the LLM services.
    • Usage policies and governance for cost management and control: 3Scale allows administrators to define and enforce rate limits and quotas on API usage. These controls are critical for managing costs, preventing excessive consumption of resources, and monitoring LLM API usage on a granular level (by team or project). This facilitates better planning and cost optimization.
    • Developer enablement with integrated security: The self-service developer portal offered by 3Scale streamlines the process of LLM API discovery and provides automatically generated API documentation. It also handles the secure management of access credentials, like API keys, simplifying the integration process for developers while ensuring security is not compromised.

    3. Unified identity management for zero-trust access

    The MaaS solution stack integrates an authentication component built on SSO (based on Keycloak) to implement unified identity management for all LLM services. This empowers a zero-trust security model.

    • Zero-trust security through centralized authentication: Centralized authentication via protocols like OIDC (OpenID Connect) and SAML (Security Assertion Markup Language) is implemented for all LLM tools. This ensures every request for access is verified, adhering to the principle of “never trust, always verify.”
    • Role-based access control (RBAC) for granular permissions: The platform uses RBAC, enabling fine-grained permissions to LLM services and resources. Access is determined based on user roles, granting only the necessary privileges, which minimizes the risk of unauthorized access.
    • Multifactor authentication (MFA) support for enhanced security: For sensitive AI workloads, the platform provides support for Multifactor Authentication (MFA). This adds an extra layer of security, requiring users to provide multiple forms of identification before granting access.
    • Enterprise identity integration with existing systems: The platform integrates with existing identity providers such as Active Directory or LDAP, allowing for seamless integration with enterprise infrastructure. This integration streamlines user provisioning and deprovisioning, ensuring that access permissions are always up to date.
    • Single sign-on (SSO) for uniform access across hybrid cloud environments: SSO is supported for all internal AI portals. This ensures consistent access policies are enforced across all hybrid cloud environments, simplifying user experience while maintaining stringent security standards.

    Specific considerations include:

    • Model indemnification: With the IBM Granite series, pick a competitively priced model with less infrastructure requirement, IP indemnification, and an easy-to-use toolkit for model customization and application integration
    • Ensuring model authenticity implicitly: While detailed procedures for verifying the authentic source, or provenance of models, are not explicitly outlined, the MaaS approach inherently addresses this. By centralizing model management within the organization's IT function, control over which open source models are deployed and how they are modified with proprietary data is gained. This centralized control implies an internal vetting process for models from trusted sources. The emphasis on complying with existing security, data, and privacy policies by avoiding third-party hosted models further reinforces this.

    Scaling the inference service for MaaS for enterprises

    The Models-as-a-Service solution significantly leverages vLLM to provide high-performance, scalable, and cost-effective large language model (LLM) inference and serving. Key scalability aspects include:

    • High-throughput and memory-efficient inference: Utilizing vLLM as the inference server enables the handling of a substantial volume of requests from applications like AnythingLLM, thanks to its design optimized for high-throughput and memory efficiency.

    • Efficient memory management with PagedAttention: The vLLM's PagedAttention mechanism reduces memory wastage by efficiently managing attention key and value memory, allowing for increased batching and concurrent serving of requests on the same hardware, thus enhancing throughput and scalability.

    • Continuous batching of incoming requests: The continuous batching feature in vLLM processes incoming requests together, regardless of arrival time or length variations, thereby optimizing GPU utilization and boosting throughput, which is critical for managing unpredictable loads from multiple applications.

    • Parallelism and distributed inference: For scaling large models, vLLM supports tensor and pipeline parallelism for distributed inference, enabling efficient distribution of LLMs across multiple hardware accelerators and nodes to serve very large models or a higher number of concurrent requests.

    • Quantization for reduced resource consumption: Integrating various quantization techniques (GPTQ, AWQ, INT4, INT8, FP8) reduces the model's memory footprint and computational demands, allowing more models to be hosted or larger batch sizes to be processed on existing hardware, directly impacting the MaaS platform's scalability and cost-effectiveness.

    • Optimized kernels and execution: vLLM employs optimized CUDA kernels, including integration with FlashAttention and FlashInfer, and features fast model execution with CUDA/HIP graphs, resulting in faster inference speeds, quicker response times to requests from applications like AnythingLLM, and higher query per second (QPS).

    • Advanced scheduling and caching features: Advanced scheduling, speculative decoding, chunked prefill, and prefix caching within the vLLM framework optimize request processing and reuse intermediate computations. This leads to improved latency and higher throughput, especially for common prompts or longer sequences.

    • Disaggregated serving: The llm-d framework, which builds on vLLM, uses disaggregated serving to run prefill and decode on independent instances, enhancing scalability by optimizing resource allocation for different stages of LLM inference.

    • Hardware agnostic performance: Designed to perform efficiently across various hardware, including NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, Google TPUs, AWS Neuron, and CPUs, vLLM allows enterprises to scale LLM deployments using diverse and potentially more cost-effective hardware, avoiding vendor lock-in and supporting hybrid cloud strategies.

    • Continuous performance optimization: Ongoing efforts, such as those on Google TPUs, demonstrate significant performance improvements for models like Llama 3, achieved through optimizations to the ragged attention kernel, KV cache writing, compilation fixes, and communication within the TPU pod, enhancing the scalability and efficiency of LLM serving on specific hardware.

    • Centralized API management for scalability and governance: The 3Scale API Gateway manages scalability for the LLM services by enabling administrators to set rate limits and quotas, preventing cost overruns and service overload, thus ensuring controlled and predictable scaling of LLM consumption across different teams or projects.

    • Automated deployment of new models and products: The MaaS approach streamlines the deployment of new models by automating 3scale configuration via its operator. This leads to faster innovation and deployment speed. It enables enterprises to quickly introduce new LLM capabilities to their applications and scale their AI offerings.

    By incorporating vLLM's advanced optimizations and leveraging the robust capabilities of OpenShift AI and 3scale API Gateway, the MaaS platform ensures that LLMs are not only accessible but also served efficiently, cost-effectively, and scalably to meet enterprise demands.

    Wrap up

    This brings us to the end of our series. The Models-as-a-Service (MaaS) platform securely deploys AI models using Red Hat OpenShift AI and 3Scale API Gateway, emphasizing security through platform protections, AI governance, and API authentication. It scales using vLLM for efficient LLM serving, supported by single sign-on for identity management. This infrastructure ensures secure, scalable, and cost-effective deployment, protecting data and adhering to regulations, thus enabling an enterprise-wide AI approach.

    If you haven't already, check out the other articles in this series:

    • Part 1: Discover the 6 benefits of Models-as-a-Service for enterprises, an introduction to MaaS for enterprises.
    • Part 2: Explore broad architectural details and learn why enterprises need MaaS.
    • Part 3: Learn about how to implement MaaS in an enterprise and its various components.

    Related Posts

    • 6 benefits of Models-as-a-Service for enterprises

    • Why Models-as-a-Service architecture is ideal for AI models

    • How RamaLama runs AI models in isolation by default

    • Experiment and test AI models with Podman AI Lab

    • How to use service mesh to improve AI model security

    • How to run AI models in cloud development environments

    Recent Posts

    • Create and enrich ServiceNow ITSM tickets with Ansible Automation Platform

    • Expand Model-as-a-Service for secure enterprise AI

    • OpenShift LACP bonding performance expectations

    • Build container images in CI/CD with Tekton and Buildpacks

    • How to deploy OpenShift AI & Service Mesh 3 on one cluster

    What’s up next?

    This hands-on learning path demonstrates how retrieval-augmented generation (RAG) works and how users can implement a RAG workflow using Red Hat OpenShift AI and Elasticsearch vector database.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue