Amazon SageMaker AI Launches /openai/v1 Path for Real-Time Inference | Neural Catalog

Amazon SageMaker AI has expanded its capabilities with the introduction of OpenAI-compatible API support for real-time inference endpoints, allowing users to invoke models without rewriting code or managing complex authentication. The new system exposes an /openai/v1 path, accepting standard Chat Completions requests and returning responses directly from the container, including streaming data. This means applications built for OpenAI’s SDK, LangChain, or Strands Agents can now seamlessly connect to models hosted on SageMaker AI simply by updating the endpoint URL. Giorgio Piatti, AI/ML Engineer at Caffeine.AI, noted that the bearer token feature enables integration with their existing systems, allowing it to work natively with their gateway, Vercel AI SDK, and standard OpenAI clients. This development also offers a path for running agentic workflows entirely on owned infrastructure, keeping data and processes within a user’s AWS account.

OpenAI-Compatible API Support for SageMaker AI Endpoints

Amazon SageMaker now offers a way for running large language models without requiring developers to rewrite existing code. This functionality centers around exposing an /openai/v1 path on SageMaker AI endpoints, enabling invocation of models without the need for a SigV4 wrapper or custom client implementations. The core of this advancement lies in the authentication mechanism. SageMaker AI utilizes bearer token authentication, generated locally by the SageMaker Python SDK. This process constructs a request to the SageMaker AI service, signs it using existing AWS credentials, and encodes the result into a portable token string, all without any network calls during generation. Giorgio Piatti (AI/ML Engineer, Caffeine.AI) highlighted the immediate practical benefits for companies already invested in OpenAI-compatible tools; Caffeine.AI is already leveraging this feature with its LLM gateway, Bifrost, demonstrating a rapid adoption of the new capability.

The generated token, valid for up to 12 hours, though this can be overridden with the expiry parameter, grants access based on the IAM permissions of the user or role creating it, requiring the sagemaker:CallWithBearerToken action permission. CallWithBearerToken requires a wildcard ( "*" ) for the Resource field and does not support resource-level restrictions. This compatibility extends beyond simple model swapping; it unlocks new possibilities for architecting complex AI workflows. Agentic workflows, built with frameworks like Strands Agents or LangChain, can now be executed entirely on SageMaker AI endpoints, providing organizations with greater control over their data and processing. The system supports multi-model hosting on a single endpoint using inference components, allowing for the deployment of models like Llama, Mistral, and smaller classification models, all accessible through the same OpenAI SDK.

Amazon emphasizes the security of this approach, advising users to scope IAM roles to the minimum required permissions and avoid storing tokens persistently. The company advises users to treat tokens with the same care as credentials, noting that token generation is a local operation with no network overhead, making frequent refresh cycles a viable best practice. The system’s design prioritizes minimizing the risk of token leakage and maximizing the token’s remaining validity, offering a secure and efficient solution for integrating LLMs into existing applications.

Bearer Token Authentication for SageMaker AI Endpoints

The increasing number of large language models (LLMs) has created a complex ecosystem of deployment options, but integrating these models into existing infrastructure often demands significant code adaptation. Previously, connecting to Amazon SageMaker AI endpoints required developers to implement SigV4 signature generation, a process that added overhead and complexity to applications. Now, Amazon SageMaker AI has introduced OpenAI-compatible API support, streamlining integration by allowing invocation of models with a simple change to the endpoint URL, eliminating the need for custom clients or SigV4 wrappers. This shift represents a move towards interoperability, reducing friction for developers already familiar with the OpenAI ecosystem. Central to this advancement is a new authentication mechanism leveraging bearer tokens.

Giorgio Piatti (AI/ML Engineer, Caffeine.AI) explains, “The bearer token feature lets us add SageMaker as a drop-in OpenAI-compatible inference endpoint — no custom SigV4 signing — so it works natively with our gateway, Vercel AI SDK, and standard OpenAI clients.” The implications extend beyond simplified integration, eliminating the need for separate API clients and routing logic, simplifying application architecture. As a security measure, the CallWithBearerToken action requires a wildcard ( "*" ) for the Resource field and does not support resource-level restrictions.

You are a senior code reviewer. Review Python code for correctness, performance, and PEP 8 style. Give a concise review with specific suggestions.

Multi-Model Hosting with Inference Components

Giorgio Piatti, an AI/ML Engineer at Caffeine.AI, is currently leveraging a new capability within Amazon SageMaker AI to streamline the integration of large language models into his company’s workflows. Rather than building bespoke integrations for each provider, Caffeine.AI is utilizing the platform’s OpenAI-compatible API to connect to SageMaker models through their existing LLM gateway, Bifrost. “We run AI coding agents that use multiple LLM providers through an LLM gateway (Bifrost) speaking the OpenAI chat completions protocol. The bearer token feature lets us add SageMaker as a drop-in OpenAI-compatible inference endpoint — no custom SigV4 signing — so it works natively with our gateway, Vercel AI SDK, and standard OpenAI clients,” says Giorgio Piatti (AI/ML Engineer, Caffeine.AI).

Amazon SageMaker AI now facilitates a multi-model hosting environment, allowing users to deploy and serve diverse models, such as Llama for general tasks, a fine-tuned Mistral for specialized applications, and a smaller model for classification, from a single endpoint using inference components. Each model receives its own dedicated resource allocation, yet remains accessible through the standardized OpenAI SDK interface, eliminating the need for separate API clients or complex routing logic within application code. This consolidation is particularly valuable for organizations managing a portfolio of models tailored to different use cases, reducing operational overhead and simplifying deployment. The system allows for the seamless serving of fine-tuned open-source models without necessitating code changes in existing applications.

If a user has refined an open-source model for a specific task, they can deploy it on SageMaker AI and access it using the same OpenAI-compatible interface their applications already utilize; the sole adjustment required is the endpoint URL. As a security measure, CallWithBearerToken requires a wildcard ( "*" ) for the Resource field and does not support resource-level restrictions.

He focuses on helping customers build and optimize their AI/ML inference experience on Amazon SageMaker AI.

Agentic Workflows on Owned Infrastructure

The ability to run complex, multi-step artificial intelligence processes entirely on privately owned infrastructure is rapidly becoming a priority for organizations seeking greater control and data security. Amazon SageMaker AI’s recent introduction of OpenAI-compatible API support directly addresses this need, enabling a shift away from reliance on third-party hosted large language models (LLMs) for agentic workflows. This isn’t simply about replicating functionality; it’s about establishing a fully contained ecosystem where AI agents operate within a defined and managed environment. Now, these same agents can be deployed to run entirely on SageMaker AI endpoints, leveraging a familiar interface without requiring code modifications. This is achieved through the exposure of an /openai/v1 path, allowing applications to invoke models using existing SDKs and tools.

Giorgio Piatti (AI/ML Engineer, Caffeine.AI) says, “The bearer token feature lets us add SageMaker as a drop-in OpenAI-compatible inference endpoint — no custom SigV4 signing — so it works natively with our gateway, Vercel AI SDK, and standard OpenAI clients.” Organizations can deploy fine-tuned open-source models without altering existing application code, changing only the endpoint URL to direct requests to their privately hosted instance. These time-limited tokens, valid for up to 12 hours, though this can be overridden with the expiry parameter, utilize existing AWS credentials, removing the need for additional secrets or API keys. This localized approach enhances security and control, allowing organizations to maintain complete ownership of their AI infrastructure and data flows.

We run AI coding agents that use multiple LLM providers through an LLM gateway (Bifrost) speaking the OpenAI chat completions protocol.

Deploying and Invoking Single-Model Endpoints

The expectation that deploying large language models (LLMs) necessitates complex integrations and bespoke client software is rapidly dissolving. Amazon SageMaker AI has introduced support for OpenAI-compatible APIs on its real-time inference endpoints, a move that dramatically simplifies model invocation for developers already familiar with established frameworks. Rather than requiring custom SigV4 wrappers or code rewrites, SageMaker AI now allows access to hosted models simply by altering the endpoint URL, a level of interoperability previously uncommon in managed AI services. Giorgio Piatti (AI/ML Engineer, Caffeine.AI) is integrating the feature into its Bifrost LLM gateway. Beyond simplified access, the OpenAI-compatible interface unlocks new possibilities for architectural control. The authentication mechanism relies on IAM permissions; the IAM role or user invoking the endpoint needs permission to both sagemaker:InvokeEndpoint and sagemaker:CallWithBearerToken. The generated bearer token, according to Amazon, carries the same authorization as the underlying AWS credentials, and should be treated with the same level of care.

He focuses on enabling generative AI model development and governance on Amazon SageMaker HyperPod.

Source: https://aws.amazon.com/blogs/machine-learning/announcing-openai-compatible-api-support-for-amazon-sagemaker-ai-endpoints/