Question 1

What is Axiom LLM Gateway?

Accepted Answer

Axiom LLM Gateway is a Kubernetes-native inference gateway that provides governance, policy guardrails, and enterprise-grade control over your AI infrastructure. It sits between your applications and 18+ AI providers, offering a single OpenAI-compatible API endpoint with full control over routing, failover, traffic shaping, rate limiting, quotas, credential management, and observability across your entire AI fleet.

Question 2

Do I need to change my application code to use the LLM Gateway?

Accepted Answer

No. The LLM Gateway exposes an OpenAI-compatible API, so you simply point your existing OpenAI client library at the gateway URL. No proprietary SDKs, no code changes, and no lock-in to the gateway itself. All governance policies, guardrails, and traffic shaping are enforced at the gateway layer — invisible to your application code.

Question 3

Which AI providers does the gateway support?

Accepted Answer

The gateway supports 18+ providers including OpenAI, Anthropic, Google Gemini, Azure OpenAI, AWS Bedrock, Google Vertex AI, Mistral AI, Cohere, Groq, Ollama (self-hosted), OpenRouter, Perplexity, Cerebras, HuggingFace, ElevenLabs, Nebius, xAI (Grok), and Parasail. Streaming (SSE) is supported for all chat completion providers. Governance policies and rate limiting apply uniformly across all providers.

Question 4

How does automatic failover work?

Accepted Answer

You define fallback chains per credential with priority-based ordering. When a primary provider fails due to rate limits, API errors, or outages, the gateway automatically retries with the next fallback in priority order — including cross-provider fallback such as OpenAI primary with Anthropic fallback. All failover events are logged in governance audit trails.

Question 5

What governance and enterprise features are available?

Accepted Answer

Enterprise-grade governance features include policy-based access controls, SSO with SAML 2.0 and OIDC, rate limiting with RPM and TPM quotas, traffic shaping for load distribution, budget controls with automated alerts, governance dashboards with audit trails, and a FinOps dashboard with cost breakdown by provider, credential, and model.

Question 6

How does traffic shaping and rate limiting work?

Accepted Answer

The gateway provides granular traffic shaping through weighted load balancing across credentials, combined with rate limiting guardrails that enforce requests per minute (RPM) and tokens per minute (TPM) quotas. You can set quotas per user, per team, or per credential to prevent runaway usage and enforce organizational policies.

Question 7

How is the LLM Gateway deployed?

Accepted Answer

The LLM Gateway is Kubernetes-native and deploys via Helm chart. It supports horizontal pod autoscaling (HPA), ArgoCD GitOps CI/CD pipelines, and multi-cluster fleet management with consistent governance policies across clusters. Storage is flexible: SQLite for development and edge deployments, PostgreSQL with Redis for production scale with horizontal pod scaling.

Question 8

How does governance and observability work in the LLM Gateway?

Accepted Answer

The gateway instruments itself at every layer for full governance visibility. It exposes a native Prometheus metrics endpoint with bearer token authentication, provides governance dashboards showing who uses which models at what cost, decomposes latency into gateway overhead vs. provider response time, provides structured audit logs in logfmt format, and includes a real-time dashboard showing P50, P95, and P99 latency percentiles along with request rate, token usage, cost rate, and error rate.

Question 9

Does the LLM Gateway support self-hosted models?

Accepted Answer

Yes. Through Ollama integration, the gateway routes to self-hosted models using the same unified API. This means you can mix cloud providers and on-premises models within a single routing configuration, applying the same governance policies, guardrails, and enterprise-grade controls regardless of where your AI inference runs.

Provider	Requests	Tokens (prompt · cache write · cache read · completion)	Avg latency	Cost
anthropic	4.2K	2.5M410.7K · 10.4M · 1,247.0M · 2.1M	8.97s	$743.25
openai	998	119.5M119.3M · 0 · 115.8M · 140.6K	6.08s	$119.49

One endpoint. 18+ providers.
Full control.

LLM Gateway

Scaling AI is hard

Vendor Lock-In

No Governance

Credential Sprawl

Silent Failures

Cost Surprises

Multi-Cluster Sprawl

The difference LLM Gateway makes

Before Axiomstudio

Drop-in replacement

18+ providers supported

Everything you need to govern AI at scale

Unified API Across 18+ Providers

See every token, every dollar

Usage by provider

Kubernetes-native from the ground up

Kubernetes Native GitOps

Flexible Storage

Cache-First Inference

Prometheus Metrics

Embedded Management UI

Multi-Tenant Isolation

Built for enterprise scale

SSO & Enterprise-Grade Control

Semantic Caching

Rate Limiting, Quotas & Traffic Shaping

FinOps & Governance Dashboard

Contact Us

One endpoint. 18+ providers.Full control.