Skip to main content

AI Inference Cost Attribution: What AWS, Azure, GCP, OpenAI, and Anthropic Actually Give You

Every major cloud provider offers AI cost attribution — AWS Inference Profiles, Azure deployments, GCP labels — but each is incomplete in a different place. Here is what actually works.

Nishant Thorat

Nishant Thorat

Founder

AI Inference Cost Attribution: What AWS, Azure, GCP, OpenAI, and Anthropic Actually Give You

AI inference cost attribution is the practice of connecting AI model invocation costs to the teams, products, and features generating them. Every major cloud provider offers a mechanism for this — AWS Bedrock Inference Profiles, Azure OpenAI deployments, GCP Vertex AI labels — but each is incomplete in a different place, and none covers the full picture out of the box. This guide explains exactly what each provider gives you, where each falls short, and what to do about it based on your spend level.

---

AI inference costs are growing fast, attributed poorly, and optimized by accident more than by design.

The same teams that instrument every API endpoint, track every database query, and set up alerts for every p99 latency spike — those same teams have no idea which of their AI calls cost the most, or why. Turns out observability stopped at the LLM boundary.

And it's not because the tools don't exist. AWS Bedrock has Inference Profiles that can't be created from the console. Azure tracks costs at the deployment level but not per user or feature. GCP's label-on-request approach is elegant — unless you're using a non-Google model, in which case it silently breaks. Every provider has something. Every provider's something is incomplete in a different place, and documented in a way that makes it look simpler than it is.

What follows is the real version.

---

Why is AI inference so much harder to cost-attribute than regular cloud spend?

If you've spent any time managing cloud costs, you've built mental models around predictable resources: compute instances that run 24/7, storage buckets that grow steadily, databases with consistent load. You can forecast a month out with reasonable confidence.

AI inference laughs at all of that.

Think of traditional cloud costs like a monthly utility bill for a building — predictable, roughly tied to occupancy, easy to attribute to the floors using the most electricity. AI inference is more like a taxi dispatch service running on surge pricing, where every ride (inference call) costs a different amount depending on the passenger count (input tokens), the destination (output tokens), the time of day (peak demand routing), and whether the driver is a Tesla or a minivan (which model you chose). Now multiply that across five different taxi companies — AWS, Azure, GCP, OpenAI, Anthropic — each using different currencies, different meter formats, and different levels of receipt detail.

That's the environment FinOps teams are trying to make sense of right now. According to the FinOps Foundation's State of FinOps 2026 survey, 98% of practitioners are now managing AI spend — up from just 31% two years ago. The Foundation has formally recognized AI as a distinct scope category alongside Public Cloud, SaaS, and Data Centers, publishing dedicated working group papers, KPIs, and a FinOps for AI certification path. AI is no longer a side project that accounting can ignore.

The billing complexity is real. Token pricing varies by model, direction (input vs. output), caching state, context window size, and modality. 71% of companies reported AI/ML cost overruns in 2025, and 48% identified generative AI as their least predictable cloud spending category. Attribution — the ability to say "team X spent Y on model Z for use case W" — is the foundation every optimization conversation depends on. Without it, you're optimizing blindly.

---

How does the FinOps lifecycle apply to AI inference cost management?

The FinOps Foundation's framework defines three iterative phases: Inform, Optimize, and Operate. For AI workloads, each phase carries distinct challenges that don't exist in traditional cloud cost management.

In the Inform phase, you're trying to answer: who is spending, on which models, at what volume, and what's the trend? This is where cost attribution mechanisms — tags, profiles, labels, projects — do their work. Without them, you can see the total AI bill but can't decompose it. New cost-generating personas complicate this further: product managers using no-code AI tools and third-party SaaS platforms making AI calls on your behalf create spend that never flows through engineering's cost attribution infrastructure.

In the Optimize phase, you're using visibility to make decisions. The FinOps Foundation estimates that **inference optimization addresses 80–90% of total GenAI spend**. The cost differential between model tiers is enormous — premium models can cost 20–100x more per token than economy models in the same family. Routing even a fraction of simpler queries to lighter models has an outsized impact on the total bill. The Foundation's working group on optimizing GenAI usage identifies model selection, semantic caching, batching, and prompt engineering as the four primary levers, in roughly that order of impact.

In the Operate phase, you're building governance: enforcement of tagging policies through IaC, budget alerts with automated escalation, regular showback reports, and an AI Investment Council that reviews AI spend against business outcomes. The Foundation recommends weekly forecasting cadences for AI costs due to high volatility — compared to the monthly cycles appropriate for traditional cloud.

Three FinOps personas need different things from AI cost data. Engineers need cost-per-inference displayed alongside technical metrics — they need to know whether a new prompt template costs 3x more than the old one, not just that AI costs went up this month. FinOps practitioners need centralized showback and chargeback reports, anomaly detection for token consumption spikes, and cross-functional coordination. Finance and leadership need unit economics: cost per inference, AI spend as a percentage of revenue, and ROI measured in business outcomes rather than token counts.

With that framework in mind, here's what each provider actually gives you.

---

What do AWS Bedrock Inference Profiles actually give you for cost attribution?

AWS is the only major cloud provider that built a dedicated abstraction layer specifically for AI cost attribution. That makes it the current ceiling — and also a good illustration of how much gap remains even with purpose-built tooling.

System vs. Application Inference Profiles: The distinction that matters

Amazon Bedrock Inference Profiles serve as a model invocation resource — an abstraction that sits between your API calls and the underlying foundation model. Think of them like a named billing account for each AI use case: instead of calling Claude 3 Sonnet directly, you call your claims-processing-profile, which points to Claude 3 Sonnet, carries your cost attribution tags, and routes traffic intelligently across regions.

There are two types, and confusing them is a common and costly mistake.

System-defined inference profiles are pre-created by AWS and handle cross-region traffic routing. When you invoke a model using a system profile ID like us.anthropic.claude-3-sonnet-20240229-v1:0, AWS can route your request to any US region based on available capacity — improving throughput and reducing throttling. The critical limitation: system-defined profiles cannot be tagged. Their ARN deliberately omits the account ID, and AWS Cost Explorer cannot filter by tags that don't exist. For cost attribution purposes, system profiles are invisible.

Application inference profiles are what you want. Created by you via API or CLI (never from the console), they support custom cost allocation tags. Tags like team=data-platform, app=customer-support-bot, and cost-center=sales-engineering flow into AWS Cost and Usage Reports (CUR) once activated in the Billing console.

How to set up Application Inference Profiles for cost attribution

The workflow has several steps that need to happen in the right order:

Step 1: Create the profile via CLI or API (the console doesn't support this):

aws bedrock create-inference-profile \
  --inference-profile-name "claims-processing-claude-sonnet" \
  --description "Claims team - Claude Sonnet for document extraction" \
  --model-source copyFrom="arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0" \
  --tags team=claims,app=doc-extraction,env=production,cost-center=insurance-ops

**Step 2: Activate tags in AWS Billing.** Tags must be explicitly activated as cost allocation tags in the Billing Console. This step is easy to miss and takes up to 24 hours to propagate — your first day of data after creation will appear untagged.

Step 3: Route all invocations through the profile ARN:

response = bedrock_runtime.invoke_model(
    modelId="arn:aws:bedrock:us-east-1:123456789012:application-inference-profile/claims-processing-claude-sonnet",
    body=json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "messages": [{"role": "user", "content": document_text}],
        "max_tokens": 1024
    })
)

For per-request metadata beyond daily CUR aggregates, the `Converse` API's `requestMetadata` parameter attaches key-value pairs that appear in CloudWatch logs:

response = bedrock_runtime.converse(
    modelId="arn:aws:bedrock:...:application-inference-profile/...",
    messages=[{"role": "user", "content": [{"text": prompt}]}],
    requestMetadata={
        "requestType": "claim-extraction",
        "claimId": "CLM-2024-789",
        "userId": "agent-42"
    }
)

Enforcement works too: an IAM policy with `bedrock:InferenceProfileArn` condition key prevents engineers from invoking models directly, bypassing attribution.

The real limitations of AWS Bedrock Inference Profiles

Application Inference Profiles are genuinely useful — the most purpose-built AI cost attribution mechanism any cloud provider offers. But the gaps are significant enough to plan around.

Bedrock Agents don't inherit profile tags. If you're using Bedrock Agents for multi-step reasoning workflows, the token costs do not inherit tags from an attached inference profile. This is a confirmed known issue on AWS re:Post. If agents represent meaningful spend — and for complex use cases they often do — this is a significant attribution hole with no clean workaround today.

CUR gives daily aggregated costs, not per-invocation detail. CUR tells you "the claims team spent $847 on Claude Sonnet in the last 24 hours." It doesn't tell you which specific document extractions cost the most. Per-invocation granularity requires CloudWatch, which requires custom dashboards.

No Terraform support. Application Inference Profiles aren't in the Terraform AWS provider. CloudFormation requires custom resources. For IaC-first teams, this creates a provisioning bottleneck.

Console creation isn't supported. If your provisioning workflow relies on console access for a subset of engineers, the API-only requirement creates friction.

The honest summary: for teams with clean application architecture and no Bedrock Agents, AWS Inference Profiles are excellent. For teams using agents, multi-region architectures, or IaC-first workflows, the gaps are real enough to require supplemental tooling.

---

How does Azure OpenAI handle inference cost attribution?

Azure takes a fundamentally different approach — leaning on its existing resource hierarchy and billing infrastructure rather than building a new abstraction. It's more familiar for Azure-native teams, but has its own set of limitations.

Resource hierarchy and deployment-level tracking

Azure OpenAI resources register as Microsoft.CognitiveServices/accounts. In Cost Management, they appear under "Cognitive Services" with a service tier of "Azure Open AI" — a naming quirk that catches many teams off guard. Filtering by "Azure OpenAI" directly yields zero results.

Each model deployment generates separate billing meters for input and output tokens. E.g. Three deployments — gpt4o-customer-support, gpt4o-mini-classification, o3-analysis — each produce their own meter rows in Cost Analysis, giving deployment-level cost visibility out of the box.

Tags can be applied to Azure OpenAI resources (up to 50 tag name-value pairs per resource). Azure Monitor provides rich per-deployment metrics through the `ModelDeploymentName` dimension, including ProcessedPromptTokens, GeneratedTokens, and AzureOpenAIRequests. However, these are operational metrics — they don't flow directly into billing data, requiring custom pipelines to correlate them with Cost Management exports.

The per-request attribution gap and the APIM gateway pattern

Azure's fundamental ceiling: you cannot attribute costs to individual users or applications within a single deployment natively. If five microservices all call the same deployment, billing data tells you total tokens consumed — not which service consumed how much.

Microsoft's recommended workaround is the Azure API Management (APIM) gateway pattern. APIM sits in front of your Azure OpenAI deployments and applies the `azure-openai-emit-token-metric` policy, capturing per-application and per-user token metrics with custom dimensions. Effective, but APIM adds its own cost and operational overhead, and its throughput limits can become bottlenecks at scale.

Azure OpenAI does not support hard spending limits. Budget alerts can trigger automation via action groups, but cutting off a team's AI spend when they hit their budget requires custom development — not a configuration toggle.

PTU cost allocation: A unique challenge

Azure's Provisioned Throughput Units (PTUs) change the attribution problem significantly. PTUs are capacity reservations priced per unit per hour — you pay for reserved capacity regardless of utilization, similar to Reserved Instances for compute. PTU reservations are not tied to specific deployments and are not interchangeable across Global, Data Zone, and Regional deployment types.

This creates an attribution puzzle: you've pre-paid for capacity, so per-request marginal cost is effectively zero until you exceed reserved throughput. Allocating fixed PTU costs to teams based on consumption requires custom logic built on Azure Monitor token consumption data — there's no native mechanism.

Azure AI Foundry's Model Router feature routes requests to optimal models based on cost, quality, or balanced mode, demonstrating 4.5–14.2% cost savings by directing simpler prompts to cheaper models. But the routing decision doesn't surface in billing data — attribution requires inspecting response.model in application code.

---

What does GCP Vertex AI offer for AI inference cost attribution?

Google's approach is architecturally the most elegant — and most frustratingly incomplete for multi-model deployments.

Labels on API requests (Google models only)

For Gemini and other Google-native models, Vertex AI supports **custom metadata labels directly on generateContent API requests**. Labels propagate to BigQuery billing exports and can be queried directly:

response = model.generate_content(
    contents=prompt,
    generation_config=GenerationConfig(temperature=0.2),
    labels={
        "team": "research",
        "component": "literature-review",
        "environment": "production"
    }
)

Querying attribution data in BigQuery billing export:

SELECT
  labels.key,
  labels.value,
  SUM(cost) as total_cost
FROM `project.dataset.gcp_billing_export_resource_v1_*`,
UNNEST(labels) as labels
WHERE service.description = 'Vertex AI'
  AND DATE(usage_start_time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY 1, 2
ORDER BY total_cost DESC

Vertex AI Pipelines automatically attach a `vertex-ai-pipelines-run-billing-id` label for pipeline cost tracking, though this label doesn't propagate to Cloud Storage or ML Metadata resources used by the pipeline — another gap to plug manually.

The third-party model problem

Here's what catches teams off guard: adding labels to API requests for third-party or partner models on Vertex AI — including Claude, Llama, and Mistral — results in an error. Label support is only available for Google-native models. This is an open community request with no timeline for resolution as of early 2026.

For multi-model deployments mixing Gemini with Claude on Vertex AI, you can label Gemini calls but cannot apply the same attribution mechanism to Claude. Project-based isolation (separate GCP projects per team or use case) becomes the fallback.

Cost levers worth knowing

Gemini pricing includes several levers with meaningful attribution implications:

  • Batch inference at a 50% discount — significant for non-latency-sensitive workloads, but costs appear under a different SKU
  • Context caching reads at roughly 10% of base input price — savings don't automatically appear as tagged line items, making it hard to attribute caching benefits to the teams using them
  • Long context premium: inputs exceeding 200K tokens on Pro models incur a 2x price premium — important for RAG pipelines with large retrieved chunks, and easy to miss in token-level attribution

---

How do OpenAI, Anthropic, and Cohere handle cost attribution?

Direct API providers operate differently from cloud platforms, with more limited native attribution tooling and faster-updating cost data.

OpenAI

OpenAI's primary cost segregation unit is the Project. Each project has its own API keys and usage data queryable via the Usage API at /v1/organization/usage/completions, grouped by project, user, API key, and model across time buckets from one minute to one day. A separate Costs API at /v1/organization/costs returns actual dollar amounts.

The limitation: there's no arbitrary key-value tagging. Your attribution dimensions are fixed: project name, user ID (a string you can set per request via the user parameter), and API key ID. If you need to attribute costs to a feature flag variant, customer tier, or A/B test, you need to track it externally.

OpenAI's Scale Tier offers committed token capacity for enterprise customers, changing the attribution problem in the same way Azure PTUs do — fixed capacity costs require custom allocation logic based on consumption ratios.

Anthropic

Anthropic organizes API access through **Workspaces**, each with its own monthly spend limit and usage tracking. The Admin API provides a Usage endpoint (token breakdowns by model, workspace, API key, service tier, and inference_geo) and a Cost endpoint with dollar amounts updated within approximately five minutes of usage — notably faster than cloud providers' 24-hour billing delays.

The inference_geo dimension is a practical differentiator: it tells you which geographic region processed each request, useful for data residency compliance verification beyond just cost.

Priority Tier costs (higher availability and throughput tier) are excluded from the Cost endpoint and must be tracked via the Usage endpoint separately — a gap in attribution completeness.

Cohere

Cohere has the least mature native attribution tooling. Usage is tracked at the organization level with per-request token counts returned in API responses, but there is no public Usage or Cost API comparable to OpenAI's or Anthropic's. Dashboard-level spending limits exist but are coarse.

The more accurate framing: most enterprise Cohere customers access it through AWS Bedrock or Azure AI, where the cloud provider's attribution mechanisms apply. Cohere has effectively offloaded the attribution problem to cloud platforms — a defensible architectural decision that means comparing Cohere's native billing against AWS CUR is apples to oranges.

Cross-provider comparison

| Capability | AWS Bedrock | Azure OpenAI | GCP Vertex AI | OpenAI Direct | Anthropic Direct | |---|---|---|---|---|---| | Attribution unit | Application Inference Profile | Resource + Deployment | Project + API Labels | Project | Workspace | | Custom tags | Yes (app profiles only) | Yes (resources, 50 max) | Yes (Google models only) | No | No | | Per-request metadata | requestMetadata → CloudWatch only | Custom logging or APIM | Labels on Google model calls | user param only | Workspace + API key | | Billing data location | CUR + Cost Explorer | Cost Management | BigQuery billing export | Usage/Costs API | Admin API | | Hard spend limits | Via Budgets + Lambda automation | Not native | Not native | Org-level budget | Per-workspace limits | | Data freshness | ~24 hours | 24–48 hours | ~24 hours | Near real-time | ~5 minutes | | Agent/pipeline attribution | Broken (Agents don't inherit tags) | Partial (APIM required) | Partial (labels don't propagate) | N/A | N/A | | Third-party model labels | N/A | N/A | Not supported | N/A | N/A |

---

Does an LLM gateway actually solve the attribution problem?

For multi-provider environments or per-request attribution requirements that native tools can't meet, the LLM proxy/gateway pattern has become the practical standard. Tools like LiteLLM, Portkey, and Kong AI Gateway sit between your application and the AI providers, stamping every request with rich metadata and providing unified cost dashboards across providers.

A LiteLLM configuration handling multi-provider routing with metadata:

model_list:
  - model_name: claude-sonnet
    litellm_params:
      model: bedrock/anthropic.claude-3-sonnet-20240229-v1:0
      aws_region_name: us-east-1
  - model_name: gpt4o-mini
    litellm_params:
      model: azure/gpt-4o-mini
      api_base: https://your-resource.openai.azure.com
      api_key: os.environ/AZURE_API_KEY

router_settings:
  routing_strategy: cost-based-routing

general_settings:
  max_budget: 10000
  budget_duration: monthly

With per-request attribution metadata:

response = litellm.completion(
    model="claude-sonnet",
    messages=messages,
    metadata={
        "team": "claims",
        "feature": "document-extraction",
        "customer_tier": "enterprise",
        "experiment": "v2-prompt"
    }
)

But the gateway pattern isn't a free lunch. It introduces 10–30ms of latency per request, becomes a single point of failure for all AI calls, requires a security review as a privileged component touching API keys, and adds operational complexity that not every team can absorb. For a startup spending $3k/month on OpenAI, this is almost certainly over-engineering. For a 50-person engineering org spending $150k/month across three providers, the attribution value likely justifies the overhead. Know your situation before reaching for this tool.

---

What are the hidden attribution gaps nobody documents?

These don't appear in any provider's documentation, but they show up consistently in real deployments.

Shadow AI is the unattributed iceberg. Every attribution mechanism discussed here assumes you know where your AI calls are coming from. In practice, engineers use personal API keys, product managers spin up ChatGPT Team accounts on corporate cards, and third-party SaaS tools make AI calls on your behalf. A comprehensive attribution strategy requires auditing for AI spend outside official channels — not just better tagging on the spend you already know about.

Multi-turn conversations silently multiply costs. In a standard chat application, every new message resends the entire conversation history as context. A conversation starting at 100 tokens can cost 10x as much per exchange after 50 turns. This "conversation inflation" doesn't appear in per-request token counts in any obvious way — it requires tracking token counts over conversation lifecycle to detect.

Caching savings don't flow to attribution. AWS Bedrock prompt caching, GCP context caching, and similar features reduce costs for repeated context, but the savings don't automatically appear as tagged line items. You see lower total costs but can't attribute the savings to the teams benefiting from them — making it hard to incentivize good caching behavior.

The 24-hour billing lag makes anomaly detection retroactive. With CUR updated daily and Azure Cost Management updated every 24–48 hours, by the time you see a cost spike in attribution data, it's already happened. Real-time token tracking via CloudWatch (AWS) or Azure Monitor is the only way to catch problems within the same business day.

---

How should you approach AI cost attribution based on your current spend level?

The right approach depends significantly on your AI spend level, team size, and operational maturity. The FinOps Foundation's crawl/walk/run maturity model applies directly here — and most organizations overestimate which phase they're in.

Under $20k/month: Crawl

You don't need an LLM gateway. You probably don't need Application Inference Profiles yet. What you need is:

  • Project/workspace isolation per team or product line in each provider
  • Consistent naming conventions on every resource
  • A simple weekly cost report aggregated across providers — automated daily and weekly reports that surface anomalies without requiring a custom data pipeline
  • Budget alerts at 80% of your monthly AI allocation

The goal at this stage is awareness, not granularity.

$20k–$200k/month: Walk

This is where native attribution tooling starts to matter and where tag standardization becomes critical. Define your tagging taxonomy first, before touching any provider settings:

  • team — which engineering team or business unit
  • product — which product or feature
  • environmentproduction, staging, development
  • cost-center — finance's allocation code
  • model-tierpremium, standard, economy

On AWS, create Application Inference Profiles per team × product combination. On Azure, organize resources by team with deployment names encoding the product. On GCP, use API labels for Google models and project isolation for third-party models. On direct providers, use Projects/Workspaces per team.

Multi-Cloud Budgets and Alerts per team prevent runaway spend before it shows up in the monthly bill. Set alerts at 50%, 80%, and 100% of each team's AI budget, and wire them to Slack channels where relevant engineers will see them — not just to a shared finance inbox.

Over $200k/month: Run

At this scale, the LLM gateway pattern is almost certainly worth the operational overhead. Per-request attribution that native tools can't provide, semantic caching to eliminate redundant expensive calls, and active model routing become necessary.

The cost differential between routing appropriately — Claude Haiku for classification, Claude Sonnet for standard tasks, Claude Opus only for complex reasoning — versus defaulting to premium models for everything can represent six-figure monthly savings. Track unit economics: cost per inference, cost per customer interaction, cost per transaction processed. These metrics connect AI spend to business outcomes and make finance conversations productive rather than defensive.

Monthly cloud cost reports that break down AI spend by team, model, and use case, with trend lines and variance explanations, are what leadership needs to maintain confidence in AI investment. Without them, every unexpected spike becomes a crisis.

---

The honest state of AI cost attribution in 2026

AI cost attribution is still genuinely hard, and most teams are in the crawl phase whether they know it or not.

The 98% of FinOps practitioners managing AI spend says nothing about how well they're managing it. "Managing" might mean "someone noticed the Bedrock line item." The practices described here as "walk" and "run" require deliberate investment, and that investment competes with pressure to ship the next AI feature. The same team that needs to implement Inference Profiles, activate tags, set up monitoring, and build PTU allocation logic is also the team building what the product team wants in production next sprint.

The tooling is improving. AWS added Application Inference Profiles in late 2024. Anthropic added a Usage and Cost API. The FinOps Foundation's FOCUS specification is adding AI-specific fields, and 68% of large cloud spenders are already using or experimenting with FOCUS-formatted data. The gap between what you need and what providers natively offer is narrowing.

But it exists today. Navigating it — knowing which native tools to use, where to supplement with proxy infrastructure, how to standardize tags across providers, how to set up budgets that actually catch anomalies — is the work. The organizations that invest in it now will have a significant advantage as AI spend continues to grow as a fraction of their total cloud bill.

---

Where to start tomorrow morning

Day 1: Audit your current AI spend across every provider. Don't optimize yet. Just enumerate: AWS Bedrock, Azure OpenAI, GCP Vertex AI, OpenAI API, Anthropic API, any others. Which teams are responsible? Can you answer that question? If not, that's your baseline.

Week 1: Define your tagging taxonomy. Get engineering leads and finance in a room for 90 minutes and agree on four or five tags that matter for attribution. This conversation is harder than it sounds because it requires cross-functional agreement — but it's the single most leveraged thing you can do for attribution maturity.

Month 1: Implement the taxonomy on your highest-spend provider first. Don't try to do everything at once. Get AWS Application Inference Profiles in place, or Azure resource tagging standardized, or GCP labels on Gemini calls — whichever represents your biggest AI spend. Do it right there before moving on.

Quarter 1: Set up budget alerts, build your first attribution report, and review it with engineering leads monthly. A conversation that happens monthly with imperfect data is more valuable than a perfect dashboard nobody looks at.

The AI bill is only going to get bigger. The teams that build attribution infrastructure now — before the bill demands it — are the ones that will be able to explain every line item when the question comes.

---

Frequently asked questions about AI inference cost attribution

What is AI inference cost attribution? AI inference cost attribution is the practice of tracking and assigning AI model invocation costs — measured in tokens, PTUs, or API calls — to specific teams, products, features, or business outcomes. Without attribution, you can see your total AI bill but cannot determine who spent what, on which model, or why costs changed.

What are AWS Bedrock Application Inference Profiles and how do they work? Application Inference Profiles are user-created AWS Bedrock resources that wrap a foundation model and support custom cost allocation tags. You create them via CLI or API (not the console), attach tags, activate those tags in AWS Billing, and route all model invocations through the profile ARN. Tags then appear in AWS CUR and Cost Explorer within 24 hours of activation.

Does GCP Vertex AI support cost attribution labels for Claude or Llama? No. GCP Vertex AI's label-on-request feature only works for Google-native models. Adding labels to API calls for third-party models (Claude, Llama, Mistral) on Vertex AI returns an error. Project-level isolation is the recommended workaround.

Why don't Bedrock Agent costs appear in Inference Profile tags? Bedrock Agents do not inherit cost allocation tags from attached application inference profiles. This is a known limitation confirmed on AWS re:Post with no current workaround through native tagging alone.

What's the difference between cost attribution on OpenAI vs. Anthropic? OpenAI attributes costs by Project, user ID, and API key — no arbitrary tagging. Anthropic attributes costs by Workspace, API key, model, service tier, and geographic region. Both providers update billing data significantly faster than cloud providers (near real-time for OpenAI, approximately 5 minutes for Anthropic) but offer fewer attribution dimensions than cloud-native tagging.

When is an LLM gateway worth the operational overhead? Generally when you're spending over $100k/month across multiple AI providers and need per-request attribution that native tools can't provide. Below that threshold, the operational complexity (latency, single point of failure, security surface) typically outweighs the attribution benefits.

What FinOps KPIs should I track for AI costs? The FinOps Foundation recommends: cost per inference, cost per token (input and output separately), cost per business unit of work (per customer case handled, per transaction processed), and AI spend as a percentage of revenue. These connect raw token costs to business outcomes.

---

*CloudYali is a multi-cloud SaaS platform for Cloud Cost Visibility and Management. CloudYali's Custom Cost Reports, Tag Standardization Tracking, and Multi-Cloud Budgets and Alerts are built for exactly the attribution challenges described in this post — across AWS, Azure, and GCP in one place. Learn more at cloudyali.io.*

---

References:

  • FinOps Foundation: FinOps for AI
  • FinOps Foundation: Optimizing GenAI Usage
  • FinOps Foundation: How to Build a Generative AI Cost and Usage Tracker
  • FinOps Foundation: State of FinOps 2026
  • AWS: Set up a model invocation resource using inference profiles
  • AWS: Track, allocate, and manage your generative AI cost and usage with Amazon Bedrock
  • Microsoft: Plan to manage costs for Azure OpenAI
  • GCP: Custom metadata labels for Vertex AI API calls
  • Anthropic: Usage and Cost API

Ready to optimize your cloud costs?cloud costs

Start your free trial today and see how CloudYali can help you save.