back to writing

Your AI API Calls Are Leaking Money. I Built a Fix

January 2026·7 min read

Every team that seriously uses AI APIs hits the same wall eventually.

The OpenAI bill arrives and nobody knows which service caused the spike. A provider goes down and everything breaks. Someone on the team is making the same expensive LLM call hundreds of times because there's no shared cache. Rate limits get hit and requests just... fail.

The naive fix is to add a try/catch around your API calls and hope for the best. The real fix is to stop treating AI APIs like simple HTTP endpoints and start treating them like the external infrastructure dependencies they actually are - with all the resilience, observability, and cost management that implies.

That's why I built Relay: a reverse proxy and API gateway for AI services, written in Go. This post is the architecture walkthrough I wish had existed before I started.

The Core Idea: A Transparent Proxy

Relay sits between your application and OpenAI (or Anthropic, or any other provider). Your application thinks it's talking to OpenAI. Relay intercepts every request, does its work - caching, rate limiting, cost tracking, circuit breaking - then forwards to the real provider.

The key design constraint: zero changes required in client code. You just point your SDK at localhost:8080 instead of api.openai.com:

import openai
openai.api_base = "http://localhost:8080/v1"  # That's it

Everything else continues working exactly as before. This constraint shaped every architectural decision.

The Middleware Chain

The heart of Relay is a layered middleware chain. In cmd/main.go, you can read the chain construction directly:

handler = middleware.TransformMiddleware(transformCfg)(handler)    // Layer A
handler = middleware.NewRateLimiter(rdb, cfgStore)(handler)        // Layer B
handler = middleware.CachingMiddleware(rdb)(handler)               // Layer C
handler = middleware.AuthMiddleware(rdb, true)(handler)            // Layer D
handler = middleware.RequestLoggingMiddleware(store, true)(handler) // Layer E
handler = middleware.TokenCostLogger(cfgStore)(handler)            // Layer F
handler = middleware.RequestLogger(handler)                        // Layer G

This is standard Go middleware composition - each layer wraps the next, onion-style. A request enters at Layer G (outermost), passes through each layer in order, hits the proxy, and the response unwinds back through the same layers in reverse.

The order matters enormously. Auth (D) must run before logging (E) so we know who made the request when we log it. Caching (C) must run before the proxy so we can short-circuit before hitting the upstream. Cost tracking (F) must run before caching so we're counting tokens on every real request, not just cache misses.

Getting this order wrong produces subtle, hard-to-debug bugs. I got it wrong twice.

Caching: The Non-Obvious Parts

The caching middleware looks simple on the surface: hash the request body, check Redis, return cached response if found, otherwise proxy and cache the result.

hash := sha256.Sum256(bodyBytes)
key := fmt.Sprintf("cache:%s", hex.EncodeToString(hash[:]))

SHA-256 of the request body is a reasonable cache key for LLM calls. Two requests with identical JSON bodies - same model, same messages, same parameters - will always produce the same hash. This catches the repeated identical calls that inflate costs.

The non-obvious part is body handling in Go's http.Handler chain.

r.Body is an io.ReadCloser - once you read it, it's gone. If the caching middleware reads the body to compute the hash, the proxy downstream gets an empty body and the upstream API call fails.

The fix is to read the bytes, then replace the body with a new reader over the same bytes:

bodyBytes, _ := io.ReadAll(r.Body)
r.Body = io.NopCloser(bytes.NewBuffer(bodyBytes)) // Refill

Every middleware that reads the body does this. The cost tracking middleware does it. The request logging middleware does it. Miss this in any one place and you get mystifying empty-body errors that only appear in production.

The other non-obvious part: async caching. After a successful upstream response, Relay saves to Redis in a goroutine:

go func(k string, data []byte) {
    ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
    defer cancel()
    rdb.Set(ctx, k, data, time.Hour)
}(key, spy.body.Bytes())

This is deliberate. The user doesn't wait for Redis. If Redis is slow or temporarily unavailable, the user gets their response at normal speed and the cache just doesn't get populated for this request. Caching is an optimization, not a requirement - it should never add latency to the happy path.

Rate Limiting: Two Strategies

Rate limiting has an interesting constraint: it needs to work both with and without Redis.

Without Redis (single instance, local development): use Go's golang.org/x/time/rate package, which implements a token bucket in memory.

With Redis (distributed, multiple instances): use go-redis/redis_rate, which implements the same token bucket algorithm but backed by Redis, so all instances share the same limit state.

if rdb == nil {
    // In-memory limiter
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            if !limiter.Allow() {
                http.Error(w, "Too Many Requests", http.StatusTooManyRequests)
                return
            }
            next.ServeHTTP(w, r)
        })
    }
}

// Distributed Redis limiter
redisLimiter := redis_rate.NewLimiter(rdb.Redis())

The distributed case also sets a Retry-After header when a request is rejected, telling the client exactly how many seconds to wait before retrying. This is the correct HTTP behavior for 429 responses and most well-behaved API clients will honor it automatically.

One subtlety: the rate limit parameters come from cfgStore.Get() on every request, not cached at startup. This is intentional - hot reload (more on that shortly) needs rate limit changes to take effect immediately without restarting.

Circuit Breakers: Fail Fast, Recover Automatically

The circuit breaker pattern is essential for any proxy that forwards to external services. Without it, a slow or failing upstream makes every request wait for a timeout before failing - turning an upstream outage into a full system slowdown.

Relay uses Sony's gobreaker library, configured per-target:

cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
    Name:    fmt.Sprintf("target-%s", parsedURL.Host),
    Timeout: 30 * time.Second,
    ReadyToTrip: func(c gobreaker.Counts) bool {
        return c.ConsecutiveFailures >= 5
    },
})

After 5 consecutive upstream failures, the circuit opens. For the next 30 seconds, all requests to that target immediately return 503 rather than waiting for a timeout. After 30 seconds, the circuit enters half-open state - it lets one request through as a probe. If that succeeds, the circuit closes and normal operation resumes. If it fails, the circuit opens again for another 30 seconds.

The circuit breaker wraps the proxy call and classifies responses:

_, err = cb.Execute(func() (interface{}, error) {
    proxy.ServeHTTP(rec, r)
    if rec.status >= 500 {
        return nil, fmt.Errorf("upstream error: %d", rec.status)
    }
    return nil, nil
})

Only 5xx responses trip the breaker - 4xx responses (bad requests, auth errors) are client errors, not upstream failures, and shouldn't affect circuit state.

Hot Reload: Config Changes Without Downtime

One feature that turned out to be more useful than expected: configuration hot reload. Edit configs/config.yaml and changes take effect within seconds, with no restart required.

The implementation uses fsnotify via Viper:

v.WatchConfig()
v.OnConfigChange(func(e fsnotify.Event) {
    if err := refresh(v, store); err != nil {
        log.Printf("[CONFIG] reload failed: %v", err)
    } else {
        log.Printf("[CONFIG] reloaded from %s", e.Name)
    }
})

The Store wraps the config with a read-write mutex so concurrent access during a reload is safe:

type Store struct {
    mu  sync.RWMutex
    cfg *Config
}

func (s *Store) Get() *Config {
    s.mu.RLock()
    defer s.mu.RUnlock()
    cpy := *s.cfg  // Return a copy
    return &cpy
}

Returning a copy rather than a pointer to the internal config prevents callers from holding a reference to config that gets mutated during a reload. This is a subtle correctness issue - returning &s.cfg would be a data race waiting to happen.

The practical benefit: during a load spike, you can increase the rate limit ceiling by editing a YAML file. No deployment, no downtime, takes effect in under 5 seconds.

Cost Tracking: Counting Tokens Before They Happen

The cost tracking middleware estimates token counts and costs before the request hits the upstream. This is useful for logging, alerting, and quota management.

count, _ := ai.CountTokens(payload.Model, fullText)
cost := ai.EstimateCost(count, payload.Model, cfg.Models)

Token counting uses the tiktoken-go library, which implements OpenAI's actual tokenizer. This is important - a naive character count would be significantly off (GPT tokenization is not character-based).

The model pricing lives in config.yaml:

models:
  gpt-4: 0.03
  gpt-3.5-turbo: 0.002
  claude-3-opus: 0.015

Since pricing changes and new models appear regularly, keeping it in a hot-reloadable config file rather than hardcoded constants means updates don't require a deployment.

The token count and cost are stored in the request context so downstream middleware (logging, quota enforcement) can access them without recomputing:

ctx = context.WithValue(ctx, tokenCountContextKey, count)
ctx = context.WithValue(ctx, tokenCostContextKey, cost)

The Load Balancer

For teams running multiple AI providers, Relay includes a load balancer with four strategies: round-robin, weighted, least-latency, and random.

The least-latency strategy is the most interesting. Each target maintains a sliding window of the last 100 response times:

type LatencyTracker struct {
    samples []time.Duration
    maxSize int
}

On each request, the load balancer calculates average latency per target and routes to the fastest one. This naturally adapts to provider performance - if OpenAI is running slow and Anthropic is fast, traffic shifts to Anthropic automatically.

Combined with per-target circuit breakers, this means the load balancer both avoids slow targets and avoids dead ones.

What I'd Do Differently

The Admin API is synchronous where it should be async. The POST /admin/keys endpoint returns the key immediately, which means key creation latency is the Redis write latency. For a management API this is fine. But incrementing usage counts in auth.go is done with a goroutine-launched full get-modify-put cycle, which has a race condition: two concurrent requests with the same key could both read used: 5, both increment to used: 6, and both write back used: 6. The fix is a Redis INCR command instead of the get-modify-put pattern.

Semantic caching is the highest-value missing feature. Right now, caching is exact-match - the SHA-256 of the request body must match exactly. Two semantically identical questions ("What's 2+2?" and "What does 2+2 equal?") get separate upstream calls. Embedding-based similarity search would dramatically improve cache hit rate for conversational applications, but adds meaningful complexity (an embedding model, a vector index, a similarity threshold to tune).

The transform middleware's JSON path implementation is fragile. I implemented a simplified dot-notation path parser (messages.0.content) that doesn't handle arrays, nested structures, or edge cases well. A proper JSON pointer implementation (RFC 6901) would be more robust.

The Deployment Story

Relay ships as a single Go binary with a multi-stage Docker build. The final image is Alpine-based and weighs in under 20MB. The included docker-compose.yml gets you a full stack - Relay plus Redis - in one command:

docker-compose up -d

At that point, localhost:8080 is your AI gateway. Point your SDK there and you get caching, rate limiting, cost tracking, and circuit breaking for free.

The Number That Justified Building This

In testing, repeated identical queries against a warm cache resulted in 0ms upstream latency - the response came entirely from Redis. For applications with predictable query patterns (think: FAQ bots, code review tools, document summarizers), cache hit rates of 30-60% are realistic. At GPT-4 pricing, that's a meaningful cost reduction.

More than the cost savings, the observability alone is worth it. Knowing exactly how many tokens each service is consuming, which models are being called, and where latency is coming from - that information is genuinely hard to get from provider dashboards alone.

Relay is open source at https://github.com/ngoyal88/relay. Docker image available at https://hub.docker.com/r/ngoyal88/relay.

thanks for reading. feel free to reach out!