Skip to main content
When you run AI models locally — on Ollama, llama.cpp, vLLM, LM Studio, or any other self-hosted inference server — there is no per-token API bill. Your real cost is electricity and GPU time. CostHQ lets you register local models with a cost-per-hour rate so you can track compute costs alongside your cloud API spend in the same dashboard.

Quick start

# 1. Register a local model with your GPU's hourly compute cost
cs local-models add ollama/llama3 --cost-per-hour 0.50 --gpu "RTX 4090"

# 2. Start a session as usual
cs start "Local AI work"

# 3. Log usage with --duration instead of --cost
cs log-ai -p ollama -m llama3 --tokens 5000 --duration 2m30s

# 4. Check the session — cost is computed from duration × rate
cs status

Cost formula

cost = (durationSeconds / 3600) × costPerHour
For example, 2 minutes 30 seconds on a model registered at $0.50/hr:
(150 / 3600) × 0.50 = $0.0208

Managing local models

Register a model

cs local-models add <provider>/<model> --cost-per-hour <rate> [--gpu <name>] [--notes <text>]
ParameterDescription
<provider>/<model>Provider and model name, e.g. ollama/llama3, vllm/mistral-7b
--cost-per-hour <rate>USD per hour of GPU compute
--gpu <name>Optional GPU identifier (e.g. "RTX 4090", "M2 Ultra")
--notes <text>Optional notes

List registered models

cs local-models list [--json]

Remove a model

cs local-models remove <provider>/<model>

Auto-detect Ollama models

If you have Ollama running locally, CostHQ can scan localhost:11434 and register all available models in one command:
cs local-models detect --cost-per-hour 0.50 --gpu "RTX 4090"
This hits Ollama’s GET /api/tags endpoint, pulls every loaded model name, and registers them all with the rate you specify.

Logging local model usage

Use the standard cs log-ai command with the new --duration flag:
# Duration accepts multiple formats
cs log-ai -p ollama -m llama3 --tokens 5000 --duration 120        # plain seconds
cs log-ai -p ollama -m llama3 --tokens 5000 --duration 2m30s      # human format
cs log-ai -p ollama -m llama3 --tokens 5000 --duration 1h         # hours
cs log-ai -p ollama -m llama3 --tokens 5000 --duration 1h30m      # combined

The --local flag

CostHQ auto-recognizes these providers as local: ollama, llamacpp, llama.cpp, vllm, lmstudio, localai, jan, koboldcpp. If your provider isn’t in that list, use the --local flag to explicitly tell CostHQ to use compute-time costing:
cs log-ai -p my-custom-server -m meta/llama-3-70b --tokens 8000 --local --duration 3m

JSON output

When you log a local model call with --json, the pricing object shows source: "local":
{
  "logged": {
    "provider": "ollama",
    "model": "llama3",
    "tokens": 10000,
    "cost": 0.0416666667
  },
  "pricing": {
    "source": "local",
    "modelKnown": true,
    "inputPer1M": 0,
    "outputPer1M": 0,
    "costPerHour": 0.5,
    "durationSeconds": 300
  }
}

Estimating your GPU cost per hour

If you’re not sure what --cost-per-hour rate to use, here are some reference points:
HardwareEstimated $/hrSource
RTX 4090 (electricity only)0.100.10–0.20~450W TDP × local electricity rate
RTX 4090 (cloud rental)0.400.40–0.80RunPod, Vast.ai spot pricing
A100 80GB (cloud)1.501.50–3.00AWS, GCP, Azure on-demand
M2 Ultra (Mac Studio)0.050.05–0.15Apple Silicon power draw
H100 (cloud)2.502.50–5.00AWS, GCP on-demand
Pick a rate that makes sense for your setup. You can always update it later with cs local-models add — it upserts, so re-registering the same model replaces the old rate.
Local model configurations are stored in ~/.costhq/local-models.json. This file is independent from the session database and can be version-controlled or shared across machines.