๐ฅPrometheus
I'm familiar with the architecture etc, and am currently interested in the querying side of things. So here are my notes from reading Prometheus: Up & Running by Brian Brazil.
Basic queries
Use rate
for counters, e.g. rate(prometheus_blah[1m])
Curly braces for labels i.e. tags, e.g. process_resident_memory_bytes{job="node"}
Basic alerting
up == 0
returns only results where the condition matches. You can set this as
an alert if it happens over a particular duration, e.g 1m
.
Basic calulations
rate(hello_world_exceptions_total[1m])
/
rate(hello_worlds_total[1m])
Conventions
Counters generally end with _total
.
Strictly avoid ending anything with these:
_count
and_sum
: these are for Summaries._bucket
which is for histograms.
Use the unit in the name, e.g. myapp_requests_processed_bytes_total
.
Summaries
A Summary
has an observe
method to which you pass a non-negative size. E.g.
LATENCY = prometheus_client.Summary('hello_world_latency_seconds', 'Time for a request Hello World.')
class blah
def get(self):
start = time.time()
# do stuff here
LATENCY.observe(time.time() - start)
Now the /metrics
endpoint will show hello_world_latency_seconds
containing
a hello_world_latency_seconds_count
and a hello_world_latency_seconds_sum
.
The former is the number of observe()
calls made, the latter is the sum of
the values passed.
So average latency over the last minute would be:
rate(hello_world_latency_seconds_sum[1m])
/
rate(hello_world_latency_seconds_count[1m])
Here the numerator gives you total latency in that duration (say, 5s, 10s, 15s) and the denominator gives you your requests count (1, 1, 1 => 3). So the answer would be 30/3 i.e. 10s in this case. This is the average request latency in this window.
Simplify all this in the code by just using the @LATENCY.time()
decorator.
Histograms
Again you would use an observe
method but here you would get quantiles like
p95. Using this on a metric hello_world_latency_seconds
would yield a
hello_world_latency_seconds_bucket
, which are a set of counters. Use a query
like this to extract data out of it:
histogram_quantile(0.95, rate(hello_world_latency_seconds_bucket[1m]))
Default buckets cover latencies from 1ms to 10s. But you can create your own. e.g. A more interesting query:
my_latency_seconds_bucket{le="0.5"}
/ ignoring(le)
my_latency_seconds_bucket{le="+Inf"}
What this does is: if you have a 500ms
bucket in your histogram, show all
requests that were below/before that bucket and divide by count of requests
that were in the remaining buckets. le
is a bucket label here.
Note
No further calculations can happen on a quantile, like sum or avg.
Functions
increase
is just syntactic sugar, and displays the rate
* range
where the
range is something like [5m]
.
rate
is for counters, and if your instance restarts, rate
will
automatically account for it. So metrics like 5, 7, 12,
resets
counts the number of times this has happened, so is helpful to detect
the number of times your process has restarted. E.g.
resets(process_cpu_seconds_total[1h])
Similarly changes
tells you how often a gauge changed, and is useful for
gauges that don't change very often.
Recording Rules
A way to run queries periodically. Helps to speed up dashboards or use the results elsewhere. Useful to reduce cardinality: when you have a slow query, you can split it up and use a recording rule. Prometheus will output a new metric that you can use in the outer query.