Metrics Server Integration
- Table of Contents
- Introduction
- StatsD Integration
- Prometheus Integration
- Multilog
- GreenArrow Telemetry
Introduction
GreenArrow integrates with both StatsD and Prometheus. Regardless of which integration you choose (or both), you’ll receive the same telemetry data from GreenArrow.
StatsD Integration
Configuration
Configuration is done using the following directives:
Prometheus Integration
Configuration
Configuration is done using the following directive:
Timeouts
The HTTP server that is bound to prometheus_listen is configured with 60 second timeouts. If a Prometheus
request takes longer than this to fulfill, the client will receive either 408 Request Timeout
or 500 Internal Server Error
.
Multilog
Every 5 seconds, the hvmail-metrics
service will log a summary of recent
metrics it has processed. This can be monitored using tail:
tail -F /var/hvmail/log/metrics/current | tai64nlocal
For more information on GreenArrow’s logs, see Service Logs.
GreenArrow Telemetry
The metric keys below are defined as provided to Prometheus. If a metric has any labels associated with it, they are defined in a list within the metric definition.
When using StatsD, be aware:
- Any labels are appended to the key name, prefixed by the label name and an underscore, with non-alphanumeric
characters replaced by underscore. For example, in the
remote_delivery_attempts_active
metric, a delivery attempt being made toexample.com
on the outmtaipaddr-1
would be sent asremote_delivery_attempts_active.outmta_ipaddr_1.throttle_example_com
. - Timers (e.g.
http_request_duration_seconds
) are recorded in milliseconds, not seconds (contrary to these metric names).
Queue Status
Number of message batches currently in the RAM queue (first delivery attempts).
Maximum number of message batches that will fit in the RAM queue (first delivery attempts).
Total size (in bytes) of messages currently in the RAM queue (first delivery attempts).
Maximum size (in bytes) of messages that will fit in the RAM queue (first delivery attempts).
Number of messages currently in the HTTP Submission API SimpleMH processing queue.
Maximum number of messages that will fit in the HTTP Submission API SimpleMH processing queue.
Total size (in bytes) of messages currently in the HTTP Submission API SimpleMH processing queue.
Maximum size (in bytes) of messages that will fit in the HTTP Submission API SimpleMH processing queue.
Timing (in seconds) of how long it takes a message to make its way through the HTTP Submission API SimpleMH processing queue.
This is measured using a probe message that is injected every 15 seconds and is not dependent on messages being injected. This means that alerting on the non-existence of this metric is helpful, as is alerting on it being excessively high.
This metric is disabled by default and can be enabled using export_metric.
Number of messages currently in the no-wait message queue. This queue is used for notifications generated by GreenArrow.
Number of bounces currently waiting to be processed.
Maximum number of bounces that will fit in the bounce processing queue before bounce message injection will slow down. This represents a back-pressure mechanism to give the bounce processor a chance to catch up.
Maximum number of bounces that will fit in the bounce processing queue.
Number of FBL messages currently waiting to be processed.
Maximum number of FBL messages that will fit in the FBL processor queue before FBL message injection will slow down. This represents a back-pressure mechanism to give the FBL processor a chance to catch up.
Maximum number of FBL messages that will fit in the FBL processor queue.
Number of bounce messages and FBL notifications waiting to be processed by the Lite Bounce Processor. This queue is
filled by recipients matching the lite_bounce_processor_address and lite_fbl_processor_address directives.
This includes messages that will generate bounce_lite
and scomp_lite
events.
Maximum number of bounces that will fit in the Lite Bounce Processor queue before bounce message injection will slow down. This represents a back-pressure mechanism to give the Lite Bounce Processor a chance to catch up.
Maximum number of bounces that will fit in the Lite Bounce Processor queue.
Number of message batches waiting to be injected into this instance that were drained from another instance.
Maximum number of drained message batches that can fit in the incoming drain queue.
Percentage of time (ranging from 0.0 to 1.0) that the disk queue has spent in catch-up mode. Catch-up mode represents degraded disk queue performance and this can result in significant back-pressure on message injection.
The difference between the current time and the next scheduled retry (in seconds). If the next scheduled retry is in the future, this gauge is zero and the disk queue is caught-up on scheduling. If the next scheduled retry is in the past, this value is non-zero, meaning that the disk queue is behind in scheduling message delivery attempts.
Remote Delivery
Number of remote delivery attempts that are cancelled/rescheduled because they would exceed the throttle limits for a domain. This includes when this happens on the last delivery attempt of the message, which causes the message to bounce.
Number of remote delivery attempts that result in a deferral. This includes deferrals on the last delivery attempt which cause messages to be bounced.
Number of remote delivery attempts that result in the message being dumped from the queue (due to the “dump messages from queue” feature).
Number of remote delivery attempts that result in a failure.
Number of remote delivery attempts that succeed.
The number of remote delivery attempts that are currently active.
outmta |
The name of the IP Address or Relay Server VirtualMTA that is currently being used for this delivery attempt. |
throttle |
The first (in order of definition) domain associated with the explicit throttling rule that is being used for this delivery attempt.
If this delivery attempt is using a non-explicit throttle (i.e. it uses the default max concurrent connections & default max messages per hour)
or is using a Relay Server, this value will be |
The number of remote delivery attempts that have completed.
outmta |
The name of the IP Address or Relay Server VirtualMTA that was used for the delivery attempt(s). |
throttle |
The first (in order of definition) domain associated with the explicit throttling rule that was used for the delivery attempt(s).
For delivery attempts that used non-explicit throttle (i.e. it uses the default max concurrent connections & default max messages per hour)
or on a Relay Server, this value will be |
result |
The result of these delivery attempt(s). This may be one of the following: |
smtp_status_code |
The 3-digit numeric SMTP status code for these delivery attempt(s). |
Number of remote delivery slots used by messages undergoing a delivery attempt. This includes: DNS lookups, establishing the SMTP connection, transferring the message, waiting for a response, etc.
Number of remote delivery slots used by messages that are waiting to be moved to the disk queue after a deferral.
Note: when IO load due to writing messages to the disk queue is slowing the system down, this number will increase.
Number of remote delivery slots that are unused.
Number of remote delivery slots used by messages that are waiting, due to throttle rules, to be allowed to make a delivery attempt.
Number of remote delivery slots available in the queue. This is
the value of the queue.{ram,bounce,disk}.concurrencyremote
setting.
Number of messages moved to the disk queue.
New remote messages queued in the system.
Timing (in seconds) of how long it takes a message to make its way from the HTTP Submission API to the point at which GreenArrow is ready to establish a remote network connection for delivery.
This includes the HTTP Submission API SimpleMH processing queue, which is measured with simplemh_http_queue_latency_seconds
, and the
message progressing through the scheduling/throttling queues and systems in GreenArrow.
This is measured using a probe message that is injected every 15 seconds and is not dependent on messages being injected. This means that alerting on the non-existence of this metric is helpful, as is alerting on it being excessively high.
This metric is disabled by default and can be enabled using export_metric.
HTTP Processing
Timing of HTTP requests that have been processed by GreenArrow.
category |
The category of HTTP request. Possible values include: |
Incoming SMTP
Number of open incoming SMTP connections.
service |
The GreenArrow SMTPD service that this gauge represents. Possible values include: |
Maximum number of open incoming SMTP connections supported by this SMTPD service.
service |
The GreenArrow SMTPD service that this gauge represents. Possible values include: |
Total number of new SMTP connections that have been established to this SMTPD service.
service |
The GreenArrow SMTPD service that this gauge represents. Possible values include: |
Remote SMTP Deliveries
The following metrics are exposed by both GreenArrow (MTA) and by GreenArrow Proxy.
- When published by the MTA, it is describing SMTP connections/deliveries that are made directly from the MTA, without using GreenArrow Proxy.
- When published by GreenArrow Proxy, it is describing SMTP connections/deliveries that are made from that GreenArrow Proxy instance.
The result is that when aggregating these datapoints for a cluster, you get the total number of outgoing SMTP connections/deliveries made from the entire cluster, with no duplicate values counted.
Number of new remote SMTP connections that have been successfully opened.
source_ip |
The source IP address of this network connection. |
Number of new remote SMTP connections that have failed to successfully open (e.g. if the connection was refused, or a network error).
source_ip |
The source IP address of this network connection. |
Number of remote connections that are currently open.
source_ip |
The source IP address of this network connection. |
Number of times open SMTP connections have been reused.
source_ip |
The source IP address of this network connection. |
When emitted by GreenArrow Proxy, this is the number of GreenArrow instances that are currently connected to this GreenArrow Proxy.
When emitted by other GreenArrow instances, this is always 1.
Bounce and FBL Processing
Number of bounce messages that have been processed by the bounce processor.
Number of FBL messages that have been processed by the FBL processor.
Number of bounce messages that have been processed by the lite bounce processor.
Number of FBL messages that have been processed by the lite FBL processor.
Event Processor
Timing (in seconds) of how long it takes from when an event is generated until the event processor is ready to deliver it to its destination. For events that need to be retried, this is the time it takes from when the retry is scheduled until it is ready to deliver to its destination.
destination |
The name of the event delivery destination. |
Timing (in seconds) of how long it takes to submit one event batch to the destination.
This metric is not written for event_delivery_logfile destinations.
destination |
The name of the event delivery destination. |
Number of events that were succesfully delivered to the destination.
destination |
The name of the event delivery destination. |
Number of events that failed in delivery to the destination.
destination |
The name of the event delivery destination. |
Number of events that failed in delivery to the destination due to a network error (as opposed to, for example, an HTTP 404 response).
destination |
The name of the event delivery destination. |
Number of read operations that have been completed against the GreenArrow events
table.
destination |
The name of the event delivery destination. |