Metrics Server Integration
- Table of Contents
- Introduction
- StatsD Integration
- Prometheus Integration
- Multilog
- GreenArrow Telemetry
Introduction
GreenArrow integrates with both StatsD and Prometheus. Regardless of which integration you choose (or both), you’ll receive the same telemetry data from GreenArrow.
StatsD Integration
Configuration
Configuration is done using the following directives:
Prometheus Integration
Configuration
Configuration is done using the following directive:
Timeouts
The HTTP server that is bound to prometheus_listen is configured with 60 second timeouts. If a Prometheus
request takes longer than this to fulfill, the client will receive either 408 Request Timeout or 500 Internal Server Error.
Multilog
Every 5 seconds, the hvmail-metrics service will log a summary of recent
metrics it has processed. This can be monitored using tail:
tail -F /var/hvmail/log/metrics/current | tai64nlocal
For more information on GreenArrow’s logs, see Service Logs.
GreenArrow Telemetry
The metric keys below are defined as provided to Prometheus. If a metric has any labels associated with it, they are defined in a list within the metric definition.
When using StatsD, be aware:
- Any labels are appended to the key name, prefixed by the label name and an underscore, with non-alphanumeric
characters replaced by underscore. For example, in the
remote_delivery_attempts_activemetric, a delivery attempt being made toexample.comon the outmtaipaddr-1would be sent asremote_delivery_attempts_active.outmta_ipaddr_1.throttle_example_com. - Timers (e.g.
http_request_duration_seconds) are recorded in milliseconds, and the_secondssuffix is transformed to_milliseconds.
Queue Status
Number of message batches currently in the RAM queue (first delivery attempts).
Maximum number of message batches that will fit in the RAM queue (first delivery attempts).
Total size (in bytes) of messages currently in the RAM queue (first delivery attempts).
Maximum size (in bytes) of messages that will fit in the RAM queue (first delivery attempts).
Number of messages currently in the HTTP Submission API SimpleMH processing queue.
Maximum number of messages that will fit in the HTTP Submission API SimpleMH processing queue.
Total size (in bytes) of messages currently in the HTTP Submission API SimpleMH processing queue.
Maximum size (in bytes) of messages that will fit in the HTTP Submission API SimpleMH processing queue.
Timing (in seconds) of how long it takes a message to make its way through the HTTP Submission API SimpleMH processing queue.
This is measured using a probe message that is injected every 15 seconds and is not dependent on messages being injected. This means that alerting on the non-existence of this metric is helpful, as is alerting on it being excessively high.
This metric is disabled by default and can be enabled using export_metric.
Number of messages currently in the no-wait message queue. This queue is used for notifications generated by GreenArrow.
Number of bounces currently waiting to be processed.
Maximum number of bounces that will fit in the bounce processing queue before bounce message injection will slow down. This represents a back-pressure mechanism to give the bounce processor a chance to catch up.
Maximum number of bounces that will fit in the bounce processing queue.
Number of FBL messages currently waiting to be processed.
Maximum number of FBL messages that will fit in the FBL processor queue before FBL message injection will slow down. This represents a back-pressure mechanism to give the FBL processor a chance to catch up.
Maximum number of FBL messages that will fit in the FBL processor queue.
Number of bounce messages and FBL notifications waiting to be processed by the Lite Bounce Processor. This queue is
filled by recipients matching the lite_bounce_processor_address and lite_fbl_processor_address directives.
This includes messages that will generate bounce_lite and scomp_lite events.
Maximum number of bounces that will fit in the Lite Bounce Processor queue before bounce message injection will slow down. This represents a back-pressure mechanism to give the Lite Bounce Processor a chance to catch up.
Maximum number of bounces that will fit in the Lite Bounce Processor queue.
Number of message batches waiting to be injected into this instance that were drained from another instance.
Maximum number of drained message batches that can fit in the incoming drain queue.
Percentage of time (ranging from 0.0 to 1.0) that the disk queue has spent in catch-up mode. Catch-up mode represents degraded disk queue performance and this can result in significant back-pressure on message injection.
The difference between the current time and the next scheduled retry (in seconds). If the next scheduled retry is in the future, this gauge is zero and the disk queue is caught-up on scheduling. If the next scheduled retry is in the past, this value is non-zero, meaning that the disk queue is behind in scheduling message delivery attempts.
Remote Delivery
Number of remote delivery attempts that are cancelled/rescheduled because they would exceed the throttle limits for a domain. This includes when this happens on the last delivery attempt of the message, which causes the message to bounce.
Number of remote delivery attempts that result in a deferral. This includes deferrals on the last delivery attempt which cause messages to be bounced.
Number of remote delivery attempts that result in the message being dumped from the queue (due to the “dump messages from queue” feature).
Number of remote delivery attempts that result in a failure.
Number of remote delivery attempts that succeed.
Number of remote delivery attempts that began processing.
Number of remote delivery attempts that were a connmaxout due to exceeding a maximum concurrent connections limit.
Number of remote delivery attempts that were a connmaxout due to exceeding a maximum delivery attempts per hour limit.
Number of remote delivery attempts that were a connmaxout due to exceeding the limit of maximum unacknowledged delivery requests to GreenArrow Proxy. This can happen if the GreenArrow Proxy host is under-performing (i.e. exhausting CPU resources) or there is a network problem between the MTA and the GreenArrow Proxy host.
Number of remote delivery attempts from IP Addresses using GreenArrow Proxy resulting in a connmaxout, broken down by whether or not the delivery attempt request was fully dispatched to the throttle decision maker in GreenArrow Proxy, or if a connmaxout was determined locally without the network request.
Delivery attempts not made through GreenArrow Proxy do not report here.
The number of remote delivery attempts that are currently active.
| outmta |
The name of the IP Address or Relay Server VirtualMTA that is currently being used for this delivery attempt. |
| throttle |
The first (in order of definition) domain associated with the explicit throttling rule that is being used for this delivery attempt.
If this delivery attempt is using a non-explicit throttle (i.e. it uses the default max concurrent connections & default max messages per hour)
or is using a Relay Server, this value will be |
The number of remote delivery attempts that have completed.
| outmta |
The name of the IP Address or Relay Server VirtualMTA that was used for the delivery attempt(s). |
| throttle |
The first (in order of definition) domain associated with the explicit throttling rule that was used for the delivery attempt(s).
For delivery attempts that used non-explicit throttle (i.e. it uses the default max concurrent connections & default max messages per hour)
or on a Relay Server, this value will be |
| result |
The result of these delivery attempt(s). This may be one of the following: |
Number of remote delivery slots used by messages undergoing a delivery attempt. This includes: DNS lookups, establishing the SMTP connection, transferring the message, waiting for a response, etc.
Number of remote delivery slots used by messages that are waiting to be moved to the disk queue after a deferral.
Note: when IO load due to writing messages to the disk queue is slowing the system down, this number will increase.
Number of remote delivery slots that are unused.
Number of remote delivery slots used by messages that are waiting, due to throttle rules, to be allowed to make a delivery attempt.
Number of remote delivery slots available in the queue. This is
the value of the queue.{ram,bounce,disk}.concurrencyremote setting.
Number of remote delivery attempts that have started, but this MTA has not yet determined (either locally or from GreenArrow Proxy) whether this delivery attempt may begin.
These metrics are related, representing the same information but from different perspectives:
remote_dslots_{ram,bounce,disk}_throttlewait_unacknowledged
remote_throttle_backlog_unacknowledged
remote_throttle_unacknowledged_requests_count
Number of remote delivery attempts that have been placed into the backlog, waiting for an opportunity for delivery.
These metrics are related, representing the same information but from different perspectives:
remote_dslots_{ram,bounce,disk}_throttlewait_acknowledged
remote_throttle_backlog_backlogged
Number of messages moved to the disk queue.
New remote messages queued in the system.
Timing (in seconds) of how long it takes a message to make its way from the HTTP Submission API to the point at which GreenArrow is ready to establish a remote network connection for delivery.
This includes the HTTP Submission API SimpleMH processing queue, which is measured with simplemh_http_queue_latency_seconds, and the
message progressing through the scheduling/throttling queues and systems in GreenArrow.
This is measured using a probe message that is injected every 15 seconds and is not dependent on messages being injected. This means that alerting on the non-existence of this metric is helpful, as is alerting on it being excessively high.
This metric is disabled by default and can be enabled using export_metric.
Number of unacknowledged delivery attempt requests that are currently in-flight to this GreenArrow Proxy.
This key is also emitted with queue omitted. In this case, the value is a sum of the three queues (ram/bounce/disk).
A delivery attempt request is considered to be acknowledged when the MTA hears back a response of either “begin delivery”, “connmaxout”, or “placed in backlog queue” from GreenArrow Proxy.
These metrics are related, representing the same information but from different perspectives:
remote_dslots_{ram,bounce,disk}_throttlewait_unacknowledged
remote_throttle_backlog_unacknowledged
remote_throttle_unacknowledged_requests_count
The maximum number of unacknowledged delivery attempt requests that can be in-flight at the same time for this queue (ram/bounce/disk) to this GreenArrow Proxy.
This value is normally dynamically calculated, but can be overridden using greenarrow_proxy_max_unacknowledged_requests.
Number of unacknowledged delivery attempt requests that are currently in-flight to this GreenArrow Proxy, as seen by the component that intercepts requests that would receive connmaxout due to backlog capacity.
These metrics are related, representing the same information but from different perspectives:
remote_dslots_{ram,bounce,disk}_throttlewait_unacknowledged
remote_throttle_backlog_unacknowledged
remote_throttle_unacknowledged_requests_count
Number of delivery attempt requests that are currently “in the backlog” for this GreenArrow Proxy, as seen by the component that intercepts requests that would receive connmaxout due to backlog capacity.
These metrics are related, representing the same information but from different perspectives:
remote_dslots_{ram,bounce,disk}_throttlewait_acknowledged
remote_throttle_backlog_backlogged
Duration of how long it takes for a simple network request to be exchanged with GreenArrow Proxy. High values here can indicate a problem with network communication to GreenArrow Proxy.
Duration of how long it takes for a request to make it through GreenArrow Proxy. High values here can indicate a performance bottleneck on GreenArrow Proxy.
HTTP Processing
Timing of HTTP requests that have been processed by GreenArrow.
| category |
The category of HTTP request. Possible values include: |
Incoming SMTP
Number of open incoming SMTP connections.
| service |
The GreenArrow SMTPD service that this gauge represents. Possible values include: |
Maximum number of open incoming SMTP connections supported by this SMTPD service.
| service |
The GreenArrow SMTPD service that this gauge represents. Possible values include: |
Total number of new SMTP connections that have been established to this SMTPD service.
| service |
The GreenArrow SMTPD service that this gauge represents. Possible values include: |
Remote SMTP Deliveries
The following metrics are exposed by both GreenArrow (MTA) and by GreenArrow Proxy.
- When published by the MTA, it is describing SMTP connections/deliveries that are made directly from the MTA, without using GreenArrow Proxy.
- When published by GreenArrow Proxy, it is describing SMTP connections/deliveries that are made from that GreenArrow Proxy instance.
The result is that when aggregating these datapoints for a cluster, you get the total number of outgoing SMTP connections/deliveries made from the entire cluster, with no duplicate values counted.
Number of new remote SMTP connections that have been successfully opened.
| source_ip |
The source IP address of this network connection. |
Number of new remote SMTP connections that have failed to successfully open (e.g. if the connection was refused, or a network error).
| source_ip |
The source IP address of this network connection. |
Number of remote connections that are currently open.
| source_ip |
The source IP address of this network connection. |
Number of times open SMTP connections have been reused.
| source_ip |
The source IP address of this network connection. |
When emitted by GreenArrow Proxy, this is the number of GreenArrow instances that are currently connected to this GreenArrow Proxy.
When emitted by other GreenArrow instances, this is always 1.
Bounce and FBL Processing
Number of bounce messages that have been processed by the bounce processor.
Number of FBL messages that have been processed by the FBL processor.
Number of bounce messages that have been processed by the lite bounce processor.
Number of FBL messages that have been processed by the lite FBL processor.
Event Processor
Timing (in seconds) of how long it takes from when an event is generated until the event processor is ready to deliver it to its destination. For events that need to be retried, this is the time it takes from when the retry is scheduled until it is ready to deliver to its destination.
| destination |
The name of the event delivery destination. |
Timing (in seconds) of how long it takes from when an event is generated until the event processor is ready to deliver it to its destination. Does not include events are being retried because they could not be delivered on their first attempt.
| destination |
The name of the event delivery destination. |
Timing (in seconds) of how long it takes to submit one event batch to the destination.
This metric is not written for event_delivery_logfile destinations.
| destination |
The name of the event delivery destination. |
Number of events that were succesfully delivered to the destination.
| destination |
The name of the event delivery destination. |
Number of events that failed in delivery to the destination.
| destination |
The name of the event delivery destination. |
Number of events that failed in delivery to the destination due to a network error (as opposed to, for example, an HTTP 404 response).
| destination |
The name of the event delivery destination. |
Number of read operations that have been completed against the GreenArrow events table.
| destination |
The name of the event delivery destination. |
Number of events in the queue for this destination that have not yet received a delivery attempt.
This is calculated approximately once per minute; longer queues will increase the duration between calculations.
| destination |
The name of the event delivery destination. |
Number of events in the queue for this destination that have received at least one delivery attempt.
This is calculated approximately once per minute; longer queues will increase the duration between calculations.
| destination |
The name of the event delivery destination. |
Age (in seconds) of the oldest event waiting for its first delivery attempt. If there are no such events in the queue, this is set to 0.
This is calculated approximately once per minute; longer queues will increase the duration between calculations.
| destination |
The name of the event delivery destination. |
Age (in seconds) of the oldest event waiting for a retry. If there are no such events in the queue, this is set to 0.
This is calculated approximately once per minute; longer queues will increase the duration between calculations.
| destination |
The name of the event delivery destination. |
