Metrics Server Integration

Table of Contents
Introduction
StatsD Integration
- Configuration
Prometheus Integration
- Configuration
- Timeouts
Multilog
GreenArrow Telemetry

Introduction

GreenArrow integrates with both StatsD and Prometheus. Regardless of which integration you choose (or both), you’ll receive the same telemetry data from GreenArrow.

StatsD Integration

Configuration

Configuration is done using the following directives:

Prometheus Integration

Configuration

Configuration is done using the following directive:

Timeouts

The HTTP server that is bound to prometheus_listen is configured with 60 second timeouts. If a Prometheus request takes longer than this to fulfill, the client will receive either 408 Request Timeout or 500 Internal Server Error.

Multilog

Every 5 seconds, the hvmail-metrics service will log a summary of recent metrics it has processed. This can be monitored using tail:

tail -F /var/hvmail/log/metrics/current | tai64nlocal

For more information on GreenArrow’s logs, see Service Logs.

GreenArrow Telemetry

The metric keys below are defined as provided to Prometheus. If a metric has any labels associated with it, they are defined in a list within the metric definition.

When using StatsD, be aware:

Any labels are appended to the key name, prefixed by the label name and an underscore, with non-alphanumeric characters replaced by underscore. For example, in the remote_delivery_attempts_active metric, a delivery attempt being made to example.com on the outmta ipaddr-1 would be sent as remote_delivery_attempts_active.outmta_ipaddr_1.throttle_example_com.
Timers (e.g. http_request_duration_seconds) are recorded in milliseconds, and the _seconds suffix is transformed to _milliseconds.

Queue Status

name:

ram_queue_message_batches

statsd:

ram_queue_message_batches

type:

gauge

Number of message batches currently in the RAM queue (first delivery attempts).

name:

ram_queue_message_batches_max

statsd:

ram_queue_message_batches_max

type:

gauge

Maximum number of message batches that will fit in the RAM queue (first delivery attempts).

name:

ram_queue_bytes

statsd:

ram_queue_bytes

type:

gauge

Total size (in bytes) of messages currently in the RAM queue (first delivery attempts).

name:

ram_queue_bytes_max

statsd:

ram_queue_bytes_max

type:

gauge

Maximum size (in bytes) of messages that will fit in the RAM queue (first delivery attempts).

name:

simplemh_http_queue_messages

statsd:

simplemh_http_queue_messages

type:

gauge

Number of messages currently in the HTTP Submission API SimpleMH processing queue.

name:

simplemh_http_queue_messages_max

statsd:

simplemh_http_queue_messages_max

type:

gauge

Maximum number of messages that will fit in the HTTP Submission API SimpleMH processing queue.

name:

simplemh_http_queue_bytes

statsd:

simplemh_http_queue_bytes

type:

gauge

Total size (in bytes) of messages currently in the HTTP Submission API SimpleMH processing queue.

name:

simplemh_http_queue_bytes_max

statsd:

simplemh_http_queue_bytes_max

type:

gauge

Maximum size (in bytes) of messages that will fit in the HTTP Submission API SimpleMH processing queue.

name:

simplemh_http_queue_latency_seconds

statsd:

simplemh_http_queue_latency_milliseconds

type:

histogram

Timing (in seconds) of how long it takes a message to make its way through the HTTP Submission API SimpleMH processing queue.

This is measured using a probe message that is injected every 15 seconds and is not dependent on messages being injected. This means that alerting on the non-existence of this metric is helpful, as is alerting on it being excessively high.

This metric is disabled by default and can be enabled using export_metric.

name:

raw_nowait_queue_messages

statsd:

raw_nowait_queue_messages

type:

gauge

Number of messages currently in the no-wait message queue. This queue is used for notifications generated by GreenArrow.

name:

bounce_processor_queue_messages

statsd:

bounce_processor_queue_messages

type:

gauge

Number of bounces currently waiting to be processed.

name:

bounce_processor_queue_messages_soft_max

statsd:

bounce_processor_queue_messages_soft_max

type:

gauge

Maximum number of bounces that will fit in the bounce processing queue before bounce message injection will slow down. This represents a back-pressure mechanism to give the bounce processor a chance to catch up.

name:

bounce_processor_queue_messages_max

statsd:

bounce_processor_queue_messages_max

type:

gauge

Maximum number of bounces that will fit in the bounce processing queue.

name:

fbl_processor_queue_messages

statsd:

fbl_processor_queue_messages

type:

gauge

Number of FBL messages currently waiting to be processed.

name:

fbl_processor_queue_messages_soft_max

statsd:

fbl_processor_queue_messages_soft_max

type:

gauge

Maximum number of FBL messages that will fit in the FBL processor queue before FBL message injection will slow down. This represents a back-pressure mechanism to give the FBL processor a chance to catch up.

name:

fbl_processor_queue_messages_max

statsd:

fbl_processor_queue_messages_max

type:

gauge

Maximum number of FBL messages that will fit in the FBL processor queue.

name:

lite_bounce_processor_queue_messages

statsd:

lite_bounce_processor_queue_messages

type:

gauge

Number of bounce messages and FBL notifications waiting to be processed by the Lite Bounce Processor. This queue is filled by recipients matching the lite_bounce_processor_address and lite_fbl_processor_address directives. This includes messages that will generate bounce_lite and scomp_lite events.

name:

lite_bounce_processor_queue_messages_soft_max

statsd:

lite_bounce_processor_queue_messages_soft_max

type:

gauge

Maximum number of bounces that will fit in the Lite Bounce Processor queue before bounce message injection will slow down. This represents a back-pressure mechanism to give the Lite Bounce Processor a chance to catch up.

name:

lite_bounce_processor_queue_messages_max

statsd:

lite_bounce_processor_queue_messages_max

type:

gauge

Maximum number of bounces that will fit in the Lite Bounce Processor queue.

name:

incoming_drain_queue_batches

statsd:

incoming_drain_queue_batches

type:

gauge

Number of message batches waiting to be injected into this instance that were drained from another instance.

name:

incoming_drain_queue_batches_max

statsd:

incoming_drain_queue_batches_max

type:

gauge

Maximum number of drained message batches that can fit in the incoming drain queue.

name:

disk_queue_catchup_percentage

statsd:

disk_queue_catchup_percentage

type:

gauge

Percentage of time (ranging from 0.0 to 1.0) that the disk queue has spent in catch-up mode. Catch-up mode represents degraded disk queue performance and this can result in significant back-pressure on message injection.

name:

disk_queue_scheduling_delay_seconds

statsd:

disk_queue_scheduling_delay_milliseconds

type:

gauge

The difference between the current time and the next scheduled retry (in seconds). If the next scheduled retry is in the future, this gauge is zero and the disk queue is caught-up on scheduling. If the next scheduled retry is in the past, this value is non-zero, meaning that the disk queue is behind in scheduling message delivery attempts.

Remote Delivery

name:

remote_attempts_connmaxout

statsd:

remote.attempts.connmaxout

type:

counter

Number of remote delivery attempts that are cancelled/rescheduled because they would exceed the throttle limits for a domain. This includes when this happens on the last delivery attempt of the message, which causes the message to bounce.

name:

remote_attempts_deferral

statsd:

remote.attempts.deferral

type:

counter

Number of remote delivery attempts that result in a deferral. This includes deferrals on the last delivery attempt which cause messages to be bounced.

name:

remote_attempts_dump

statsd:

remote.attempts.dump

type:

counter

Number of remote delivery attempts that result in the message being dumped from the queue (due to the “dump messages from queue” feature).

name:

remote_attempts_failure

statsd:

remote.attempts.failure

type:

counter

Number of remote delivery attempts that result in a failure.

name:

remote_attempts_success

statsd:

remote.attempts.success

type:

counter

Number of remote delivery attempts that succeed.

name:

remote_attempts_started

statsd:

remote.attempts.started

type:

counter

Number of remote delivery attempts that began processing.

name:

remote_attempts_connmaxout_maxconn

statsd:

remote.attempts.connmaxout.maxconn

type:

counter

Number of remote delivery attempts that were a connmaxout due to exceeding a maximum concurrent connections limit.

name:

remote_attempts_connmaxout_msgperhour

statsd:

remote.attempts.connmaxout.msgperhour

type:

counter

Number of remote delivery attempts that were a connmaxout due to exceeding a maximum delivery attempts per hour limit.

name:

remote_attempts_connmaxout_maxunacknowledged

statsd:

remote.attempts.connmaxout.maxunacknowledged

type:

counter

Number of remote delivery attempts that were a connmaxout due to exceeding the limit of maximum unacknowledged delivery requests to GreenArrow Proxy. This can happen if the GreenArrow Proxy host is under-performing (i.e. exhausting CPU resources) or there is a network problem between the MTA and the GreenArrow Proxy host.

name:

remote_attempts_connmaxout_proxy_{local,remote}

statsd:

remote.attempts.connmaxout.proxy.{local,remote}

type:

counter

Number of remote delivery attempts from IP Addresses using GreenArrow Proxy resulting in a connmaxout, broken down by whether or not the delivery attempt request was fully dispatched to the throttle decision maker in GreenArrow Proxy, or if a connmaxout was determined locally without the network request.

Delivery attempts not made through GreenArrow Proxy do not report here.

name:

remote_delivery_attempts_active

statsd:

remote_delivery_attempts_active.outmta_{}.throttle_{}

type:

gauge

The number of remote delivery attempts that are currently active.

outmta	The name of the IP Address or Relay Server VirtualMTA that is currently being used for this delivery attempt.
throttle	The first (in order of definition) domain associated with the explicit throttling rule that is being used for this delivery attempt. If this delivery attempt is using a non-explicit throttle (i.e. it uses the default max concurrent connections & default max messages per hour) or is using a Relay Server, this value will be `__default`.

name:

remote_delivery_attempts_total

statsd:

remote_delivery_attempts_total.outmta_{}.throttle_{}.result_{}

type:

counter

The number of remote delivery attempts that have completed.

outmta	The name of the IP Address or Relay Server VirtualMTA that was used for the delivery attempt(s).
throttle	The first (in order of definition) domain associated with the explicit throttling rule that was used for the delivery attempt(s). For delivery attempts that used non-explicit throttle (i.e. it uses the default max concurrent connections & default max messages per hour) or on a Relay Server, this value will be `__default`.
result	The result of these delivery attempt(s). This may be one of the following: `success`, `failure`, `deferral`, `pause`, `dump`, `connmaxout`, `unknown_report_type`, or `unknown_report_code`

name:

remote_dslots_{ram,bounce,disk}_delivering

statsd:

remote.dslots.{ram,bounce,disk}.delivering

type:

gauge

Number of remote delivery slots used by messages undergoing a delivery attempt. This includes: DNS lookups, establishing the SMTP connection, transferring the message, waiting for a response, etc.

name:

remote_dslots_{ram,bounce,disk}_diskwait

statsd:

remote.dslots.{ram,bounce,disk}.diskwait

type:

gauge

Number of remote delivery slots used by messages that are waiting to be moved to the disk queue after a deferral.

Note: when IO load due to writing messages to the disk queue is slowing the system down, this number will increase.

name:

remote_dslots_{ram,bounce,disk}_free

statsd:

remote.dslots.{ram,bounce,disk}.free

type:

gauge

Number of remote delivery slots that are unused.

name:

remote_dslots_{ram,bounce,disk}_throttlewait

statsd:

remote.dslots.{ram,bounce,disk}.throttlewait

type:

gauge

Number of remote delivery slots used by messages that are waiting, due to throttle rules, to be allowed to make a delivery attempt.

name:

remote_dslots_{ram,bounce,disk}_total

statsd:

remote.dslots.{ram,bounce,disk}.total

type:

gauge

Number of remote delivery slots available in the queue. This is the value of the queue.{ram,bounce,disk}.concurrencyremote setting.

name:

remote_dslots_{ram,bounce,disk}_throttlewait_unacknowledged

statsd:

remote.dslots.{ram,bounce,disk}.throttlewait.unacknowledged

type:

gauge

Number of remote delivery attempts that have started, but this MTA has not yet determined (either locally or from GreenArrow Proxy) whether this delivery attempt may begin.

These metrics are related, representing the same information but from different perspectives:

remote_dslots_{ram,bounce,disk}_throttlewait_unacknowledged
remote_throttle_backlog_unacknowledged
remote_throttle_unacknowledged_requests_count

name:

remote_dslots_{ram,bounce,disk}_throttlewait_acknowledged

statsd:

remote.dslots.{ram,bounce,disk}.throttlewait.acknowledged

type:

gauge

Number of remote delivery attempts that have been placed into the backlog, waiting for an opportunity for delivery.

These metrics are related, representing the same information but from different perspectives:

remote_dslots_{ram,bounce,disk}_throttlewait_acknowledged
remote_throttle_backlog_backlogged

name:

remote_messages_disk_queue_add_total

statsd:

remote.messages.disk_queue.add.count

type:

counter

Number of messages moved to the disk queue.

name:

remote_messages_new_total

statsd:

remote.messages.new.count

type:

counter

New remote messages queued in the system.

name:

delivery_probe_latency_seconds

statsd:

delivery_probe_latency_milliseconds

type:

histogram

Timing (in seconds) of how long it takes a message to make its way from the HTTP Submission API to the point at which GreenArrow is ready to establish a remote network connection for delivery.

This includes the HTTP Submission API SimpleMH processing queue, which is measured with simplemh_http_queue_latency_seconds, and the message progressing through the scheduling/throttling queues and systems in GreenArrow.

This metric is disabled by default and can be enabled using export_metric.

name:

remote_throttle_unacknowledged_requests_count

statsd:

remote_throttle_unacknowledged_requests_count.queue_{ram,bounce,disk}.proxy_{}

type:

gauge

Number of unacknowledged delivery attempt requests that are currently in-flight to this GreenArrow Proxy.

This key is also emitted with queue omitted. In this case, the value is a sum of the three queues (ram/bounce/disk).

A delivery attempt request is considered to be acknowledged when the MTA hears back a response of either “begin delivery”, “connmaxout”, or “placed in backlog queue” from GreenArrow Proxy.

These metrics are related, representing the same information but from different perspectives:

remote_dslots_{ram,bounce,disk}_throttlewait_unacknowledged
remote_throttle_backlog_unacknowledged
remote_throttle_unacknowledged_requests_count

name:

remote_throttle_unacknowledged_requests_max

statsd:

remote_throttle_unacknowledged_requests_max.queue_{ram,bounce,disk}.proxy_{}

type:

gauge

The maximum number of unacknowledged delivery attempt requests that can be in-flight at the same time for this queue (ram/bounce/disk) to this GreenArrow Proxy.

This value is normally dynamically calculated, but can be overridden using greenarrow_proxy_max_unacknowledged_requests.

name:

remote_throttle_backlog_unacknowledged

statsd:

remote_throttle_backlog_unacknowledged.proxy_{}

type:

gauge

Number of unacknowledged delivery attempt requests that are currently in-flight to this GreenArrow Proxy, as seen by the component that intercepts requests that would receive connmaxout due to backlog capacity.

These metrics are related, representing the same information but from different perspectives:

remote_dslots_{ram,bounce,disk}_throttlewait_unacknowledged
remote_throttle_backlog_unacknowledged
remote_throttle_unacknowledged_requests_count

name:

remote_throttle_backlog_backlogged

statsd:

remote_throttle_backlog_backlogged.proxy_{}

type:

gauge

Number of delivery attempt requests that are currently “in the backlog” for this GreenArrow Proxy, as seen by the component that intercepts requests that would receive connmaxout due to backlog capacity.

These metrics are related, representing the same information but from different perspectives:

remote_dslots_{ram,bounce,disk}_throttlewait_acknowledged
remote_throttle_backlog_backlogged

name:

remote_throttle_ping_duration_seconds

statsd:

remote_throttle_ping_duration_milliseconds.proxy_{}

type:

histogram

Duration of how long it takes for a simple network request to be exchanged with GreenArrow Proxy. High values here can indicate a problem with network communication to GreenArrow Proxy.

name:

remote_throttle_round_trip_duration_seconds

statsd:

remote_throttle_round_trip_duration_milliseconds.proxy_{}

type:

histogram

Duration of how long it takes for a request to make it through GreenArrow Proxy. High values here can indicate a performance bottleneck on GreenArrow Proxy.

HTTP Processing

name:

http_request_duration_seconds

statsd:

http_request_duration_milliseconds.category_{}

type:

histogram

Timing of HTTP requests that have been processed by GreenArrow.

category

The category of HTTP request.

Possible values include: click, open, inject, engine_stats, engine_api, engine_ui, and other

Incoming SMTP

name:

incoming_smtp_connections_active

statsd:

incoming.smtp.connections.active

type:

gauge

Number of open incoming SMTP connections.

service

The GreenArrow SMTPD service that this gauge represents.

Possible values include: smtpd1, smtpd2, smtpd3

name:

incoming_smtp_connections_max

statsd:

incoming.smtp.connections.max

type:

gauge

Maximum number of open incoming SMTP connections supported by this SMTPD service.

service

The GreenArrow SMTPD service that this gauge represents.

Possible values include: smtpd1, smtpd2, smtpd3

name:

incoming_smtp_connections_total

statsd:

incoming.smtp.connections.total

type:

counter

Total number of new SMTP connections that have been established to this SMTPD service.

service

The GreenArrow SMTPD service that this gauge represents.

Possible values include: smtpd1, smtpd2, smtpd3

Remote SMTP Deliveries

The following metrics are exposed by both GreenArrow (MTA) and by GreenArrow Proxy.

When published by the MTA, it is describing SMTP connections/deliveries that are made directly from the MTA, without using GreenArrow Proxy.
When published by GreenArrow Proxy, it is describing SMTP connections/deliveries that are made from that GreenArrow Proxy instance.

The result is that when aggregating these datapoints for a cluster, you get the total number of outgoing SMTP connections/deliveries made from the entire cluster, with no duplicate values counted.

name:

remote_connection_new_total

statsd:

remote.connection.new.total

type:

counter

Number of new remote SMTP connections that have been successfully opened.

source_ip

The source IP address of this network connection.

name:

remote_connection_failed_total

statsd:

remote.connection.failed.total

type:

counter

Number of new remote SMTP connections that have failed to successfully open (e.g. if the connection was refused, or a network error).

source_ip

The source IP address of this network connection.

name:

remote_connections_active

statsd:

remote.connections.active

type:

gauge

Number of remote connections that are currently open.

source_ip

The source IP address of this network connection.

name:

remote_connection_reused_total

statsd:

remote.connection.reused.total

type:

counter

Number of times open SMTP connections have been reused.

source_ip

The source IP address of this network connection.

name:

remote_throttle_sessions

statsd:

remote.throttle.sessions

type:

gauge

When emitted by GreenArrow Proxy, this is the number of GreenArrow instances that are currently connected to this GreenArrow Proxy.

When emitted by other GreenArrow instances, this is always 1.

Bounce and FBL Processing

name:

bounce_message_processed_total

statsd:

bounce.message.processed.total

type:

counter

Number of bounce messages that have been processed by the bounce processor.

name:

fbl_message_processed_total

statsd:

fbl.message.processed.total

type:

counter

Number of FBL messages that have been processed by the FBL processor.

name:

lite_bounce_message_processed_total

statsd:

lite.bounce.message.processed.total

type:

counter

Number of bounce messages that have been processed by the lite bounce processor.

name:

lite_fbl_message_processed_total

statsd:

lite_fbl_message_processed_total

type:

counter

Number of FBL messages that have been processed by the lite FBL processor.

Event Processor

name:

event_delivery_ready_latency_seconds

statsd:

event_delivery_ready_latency_milliseconds.destination_{}

type:

gauge

Timing (in seconds) of how long it takes from when an event is generated until the event processor is ready to deliver it to its destination. For events that need to be retried, this is the time it takes from when the retry is scheduled until it is ready to deliver to its destination.

destination

The name of the event delivery destination.

name:

event_delivery_first_attempt_ready_latency_seconds

statsd:

event_delivery_first_attempt_ready_latency_seconds.destination_{}

type:

gauge

Timing (in seconds) of how long it takes from when an event is generated until the event processor is ready to deliver it to its destination. Does not include events are being retried because they could not be delivered on their first attempt.

destination

The name of the event delivery destination.

name:

event_delivery_submission_latency_seconds

statsd:

event_delivery_submission_latency_milliseconds.destination_{}

type:

gauge

Timing (in seconds) of how long it takes to submit one event batch to the destination.

This metric is not written for event_delivery_logfile destinations.

destination

The name of the event delivery destination.

name:

event_delivery_delivered_total

statsd:

event_delivery_delivered_total.destination_{}

type:

counter

Number of events that were succesfully delivered to the destination.

destination

The name of the event delivery destination.

name:

event_delivery_failed_total

statsd:

event_delivery_failed_total.destination_{}

type:

counter

Number of events that failed in delivery to the destination.

destination

The name of the event delivery destination.

name:

event_delivery_network_failed_total

statsd:

event_delivery_network_failed_total.destination_{}

type:

counter

Number of events that failed in delivery to the destination due to a network error (as opposed to, for example, an HTTP 404 response).

destination

The name of the event delivery destination.

name:

event_delivery_read_operations_total

statsd:

event_delivery_read_operations_total.destination_{}

type:

counter

Number of read operations that have been completed against the GreenArrow events table.

destination

The name of the event delivery destination.

name:

event_delivery_queue_first_attempt_count

statsd:

event_delivery_queue_first_attempt_count.destination_{}

type:

gauge

Number of events in the queue for this destination that have not yet received a delivery attempt.

This is calculated approximately once per minute; longer queues will increase the duration between calculations.

destination

The name of the event delivery destination.

name:

event_delivery_queue_retry_count

statsd:

event_delivery_queue_retry_count.destination_{}

type:

gauge

Number of events in the queue for this destination that have received at least one delivery attempt.

This is calculated approximately once per minute; longer queues will increase the duration between calculations.

destination

The name of the event delivery destination.

name:

event_delivery_queue_first_attempt_age_seconds

statsd:

event_delivery_queue_first_attempt_age_seconds.destination_{}

type:

gauge

Age (in seconds) of the oldest event waiting for its first delivery attempt. If there are no such events in the queue, this is set to 0.

This is calculated approximately once per minute; longer queues will increase the duration between calculations.

destination

The name of the event delivery destination.

name:

event_delivery_queue_retry_age_seconds

statsd:

event_delivery_queue_retry_age_seconds.destination_{}

type:

gauge

Age (in seconds) of the oldest event waiting for a retry. If there are no such events in the queue, this is set to 0.

This is calculated approximately once per minute; longer queues will increase the duration between calculations.

destination

The name of the event delivery destination.