GreenArrow Email Software Documentation

Metrics Server Integration

Introduction

GreenArrow integrates with both StatsD and Prometheus. Regardless of which integration you choose (or both), you’ll receive the same telemetry data from GreenArrow.

StatsD Integration

Configuration

Configuration is done using the following directives:

Prometheus Integration

Configuration

Configuration is done using the following directive:

Timeouts

The HTTP server that is bound to prometheus_listen is configured with 60 second timeouts. If a Prometheus request takes longer than this to fulfill, the client will receive either 408 Request Timeout or 500 Internal Server Error.

Multilog

Every 5 seconds, the hvmail-metrics service will log a summary of recent metrics it has processed. This can be monitored using tail:

tail -F /var/hvmail/log/metrics/current | tai64nlocal

For more information on GreenArrow’s logs, see Service Logs.

GreenArrow Telemetry

The metric keys below are defined as provided to Prometheus. If a metric has any labels associated with it, they are defined in a list within the metric definition.

When using StatsD, be aware:

  • Any labels are appended to the key name, prefixed by the label name and an underscore, with non-alphanumeric characters replaced by underscore. For example, in the remote_delivery_attempts_active metric, a delivery attempt being made to example.com on the outmta ipaddr-1 would be sent as remote_delivery_attempts_active.outmta_ipaddr_1.throttle_example_com.
  • Timers (e.g. http_request_duration_seconds) are recorded in milliseconds, and the _seconds suffix is transformed to _milliseconds.

Queue Status

name:
ram_queue_message_batches
statsd:
ram_queue_message_batches
type:
gauge

Number of message batches currently in the RAM queue (first delivery attempts).

name:
ram_queue_message_batches_max
statsd:
ram_queue_message_batches_max
type:
gauge

Maximum number of message batches that will fit in the RAM queue (first delivery attempts).

name:
ram_queue_bytes
statsd:
ram_queue_bytes
type:
gauge

Total size (in bytes) of messages currently in the RAM queue (first delivery attempts).

name:
ram_queue_bytes_max
statsd:
ram_queue_bytes_max
type:
gauge

Maximum size (in bytes) of messages that will fit in the RAM queue (first delivery attempts).

name:
simplemh_http_queue_messages
statsd:
simplemh_http_queue_messages
type:
gauge

Number of messages currently in the HTTP Submission API SimpleMH processing queue.

name:
simplemh_http_queue_messages_max
statsd:
simplemh_http_queue_messages_max
type:
gauge

Maximum number of messages that will fit in the HTTP Submission API SimpleMH processing queue.

name:
simplemh_http_queue_bytes
statsd:
simplemh_http_queue_bytes
type:
gauge

Total size (in bytes) of messages currently in the HTTP Submission API SimpleMH processing queue.

name:
simplemh_http_queue_bytes_max
statsd:
simplemh_http_queue_bytes_max
type:
gauge

Maximum size (in bytes) of messages that will fit in the HTTP Submission API SimpleMH processing queue.

name:
simplemh_http_queue_latency_seconds
statsd:
simplemh_http_queue_latency_milliseconds
type:
histogram

Timing (in seconds) of how long it takes a message to make its way through the HTTP Submission API SimpleMH processing queue.

This is measured using a probe message that is injected every 15 seconds and is not dependent on messages being injected. This means that alerting on the non-existence of this metric is helpful, as is alerting on it being excessively high.

This metric is disabled by default and can be enabled using export_metric.

name:
raw_nowait_queue_messages
statsd:
raw_nowait_queue_messages
type:
gauge

Number of messages currently in the no-wait message queue. This queue is used for notifications generated by GreenArrow.

name:
bounce_processor_queue_messages
statsd:
bounce_processor_queue_messages
type:
gauge

Number of bounces currently waiting to be processed.

name:
bounce_processor_queue_messages_soft_max
statsd:
bounce_processor_queue_messages_soft_max
type:
gauge

Maximum number of bounces that will fit in the bounce processing queue before bounce message injection will slow down. This represents a back-pressure mechanism to give the bounce processor a chance to catch up.

name:
bounce_processor_queue_messages_max
statsd:
bounce_processor_queue_messages_max
type:
gauge

Maximum number of bounces that will fit in the bounce processing queue.

name:
fbl_processor_queue_messages
statsd:
fbl_processor_queue_messages
type:
gauge

Number of FBL messages currently waiting to be processed.

name:
fbl_processor_queue_messages_soft_max
statsd:
fbl_processor_queue_messages_soft_max
type:
gauge

Maximum number of FBL messages that will fit in the FBL processor queue before FBL message injection will slow down. This represents a back-pressure mechanism to give the FBL processor a chance to catch up.

name:
fbl_processor_queue_messages_max
statsd:
fbl_processor_queue_messages_max
type:
gauge

Maximum number of FBL messages that will fit in the FBL processor queue.

name:
lite_bounce_processor_queue_messages
statsd:
lite_bounce_processor_queue_messages
type:
gauge

Number of bounce messages and FBL notifications waiting to be processed by the Lite Bounce Processor. This queue is filled by recipients matching the lite_bounce_processor_address and lite_fbl_processor_address directives. This includes messages that will generate bounce_lite and scomp_lite events.

name:
lite_bounce_processor_queue_messages_soft_max
statsd:
lite_bounce_processor_queue_messages_soft_max
type:
gauge

Maximum number of bounces that will fit in the Lite Bounce Processor queue before bounce message injection will slow down. This represents a back-pressure mechanism to give the Lite Bounce Processor a chance to catch up.

name:
lite_bounce_processor_queue_messages_max
statsd:
lite_bounce_processor_queue_messages_max
type:
gauge

Maximum number of bounces that will fit in the Lite Bounce Processor queue.

name:
incoming_drain_queue_batches
statsd:
incoming_drain_queue_batches
type:
gauge

Number of message batches waiting to be injected into this instance that were drained from another instance.

name:
incoming_drain_queue_batches_max
statsd:
incoming_drain_queue_batches_max
type:
gauge

Maximum number of drained message batches that can fit in the incoming drain queue.

name:
disk_queue_catchup_percentage
statsd:
disk_queue_catchup_percentage
type:
gauge

Percentage of time (ranging from 0.0 to 1.0) that the disk queue has spent in catch-up mode. Catch-up mode represents degraded disk queue performance and this can result in significant back-pressure on message injection.

name:
disk_queue_scheduling_delay_seconds
statsd:
disk_queue_scheduling_delay_milliseconds
type:
gauge

The difference between the current time and the next scheduled retry (in seconds). If the next scheduled retry is in the future, this gauge is zero and the disk queue is caught-up on scheduling. If the next scheduled retry is in the past, this value is non-zero, meaning that the disk queue is behind in scheduling message delivery attempts.

Remote Delivery

name:
remote_attempts_connmaxout
statsd:
remote.attempts.connmaxout
type:
counter

Number of remote delivery attempts that are cancelled/rescheduled because they would exceed the throttle limits for a domain. This includes when this happens on the last delivery attempt of the message, which causes the message to bounce.

name:
remote_attempts_deferral
statsd:
remote.attempts.deferral
type:
counter

Number of remote delivery attempts that result in a deferral. This includes deferrals on the last delivery attempt which cause messages to be bounced.

name:
remote_attempts_dump
statsd:
remote.attempts.dump
type:
counter

Number of remote delivery attempts that result in the message being dumped from the queue (due to the “dump messages from queue” feature).

name:
remote_attempts_failure
statsd:
remote.attempts.failure
type:
counter

Number of remote delivery attempts that result in a failure.

name:
remote_attempts_success
statsd:
remote.attempts.success
type:
counter

Number of remote delivery attempts that succeed.

name:
remote_attempts_started
statsd:
remote.attempts.started
type:
counter

Number of remote delivery attempts that began processing.

name:
remote_attempts_connmaxout_maxconn
statsd:
remote.attempts.connmaxout.maxconn
type:
counter

Number of remote delivery attempts that were a connmaxout due to exceeding a maximum concurrent connections limit.

name:
remote_attempts_connmaxout_msgperhour
statsd:
remote.attempts.connmaxout.msgperhour
type:
counter

Number of remote delivery attempts that were a connmaxout due to exceeding a maximum delivery attempts per hour limit.

name:
remote_attempts_connmaxout_maxunacknowledged
statsd:
remote.attempts.connmaxout.maxunacknowledged
type:
counter

Number of remote delivery attempts that were a connmaxout due to exceeding the limit of maximum unacknowledged delivery requests to GreenArrow Proxy. This can happen if the GreenArrow Proxy host is under-performing (i.e. exhausting CPU resources) or there is a network problem between the MTA and the GreenArrow Proxy host.

name:
remote_attempts_connmaxout_proxy_{local,remote}
statsd:
remote.attempts.connmaxout.proxy.{local,remote}
type:
counter

Number of remote delivery attempts from IP Addresses using GreenArrow Proxy resulting in a connmaxout, broken down by whether or not the delivery attempt request was fully dispatched to the throttle decision maker in GreenArrow Proxy, or if a connmaxout was determined locally without the network request.

Delivery attempts not made through GreenArrow Proxy do not report here.

name:
remote_delivery_attempts_active
statsd:
remote_delivery_attempts_active.outmta_{}.throttle_{}
type:
gauge

The number of remote delivery attempts that are currently active.

outmta

The name of the IP Address or Relay Server VirtualMTA that is currently being used for this delivery attempt.

throttle

The first (in order of definition) domain associated with the explicit throttling rule that is being used for this delivery attempt. If this delivery attempt is using a non-explicit throttle (i.e. it uses the default max concurrent connections & default max messages per hour) or is using a Relay Server, this value will be __default.

name:
remote_delivery_attempts_total
statsd:
remote_delivery_attempts_total.outmta_{}.throttle_{}.result_{}
type:
counter

The number of remote delivery attempts that have completed.

outmta

The name of the IP Address or Relay Server VirtualMTA that was used for the delivery attempt(s).

throttle

The first (in order of definition) domain associated with the explicit throttling rule that was used for the delivery attempt(s). For delivery attempts that used non-explicit throttle (i.e. it uses the default max concurrent connections & default max messages per hour) or on a Relay Server, this value will be __default.

result

The result of these delivery attempt(s). This may be one of the following: success, failure, deferral, pause, dump, connmaxout, unknown_report_type, or unknown_report_code

name:
remote_dslots_{ram,bounce,disk}_delivering
statsd:
remote.dslots.{ram,bounce,disk}.delivering
type:
gauge

Number of remote delivery slots used by messages undergoing a delivery attempt. This includes: DNS lookups, establishing the SMTP connection, transferring the message, waiting for a response, etc.

name:
remote_dslots_{ram,bounce,disk}_diskwait
statsd:
remote.dslots.{ram,bounce,disk}.diskwait
type:
gauge

Number of remote delivery slots used by messages that are waiting to be moved to the disk queue after a deferral.

Note: when IO load due to writing messages to the disk queue is slowing the system down, this number will increase.

name:
remote_dslots_{ram,bounce,disk}_free
statsd:
remote.dslots.{ram,bounce,disk}.free
type:
gauge

Number of remote delivery slots that are unused.

name:
remote_dslots_{ram,bounce,disk}_throttlewait
statsd:
remote.dslots.{ram,bounce,disk}.throttlewait
type:
gauge

Number of remote delivery slots used by messages that are waiting, due to throttle rules, to be allowed to make a delivery attempt.

name:
remote_dslots_{ram,bounce,disk}_total
statsd:
remote.dslots.{ram,bounce,disk}.total
type:
gauge

Number of remote delivery slots available in the queue. This is the value of the queue.{ram,bounce,disk}.concurrencyremote setting.

name:
remote_dslots_{ram,bounce,disk}_throttlewait_unacknowledged
statsd:
remote.dslots.{ram,bounce,disk}.throttlewait.unacknowledged
type:
gauge

Number of remote delivery attempts that have started, but this MTA has not yet determined (either locally or from GreenArrow Proxy) whether this delivery attempt may begin.

These metrics are related, representing the same information but from different perspectives:

remote_dslots_{ram,bounce,disk}_throttlewait_unacknowledged
remote_throttle_backlog_unacknowledged
remote_throttle_unacknowledged_requests_count

name:
remote_dslots_{ram,bounce,disk}_throttlewait_acknowledged
statsd:
remote.dslots.{ram,bounce,disk}.throttlewait.acknowledged
type:
gauge

Number of remote delivery attempts that have been placed into the backlog, waiting for an opportunity for delivery.

These metrics are related, representing the same information but from different perspectives:

remote_dslots_{ram,bounce,disk}_throttlewait_acknowledged
remote_throttle_backlog_backlogged

name:
remote_messages_disk_queue_add_total
statsd:
remote.messages.disk_queue.add.count
type:
counter

Number of messages moved to the disk queue.

name:
remote_messages_new_total
statsd:
remote.messages.new.count
type:
counter

New remote messages queued in the system.

name:
delivery_probe_latency_seconds
statsd:
delivery_probe_latency_milliseconds
type:
histogram

Timing (in seconds) of how long it takes a message to make its way from the HTTP Submission API to the point at which GreenArrow is ready to establish a remote network connection for delivery.

This includes the HTTP Submission API SimpleMH processing queue, which is measured with simplemh_http_queue_latency_seconds, and the message progressing through the scheduling/throttling queues and systems in GreenArrow.

This is measured using a probe message that is injected every 15 seconds and is not dependent on messages being injected. This means that alerting on the non-existence of this metric is helpful, as is alerting on it being excessively high.

This metric is disabled by default and can be enabled using export_metric.

name:
remote_throttle_unacknowledged_requests_count
statsd:
remote_throttle_unacknowledged_requests_count.queue_{ram,bounce,disk}.proxy_{}
type:
gauge

Number of unacknowledged delivery attempt requests that are currently in-flight to this GreenArrow Proxy.

This key is also emitted with queue omitted. In this case, the value is a sum of the three queues (ram/bounce/disk).

A delivery attempt request is considered to be acknowledged when the MTA hears back a response of either “begin delivery”, “connmaxout”, or “placed in backlog queue” from GreenArrow Proxy.

These metrics are related, representing the same information but from different perspectives:

remote_dslots_{ram,bounce,disk}_throttlewait_unacknowledged
remote_throttle_backlog_unacknowledged
remote_throttle_unacknowledged_requests_count

name:
remote_throttle_unacknowledged_requests_max
statsd:
remote_throttle_unacknowledged_requests_max.queue_{ram,bounce,disk}.proxy_{}
type:
gauge

The maximum number of unacknowledged delivery attempt requests that can be in-flight at the same time for this queue (ram/bounce/disk) to this GreenArrow Proxy.

This value is normally dynamically calculated, but can be overridden using greenarrow_proxy_max_unacknowledged_requests.

name:
remote_throttle_backlog_unacknowledged
statsd:
remote_throttle_backlog_unacknowledged.proxy_{}
type:
gauge

Number of unacknowledged delivery attempt requests that are currently in-flight to this GreenArrow Proxy, as seen by the component that intercepts requests that would receive connmaxout due to backlog capacity.

These metrics are related, representing the same information but from different perspectives:

remote_dslots_{ram,bounce,disk}_throttlewait_unacknowledged
remote_throttle_backlog_unacknowledged
remote_throttle_unacknowledged_requests_count

name:
remote_throttle_backlog_backlogged
statsd:
remote_throttle_backlog_backlogged.proxy_{}
type:
gauge

Number of delivery attempt requests that are currently “in the backlog” for this GreenArrow Proxy, as seen by the component that intercepts requests that would receive connmaxout due to backlog capacity.

These metrics are related, representing the same information but from different perspectives:

remote_dslots_{ram,bounce,disk}_throttlewait_acknowledged
remote_throttle_backlog_backlogged

name:
remote_throttle_ping_duration_seconds
statsd:
remote_throttle_ping_duration_milliseconds.proxy_{}
type:
histogram

Duration of how long it takes for a simple network request to be exchanged with GreenArrow Proxy. High values here can indicate a problem with network communication to GreenArrow Proxy.

name:
remote_throttle_round_trip_duration_seconds
statsd:
remote_throttle_round_trip_duration_milliseconds.proxy_{}
type:
histogram

Duration of how long it takes for a request to make it through GreenArrow Proxy. High values here can indicate a performance bottleneck on GreenArrow Proxy.

HTTP Processing

name:
http_request_duration_seconds
statsd:
http_request_duration_milliseconds.category_{}
type:
histogram

Timing of HTTP requests that have been processed by GreenArrow.

category

The category of HTTP request.

Possible values include: click, open, inject, engine_stats, engine_api, engine_ui, and other

Incoming SMTP

name:
incoming_smtp_connections_active
statsd:
incoming.smtp.connections.active
type:
gauge

Number of open incoming SMTP connections.

service

The GreenArrow SMTPD service that this gauge represents.

Possible values include: smtpd1, smtpd2, smtpd3

name:
incoming_smtp_connections_max
statsd:
incoming.smtp.connections.max
type:
gauge

Maximum number of open incoming SMTP connections supported by this SMTPD service.

service

The GreenArrow SMTPD service that this gauge represents.

Possible values include: smtpd1, smtpd2, smtpd3

name:
incoming_smtp_connections_total
statsd:
incoming.smtp.connections.total
type:
counter

Total number of new SMTP connections that have been established to this SMTPD service.

service

The GreenArrow SMTPD service that this gauge represents.

Possible values include: smtpd1, smtpd2, smtpd3

Remote SMTP Deliveries

The following metrics are exposed by both GreenArrow (MTA) and by GreenArrow Proxy.

  • When published by the MTA, it is describing SMTP connections/deliveries that are made directly from the MTA, without using GreenArrow Proxy.
  • When published by GreenArrow Proxy, it is describing SMTP connections/deliveries that are made from that GreenArrow Proxy instance.

The result is that when aggregating these datapoints for a cluster, you get the total number of outgoing SMTP connections/deliveries made from the entire cluster, with no duplicate values counted.

name:
remote_connection_new_total
statsd:
remote.connection.new.total
type:
counter

Number of new remote SMTP connections that have been successfully opened.

source_ip

The source IP address of this network connection.

name:
remote_connection_failed_total
statsd:
remote.connection.failed.total
type:
counter

Number of new remote SMTP connections that have failed to successfully open (e.g. if the connection was refused, or a network error).

source_ip

The source IP address of this network connection.

name:
remote_connections_active
statsd:
remote.connections.active
type:
gauge

Number of remote connections that are currently open.

source_ip

The source IP address of this network connection.

name:
remote_connection_reused_total
statsd:
remote.connection.reused.total
type:
counter

Number of times open SMTP connections have been reused.

source_ip

The source IP address of this network connection.

name:
remote_throttle_sessions
statsd:
remote.throttle.sessions
type:
gauge

When emitted by GreenArrow Proxy, this is the number of GreenArrow instances that are currently connected to this GreenArrow Proxy.

When emitted by other GreenArrow instances, this is always 1.

Bounce and FBL Processing

name:
bounce_message_processed_total
statsd:
bounce.message.processed.total
type:
counter

Number of bounce messages that have been processed by the bounce processor.

name:
fbl_message_processed_total
statsd:
fbl.message.processed.total
type:
counter

Number of FBL messages that have been processed by the FBL processor.

name:
lite_bounce_message_processed_total
statsd:
lite.bounce.message.processed.total
type:
counter

Number of bounce messages that have been processed by the lite bounce processor.

name:
lite_fbl_message_processed_total
statsd:
lite_fbl_message_processed_total
type:
counter

Number of FBL messages that have been processed by the lite FBL processor.

Event Processor

name:
event_delivery_ready_latency_seconds
statsd:
event_delivery_ready_latency_milliseconds.destination_{}
type:
gauge

Timing (in seconds) of how long it takes from when an event is generated until the event processor is ready to deliver it to its destination. For events that need to be retried, this is the time it takes from when the retry is scheduled until it is ready to deliver to its destination.

destination

The name of the event delivery destination.

name:
event_delivery_first_attempt_ready_latency_seconds
statsd:
event_delivery_first_attempt_ready_latency_seconds.destination_{}
type:
gauge

Timing (in seconds) of how long it takes from when an event is generated until the event processor is ready to deliver it to its destination. Does not include events are being retried because they could not be delivered on their first attempt.

destination

The name of the event delivery destination.

name:
event_delivery_submission_latency_seconds
statsd:
event_delivery_submission_latency_milliseconds.destination_{}
type:
gauge

Timing (in seconds) of how long it takes to submit one event batch to the destination.

This metric is not written for event_delivery_logfile destinations.

destination

The name of the event delivery destination.

name:
event_delivery_delivered_total
statsd:
event_delivery_delivered_total.destination_{}
type:
counter

Number of events that were succesfully delivered to the destination.

destination

The name of the event delivery destination.

name:
event_delivery_failed_total
statsd:
event_delivery_failed_total.destination_{}
type:
counter

Number of events that failed in delivery to the destination.

destination

The name of the event delivery destination.

name:
event_delivery_network_failed_total
statsd:
event_delivery_network_failed_total.destination_{}
type:
counter

Number of events that failed in delivery to the destination due to a network error (as opposed to, for example, an HTTP 404 response).

destination

The name of the event delivery destination.

name:
event_delivery_read_operations_total
statsd:
event_delivery_read_operations_total.destination_{}
type:
counter

Number of read operations that have been completed against the GreenArrow events table.

destination

The name of the event delivery destination.

name:
event_delivery_queue_first_attempt_count
statsd:
event_delivery_queue_first_attempt_count.destination_{}
type:
gauge

Number of events in the queue for this destination that have not yet received a delivery attempt.

This is calculated approximately once per minute; longer queues will increase the duration between calculations.

destination

The name of the event delivery destination.

name:
event_delivery_queue_retry_count
statsd:
event_delivery_queue_retry_count.destination_{}
type:
gauge

Number of events in the queue for this destination that have received at least one delivery attempt.

This is calculated approximately once per minute; longer queues will increase the duration between calculations.

destination

The name of the event delivery destination.

name:
event_delivery_queue_first_attempt_age_seconds
statsd:
event_delivery_queue_first_attempt_age_seconds.destination_{}
type:
gauge

Age (in seconds) of the oldest event waiting for its first delivery attempt. If there are no such events in the queue, this is set to 0.

This is calculated approximately once per minute; longer queues will increase the duration between calculations.

destination

The name of the event delivery destination.

name:
event_delivery_queue_retry_age_seconds
statsd:
event_delivery_queue_retry_age_seconds.destination_{}
type:
gauge

Age (in seconds) of the oldest event waiting for a retry. If there are no such events in the queue, this is set to 0.

This is calculated approximately once per minute; longer queues will increase the duration between calculations.

destination

The name of the event delivery destination.


Copyright © 2012–2025 GreenArrow Email