GreenArrow Email Software Documentation

Metrics Server Integration

Introduction

GreenArrow integrates with both StatsD and Prometheus. Regardless of which integration you choose (or both), you’ll receive the same telemetry data from GreenArrow.

StatsD Integration

Configuration

Configuration is done using the following directives:

Prometheus Integration

Configuration

Configuration is done using the following directive:

Timeouts

The HTTP server that is bound to prometheus_listen is configured with 60 second timeouts. If a Prometheus request takes longer than this to fulfill, the client will receive either 408 Request Timeout or 500 Internal Server Error.

Multilog

Every 5 seconds, the hvmail-metrics service will log a summary of recent metrics it has processed. This can be monitored using tail:

tail -F /var/hvmail/log/metrics/current | tai64nlocal

For more information on GreenArrow’s logs, see Service Logs.

GreenArrow Telemetry

The metric keys below are defined as provided to Prometheus. If a metric has any labels associated with it, they are defined in a list within the metric definition.

When using StatsD, be aware:

  • Any labels are appended to the key name, prefixed by the label name and an underscore, with non-alphanumeric characters replaced by underscore. For example, in the remote_delivery_attempts_active metric, a delivery attempt being made to example.com on the outmta ipaddr-1 would be sent as remote_delivery_attempts_active.outmta_ipaddr_1.throttle_example_com.
  • Timers (e.g. http_request_duration_seconds) are recorded in milliseconds, not seconds (contrary to these metric names).

Queue Status

name:
ram_queue_message_batches
statsd:
ram.queue.message.batches
type:
gauge

Number of message batches currently in the RAM queue (first delivery attempts).

name:
ram_queue_message_batches_max
statsd:
ram.queue.message.batches.max
type:
gauge

Maximum number of message batches that will fit in the RAM queue (first delivery attempts).

name:
ram_queue_bytes
statsd:
ram.queue.bytes
type:
gauge

Total size (in bytes) of messages currently in the RAM queue (first delivery attempts).

name:
ram_queue_bytes_max
statsd:
ram.queue.bytes.max
type:
gauge

Maximum size (in bytes) of messages that will fit in the RAM queue (first delivery attempts).

name:
simplemh_http_queue_messages
statsd:
simplemh.http.queue.messages
type:
gauge

Number of messages currently in the HTTP Submission API SimpleMH processing queue.

name:
simplemh_http_queue_messages_max
statsd:
simplemh.http.queue.messages.max
type:
gauge

Maximum number of messages that will fit in the HTTP Submission API SimpleMH processing queue.

name:
simplemh_http_queue_bytes
statsd:
simplemh.http.queue.bytes
type:
gauge

Total size (in bytes) of messages currently in the HTTP Submission API SimpleMH processing queue.

name:
simplemh_http_queue_bytes_max
statsd:
simplemh.http.queue.bytes.max
type:
gauge

Maximum size (in bytes) of messages that will fit in the HTTP Submission API SimpleMH processing queue.

name:
simplemh_http_queue_latency_seconds
statsd:
simplemh.http.queue.latency.seconds
type:
histogram

Timing (in seconds) of how long it takes a message to make its way through the HTTP Submission API SimpleMH processing queue.

This is measured using a probe message that is injected every 15 seconds and is not dependent on messages being injected. This means that alerting on the non-existence of this metric is helpful, as is alerting on it being excessively high.

This metric is disabled by default and can be enabled using export_metric.

name:
raw_nowait_queue_messages
statsd:
raw.nowait.queue.messages
type:
gauge

Number of messages currently in the no-wait message queue. This queue is used for notifications generated by GreenArrow.

name:
bounce_processor_queue_messages
statsd:
bounce.processor.queue.messages
type:
gauge

Number of bounces currently waiting to be processed.

name:
bounce_processor_queue_messages_soft_max
statsd:
bounce.processor.queue.messages.soft.max
type:
gauge

Maximum number of bounces that will fit in the bounce processing queue before bounce message injection will slow down. This represents a back-pressure mechanism to give the bounce processor a chance to catch up.

name:
bounce_processor_queue_messages_max
statsd:
bounce.processor.queue.messages.max
type:
gauge

Maximum number of bounces that will fit in the bounce processing queue.

name:
fbl_processor_queue_messages
statsd:
fbl.processor.queue.messages
type:
gauge

Number of FBL messages currently waiting to be processed.

name:
fbl_processor_queue_messages_soft_max
statsd:
fbl.processor.queue.messages.soft.max
type:
gauge

Maximum number of FBL messages that will fit in the FBL processor queue before FBL message injection will slow down. This represents a back-pressure mechanism to give the FBL processor a chance to catch up.

name:
fbl_processor_queue_messages_max
statsd:
fbl.processor.queue.messages.max
type:
gauge

Maximum number of FBL messages that will fit in the FBL processor queue.

name:
lite_bounce_processor_queue_messages
statsd:
lite.bounce.processor.queue.messages
type:
gauge

Number of bounce messages and FBL notifications waiting to be processed by the Lite Bounce Processor. This queue is filled by recipients matching the lite_bounce_processor_address and lite_fbl_processor_address directives. This includes messages that will generate bounce_lite and scomp_lite events.

name:
lite_bounce_processor_queue_messages_soft_max
statsd:
lite.bounce.processor.queue.messages.soft.max
type:
gauge

Maximum number of bounces that will fit in the Lite Bounce Processor queue before bounce message injection will slow down. This represents a back-pressure mechanism to give the Lite Bounce Processor a chance to catch up.

name:
lite_bounce_processor_queue_messages_max
statsd:
lite.bounce.processor.queue.messages.max
type:
gauge

Maximum number of bounces that will fit in the Lite Bounce Processor queue.

name:
incoming_drain_queue_batches
statsd:
incoming.drain.queue.batches
type:
gauge

Number of message batches waiting to be injected into this instance that were drained from another instance.

name:
incoming_drain_queue_batches_max
statsd:
incoming.drain.queue.batches.max
type:
gauge

Maximum number of drained message batches that can fit in the incoming drain queue.

name:
disk_queue_catchup_percentage
statsd:
disk.queue.catchup.percentage
type:
gauge

Percentage of time (ranging from 0.0 to 1.0) that the disk queue has spent in catch-up mode. Catch-up mode represents degraded disk queue performance and this can result in significant back-pressure on message injection.

name:
disk_queue_scheduling_delay_seconds
statsd:
disk.queue.scheduling.delay.seconds
type:
gauge

The difference between the current time and the next scheduled retry (in seconds). If the next scheduled retry is in the future, this gauge is zero and the disk queue is caught-up on scheduling. If the next scheduled retry is in the past, this value is non-zero, meaning that the disk queue is behind in scheduling message delivery attempts.

Remote Delivery

name:
remote_attempts_connmaxout
statsd:
remote.attempts.connmaxout
type:
counter

Number of remote delivery attempts that are cancelled/rescheduled because they would exceed the throttle limits for a domain. This includes when this happens on the last delivery attempt of the message, which causes the message to bounce.

name:
remote_attempts_deferral
statsd:
remote.attempts.deferral
type:
counter

Number of remote delivery attempts that result in a deferral. This includes deferrals on the last delivery attempt which cause messages to be bounced.

name:
remote_attempts_dump
statsd:
remote.attempts.dump
type:
counter

Number of remote delivery attempts that result in the message being dumped from the queue (due to the “dump messages from queue” feature).

name:
remote_attempts_failure
statsd:
remote.attempts.failure
type:
counter

Number of remote delivery attempts that result in a failure.

name:
remote_attempts_success
statsd:
remote.attempts.success
type:
counter

Number of remote delivery attempts that succeed.

name:
remote_delivery_attempts_active
statsd:
remote_delivery_attempts_active.outmta_{}.throttle_{}
type:
gauge

The number of remote delivery attempts that are currently active.

outmta

The name of the IP Address or Relay Server VirtualMTA that is currently being used for this delivery attempt.

throttle

The first (in order of definition) domain associated with the explicit throttling rule that is being used for this delivery attempt. If this delivery attempt is using a non-explicit throttle (i.e. it uses the default max concurrent connections & default max messages per hour) or is using a Relay Server, this value will be __default.

name:
remote_delivery_attempts_total
statsd:
remote_delivery_attempts_total.outmta_{}.throttle_{}.result_{}.smtp_status_code_{}
type:
counter

The number of remote delivery attempts that have completed.

outmta

The name of the IP Address or Relay Server VirtualMTA that was used for the delivery attempt(s).

throttle

The first (in order of definition) domain associated with the explicit throttling rule that was used for the delivery attempt(s). For delivery attempts that used non-explicit throttle (i.e. it uses the default max concurrent connections & default max messages per hour) or on a Relay Server, this value will be __default.

result

The result of these delivery attempt(s). This may be one of the following: success, failure, deferral, pause, dump, connmaxout, unknown_report_type, or unknown_report_code

smtp_status_code

The 3-digit numeric SMTP status code for these delivery attempt(s).

name:
remote_dslots_{ram,bounce,disk}_delivering
statsd:
remote.dslots.{ram,bounce,disk}.delivering
type:
gauge

Number of remote delivery slots used by messages undergoing a delivery attempt. This includes: DNS lookups, establishing the SMTP connection, transferring the message, waiting for a response, etc.

name:
remote_dslots_{ram,bounce,disk}_diskwait
statsd:
remote.dslots.{ram,bounce,disk}.diskwait
type:
gauge

Number of remote delivery slots used by messages that are waiting to be moved to the disk queue after a deferral.

Note: when IO load due to writing messages to the disk queue is slowing the system down, this number will increase.

name:
remote_dslots_{ram,bounce,disk}_free
statsd:
remote.dslots.{ram,bounce,disk}.free
type:
gauge

Number of remote delivery slots that are unused.

name:
remote_dslots_{ram,bounce,disk}_throttlewait
statsd:
remote.dslots.{ram,bounce,disk}.throttlewait
type:
gauge

Number of remote delivery slots used by messages that are waiting, due to throttle rules, to be allowed to make a delivery attempt.

name:
remote_dslots_{ram,bounce,disk}_total
statsd:
remote.dslots.{ram,bounce,disk}.total
type:
gauge

Number of remote delivery slots available in the queue. This is the value of the queue.{ram,bounce,disk}.concurrencyremote setting.

name:
remote_messages_disk_queue_add_total
statsd:
remote.messages.disk_queue.add.count
type:
counter

Number of messages moved to the disk queue.

name:
remote_messages_new_total
statsd:
remote.messages.new.count
type:
counter

New remote messages queued in the system.

name:
delivery_probe_latency_seconds
statsd:
delivery_probe_latency_seconds
type:
histogram

Timing (in seconds) of how long it takes a message to make its way from the HTTP Submission API to the point at which GreenArrow is ready to establish a remote network connection for delivery.

This includes the HTTP Submission API SimpleMH processing queue, which is measured with simplemh_http_queue_latency_seconds, and the message progressing through the scheduling/throttling queues and systems in GreenArrow.

This is measured using a probe message that is injected every 15 seconds and is not dependent on messages being injected. This means that alerting on the non-existence of this metric is helpful, as is alerting on it being excessively high.

This metric is disabled by default and can be enabled using export_metric.

HTTP Processing

name:
http_request_duration_seconds
statsd:
http.request.duration.seconds.category_{}
type:
histogram

Timing of HTTP requests that have been processed by GreenArrow.

category

The category of HTTP request.

Possible values include: click, open, inject, engine_stats, engine_api, engine_ui, and other

Incoming SMTP

name:
incoming_smtp_connections_active
statsd:
incoming.smtp.connections.active
type:
gauge

Number of open incoming SMTP connections.

service

The GreenArrow SMTPD service that this gauge represents.

Possible values include: smtpd1, smtpd2, smtpd3

name:
incoming_smtp_connections_max
statsd:
incoming.smtp.connections.max
type:
gauge

Maximum number of open incoming SMTP connections supported by this SMTPD service.

service

The GreenArrow SMTPD service that this gauge represents.

Possible values include: smtpd1, smtpd2, smtpd3

name:
incoming_smtp_connections_total
statsd:
incoming.smtp.connections.total
type:
counter

Total number of new SMTP connections that have been established to this SMTPD service.

service

The GreenArrow SMTPD service that this gauge represents.

Possible values include: smtpd1, smtpd2, smtpd3

Remote SMTP Deliveries

The following metrics are exposed by both GreenArrow (MTA) and by GreenArrow Proxy.

  • When published by the MTA, it is describing SMTP connections/deliveries that are made directly from the MTA, without using GreenArrow Proxy.
  • When published by GreenArrow Proxy, it is describing SMTP connections/deliveries that are made from that GreenArrow Proxy instance.

The result is that when aggregating these datapoints for a cluster, you get the total number of outgoing SMTP connections/deliveries made from the entire cluster, with no duplicate values counted.

name:
remote_connection_new_total
statsd:
remote.connection.new.total
type:
counter

Number of new remote SMTP connections that have been successfully opened.

source_ip

The source IP address of this network connection.

name:
remote_connection_failed_total
statsd:
remote.connection.failed.total
type:
counter

Number of new remote SMTP connections that have failed to successfully open (e.g. if the connection was refused, or a network error).

source_ip

The source IP address of this network connection.

name:
remote_connections_active
statsd:
remote.connections.active
type:
gauge

Number of remote connections that are currently open.

source_ip

The source IP address of this network connection.

name:
remote_connection_reused_total
statsd:
remote.connection.reused.total
type:
counter

Number of times open SMTP connections have been reused.

source_ip

The source IP address of this network connection.

name:
remote_throttle_sessions
statsd:
remote.throttle.sessions
type:
gauge

When emitted by GreenArrow Proxy, this is the number of GreenArrow instances that are currently connected to this GreenArrow Proxy.

When emitted by other GreenArrow instances, this is always 1.

Bounce and FBL Processing

name:
bounce_message_processed_total
statsd:
bounce.message.processed.total
type:
counter

Number of bounce messages that have been processed by the bounce processor.

name:
fbl_message_processed_total
statsd:
fbl.message.processed.total
type:
counter

Number of FBL messages that have been processed by the FBL processor.

name:
lite_bounce_message_processed_total
statsd:
lite.bounce.message.processed.total
type:
counter

Number of bounce messages that have been processed by the lite bounce processor.

name:
lite_fbl_message_processed_total
statsd:
lite.fbl.message.processed.total
type:
counter

Number of FBL messages that have been processed by the lite FBL processor.

Event Processor

name:
event_delivery_ready_latency_seconds
statsd:
event.delivery.ready.latency.seconds.destination_{}
type:
histogram

Timing (in seconds) of how long it takes from when an event is generated until the event processor is ready to deliver it to its destination. For events that need to be retried, this is the time it takes from when the retry is scheduled until it is ready to deliver to its destination.

destination

The name of the event delivery destination.

name:
event_delivery_submission_latency_seconds
statsd:
event.delivery.submission.latency.seconds.destination_{}
type:
histogram

Timing (in seconds) of how long it takes to submit one event batch to the destination.

This metric is not written for event_delivery_logfile destinations.

destination

The name of the event delivery destination.

name:
event_delivery_delivered_total
statsd:
event.delivery.delivered.total.destination_{}
type:
counter

Number of events that were succesfully delivered to the destination.

destination

The name of the event delivery destination.

name:
event_delivery_failed_total
statsd:
event.delivery.failed.total.destination_{}
type:
counter

Number of events that failed in delivery to the destination.

destination

The name of the event delivery destination.

name:
event_delivery_network_failed_total
statsd:
event.delivery.network.failed.total.destination_{}
type:
counter

Number of events that failed in delivery to the destination due to a network error (as opposed to, for example, an HTTP 404 response).

destination

The name of the event delivery destination.

name:
event_delivery_read_operations_total
statsd:
event.delivery.read.operations.total.destination_{}
type:
counter

Number of read operations that have been completed against the GreenArrow events table.

destination

The name of the event delivery destination.


Copyright © 2012–2025 GreenArrow Email