An interview with Klaviyo's MTA team

An interview with Klaviyo’s MTA team

This blog post is an interview with John Moody and Liam Kelly from Klaviyo about deploying and running the GreenArrow Engine MTA at scale. Klaviyo’s modern and intuitive SaaS platform enables business users of any skill level to harness their first-party data from more than 350 integrations to send the right message at the right time across email, SMS, and push notifications. John Moody is the engineering manager of the Klaviyo MTA team. He has been writing code and architecting systems (and leading teams that write code and architect systems) for the past 30 years. Liam Kelly is the Lead Software Engineer for the architecture and build-out of the Klaviyo MTA team, and has over 20 years of experience deploying and managing MTAs in the Email Service Provider (ESP) industry.

Why did Klaviyo decide to run their own MTA?

John Moody: Basically we have this saying at Klaviyo – we empower creators to own their destiny. That is true of us as well: email is such a big part of what we do, we didn’t want to have an external dependency there if we could help it.

Liam Kelly: We had been using a couple of third party mail vendors to do all of our mail delivery, I think, since Klaviyo’s inception. There were a few main things: Ownership was really the big part of it, the fact that we would own the infrastructure – own the whole thing from message inception to message delivery – just as a philosophical goal was something that Klaviyo has always emphasized. Obviously cost is a factor as well.

And then the other kind of more esoteric parts are just things like visibility. If we deliver a message through a third-party vendor, we have a record of the message handoff, and we don’t know anything about what happens to it until they send a webhook event back to us. So there’s just a lot more transparency in running our own MTA. We can see the retry attempts as they’re happening. We can intervene in the throttling rules and things like that as they’re going on. So the notion would be that better transparency and visibility ultimately gives us better control over our deliverability. We can modify things in flight as they’re happening rather than waiting for a failure event to get sent back to us from a third party.

Why did you select GreenArrow as your MTA?

Liam Kelly: We created a whole matrix that we put in a spreadsheet that was an evaluation of features that were “nice to haves” and “must haves.” This included things like: DKIM signing, support for TLS, ingress APIs for SMTP and HTTP, performance, cluster support, configuration management, async bounce handling, feedback loop handling, VirtualMTA or IP Pool handling, engagement tracking (although we did not end up using it), proxy server support, monitoring tools, queue management, and installation packaging.

Based on comparing the feature sets of all of the MTA vendors that we evaluated using our matrix, we decided to do a P.O.C. with GreenArrow. Basically, GreenArrow had everything we needed out of the box. You guys had published numbers for throughput and all your docs out there. I didn’t have to go through pre-sales support staff, so I could just go down our matrix and say, yes, yes, yes, it has all these things. We built the P.O.C. with GreenArrow, and it did everything we needed it to do at a price point we could live with, so there wasn’t really a reason to build another P.O.C.

What is the high-level architecture of your GreenArrow deployment?

Liam Kelly: We host in AWS and use TerraForm and Puppet – everything we do is infrastructure as code. We wrote TerraForm scripts that manage building up the MTA instances.

Puppet is how we deploy our config files onto the instances. All of our configs are just in a git repo, so if we do a config change we do a pull request, merge it, and Puppet pushes it to the instances, runs a GreenArrow config refresh, and we’re good to go.

The Klaviyo application creates messages that are fully rendered and ready to send to recipients, which get dumped into a queue for GreenArrow. Our team has built a scalable set of consumers in Kubernetes that consume messages from that queue, feed them to the load balancer via regular port 25, and the load balancer feeds them to our GreenArrow instances – which are all in AWS, split across multiple availability zones.

Because they are TerraForm managed, when we need to do a thing like a GreenArrow upgrade, we simply update the code that’s managing those clusters, and do a rolling refresh of the cluster. It terminates all the running instances and spins up new ones with a new GreenArrow version. And, you know, we’re on our way.

Architecturally, we are able to do that because of the persistence path feature. That works because our EC2 instances have EBS volumes that store the mail queues, the logs, and everything else put in the persistent path. So when we need to reap an EC2 instance – either it is unhealthy or upgrading it or whatever – we can reattach the volume in the new instance, and it’s on its way, our mail queue is just there, and it just works. So that feature enables us essentially to do kind of cloud native instance management.

And then architecturally, all of those instances deliver through a tier of HAProxy nodes which have the sending IP addresses. So any piece of mail can go through any GreenArrow instance in any Availability Zone: they all know how to talk to every IP in the proxy tier. All of our IP pools have multiple IPs, which are again striped across multiple availability zones on the proxy tier. So, if US East 1A completely goes away, we still have mailers in the other three AZs, and we still have IPs hosted in HAProxy in the other three AZs. We can still deliver mail even if a whole Amazon availability zone were to die.

For all of the operational logs, as well as the delivery attempt log, we run filebeat on all the instances, which takes all of our logs and pushes them to logstash. Then in logstash we do a little bit of manipulation of the logs – things like sampling or filtering, because, Splunk as you know, is famously expensive. Then from there, we send logs from logstash to both Splunk and S3. So in S3 we have the full set of logs, and in Splunk we have just the ones that we actually want to use in Splunk to do reporting – which is mainly the delivery attempt logs.

In Splunk we aggregate delivery attempt logs across all of our MTAs, and build multiple reporting dashboards: For example: IP Pool health. We can drill down by IP pool or mailbox provider or see reports on which MX hosts we had trouble with. We can see how we are doing with just Google or Apple or everybody else using splunk filter magic. We can do all that because the delivery attempt log is such a nicely formatted JSON log.

We also push events (such as delivery attempt logs, asynchronous bounces, FBL notifications) into SQLite files using the GreenArrow event processor. We’ve got a sidecar process that parses the SQLite data, and pushes that into the Klaviyo event ingest pipeline. And that’s where it then gets merged with the click tracking data that we get from our tracking service. And that all gets merged into a data lake that we then use to drive other kinds of reporting.

What has the reliability of GreenArrow been like?

Liam Kelly: Reliability has been rock solid. You know, I don’t think we have ever had an outage or even a page that was because of GreenArrow in the year and a half we have been in operation. We’ve gotten paged because of stuff on either side of it.

We still have to manage queues and all of that, but the mailer has been very hands off, which is a good thing. And because we haven’t had to put resources into managing the server itself – because it just works – we then can use those resources to manage things like the deliverability of the email going out. Because I don’t have to worry about the infrastructure, I can pay attention to the deliverability of the messages going over the infrastructure instead.

John Moody: We had something over the weekend where one of our instances crashed, but it was an AWS problem. But the system did what it was supposed to do: it went down, a new instance spun up, reattached the disk, and we did not lose any email.

What has the throughput performance of GreenArrow been like?

John Moody: We hit all of our targets for BFCM (Black Friday; Cyber Monday) for KMTA, and part of that was because GreenArrow was so darn easy to work with: both the platform and the people.

Liam Kelly: Our load testing matched the GreenArrow public documentation. In our throughput testing across 12 instances, we were able to achieve 50,000 messages per second. In our actual load testing we focus on throughput of messages per second rather than messages per hour, because I don’t want our customers to wait an hour for the mail to get delivered.

What do you think of the GreenArrow documentation?

Liam Kelly: It is well organized, easy to find, and fairly comprehensive. And seems to be maintained. My experience with other MTA vendors has been that if I want to find the answer to a question, I have to look at their docs, I have to look at a separate site that has their white papers, I have to also search their forums. You know, there’s multiple places, and what I’m looking for could be in one, two, or none of them. And then I end up having to write to their support to ask where to find a thing in the docs. With GreenArrow, usually I go to the docs page, it has the thing I need, and that’s the transaction – which is great.

What is your experience working with the GreenArrow company and support?

John Moody: Working with the GreenArrow team has been an absolute treat. They have been incredibly responsive to requests, needs, and problems.

Liam Kelly: It’s just outstanding. I find little to improve upon here, to be honest. GreenArrow has been really responsive to hearing about issues that we have had and really tenacious about solving them; nobody has tried to blow us off and tell us why it is not a problem like you sometimes get with vendors. GreenArrow seems to have a genuine desire to improve the product by way of hearing feedback from the users. It has felt, at least from my end, like we are kind of in a partnership with GreenArrow, not just a purchaser of a product from GreenArrow.