This article was originally authored by Srinath Ananthakrishnan, an engineer on the Heroku Runtime Networking Team
Summary
This following story outlines a recent issue we saw with migrating one of our internal systems over to a new EC2 substrate and in the process breaking one of our customer’s use cases. We also outline how we went about discovering the root of the issue, how we fixed it, and how we enjoyed solving a complex problem that helped keep the Heroku customer experience as simple and straightforward as possible!
History 📖
Heroku has been leveraging AWS and EC2 since the very early days. All these years, the Common Runtime has been running on EC2 Classic and while there have always been talks about moving to the more performant and feature rich VPC architecture that AWS offers, we hadn’t had the time and personnel investment to make it a reality until very recently. The results of that effort were captured in a previous blog post titled Faster Dynos for All
While our Common Runtime contains many critical components, including our instance fleet to run app containers, our routers and several other control plane components, one of the often overlooked but yet critical components is Rendezvous, our bidirectional proxy server that enables Heroku Run sessions to containers. This is the component that lets customers run what are called one-off dynos that are used for a wide range of use-cases ranging from a simple prompt to execute/test a piece of code to complex CI scenarios.
Architecture of Rendezvous 💪
Rendezvous has been a single-instance server from time immemorial. It is a sub-200 line Ruby script that runs on an EC2 instance with an EIP attached to it. The ruby process receives TLS connections directly, performs TLS termination and proxies bidirectional connections that match a given hash.
Every Heroku Run/One-off dyno invocation involves two parties - the client which is usually the Heroku CLI or custom implementations that use the Heroku API and the dyno on one of Heroku’s instances deep in the cloud. The existence of Rendezvous is necessitated by one of the painful yet essential warts of the Internet - NATs.
Both the client and the dyno are behind NATs and there’s no means for them to talk to each other through these pesky devices. To combat this, the Heroku API returns an attach_url
as part of the create_dyno
request which lets the client reach the dyno. The attach_url
also contains a 64 bit hash to identify this specific session in Rendezvous. The same attach_url
with the exact hash is passed on by our dyno management system to an agent that runs on our EC2 instance fleet which is responsible for the lifecycle of dynos.
Once both the systems receive the attach_url
with the same hash, they make a TLS request to the host, which is a specific instance of Rendezvous. Once the TLS session is established, both sides send the hash as the first message which lets Rendezvous identify which session the connection belongs to. Once the two sides of the session are established, Rendezvous splices them together, thus creating a bi-directional session between the CLI/user and the dyno.
A unique use-case of Rendezvous 💡
While the majority of customers use Rendezvous via heroku run
commands executed via the CLI, some clients have more sophisticated ways of needing containers to be started arbitrarily via the Heroku API. These clients programmatically create a dyno via the API and also establish a session to the attach_url
.
One of our customers utilized Rendezvous in a very unique way by running an app in a Private Space that received client HTTP requests and within the context of a request, issued another request to the Heroku API and to Rendezvous. They had a requirement to support requests across multiple customers and to ensure isolation between them, they opted to run each of their individual customer’s requests inside one-off dynos. The tasks in the one-off dyno runs are expected to take a few seconds and were usually well within the expected maximum response time limit of 30s by the Heroku router.
Oh! Something’s broken!
In July 2021, we moved Rendezvous into AWS VPCs as part of our effort to evacuate EC2 classic. We chose similar generation instances for this as our instance in classic. As part of this effort, we also wanted to remove a few of the architectural shortcomings of rendezvous - having a single EIP ingress and also manual certificate management for terminating TLS.
Based on experience with other routing projects, we decided to leverage Network Load Balancers that AWS offers. From a performance perspective, these were also significantly better - our internal tests revealed that NLBs offered 5-7x more throughput in comparison to the EIP approach. We also decided to leverage the NLB’s TLS termination capabilities which allowed us to stop managing our own certificate and private key manually and rely on AWS ACM to take care of renewals in the future.
While the move was largely a success and most customers didn’t notice this and their heroku run
sessions continued to work after the transition, our unique customer immediately hit H12s on their app that spawns one-off dynos. Almost immediately, we identified this issue to Rendezvous sessions taking longer than the 30s limit imposed by the Heroku Router. We temporarily switched their app to use the classic path and sat down to investigate.
Where’s the problem! 😱
Our first hunch was that the TLS termination on the NLB wasn’t happening as expected but our investigations revealed that TLS was appropriately terminated and the client was able to make progress following that. The next line of investigation was in Rendezvous itself. The new VPC-based instances were supposed to be faster, so the slowdown was something of a mystery. We even tried out an instance type that supported 100Gbps networking but the issue persisted. As part of this effort, we also had upgraded the Ruby version that Rendezvous was running on - and you guessed it right - we attempted a downgrade as well. This proved to be inconclusive as well.
All along we also suspected this could possibly be a problem in the language runtime of the recipient of the connection, where the bytes were available in the userspace buffer of the runtime but the API call was not notified or there is a race condition. We attempted to mimic the data pattern between the client and the process in the one-off dyno by writing our own sample applications. We actually built sample applications in two different languages with very different runtimes. Both these ended up having the same issues in the new environment as well. 🤔
We even briefly considered altering the Heroku Router’s timeout from 30s, but it largely felt like spinning a roulette wheel since we weren’t absolutely sure where the problem was.
Nailing it down! 🔍
As part of the troubleshooting effort, we also added some more logging on the agent that runs on every EC2 instance that is responsible for maintaining a connection with Rendezvous and the dyno. This agent negotiates TLS with Rendezvous and establishes a connection and sets up a pty
terminal connection on the dyno side and sets up stdin/stdout/stderr channels with the same. The client would send requests in a set-size byte chunks which would be streamed by this agent to the dyno. The same agent would also receive bytes from the dyno and stream it back to Rendezvous to send it back to the client. Through the logs on the agent, we determined that there were logs back and forth indicating traffic between the dyno and Rendezvous when connections worked. However, for the abnormal case, there were no logs indicating traffic coming from the dyno after a while and the last log was bytes being sent to the dyno.
Digging more, we identified two issues with this piece of code:
- This piece of code was single threaded - i.e. a single thread was performing an
IO.select
on the TCP socket on the Rendezvous side and the terminal reader on the dyno. - While #1 itself is not a terrible problem, it became a problem with the use of NLBs which are more performant and have different TLS frame characteristics.
#2 meant that the NLB could potentially send much larger TLS frames than the classic setup where the Rendezvous ruby process would have performed TLS.
The snippet of code that had the bug was as follows.
# tcp_socket can be used with IO.select
# ssl_socket is after openssl has its say
# pty_reader and pty_writer are towards the dyno
def rendezvous_channel(tcp_socket, ssl_socket, pty_reader, pty_writer)
if o = IO.select([tcp_socket, pty_reader], nil, nil, IDLE_TIMEOUT)
if o.first.first == pty_reader
# read from the pty_reader and write to ssl_socket
elsif o.first.first == tcp_socket
# read from the ssl_socket and write to pty_writer
end
end
end
Since the majority of the bytes were from the client, this thread would have read from the ssl_socket
and written them to the pty_writer
. With classic, these would have been small TLS frames which would mean that an IO.select
readability notification would correspond to a single read from the SSL socket which would in-turn read from the TCP socket.
However, with the shards, the TLS frames from the NLB end up being larger, and a previous read from the ssl_socket
could end up reading more bytes off of the tcp_socket
which could potentially block IO.select
till the IDLE_TIMEOUT
has passed. It’s not a problem if the IDLE_TIMEOUT
is a relatively smaller number but since this was larger than the 30s limit imposed by the Heroku Router, IO.select
blocking here resulted in that timer elapsing resulting in H12s.
In fact, the Ruby docs for IO.select
specifically talk about this issue.
The most likely situation is that OpenSSL::SSL::SSLSocket buffers some data. IO.select doesn't see the buffer. So IO.select can block when OpenSSL::SSL::SSLSocket#readpartial doesn't block.
According to the Linux kernel on the instance, there were no bytes to be read from the tcp_socket
while there were still bytes being left to read from the buffers in openssl since we only read partially the last time around.
The fix 💜
Once we had identified the issue, it was rather straightforward for us to fix this. We made the code dual threaded - one each for one side of the connection and also fixed the way we read from the sockets and did an IO.select
. With this code change, we ensured that we wouldn’t perennially block where there are bytes lying around to be read.
We deployed this fix to our staging environments and after thorough testing we moved the customer over to the VPC-based rendezvous. The customer subsequently confirmed that the issue was resolved and all our remote offices erupted in roars of cheer after that. It was 🍰 time.
Conclusion
Computers are fun, computers are hard!
Try to run a platform and you’ll often say, oh my god!
Gratifying and inspiring it is, to run our stack
For if you lose their trust, it’s hard to get it back ...
Running a platform makes you appreciate more of Hyrum’s Law, every day. Customers find interesting ways to use your platform and they sure do keep you on your toes to ensure you provide the best in class service. At Heroku we have always taken pride in our mission to make life easy for customers and we are grateful to have got the opportunity to demonstrate that yet again as part of this endeavor.
Thanks are in order for all the folks who tirelessly worked on identifying this issue and fixing it. In alphabetical order - David Murray, Elizabeth Cox, Marcus Blankenship, Srinath Ananthakrishnan, Thomas Holmes, Tilman Holschuh and Will Farrington.