This article is the first in a series of posts I'm writing about running various SaaS products and websites for the last 8 years. I'll be sharing some of the issues I've dealt with, lessons I've learned, mistakes I've made, and maybe a few things that went right. Let me
This article is the first in a series of posts I'm writing about running various SaaS products and websites for the last 8 years. I'll be sharing some of the issues I've dealt with, lessons I've learned, mistakes I've made, and maybe a few things that went right. Let me know what you think!
Back in 2019 or 2020, I had decided to rewrite the entire backend for Block Sender, a SaaS application that helps users create better email blocks, among other features. In the process, I added a few new features and upgraded to much more modern technologies. I ran the tests, deployed the code, manually tested everything in production, and other than a few random odds and ends, everything seemed to be working great. I wish this was the end of the story, but...
A few weeks later, I was notified by a customer (which is embarrassing in itself) that the service wasn't working and they were getting lots of should-be-blocked emails in their inbox, so I investigated. Many times this issue is due to Google removing the connection from our service to the user's account, which the system handles by notifying the user via email and asking them to reconnect, but this time it was something else.
It looked like the backend worker that handles checking emails against user blocks kept crashing every 5-10 minutes. The weirdest part - there were no errors in the logs, memory was fine, but the CPU would occasionally spike at seemingly random times. So for the next 24 hours (with a 3-hour break to sleep - sorry customers ????), I had to manually restart the worker every time it crashed. For some reason, the Elastic Beanstalk service was waiting far too long to restart, which is why I had to do it manually.
Debugging issues in production is always a pain, especially since I couldn't reproduce the issue locally, let alone figure out what was causing it. So like any "good" developer, I just started logging everything and waited for the server to crash again. Since the CPU was spiking periodically, I figured it wasn't a macro issue (like when you run out of memory) and was probably being caused by a specific email or user. So I tried to narrow it down:
Was it crashing on a certain email ID or type?
Was it crashing for a given customer?
Was it crashing at some regular interval?
After hours of this, and staring at logs longer than I'd care to, eventually, I did narrow it down to a specific customer. From there, the search space narrowed quite a bit - it was most likely a blocking rule or a specific email our server kept retrying on. Luckily for me, it was the former, which is a far easier problem to debug given that we're a very privacy-focused company and don't store or view any email data.
Before we get into the exact problem, let's first talk about one of Block Sender's features. At the time I had many customers asking for wildcard blocking, which would allow them to block certain types of email addresses that followed the same pattern. For example, if you wanted to block all emails from marketing email addresses, you could use the wildcard marketing@* and it would block all emails from any address that started with marketing@.
One thing I didn't think about is that not everyone understands how wildcards work. I assumed that most people would use them in the same way I do as a developer, using one * to represent any number of characters. Unfortunately, this particular user had assumed you needed to use one wildcard for each character you wanted to match. In their case, they wanted to block all emails from a certain domain (which is a native feature Block Sender has, but they must not have realized it, which is a whole problem in itself). So instead of using *@example.com, they used **********@example.com.
POV: Watching your users use your app...
To handle wildcards on our worker server, we're using the Node.js library matcher, which helps with glob matching by turning it into a regular expression. This library would then turn **********@example.com into something like the following regex:
/[\s\S]*[\s\S]*[\s\S]*[\s\S]*[\s\S]*[\s\S]*[\s\S]*[\s\S]*[\s\S]*[\s\S]*@example\.com/i
If you have any experience with regex, you know that they can get very complicated very quickly, especially on a computational level. Matching the above expression to any reasonable length of text becomes very computationally expensive, which ended up tying up the CPU on our worker server. This is why the server would crash every few minutes; it would get stuck trying to match a complex regular expression to an email address. So every time this user received an email, in addition to all of the retries we built in to handle temporary failures, it would crash our server.
So how did I fix this? Obviously, the quick fix was to find all blocks with multiple wildcards in succession and correct them. But I also needed to do a better job of sanitizing user input. Any user could enter a regex and take down the entire system with a ReDoS attack.
Handling this particular case was fairly simple - remove successive wildcard characters:
block = block.replace(/\*+/g, '*')
But that still leaves the app open to other types of ReDoS attacks. Luckily there are a number of packages/libraries to help us with these types as well:
safe-regex
rxxr2
Using a combination of the solutions above, and other safeguards, I've been able to prevent this from happening again. But it was a good reminder that you can never trust user input, and you should always sanitize it before using it in your application. I wasn't even aware this was a potential issue until it happened to me, so hopefully, this helps someone else avoid the same problem.
Have any questions, comments, or want to share a story of your own? Reach out on Twitter!