I run this site on AWS and have been having issues with the site going down periodically.
I monitor site outages using AWS Route 53 Alerts.
At the exact same time, I was also unable to connect to SSH. The error I was getting was a familiar one.
ssh_exchange_identification: ssh_exchange_identification: Connection reset by peer
It was frustrating because normally when I have a web outage, I troubleshoot on the box. And in these cases I couldn’t SSH into the system either. I also couldn’t connect via SSM.
The issue usually just went away, or sometimes I would bounce the box and it would be fine.
So the other day, after it happened again (it seems to happen like once every couple of days for a few minutes), I decided to dig deeper.
I was thinking web, i.e., NGINX, Cloudflare, resources—anything—and was ready to fix whatever. But then I come across this on Stack Overflow.
Have also seen this happen when server was under heavy load from for example, brute force attack. Increase the amount of connections sshd can run.
I was like, huh? No way. There’s no way you can DoS an AWS box with SSH brute force attempts. Right?
…
Get a weekly breakdown of what's happening in security and tech—and why it matters.
Right?
So I removed my SSH listener, just to see what would happen.
It’s been a couple of days and I’ve not received a downtime alert since.
This is just insane to me. I don’t see how this is possibly a thing.
To be clear, I’m not talking about SSH being DoS’d—I’m talking about the box being DoS’d. More reading pointed to a possible TCP starvation issue. Or at least that’s what it sounded like.
I checked the logs and it didn’t look like some massive number of SSH attempts.
Has anyone else heard of some light SSH bruteforce leading to a TCP lockup of an entire system?
July 26, 2020 — My current theory is that I’m having experiencing TCP exhaustion, and have since tweaked my systctl variables according to a Chartbeat engineering post and upped my memory size on the box (from 4GB to 8GB). I have only had one blip since and will continue to monitor it.
The command I’m using to monitor is nstat -az | grep -i listen.