Symptoms
You SSH into sw-infrarunbook-01 for a routine check and something feels off. The login banner takes a beat longer than usual to appear. You run a quick
tail -f /var/log/auth.logand the output is scrolling faster than you can read it. Or maybe you got paged at 2 AM because a disk-usage alert fired — and the culprit turns out to be a 4 GB auth log that didn't exist yesterday.
Here's what a system under active brute force looks like in practice:
- /var/log/auth.log (Debian/Ubuntu) or /var/log/secure (RHEL/Rocky/AlmaLinux) is growing at an unusual rate — sometimes gigabytes per day
- Thousands of "Failed password for invalid user" entries from a handful of rotating source IPs
- Attempts targeting predictable usernames: root, admin, ubuntu, pi, test, deploy, git, oracle
- SSH login delays for legitimate users during peak attack windows
- fail2ban either not running, not installed, or showing zero banned IPs despite the noise
- CPU and I/O spikes with no obvious process to blame
- Source IPs tracing back to cloud provider ranges in regions you don't operate in
This is a brute force attack. Automated scanners sweep the entire routable IPv4 space continuously, probing every host they find on port 22. If your server is reachable, it's being probed — the only question is whether your defenses are configured to absorb or stop it. Let's work through every reason this happens and exactly how to fix it.
Root Cause 1: Auth Log Shows Repeated Failures and Nobody Is Watching
Why It Happens
The auth log is the ground truth for authentication events on a Linux system. Failed SSH attempts, PAM errors, sudo invocations — all of it ends up there. The problem isn't that the log exists; it's that most teams provision a server, open port 22, and never wire up any alerting against auth.log. The log fills silently, an attack runs for days or weeks, and nobody notices until disk fills or a compliance audit surfaces it.
Attackers know this. Sustained low-and-slow campaigns are specifically designed to stay under the radar — a few hundred attempts per hour from rotating IPs. Each individual source IP stays quiet enough to avoid triggering naive rate-limit rules, but collectively they grind through millions of password combinations around the clock.
How to Identify It
Start with the raw failure count over the last 24 hours:
grep "Failed password" /var/log/auth.log | wc -l
On a healthy server this should be in the single digits to low hundreds. If you see output like this, you have a problem:
147382
Get a breakdown of the attacking IPs to understand the scope:
grep "Failed password" /var/log/auth.log | awk '{print $(NF-3)}' | sort | uniq -c | sort -rn | head -20
8421 185.224.128.43
6203 45.33.32.156
4891 103.99.0.122
3317 198.51.100.77
2108 45.142.212.100
1892 194.165.16.11
Then check which usernames are being targeted — this tells you whether it's a credential-stuffing campaign or just blind dictionary blasting:
grep "Invalid user" /var/log/auth.log | awk '{print $8}' | sort | uniq -c | sort -rn | head -20
9821 root
4302 admin
2198 ubuntu
1847 pi
983 test
741 deploy
620 git
507 oracle
How to Fix It
Beyond observing the log reactively, you need active monitoring. Forward auth.log to a SIEM, configure logwatch, or use a simple alerting rule that fires when "Failed password" exceeds a threshold in a rolling window. Tools like Loki + Grafana, Graylog, or even a cron job that mails you a daily summary work fine. The key is visibility before the problem becomes a crisis.
For immediate investigation when auth.log has already rotated, check compressed archives:
zgrep "Failed password" /var/log/auth.log.*.gz | wc -l
To see the attack timeline and identify peak windows:
grep "Failed password" /var/log/auth.log | awk '{print $1, $2}' | uniq -c | sort -rn | head -20
Root Cause 2: fail2ban Not Installed
Why It Happens
This one comes up constantly. Someone spins up a VPS, installs OpenSSH, opens port 22, and ships it. fail2ban isn't part of the default installation on any major distribution — you have to install it deliberately. In my experience, this is the single biggest gap between servers that handle brute force gracefully and servers that get hammered into the ground. Without fail2ban, every failed authentication attempt is free. There's no penalty for being wrong a thousand times in a row.
fail2ban works by tailing log files for patterns and using iptables or nftables to temporarily ban offending IPs once a threshold is crossed. It's not a silver bullet — rotating botnets can exhaust its ban list — but it dramatically raises the cost of an attack and handles the long tail of automated scanners that dominate internet noise.
How to Identify It
which fail2ban-client
No output means it's not installed. You can also check systemd directly:
systemctl status fail2ban
Unit fail2ban.service could not be found.
Or if it's installed but not running or not enabled:
● fail2ban.service - Fail2Ban Service
Loaded: loaded (/lib/systemd/system/fail2ban.service; disabled; vendor preset: enabled)
Active: inactive (dead)
How to Fix It
On Debian/Ubuntu:
apt install fail2ban -y
systemctl enable fail2ban
systemctl start fail2ban
On RHEL/Rocky/AlmaLinux (fail2ban lives in EPEL):
dnf install epel-release -y
dnf install fail2ban -y
systemctl enable fail2ban
systemctl start fail2ban
Don't modify /etc/fail2ban/jail.conf directly — that file gets overwritten on package upgrades. Create a local override instead:
cat /etc/fail2ban/jail.local
[DEFAULT]
bantime = 86400
findtime = 600
maxretry = 5
backend = systemd
[sshd]
enabled = true
port = ssh
logpath = %(sshd_log)s
This bans any IP that fails 5 times within 10 minutes for a full 24 hours. I push bantime to 86400 on any host with no legitimate reason to see auth failures from unknown sources. After applying the config, verify the jail is active:
fail2ban-client status sshd
Status for the jail: sshd
|- Filter
| |- Currently failed: 3
| |- Total failed: 142
| `- File list: /var/log/auth.log
`- Actions
|- Currently banned: 7
|- Total banned: 31
`- Banned IP list: 185.224.128.43 45.33.32.156 103.99.0.122 ...
Root Cause 3: MaxAuthTries Too High
Why It Happens
The default value of MaxAuthTries in OpenSSH is 6. That means an attacker gets 6 password attempts per TCP connection before the server closes it. This sounds restrictive, but nothing stops the attacker from immediately opening a new TCP connection and trying 6 more. With enough threads, they can run thousands of attempts per minute against a single host even with the default in place.
Worse, I've seen environments where an admin bumped MaxAuthTries to 20 or higher because they were seeing "Too many authentication failures" errors during debugging — a client-side issue caused by an SSH agent offering too many keys — and never walked it back after resolving the real problem. That setting then sits in production, handing attackers a wide-open window.
How to Identify It
grep -i "maxauthtries" /etc/ssh/sshd_config
No output means the default of 6 is in effect. If you see something like this, it needs fixing immediately:
MaxAuthTries 20
Always check the effective running configuration rather than just the config file, because Include directives can pull in other files:
sshd -T | grep maxauthtries
maxauthtries 20
How to Fix It
Set MaxAuthTries to 3 in /etc/ssh/sshd_config. Combined with fail2ban, this means an attacker gets 3 guesses per connection, and fail2ban cuts the IP off once it's tripped that threshold enough times:
MaxAuthTries 3
Reload sshd without dropping existing sessions:
systemctl reload sshd
Verify the change is live:
sshd -T | grep maxauthtries
maxauthtries 3
While you're in sshd_config, also tighten LoginGraceTime. The default is 120 seconds — two full minutes that a half-open unauthenticated connection can hold a slot in the daemon. Dropping it to 30 seconds reduces resource consumption during flood attacks:
LoginGraceTime 30
Root Cause 4: Root Login Allowed
Why It Happens
Modern OpenSSH defaults PermitRootLogin to "prohibit-password", which only allows root login via public key authentication. But I've seen countless production servers — especially older systems, systems migrated from aging cloud images, or boxes that started life as quick dev environments and quietly became important — where PermitRootLogin is set to "yes". That means root can log in with a password, and since every Linux system has a root account, attackers don't even need to guess a valid username. They already know one.
The volume this generates is staggering. Even "prohibit-password" carries risk if key management is loose, which is why the safest posture is to disable root login entirely and require admins to SSH as a named user and escalate with sudo.
How to Identify It
grep -i "permitrootlogin" /etc/ssh/sshd_config
PermitRootLogin yes
Or from the effective running configuration:
sshd -T | grep permitrootlogin
permitrootlogin yes
Confirm just how much noise root login is generating:
grep "Failed password for root" /var/log/auth.log | wc -l
53847
That number alone explains why this matters.
How to Fix It
Before making this change, verify you have a non-root user with sudo access and a working SSH key. If you lock yourself out here, recovery requires console access — not fun at 3 AM. On sw-infrarunbook-01, confirm the admin account first:
id infrarunbook-admin
uid=1001(infrarunbook-admin) gid=1001(infrarunbook-admin) groups=1001(infrarunbook-admin),27(sudo)
cat /home/infrarunbook-admin/.ssh/authorized_keys
Once confirmed, edit /etc/ssh/sshd_config and set:
PermitRootLogin no
Reload sshd:
systemctl reload sshd
From a separate terminal session, confirm you can still log in as infrarunbook-admin with your key before closing the root session. This step is not optional — I have seen engineers lock themselves out of cloud VMs by skipping it.
Root Cause 5: Key-Based Auth Not Enforced
Why It Happens
This is the architectural root of most SSH brute force problems. Password authentication over SSH means that as long as an attacker can reach your SSH port, they can attempt logins indefinitely — rate limits and fail2ban only slow them down. Key-based authentication eliminates the attack surface entirely. Without the private key, no amount of password guessing succeeds, period.
Password auth ships enabled by default in OpenSSH because it's easier for new users to get started. But "easier" cuts both ways — it's easier for attackers too. Every server with a public IP should have PasswordAuthentication set to no. There's no legitimate argument for leaving it enabled on a production host.
How to Identify It
sshd -T | grep passwordauthentication
passwordauthentication yes
That single line means a brute force attack against your server has a chance of succeeding. Also check ChallengeResponseAuthentication, which can act as an alternate path to password-based auth on some PAM configurations:
sshd -T | grep challengeresponse
challengeresponseauthentication yes
Both need to be disabled.
How to Fix It
Confirm your key-based login is working from a separate terminal before making this change. This is non-negotiable. Open /etc/ssh/sshd_config and set:
PasswordAuthentication no
ChallengeResponseAuthentication no
UsePAM yes
UsePAM yes stays enabled because PAM handles account restrictions, session setup, and access controls — it doesn't re-enable password auth when PasswordAuthentication is explicitly set to no.
Reload and verify:
systemctl reload sshd
sshd -T | grep -E "passwordauthentication|challengeresponse"
passwordauthentication no
challengeresponseauthentication no
Now test from a machine without an authorized key. You should see:
infrarunbook-admin@192.168.10.50: Permission denied (publickey).
That's exactly right. Password auth is gone. The brute force attack is still generating log noise, but it can't succeed — and fail2ban will silence even the noise once IPs cross the retry threshold.
Root Cause 6: No AllowUsers or AllowGroups Restriction
Why It Happens
Even with password auth disabled, an unconstrained sshd_config allows any account on the system to attempt key-based login. If a compromised CI/CD pipeline accidentally writes an authorized key to a service account's home directory, or if an attacker gains a foothold through another vulnerability and plants a key, there's nothing at the sshd layer blocking that account from gaining SSH access.
AllowUsers and AllowGroups enforce an explicit allowlist. Only accounts you name can SSH in — everything else is rejected before any other check runs. It's a cheap control with a high-value payoff.
How to Identify It
sshd -T | grep -E "allowusers|allowgroups"
No output means there's no allowlist in effect. Any account on the system can attempt SSH authentication.
How to Fix It
In /etc/ssh/sshd_config, add an explicit allowlist. For a single admin account:
AllowUsers infrarunbook-admin
For team environments, group-based control scales better:
AllowGroups sshusers
groupadd sshusers
usermod -aG sshusers infrarunbook-admin
After reload, any SSH attempt from an account not in the allowlist returns:
Permission denied (publickey).
Attackers probing common usernames like ubuntu, pi, or deploy will hit this wall immediately, even if somehow those accounts exist on the system.
Root Cause 7: No Firewall-Level Rate Limiting
Why It Happens
fail2ban reacts after the fact. It reads the log, detects a pattern, then issues a ban. During the window between the first failed attempt and the ban being applied, the attack continues unimpeded. Against a large botnet with thousands of IPs, fail2ban may be banning addresses faster than it can process them while the log still grows. Firewall-level rate limiting is a proactive layer that throttles new TCP connections before they even reach sshd — before any log entry is written, before any auth attempt is processed.
How to Identify It
iptables -L INPUT -n --line-numbers | grep -i ssh
No rate-limiting rules means new connection attempts hit sshd unbounded. Every bot on the internet gets full, unthrottled access to your auth stack.
How to Fix It
Using iptables with the recent module, add a rate limit for port 22:
iptables -A INPUT -p tcp --dport 22 -m state --state NEW -m recent --set --name SSH
iptables -A INPUT -p tcp --dport 22 -m state --state NEW -m recent --update \
--seconds 60 --hitcount 10 --name SSH -j DROP
This drops any source IP that makes more than 10 new SSH connections in 60 seconds. Legitimate users won't notice — they don't open 10 fresh connections per minute. Brute force bots stall immediately.
On systems using nftables (most modern distros ship with it):
nft add rule inet filter input tcp dport 22 ct state new \
meter ssh_meter { ip saddr limit rate 5/minute } accept
nft add rule inet filter input tcp dport 22 ct state new drop
Persist these rules through reboots using iptables-save, or manage them through your distro's preferred mechanism — ufw on Ubuntu, firewalld on RHEL-family systems. Don't set this and forget it; verify after any firewall management tool update that the rules survived.
Root Cause 8: SSH Exposed on Default Port Without Obscurity Controls
Why It Happens
Port 22 is scanned continuously. Within minutes of a new public IP address appearing on the internet, automated scanners have already probed it for SSH. Moving SSH to a non-standard port won't stop a determined attacker — a full port scan will find it — but it eliminates the enormous volume of automated noise that targets port 22 specifically. In my experience, moving SSH off port 22 drops auth.log failure counts by 90% or more overnight. It's security through obscurity, it shouldn't be your only layer, but as one component of a defense-in-depth posture it's a free noise reduction you'd be foolish to skip.
How to Identify It
sshd -T | grep ^port
port 22
How to Fix It
Edit /etc/ssh/sshd_config:
Port 2222
On SELinux-enabled systems, tell SELinux about the new port before reloading sshd:
semanage port -a -t ssh_port_t -p tcp 2222
Update your firewall to allow the new port:
# ufw
ufw allow 2222/tcp
ufw delete allow 22/tcp
# firewalld
firewall-cmd --add-port=2222/tcp --permanent
firewall-cmd --remove-service=ssh --permanent
firewall-cmd --reload
Reload sshd, then update your SSH client config (~/.ssh/config), any automation that connects to this host, jump host definitions, and Ansible inventory. If sw-infrarunbook-01 functions as a bastion, update downstream jump configurations on every client that routes through it. A port change touches more config than it looks like — take a few minutes to audit everything before the change lands.
Prevention
Defense in depth is the only real answer. No single control is sufficient. A well-hardened SSH configuration on sw-infrarunbook-01 looks like this in aggregate — this is the baseline I'd apply to any host facing the internet:
# /etc/ssh/sshd_config — hardened baseline for solvethenetwork.com
Port 2222
Protocol 2
PermitRootLogin no
MaxAuthTries 3
LoginGraceTime 30
PasswordAuthentication no
ChallengeResponseAuthentication no
UsePAM yes
AllowUsers infrarunbook-admin
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys
X11Forwarding no
AllowAgentForwarding no
AllowTcpForwarding no
PrintMotd no
AcceptEnv LANG LC_*
Subsystem sftp /usr/lib/openssh/sftp-server
Always validate sshd_config syntax before reloading — a syntax error here will prevent sshd from starting on next restart, which can lock you out permanently on a headless system:
sshd -t && echo "Config OK"
Beyond sshd_config, a complete prevention posture covers several additional areas. Key rotation: rotate SSH keys whenever a team member leaves. Audit all authorized_keys files across your fleet quarterly — keys accumulate faster than anyone tracks them, and forgotten keys are a persistent access risk.
Centralized log shipping: get auth.log into a SIEM or centralized logging system. Correlating failed logins across multiple hosts lets you spot distributed campaigns that stay below per-host thresholds. A botnet spreading 50 attempts per host across 500 hosts won't trip fail2ban anywhere, but it lights up immediately in a cross-host view.
Two-factor authentication: for bastion hosts and any server with elevated access, consider adding TOTP via libpam-google-authenticator or a similar module. Even a compromised private key can't authenticate without the second factor. This is particularly valuable for the hosts an attacker would most want to reach.
SSH Certificate Authority: in environments managing more than a handful of hosts, replace static authorized_keys with an internal SSH CA. Issue short-lived certificates with defined principals and expiry windows. Revocation becomes a centralized operation rather than a file-editing exercise across every host in your fleet. OpenSSH has built-in CA support and it's more straightforward to deploy than people expect.
Source IP allowlisting: if your SSH hosts don't need to be reachable from arbitrary internet IPs, restrict port 22 (or your custom port) to known source ranges at the firewall or cloud security group level. A management network, VPN egress range, or office IP block combined with an explicit deny-all is far stronger than any application-layer defense. Application-layer tools like fail2ban exist to handle the cases where network-level allowlisting isn't feasible — but network-level allowlisting should always be your first option when it is.
The goal is to make your SSH surface so hostile to automated attacks that scanners move on to easier targets — and to make any meaningful breach attempt detectable before it becomes a crisis. With key-only auth enforced, fail2ban actively banning, root login disabled, MaxAuthTries at 3, and centralized log monitoring in place, you've addressed the vast majority of real-world brute force scenarios. Stack network-level controls on top of that and you're in genuinely good shape.
