Symptoms
You log into your BIG-IP and the CPU graph looks wrong. Maybe a monitoring alert woke you up at 2 AM, maybe an application team is calling about timeouts — whatever brought you here, the first thing you see is TMM CPU pegged well above comfortable levels. The box is still forwarding traffic, but it is struggling.
Typical indicators of high CPU on a BIG-IP include TMM utilization consistently above 70–80%, connection establishment times climbing, health monitors beginning to fail and marking pool members down, and the management plane becoming sluggish. TMSH commands that normally execute instantly start hanging. The GUI spins. Syslog fills with
TMM CPU threshold exceededmessages. In the worst cases, you start seeing virtual servers go red as the BIG-IP can no longer respond to its own health probes.
Start your investigation here:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # top -b -n 1 | grep -E "(tmm|Cpu)"
And for a proper TMM-level view:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh show sys tmm-info
Sys::TMM Information
---------------------------------------------------------------------------
TMM CPU Memory Connections Conn-Rate PVA-Client PVA-Server
---------------------------------------------------------------------------
0 89% 1.2G/2.0G 42311 12400/s 0 0
1 91% 1.1G/2.0G 41987 11800/s 0 0
2 34% 0.8G/2.0G 18203 5200/s 0 0
3 32% 0.8G/2.0G 17901 5100/s 0 0
Notice two things in this output. TMM 0 and TMM 1 are saturated while TMM 2 and 3 are relatively idle — that uneven distribution is a clue about flow hashing, which I will cover later. Also notice that PVA-Client and PVA-Server are both zero across every thread. That means no flows are being hardware-offloaded, which is a significant problem on its own. Let us work through the most common root causes one by one.
Root Cause 1: iRule Processing Overhead
iRules are powerful. They are also one of the easiest ways to accidentally saturate your BIG-IP CPU. Every iRule event that fires runs in the TMM context — inline, blocking, consuming CPU cycles right there in the fast path. If your iRule is doing heavy string manipulation, calling external data groups on every request, using
HTTP::payloadto inspect request bodies, or running nested conditionals across thousands of transactions per second, it adds up fast.
In my experience, the worst offenders are iRules that were written years ago to solve a specific problem and then got copy-pasted across dozens of virtual servers. Nobody went back to profile them. A rule that costs 2 microseconds at 100 requests per second becomes a CPU hog at 200,000 requests per second. The math is unforgiving.
To identify which iRules are contributing to CPU load, use the built-in statistics collection. Review them across all rules at once:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh show ltm rule all-stats
Ltm::Rule Event: /Common/inspect_uri:HTTP_REQUEST
Priority 500
Executions 14823441
Failures 0
Aborts 0
CPU cycles (min) 212
CPU cycles (mean) 8843
CPU cycles (max) 482001
CPU cycles (total) 131204879363
Ltm::Rule Event: /Common/legacy_header_insert:HTTP_REQUEST
Priority 500
Executions 14820193
Failures 0
Aborts 0
CPU cycles (min) 88
CPU cycles (mean) 142
CPU cycles (max) 1893
CPU cycles (total) 2104467426
Compare those two rules. The
legacy_header_insertrule executes in a mean of 142 cycles — fast and cheap. The
inspect_urirule has a mean of 8,843 cycles and a max of 482,001. That maximum is a red flag; it means at least one execution caused a severe stall. With nearly 15 million executions counted, this rule is consuming enormous CPU. That is where you start.
Common optimizations: move static lookup data into data groups (hash lookup instead of linear string matching), replace regex patterns with
[string match]prefix checks where possible since string operations are significantly cheaper than regex compilation and execution, avoid
HTTP::payloadunless you genuinely need to buffer and inspect the request body, and ensure your rule events fire at the correct context — using
HTTP_REQUESTrather than
CLIENT_ACCEPTEDso the rule only activates when HTTP data is actually present.
Also check what is actually attached to your high-traffic virtual servers:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh list ltm virtual /Common/vs_web_443 rules
ltm virtual /Common/vs_web_443 {
rules {
/Common/inspect_uri
/Common/legacy_header_insert
/Common/old_redirect_rule
}
}
Three rules on one virtual server, all firing on every HTTP request. Profile each one. If a rule is no longer serving its original purpose — or if it was a one-time fix for something that has since been resolved elsewhere — remove it. Dead iRules attached to production virtual servers are a remarkably common finding during performance audits, and the fix is literally a single
tmsh modifycommand away.
Root Cause 2: SSL Offload Overloading
SSL termination is one of the primary value propositions of putting a BIG-IP in the path, but it comes with a real CPU cost. RSA key exchanges are computationally expensive. TLS 1.3 with ECDHE is cheaper than TLS 1.2 with RSA 4096, but full handshakes still cost meaningful cycles regardless of version. When your SSL TPS climbs faster than your hardware acceleration can absorb, the overflow lands directly on the TMM software CPU.
I have seen this happen after a routine certificate renewal where someone switched from a 2048-bit RSA cert to a 4096-bit RSA cert without understanding the CPU implication — within an hour the device was struggling under what looked like normal traffic. I have also seen it happen when session resumption was accidentally disabled on the client SSL profile, forcing a full handshake on every single connection instead of resuming from the session cache.
Check your SSL profile statistics to understand the current state:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh show ltm profile client-ssl /Common/clientssl_web
Ltm::Client SSL Profile: /Common/clientssl_web
Handshake Failures 142
Renegotiations 0
Session Cache Current Entries 24871
Session Cache Hits 18432
Session Cache Lookups 42301
Session Cache Overflows 3891
Connections (TLS 1.2) 287441
Connections (TLS 1.3) 94321
Current Connections 1823
Total Connections 381904
Avg TPS 4831
That session cache overflow count of 3,891 is worth addressing immediately. Cache overflows mean those clients are falling back to full handshakes, each one costing substantially more CPU than a resumed session. Increase the cache size and extend the cache timeout if clients are connecting more frequently than the current timeout allows:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh modify ltm profile client-ssl /Common/clientssl_web cache-size 65536 cache-timeout 3600
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh save sys config
Also verify whether hardware SSL acceleration is functional on your platform. On BIG-IP hardware with dedicated crypto ASICs (the i4000, i5000, and i7000 series all have hardware crypto), a non-functional crypto module means every SSL operation is handled in software:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh show net hardware field-fmt | grep -A5 crypto
crypto {
ssl-hw {
status enabled
}
}
If
ssl-hwshows disabled or the section is absent entirely, all SSL work is hitting the TMM CPU in software. On BIG-IP Virtual Edition deployments, software SSL is always the case — the only levers available are right-sizing the vCPU allocation for your expected SSL TPS and aggressive session caching.
Review your cipher suite configuration as well. RSA 4096 key exchanges are roughly 4–8 times more expensive than RSA 2048. If security policy permits, moving to ECDSA certificates with P-256 curves gives equivalent or better security at a dramatically lower CPU cost per handshake. This is a meaningful change on devices handling thousands of new TLS connections per second.
Root Cause 3: Too Many Concurrent Connections
The connection table on a BIG-IP is not free. Every established connection — client-side and server-side — consumes memory and requires periodic processing for timer management, keepalives, and state tracking. When concurrent connections climb into the millions, the sheer overhead of managing that state starts consuming measurable CPU cycles, separate from and in addition to the actual traffic forwarding work.
This is distinct from connection rate. You can have a moderate new connection rate and still accumulate millions of concurrent connections if timeouts are too permissive, if clients are abandoning sessions without proper TCP teardown, or if the application tier is holding connections open far longer than typical. I have seen BIG-IP devices with normal new-connection rates but 3–4 million concurrent connections simply because nobody had ever tuned the idle timeout away from the default.
Check your current connection table state:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh show sys connection count
Sys::Connections
Connections: 2847392
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh show ltm virtual /Common/vs_web_443 | grep -i connection
Current Connections 847392
Maximum Connections 1243019
Total Connections 48293847
If you have nearly a million connections on a single virtual server and cannot explain why, look at your TCP profile timeout settings:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh list ltm profile tcp /Common/tcp-wan-optimized | grep -E "(idle|close|fin|time-wait)"
close-wait-timeout 5
fin-wait-2-timeout 300
idle-timeout 300
time-wait-recycle enabled
time-wait-timeout 2000
An
idle-timeoutof 300 seconds means an inactive connection holds a connection table slot for five full minutes. For most web application workloads, 60 seconds is more than sufficient. Tightening this value flushes stale connections much faster and reduces the size of the table the system is constantly iterating over.
OneConnect is the other major lever. When enabled, BIG-IP multiplexes many client-side connections over a smaller pool of persistent server-side connections. A virtual server handling 50,000 simultaneous clients without OneConnect may have 50,000 client-side and 50,000 corresponding server-side connections in the table. With OneConnect enabled and 200 persistent server connections, you serve those same 50,000 clients while managing a fraction of the total state. The CPU reduction on HTTP workloads can be dramatic:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh modify ltm virtual /Common/vs_web_443 profiles add { /Common/oneconnect { } }
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh save sys config
Confirm that your backend pool members support HTTP/1.1 connection reuse and do not rely on connection-level session state before enabling this in production. Most modern application servers handle connection multiplexing correctly, but it is worth verifying before you flip the switch on a production virtual server.
Root Cause 4: Memory Pressure
Memory and CPU problems on a BIG-IP are tightly coupled in ways that are not always obvious. When the system runs low on available memory, TMM starts dropping connections to reclaim it. The kernel may begin swapping, which destroys I/O performance and indirectly drives CPU load higher as processes wait on swap I/O. In severe cases, the kernel starts killing processes to survive. If TMM itself gets killed and restarted, you get a brief traffic interruption followed by a CPU spike as every client that was connected tries to reconnect simultaneously.
Memory pressure can be caused by connection table bloat (see the previous section), oversized buffer allocations in HTTP or TCP profiles, large iRule data groups loaded into TMM memory, or simply having provisioned too many BIG-IP modules for the available physical RAM on the chassis.
Check memory state with both the BIG-IP native command and the underlying Linux view:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh show sys memory
Sys::Memory (bytes)
TMM Memory Used 3.8G
TMM Memory Total 4.0G
Other Memory Used 2.1G
Other Memory Total 4.0G
Swap Used 842M
Swap Total 1.0G
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # free -m
total used free shared buff/cache available
Mem: 16033 14982 312 48 739 803
Swap: 1023 842 181
TMM memory at 95% of its allocation and 842 MB of swap actively in use — this system is in trouble. Swap usage on a production BIG-IP should normally be zero. Seeing any swap activity is a warning; seeing hundreds of megabytes of swap in use means you are already past the warning stage and into damage-control territory.
Short-term actions: tighten connection idle timeouts to flush stale state faster, check whether large data groups can be reduced, and verify that the
maxrejectratethreshold has not been inadvertently set to a value that causes the system to hold more failed connection state than necessary.
Longer-term: review which BIG-IP modules are provisioned. Every module allocated at
nominallevel takes a share of system memory, whether it is actively processing traffic or not:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh list sys provision
sys provision afm {
level nominal
}
sys provision asm {
level nominal
}
sys provision ltm {
level nominal
}
sys provision apm {
level nominal
}
If AFM, ASM, and APM are all provisioned at
nominalbut you are only actively using LTM and ASM, deprovision what you do not need. Changing provisioning requires a controlled reboot — schedule it in a maintenance window. But it is often the correct long-term answer when memory is consistently constrained and you have modules sitting idle consuming resources.
Root Cause 5: Hardware Forwarding Disabled
This one catches people off guard, especially engineers who come from a pure software networking background. BIG-IP hardware platforms include a Packet Velocity Accelerator — essentially an ASIC or FPGA capable of forwarding established flows entirely in hardware without involving TMM CPU at all. When a flow is offloaded to the PVA, it consumes essentially zero TMM CPU for the bulk of its lifetime. TMM only handles the initial connection setup and the final teardown. The PVA does everything in between.
If PVA hardware forwarding is disabled — deliberately as a workaround for some other issue, accidentally through a configuration change, or implicitly because an attached profile is incompatible with hardware offload — every single packet of every established flow goes through TMM software processing. On a busy device carrying multiple gigabits of sustained traffic, that is the difference between 20% TMM CPU and 95% TMM CPU. It is that significant.
First confirm whether the PVA hardware is present and enabled at the platform level:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh show net hardware field-fmt | grep -A4 pva
pva {
status enabled
version 9.4
}
Now check whether your specific virtual servers are configured to use it:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh list ltm virtual /Common/vs_web_443 pva-acceleration
ltm virtual /Common/vs_web_443 {
pva-acceleration none
}
There it is. The PVA is present and enabled at the hardware level, but this virtual server is explicitly configured with
pva-acceleration none. Every packet hits TMM in software. Confirm by checking the per-thread TMM info — the PVA counters should tell the whole story:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh show sys tmm-info
Sys::TMM Information
---------------------------------------------------------------------------
TMM CPU Memory Connections PVA-Client PVA-Server
---------------------------------------------------------------------------
0 89% 1.2G/2.0G 42311 0 0
1 91% 1.1G/2.0G 41987 0 0
2 34% 0.8G/2.0G 18203 0 0
3 32% 0.8G/2.0G 17901 0 0
Zero PVA offloads across all threads on a device handling 100,000+ concurrent connections. Everything is in software. Re-enable hardware forwarding:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh modify ltm virtual /Common/vs_web_443 pva-acceleration full
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh save sys config
Before doing this in production, understand why it was disabled in the first place. Check your change history and ticket system. PVA is sometimes set to
noneas a workaround for a specific platform bug or a known incompatibility with an attached feature. If there is no documented reason, re-enabling it is generally safe — but verify in a maintenance window where you can watch for unintended side effects. After re-enabling, watch the
PVA-Clientand
PVA-Servercounters in
tmsh show sys tmm-info; you should see those numbers start climbing within minutes as the PVA begins offloading established flows, and you should see a corresponding drop in TMM CPU.
A few things will always prevent PVA offload regardless of the
pva-accelerationsetting. SSL profiles prevent hardware offload because the PVA cannot perform TLS decryption. iRules that fire on per-packet events (like
CLIENT_DATAor
SERVER_DATA) prevent offload because packet-level events require TMM involvement for each packet. Some APM and ASM inspection features similarly force software processing. For virtual servers where hardware offload is structurally unavailable, optimizing the other factors — iRule efficiency, SSL session caching, connection timeouts — becomes even more critical.
Root Cause 6: Logging Overhead
This one flies under the radar during performance audits. High-Speed Logging, request logging profiles, and ASM verbose logging can generate enormous log volumes. A request logging profile attached to a high-traffic virtual server that logs every URL, every request header, and every response code is formatting and transmitting a complete log record for every single transaction. At 50,000 requests per second, that is a non-trivial workload — both the string formatting inside TMM and the I/O path to the logging destination.
Look for request logging profiles attached to virtual servers:
[infrarunbook-admin@bigip-01:Active:Standalone] ~ # tmsh list ltm virtual /Common/vs_web_443 profiles
ltm virtual /Common/vs_web_443 {
profiles {
/Common/http { }
/Common/clientssl_web { context clientside }
/Common/serverssl { context serverside }
/Common/oneconnect { }
/Common/request-log-verbose { }
}
}
That
request-log-verboseprofile is a candidate. Check what it is logging, at what granularity, and where it is sending records. If the destination is a remote syslog over TCP, you also have the risk of blocking I/O behavior when the syslog buffer fills under load. For high-volume virtual servers, switch to UDP-based HSL which is non-blocking, reduce log verbosity to only what operations teams actually use, or implement request sampling — logging 1–5% of transactions at random still provides representative data for analysis without the full CPU overhead of logging every request.
Root Cause 7: TMM Thread Imbalance
Modern BIG-IP hardware runs one TMM thread per CPU core assigned to the TMM process. Traffic is distributed across those threads using Receive Side Scaling, hashing on a tuple of source IP, destination IP, source port, and destination port. Under normal conditions the load spreads roughly evenly. In practice, this breaks down when traffic lacks entropy in the hash inputs.
The most common trigger is a large fraction of traffic originating from a NAT gateway behind which thousands of clients share a small pool of public IP addresses. With only a handful of distinct source IPs, the RSS hash has limited entropy to work with and consistently routes many of those flows to the same one or two TMM threads. You end up with threads 0 and 1 sitting at 90% CPU while threads 2 and 3 idle along at 30% — exactly the pattern in the opening output of this article.
There is no universal single-command fix for RSS imbalance. If you control the upstream NAT, using a larger pool of source IPs provides more entropy for the hash algorithm. On some platforms you can tune the RSS hash key via
dbvariables to try to improve distribution. In other cases this is fundamentally a capacity planning problem — the effective parallelism of your BIG-IP is lower than the core count because the traffic pattern defeats the distribution mechanism, and you need to account for that when sizing the device.
Prevention
Most high CPU incidents on BIG-IP are preventable. The recurring pattern is that someone makes a change — attaches a new iRule, renews a cert with a larger key size, disables PVA as a quick troubleshooting step and forgets to re-enable it, deploys a verbose logging profile to all virtual servers — and the CPU impact does not become visible until traffic peaks hours or days later. Nobody was watching the CPU trend during the window when the change was made.
The single most effective prevention habit is proper baselining. Know what normal TMM CPU utilization looks like on your device across different hours of the day and days of the week. Use SNMP polling or streaming telemetry to track per-thread TMM CPU continuously over time. When CPU starts trending up, you want to catch it at 60% and have a conversation about it — not be woken up at 2 AM when it hits 95% during peak traffic.
For iRules, establish a review gate before attaching new rules to production virtual servers. Use
tmsh show ltm rule all-statsto profile rules in a staging environment first. Set a team standard that any rule with mean CPU cycles above a defined threshold requires an architecture review before production deployment. This takes ten minutes and has saved many teams from hard-to-diagnose performance regressions.
For SSL, keep session cache sizing appropriate to your concurrent client count. Calculate the expected number of unique clients connecting within your cache timeout window and set
cache-sizeaccordingly. Monitor session cache hit rates — target above 80% for typical web workloads — and treat falling hit rates as an early warning indicator. Review cipher configurations at least annually; older cipher lists often include RSA-based key exchange algorithms that modern clients will negotiate preferentially and that are significantly more expensive than their ECDHE equivalents.
Keep connection timeouts tuned to realistic values for your workload. Review your application's actual connection lifetime patterns and configure idle timeouts to reflect them. Enable OneConnect on HTTP virtual servers where the backend supports connection multiplexing — it is one of the highest-leverage configuration changes you can make for connection-heavy workloads, and the operational risk is low on properly functioning HTTP backends.
On hardware platforms, monitor PVA offload rates as a first-class metric alongside CPU and memory. If you modify a profile, attach an iRule, or enable a security feature on a virtual server and the PVA client and server counters drop to zero, that change is now costing you in software processing overhead. Make the tradeoff consciously. And whenever you disable hardware forwarding as a troubleshooting step, note it in your ticket immediately and schedule re-evaluation during the next change window. Temporary workarounds have a way of becoming permanent configurations on busy teams.
Finally, align module provisioning with actual usage. Deprovisioned modules you are not actively using consume memory that could otherwise serve live connections. Review provisioning annually as part of your lifecycle management process, and treat any change as a controlled reboot event that requires a maintenance window — not something to squeeze in between meetings.
