A production issue today has reminded me of one of those little IT ‘truths’ that I greatly believe in. This applies to security teams as much as operations.
IT should not be worked at 100% capacity.
There are a few reasons I believe in this.
1. find errors and problems.
Troubleshooting an issue or performing a specific task is really not the time to find and tackle other problems that might have been found in the process. For instance, today’s production level issue revealed that one server was having a problem writing the correct time on the web logs. I also found that my IPS had not been logging alerts for a week on just one interface. Issues that really should not go unnoticed for terribly long.
2. learn and monitor trends.
Not knowing what is normal on a server is really not a great place to be, especially when an issue appears and the knowledge of what is a normal baseline would be very important information. Sure, logging and monitoring automation can help, but there is still a lot of admin work that goes by gut feeling, and not just from all the historical data.
3. remain practiced with the tools and information at hand.
It is frustrating to be thrust into a situation where you know you’ve done it before, but can’t quit remember how to troubleshoot something. Kind of like pen-testing is best done constantly so you can be efficient and know what you’re going after, without bumbling around trying to remember the syntax of those commands or which exploit package will shovel you that shell. Or that filter in Wireshark to show you exactly and only what you need? Do it, do it, do it. Practice and make it known enough that it can be whipped out when the pressure is on.
4. bandwidth to react to issues
When two high priority issues come in to one admin, does one issue sit there until the first one is attended or a priority difference is assigned to each? What if both are show-stopper issues?
Yes, fine, there are plenty of us who spend some time visiting gaming sites and reading blogs during quiet periods of our week (fleeting as they are!), but this is also why I truly believe in hiring people who geek out about technology and security: when the fires are low, that’s what they’ll play with. They’ll look to automate troublesome tasks, improve anything that isn’t working optimally, and otherwise keep their fingers firmly on the pulse of the kingdom.
Ok, yes, I will actually concede that there are exceptions to not working the staff below 100% optimal, but this is largely a corporate culture exception where IT is expected to only do just enough to keep the lights on, even if the wiring is exposed in back and getting long in the tooth. As much as an approach like that pains me, it is reality in places and the realist in me still does accept that. But you can guess which type of environment I would be happier in. 🙂