There’s a lot of noise on Twitter. But sometimes, there are threads that harken back to the day of quality hacker and infosec forums. Like this one from @InfoSecMBz:
I love these sorts of thought exercises.
Going from next to nothing for 5,000 servers, and getting to real configuration management is going to be a multi-year process, and probably encompassing a full lifecycle for those particular machines. It just is, unless someone has the go-ahead to scorched earth burn it down and rebuild, or to slam in a standard and deal with the broken assets and resources for a few years of pain (and burnout of admins).
And let’s just be real here if we’re talking to this client. There are mature shops who do lots of things correctly, but still have poor configuration management. In a made-up scale of 1-10 on the road to mature IT and security practices, configuration management is probably around 4-5 to start, and 7-8 to really own.
Below are some of my bullet items. And yes, I know there’s a whole thread to cheat from, but in true thought exercise spirit, I’ve tried to minimize spoiling myself on all the other answers right away.
0. Discuss the scope, deliverables, definitions. Wanting to do “configuration management” can mean different things, and for a project that could take years, the specifics really need to be discussed.
- For instance, is the desire to have a hardened configuration baseline?
- Or just a checklist on how every server is built at a basic level?
- Is it necessary to know and profile all software installed?
- Does this include configuration management for all features, roles, software? E.g. IIS/Tomcat/Apache, etc.
- What is the expectation to build on-going processes, audits, checks to ensure compliance? Is this even about compliance?
- What is the driver for the customer asking for this, is this to adhere to a specific requirement or to eliminate an identified risk to operations and technical debt? Someone read an article or talked to a peer?
- What is the vision of the future? Someone at some point needs a 1-year, 3-year, 5-year vision of how the environment is managed. “In the future, I want all servers to have a documented build procedure and security configuration automatically enforced continuously and all changes known and tracked.” Vision statements help contain scope, determine deliverables, and help define success.
I would start by breaking out some of the layers of “configuration management.” My assumption here is this post will cover the first two items, and leave the others for future maturity.
- There is OS level configuration management, including patching.
- Then there is management of software.
- Then there is configuration management of things that live within the OS (software, features, services, server components..).
- And then there is configuration management of custom applications, code, web apps.
- Lastly, I also consider networking devices to be a separate discussion.
If a customer truly does not know what they want, I would say what they want is threefold:
- They want to know their inventory/assets.
- They want to patch and know their patch coverage metrics.
- They want to know how to build/rebuild their servers to minimize ops risk/cost.
00. Plan the project. At this point, there should be effort made to plan out the project. The items listed below are not meant to be done 1 by 1 and only moving to the next after finished the first. There’s no way that project will complete successfully. Instead, many of these items can be run in parallel for a long period of time. There should also be milestones and maturity levels that are achieve as they progress. And there are questions of how to move forward. Should we tackle the whole environment at once, or should we tackle small swathes first. If we do a small group first, we can more quickly produce proof of concepts, and possible pull in other lines of servers later on. Or maybe we just stand up a good environment, and as server lifecycles kick in and servers fall off, their services could be brought back up in the “good” environment. All of the above are ways to go, and an idea should be formulated at this point on options to move forward and track progress.
1. Inventory. This needs to start with some level of asset inventory to capture what is present in the environment. What OS and version, where is it located on the network, what general role does it play (database server, web server, file server, VM host…), physical or virtual, and a stab at who the owner of the system is. This should be a blend of technical and non-technical effort and is meant to be broad strokes rather than fine-grained painstakingly detailed. On the tech side, scanning the known networks*, looking at firewall logs, looking at load balancer configurations, looking at routing tables and arp tables, dumping lists of VM’s from VM hosts. On the non-technical side, interviews with staff who own the servers and interviews with staff who use resources that need to be known. All of this information will fuel further steps. And I want to stress, that very few of the subsequent steps will see true success without this step being taken seriously.
(* This may be a good time to also have the customer introduce a baseline vulnerability scanning function. There is a lot of overlap here with a vulnerability scanner that scans the network for assets, tries to log in and do various checks, and enumerate patch levels and software installed. Or it might be time to implement a real asset CMDB or other system. Keep in mind each OS family will need some master “source of truth” for asset inventory.)
From here, we can start splintering off some additional tasks to run in parallel or otherwise. For the sake of simplicity, I’ll just put these in a rough order, but some things will start together, and end at different times.
2. Determine external accessibility. The point here is to quickly know the most at-risk systems to prioritize their uptime, but also prioritize getting them in line and known. Most likely these are the systems most needed to be up, recoverable, and secure. This will require interviews, perimeter firewall reviews, load balancer device reviews, and even router device reviews to map out all interfaces sitting on the public Internet, and how those map back to actual assets on the internal networks.
3. Start making patching plans. In the scenario above, they don’t know a thing about security. This tells me they likely don’t have good patching. And this is going to have to be dealt with before tackling real configuration management. Based on the OS versions in play in the environment, plans of attack need to be made to automatically patch those servers. If this is a Windows environment, for instance, WSUS or SCCM or some other tool needs to manage the patching. This step is really about planning, but eventually it will get into deploying in phases as well. Don’t overlook that major service packs and version upgrades technically count as patches.
4. Find existing documentation and configuration practices. Someone’s been building these servers, and something has been maintaining them to some degree. Existing practices and tools should be discovered. Staff who build servers from a checklist should expose those checklists. If they’re done by memory, they need to be written down. If some servers are Windows and there is a skeleton of Group Policies, those need exposed. If they are Linux systems and some are governed by Puppet, the extent of that governance needs exposed. If possible, start collecting documentation into a central location that admins can reference and maintain and differences can be exposed.
4a. Training and evangelism. At this point, I would also start looking at training and evangelizing proper IT practices with the admins and managers I interview. From a security perspective, I find a good security-minded sysadmin to be worth 3-4 security folks. Sysadmins help keep things in check by design. They’re the ones who will adhere to and promote the controls. If the admin teams are not on board with these changes, all of the later processes will break down the moment security isn’t constantly watching.
5. Change management. Chances are this environment does not have good change management practices. At a minimum for our purposes at the start of this big project, we need to know when servers are stood up and when they are decommissioned. This way we have a chance to maintain our inventory efforts from earlier. If there is no process, start getting someone to implement one (either manual announcement, to automated step in deployments, to picking them up with the network scanning iterations). One side goal here is to use earlier network scanning for inventory to compare against what is exposed through change management. If a server is built that is a surprise, it can be treated as a rogue system and removed until change management authorizes it. This process helps reduce shadow IT and technical debt due to unknown resources in play. It also helps drive the ability to know what percentage of the whole is covered by later controls and processes. If you don’t absolutely know the total # of servers, you can’t say what your patch % is, for example!
6. Analyze inventory. At this point it should be appropriate to analyze the inventory and see what we’re dealing with. How many families of OS are present, what versions are they, what service pack levels, and what patch levels. Which systems are running unsupported Operating Systems. We should have some pretty charts showing the most common OS’s in place. And these charts can help us direct where our efforts should focus. For instance, if 80% of our environment is Windows, we should probably focus our efforts there.
We should also start looking at major types of servers, such as web, file, storage, database, and other usage we have and those percentages.
7. Baseline configuration scan for each OS family/version. This might take some effort, but this is about seeing the damage we’re looking at. This does not have to be a scan that gets every server, but from the inventory analysis above, we should be able to pick out enough representative servers to scan with a tool we introduce and get an idea of what our current configuration landscape looks like.
Bonus points on this item if a standard has been identified and used as the comparison to see drift, but I wouldn’t consider that necessary quite yet. This is all about getting a baseline scan that we can look at a few years from now and see just how much improvement we’ve made.
8. Interview owners and expand inventory data. Start chasing orphaned servers (shadow or dead IT?) and get them assigned an owner. This also helps determine who is really accountable for various servers. This usually isn’t admins, but the managers of those admins who will be the ones who will end up needing to know about changes and authorize changes such as patches and configuration changes. Try to figure out if certain owners will be easy to work with, and others will be difficult, to help prioritize how to tackle getting server in line.
Just to note, at this point, we’ve still not really made any changes or done anything that should have impacted any services or servers. That will start to change now.
9. Patch. Expand change management scope to include patching approval and cadence. Synthesize asset information on system owners and patching capabilities. Determine a technology and process to handle OS patches, and start getting them deployed. This may take several iterations and tests before it starts moving smoothly, which may be half a year for so many servers. Try to make sure progress can be tracked as % of servers patched and up to date compared against your full expected impacted inventory.
10. OS Upgrades. At some point in this large environment, systems will need to be replaced or upgraded, or there are no longer patches available for is (unsupported). Start to plan and document the process to do server upgrades or replacements. This can be wrapped into lifecycle management practices. The changes from this tie into the change management process. And if you have really good build documentation for base servers, but also the role for the servers you’re upgrading, you can morph server “upgrades” into server “replacement with newer fresh versions.” This helps combat technical debt that comes from servers upgraded over many years where no one knows how to actually build that server if it got dumped.
11. Compare baseline against configuration knowledge. Think about comparing against a known secure configuration standard to find the delta. CIS benchmarks are great for this, and this step is only about comparing against something, but not yet making a process to get closer. For the most part, this is about comparing your baseline against the configurations you think your servers should be meeting based on interviews and how staff has built servers in the past. Actively start leveraging change management and configuration management tools to make changes in non-compliant servers. A major deliverable from this step should be your first versions of configuration standards.
Only now do we get to actual “configuration management.”
12. Implement configuration management capabilities. For the largest and easiest swaths of OS family/versions, start implementing configuration management tooling. For Windows, make sure they are joined to the domain, analyze and standardize Group Policy. Create a starting point and get missing systems in order. For Linux versions, get them into and managed by Puppet or some other solution.
13. Enforce standard server builds. The situation here is that servers are now patched properly, configuration enforcement tools are in place, and build processes are exposed. This means teams and admins should only be building servers based on known configurations and standards. This is a people process and making sure all admins are doing things the same way.
14. Implement process to improve the configurations. There are many ways to do this, but it all comes down to having a process to choose a configuration setting to change, test it, track it, announce it, and implement widely. This can be based on an end goal configuration, or just picking various settings and making them better; making systems more hardened.
Keep in mind this does not mean having perfectly secure configurations. You can try that, that’s fine, but it’s about having a process to continuously move towards that direct.
Further steps will tackle the other scopes, such as installed software, roles/features, settings within installed software, etc.
Lastly, this project should only be walked away from under the awareness by the customer that most of the above steps are introducing new functional processes that have to remain in place continuously in order to succeed. Once these processes are neglected, that function will break down, configuration drift will occur, and progress will reset.