Years ago, I became aware of the book The Phoenix Project (Kim, Behr, Spafford) and added it to my wishlist, but never actually picked it up. I remedied that issue over the past couple weeks by picking it up on Kindle and going through it. Rather than post a reaction or my thoughts on the book (at least for now), I just wanted to tell a small personal story that this book made me think about again.
Back around 2007, I worked as a sysadmin, and one of my main duties was supporting the servers hosting our critical web sites that developers developed. Thankfully we were already well into the virtualization takeover, but we were still using Microsoft’s Network Load Balancer tool to spread load across about 7 Windows Server 2003/IIS 6 web servers in one data center (the outfitted closet behind my desk). These sites ran .NET code using all sorts of virtual directories and COM objects tucked into corners of the server. And other things which I’ve thankfully lost memory of!
We had dev, test, and production environments, if I recall correctly. Deployments to dev and test took place Tuesday and Thursday afternoons, and would take several hours of manual work and testing to perform, during which time that entire environment was inaccessible to anyone due to the things needed to be done to support installations and configurations of IIS and COM. Part of the COM install was done by a homegrown tool built by someone I didn’t know and no longer supportable, but the rest was manual labor. And if one team needed a deployment, every other team pretty much had to feel that outage with the shared resources.
When I took this over, I immediately started doing a few things that seemed natural to me. I first made a clear checklist to follow for each deployment (know your work!), thereby removing the need to remember each step. I then started automating the pieces I knew how to automate using batch scripts to move files around.
At this same time, my company was also performing the implementation stages of a company-wide DR/BCP project. We added a second data center and my server farm was about to grow from 7 production web servers and about 4 dev and test servers into about 50 and more. We were also plugging in dedicated hardware load balancers as a much needed upgrade to NLB. And we then needed to solve file replication challenges when supporting two data centers that needed to fail over each other. Exciting times!
But this expansion meant I needed a new solution for deployments. Devops was still not really a thing. PowerShell had just recently come out, and I decided to try learning it in support of this coming build-out. I mean, no one wants to work for hours and hours just doing tasks that a monkey could do on servers.
So I created a PowerShell script that would perform these deployments automatically. My script would run on every production web server perpetually. They would all “check in” to a common configuration file and would “elect” a master who would do the controlling of another installation configuration file. When I needed something configured, my script would orchestrate the installation kick-offs with each other server in a predefined sequence. When a server received a command to do an install, the script would delete everything in IIS and remove all the other things, and then build them all back in every time. I had around 100 sites on these servers, and it was pretty glorious to see them all run through these installs for a few hours. I minimized downtime when possible (you know, database changes making this not possible) by utilizing the load balancer to know when a server shouldn’t have traffic, and when it was good to have traffic again. This was all replicated to separated (and expanded) dev and test environments as well as the servers on the DR site. Flipping over to a DR test was really just a matter of changing DNS and waiting a bit while the database then also failed over (pre-Availability Group days).
I solved quite a few problems with this setup. I lowered the amount of time an admin needed to spend doing deployments. I also lowered the amount of time overall for deployments. Deployments could be scheduled and run unattended at any time (weekends, nights). Outage windows were greatly reduced when they were even necessary. Most of the time, by orchestrating traffic direction by the load balancer, I could allow devs to do seamless deployments any time they wanted. I could scale this up (to an extent) to accommodate our expanded environments. I was able to achieve server consistency by not only removing human hands from the deployments, but because I rebuilt every IIS server, I eliminated those inconsistencies admins introduce when troubleshooting something, getting interrupted, and never getting back to set things back to how they should be. With a few networking exceptions, my dev environment was also comparable to the production environment, so if a developer could get their code to run in dev early in the dev cycle, it would also run in prod (none of this, “it works on my laptop!” crap). As a side benefit, no one could add something to the server that wasn’t part of the known build procedure, as the script would wipe it out or just not know to include it. And the script and its configuration file were self-documenting for what was needed.
Things were good, but they got better as time went on. When we migrated to Windows Server 2008 and IIS 7, I completely rewrote the script. I removed the need to pass a “master” token around and decoupled the script from the servers. I ran it on a dedicated system and utilized remote sessions to make changes on the servers. I also decoupled the actual copying of web code from my scripts and better utilized DFSR. This allowed developers to make simpler file changes within seconds if they wanted to. This also pushed management of “dev first, then test, then prod” pipelines to development hands, taking me out of that decision structure. I also made sure my script could install pieces and parts of sites rather than the whole server, if desired (will still keeping the ability to do a full clean and install). When moving to Windows Server 2012 and IIS 8, I again made smaller changes to improve support.
By the time I was done with the last iteration of my scripts, it was about 2013 and we ran that infrastructure until I left in 2016. We didn’t really dive too hard into devops, since we didn’t really have to. I had somewhat naturally found those concepts by improving delivery, improving consistency, reducing risk, and reducing my pain felt during deployments and in support of mistakes. No one should like to be forced into constant heroic efforts to keep the lights on.
Many of those lessons are buried in The Phoenix Project, which is really the same story of an IT shop in a (rather busy) company also discovering how devops improves IT operations. It doesn’t take an Erik oracle or threats of a business falling over to figure out how to improve operations or fancy production floor studies and terms to understand how to ease your pain and make things better. If you allow it, it should happen (to a degree) on its own as people manage their little fiefdoms more efficiently and reduce their own personal pain.
Had I remained with that company, I’m pretty sure I’d have next dumped my homegrown PowerShell scripts and done one of two things: Either continue with my fiefdom and implement more situated devops tooling like Ansible to manage the environment, or marry up to developers and their chosen packaging and deployment pipeline (their issue being they couldn’t get every team to decide upon just one).
The Phoenix Project has many more nuances; it’s like taking the IT issues of 50 companies over 5 years each and compacting them all down into one year of just one company. It’s a little silly, but it illustrates all the pain that eventually led many teams and engineers down the general path of devops. Which is still really just about keeping things in line with the whole utter point of IT: automation.