Imagine you are the head of a dynamic IT department, with several people below your role and even more projects and goals to achieve.

Now, take a breath, or two, and think. Is your department effective enough to be able to deliver what it is intended to?

I’ve been in the exact same situation few months ago, starting as an IT Operations Director in a big multinational company, in a highly complex environment with several critical projects already knocking on my door, since day one at work.

The IT department I have joined, is a mixed environment (a bastard-hybrid as I hate calling it) because it is a not-so-flexible combination of in-house, core-outsourced and per-project-outsourced resources, and not necessarily humans, only. 

Therefore, the question I first had to upper management was: “How do you measure the department’s effectiveness?”.

Needless to say, that I’ve never been to a more silent room than the one at that time, for the next two minutes.

Then, a person said: “We have a ticketing system, with SLA times that monitors everything”. Sure, but this might apply to a small percentage of the department, what about the externals, vendors, outsourced staff etcetera? You cannot force third party vendors to comply with your internal SLAs, they have their own metrics to achieve.

Things were getting too complex to discuss it further, therefore, leaving the meeting, I was determined to present them how could we measure the effectiveness of the IT department, and this, can only occur when you really understand the daily BAU (Business as usual).

I then decided to make a draft of metrics and KPIs but it seems that there are literally thousands of them that could be set and monitored, and I lost the count somewhere between KPI 130 to 150 because the analysis/detail was going way deeper than it should be.

After several headache killers, I decided to put myself in the position of each member of the groups of the IT department and evaluate daily tasks that could be counted as goals.

First, I focused on Infrastructure as it is by itself one of the hottest potatoes in every business. I discussed with most of the system engineers, both in-house and outsourced, what would be their daily tasks and percentage of completion in order to have a standard metric for them. This was one of the most diverse groups I interviewed, with more than 50 or 60 KPIs; thankfully, we’ve decided to shorten them enough and keeping only the below:

  • Deployment of systems (virtual): KPI is the delivery time, starting from the time that the requester initiates the ticket, until the final delivery time where ticket gets resolved and not closed, as further requirements might occur post-delivery. 
  • Device Hardening: Similar to the above KPI, with strict delivery time, with small variation of time between OS categories or middleware applications as this might raise exemption requests by the technical or business owners.
  • Patch management: A very broad category for KPIs as this relies vastly on downtime windows and availability of systems. I’ve decided to keep 4 sub-categories (servers, workstations, hardware, networks) and average the KPI afterwards to have a better estimation of it. In general, we’ve opted for a monthly circle of patching to servers and workstations, followed by a trimester patching cycle of hardware, while networks are ad-hoc, which means patching firmware whenever a new release gets available, even after days since the last patch.
  • Back up: This has been agreed with the outsourced company supporting us and in general, we defined the scope of backup, per system, per frequency and per retain period. I would have liked some more KPIs on it, such as restoration time limit, number of successful tests per month etc., and I am considering proposing it to them.
  • Latest definition updates for Antivirus: Fairly easy KPI that can be achieved by monitoring a central management console to be sure that all systems are configured to receive the latest updates. A report can be easily generated and in case something is not up-to-date, it can become with a single click of your mouse. Nice to have it.
  • Internal and External Surface protection: This is related to firewalling and protection mostly, as we decided to add KPIs to both surfaces, to be constantly alerted if something gets attacked or fails to be protected as expected. It has generally fit well in our KPI list and we are recently extended it to include both web filtering and IDS/IPS.
  • Penetration Tests/Vulnerability Scans: Executions Findings/Remediations ratio. We concluded to monitor the decrease of scoring on the reports based on our remediations and grant this action a respectful KPI to be able to evaluate whether these actions are on the right side.
  • Disaster Recovery activities: This is the first year to implement this KPI but we decided to include it, by creating a calculation between the expected time of the DR activity and the possible issues that will be audited/reported during it.
  • SIEM KPI: As an organization having a hybrid SIEM, both outsourced and in-house, we decided to include KPIs of it, as per Detection and Response, Report and Remediation, engineers respond to SIEM required actions and team-crisis management.
  • Systems Availability: I should have put this KPI first as it is one of the most critical KPIs providing information of Business Continuity and Services Availability. For our convenience, we have set a basic SLA ratio of 99.5% for servers and infrastructure availability, a 99% of core Applications availability and a 99% availability of human factor. Everything is measured via Tivoli Enterprise Monitor, Instana and Turbonomic (Custom) whilst human availability is being provided by the HR department and gets combined in our calculation platform. 

Second, I decided to evaluate the already existing Ticketing Service to see if I could keep something from that side. It turned out that we could keep several KPIs, such as:

  • Number of requests
  • Number of resolved requests
  • Number of unresolved requests
  • Number of provisions
  • Number of deprovisions
  • Number of General security waivers
  • Number of Upgrade server resources
  • Number of User requests (resolved/unresolved)
  • Number of tickets pending for approval/pending for support/pending for further information
  • Number of tickets rejected
  • Average First response Time
  • Average Response Time
  • Average Resolution Time

Last, I went full business and had a short discussion with the CIO and the Technical Owners regarding four critical key points:

  • What is the IT Strategy that has been designed and whether is it compatible to Business or not.
  • What is the near-future expectations in terms of transformation of both infrastructure and applications (including architecture whenever applies).
  • What is the budget provided per year and a short analysis of whether is it considered a successful or unsuccessful budget, in terms of both CAPEX and OPEX expectations (exceeds or not).
  • Future expansion of the IT department and key roles that should be included.

Additionally, a short discussion with the Compliance Department revealed a couple more KPIs to be added:

  • Percentage of Compliance policies to be implemented or are implemented
  • Security/Risk Assessments conducted
  • RACI implementation
  • Compliance to various ISOs

Summing up all of these, I was really deliberating whether adding all these KPIs would be something that really provides value or it is more of a burden, where numerous people will get involved to implement a gigantic-flow process in order to calculate all of the above, where nobody will possibly care about.

Luckily, it turned out to be an essential metric system because shortly after, the new management requested most of them (along with a hundred more) to be added in a new KPI system, that is being monitored almost daily.  

What were our findings? You’d be surprised if I told you that we were not as effective as we were expecting our department to be, and there are points where we can further advance in order to both provide the end-user a high-level experience, whilst having all Infrastructure and security related Indicators in compliance.

We have set a generic SLA percentage to KPI’s at 99.5%, where we could consider our activities on the safe side. Apparently, we need more work to be done to achieve that.

Similarly, although we have set a limit to monthly tickets received, this limit is constantly broken upwards (for the record, we consider as a viable service delivery for tickets, a maximum of 450, based on complexity and criticality, which is an average of 10-15 per user per month). This has been under evaluation to be increased by +50.

There is no limitation to provisions and deprovisions of users, although an increase has been monitored for the last two months. This will affect our licensing schema, especially for O365/Azure, yet, we scored 100% due to our in-house custom-made provision-deprovision process.

Last, assuming that these could be considered factors that could impact the KPIs,  we see that the last month, although the BAU requests of security, upgrade resources, etc are kept in normal levels, the aforementioned issue of increased request (users and developers) have boosted ticketing KPIs to negative values as many tickets were left non-serviced or in pending status.

 If you are in my situation, I’d suggest you be proactive and implement some of the above.

Last modified: October 14, 2022

Author