A Success Story
A solar farm company running locations in 30+ countries worldwide were testing different operation technologies and monitoring software. The operational data were not harmonized and couldn’t be used for predictive analysis. A Network Operation Control Team (NeOConT) was set up one year before our project started and still in process of staffing. The task of NeOConT was to monitor more than 30’000 different hosts and up to 2’000 processes and services.
There was no clear concept on how to operate, just a basic set of few procedures. No communication rules with other teams were defined. The NeOConT didn’t create reports, no ticket follow ups were done. All that NeOConT did, was to look at the monitoring software and just raise general purpose ticket if there was an alert. Almost all of the sensors were only configured with standard settings, which didn’t reflect the workload pattern of the hosts and processes.
Overall, too many things were left without proper monitoring and the company basically had very limited knowledge what happened at their assets IT infrastructure.
Today, the company has a fully operational and functioning NeOConT with well-established operating procedures, clear communication lines, defined methods for collecting the operational data and an automated reporting system, providing and distributing all needed reports demanded by the management.
After setting up the training program for NeOConT, the company as a result had well trained personnel with broad knowledge on the different technologies used at the company’s infrastructure, different monitoring solutions and practical skills to work on various issues.
Also a “personal projects program” was established, to allow program team members to use company’s equipment and dedicated working hours for personal projects to learn and to try new things to start an innovation initiative.
The steps to achieve the benefits were broadly divided in three groups:
It first started with identifying the “dark spots” of the company’s IT infrastructure – what were not known about the infrastructure and processes, including gathering ideas how to collect and process this knowledge.
On parallel, “instability generators” in the infrastructure were identified – what are the most frequent reasons for downtimes and how do they affect the production as a whole.
And finally, how can issue be detected and what can be done to decrease downtimes.
An IT audit of the company’s infrastructure was done and the needs of NeOConT were identified.
Main questions at start:
When the project started, NeOConT was in the middle of testing different monitoring tools.
At the end Prometheus with Grafana were chosen as central data management and visualization tools, used for production lines yield dashboard. Several tools were connected as special data provider, e.g. cAdvisor for Docker.
There are also custom-made scripts in PHP, Python and Powershell which are developed by NeOConT and integrated in the monitoring system.
It’s about how to prioritize the services, make the asset data capturing, modelling and mapping to a harmonized data structure and make the proper reports.
First, we identify the most important services and hosts, create adequate naming convention which was implemented then by the IT team. Priority matrix was created with unified names and procedures. All existing procedures were reviewed and based on them we created training plan for NeOConT so they can have the skills necessary to execute the procedures.
Most of the procedures were written by me and TeamLead with support from other teams, depend on which team is responsible for the system in question. After service priority and names were ready, other team contact points and communication lines established, and procedures created it was time to take on the reports.
At the beginning NeOConT didn’t have clear requests from management to create reports. So different levels of reporting s were defined:
urgent – (Incidents) usually automatic email notifications for different occasions
daily – internal NeOConT report for shift changing including events and actions during the last 24 hrs. Downtime, tickets opened, things to follow, etc.
weekly-1 – the weekly outage report is for the issues that caused downtime’s or inability of employees or devices to use company’s products. They are defined in priority matrix in different levels. Brief summary of the incident, teams involved, actions taken, etc.
weekly-2 – is the missing RCA report which is for documented and reported events without root cause analysis provided by the relevant team.
monthly – monthly KPI report. Based on the downtime reports during the month it calculates availability as a percentage both time and location of services and applications. SLA of the tickets to NeOConT, number, duration and resolution of the calls on hotline to NeOConT and other metrics required by the management.
In general: the reporting templates and the data gathering solution are based on Python using Selenium. Data are automatically collected from Confluence, prepared and processed to calculate different KPIs.
As a next step, it’s planned to move all reportings to Jira as it provides already needed functionalities.
The Book of Why: The New Science of Cause and Effect
"Correlation is not causation." This mantra, chanted by scientists for more than a century, has led to a virtual prohibition on causal talk. Today, that taboo is dead.
42,49 €
The Art of Statistics: Learning from Data
David Spiegelhalter guides you through the essential principles we need in order to derive knowledge from data.
10,49 €
*I agree to receive the newsletter with the latest offers. My personal data will be processed in accordance with the provisions in the privacy policy.