Simplified onboarding using configuration rules

14-Apr-2024 10:37 PM by Geoffrin Edwin

If your business is growing, then so too must your IT infrastructure. Servers, VMs, databases, nodes, pods, containers, and all of your digital resources spawn up and down—all in accordance to your business' needs. The catch is all of these infrastructure elements have to be monitored without it being a herculean task to your team to do so.

Here are some pain points that arise every time a server or VM is added:

Setting up a monitoring platform for the resource
Assigning safe resource utilization limits for CPU, memory, and disk.
Monitoring critical processes and services running in it.
Routing the alerts to the proper directly responsible individuals (DRIs).
And most importantly, setting what not to monitor.

Configuration rules will help you solve all these problems and more. Every time you increase your monitoring net to cover a new device, or you want to target a configuration change to a particular set of devices, configuration rules come to your rescue. This document will show you how.

Before we dive head-first, see if this scenario sounds familiar.

You are trying to set up your new server monitoring tool. The plan is to configure it properly like:

Alerts for outages happening only in the production servers.
Outage alerts should reach you through an instant messaging application (like Slack or Teams).
Exclude your user acceptance testing (UAT) setup from monitoring.
Observability for critical files, directories, and ports.
Alerts when a process goes down.
Ensuring that disk partition alerts are not triggered for harmless cases (like snap getting filled).

Why does this sound painful? Because out-of-the box monitoring tools are pretty straight forward, such as generating an alert when a server goes down. But how do you bend it to suit your needs without breaking it or paying a consultant (who could raise an invoice that would potentially bankrupt you)?

Let's say your SREs and sysadmins have found the time and patience to set up your monitoring tool to satisfy all the above conditions. Good job! But good luck replicating the same process for the remaining thousands of servers out there.

To end this dreadful process, Site24x7 has configuration rules. With configuration rules, any configuration change to a monitor can be pushed to only the server monitors that satisfy certain conditions; for example, only the monitors tagged as USWest1 or falling under a specific IP range.

The use cases are plentiful. But first, let us start with something simple.

A typical use case

Consider an example organization that has 20,000 servers across on-premises locations and all three major public cloud providers (AWS, Azure, and GCP). The servers in the cloud are spread across different locations. The databases are set to full recovery, so log files getting too big is a known issue. The app servers are under threat of a memory leak. The VMs keep scaling up and down, and every time a new VM is spawned, it has to be monitored and when it is terminated, monitoring should stop.

So, an ideal monitoring setup for this environment should:

Segregate the monitors with unique identifiers. For example, Azure VM's monitors and AWS VM's monitors should be grouped separately. Application servers and database servers should be tagged for identification. There should be an option to group and tag the monitors based on a variety of identifiers.
The use case for each server is unique. Database and caching servers are to be monitored for connectivity and disk use, while application servers are to be monitored for CPU and memory utilization.
Momentary spikes should not send out alerts for some servers.
Alert fatigue is dangerous. The appropriate team or person should get the alerts, not the entire sysadmin team.
Database servers should alert when the "mysqld" process is down and app servers should alert when a critical java process or any process in a particular path is down.

13,000+ organizations handle the above problems without breaking a sweat. How?

With ManageEngine Site24x7. Let's address the above scenario piece by piece.

Monitor groups and tags

Site24x7 allows grouping your servers. For example, if you have 20,000 servers and 10,000 of them are in Azure and the remainder are in AWS, you can create two monitor groups named "Azure" and "AWS". The best part about this method is that with configuration rules, you don't need to create a monitor group every single time. Set a rule to create monitor groups, then Site24x7 will handle it for you. Site24x7 allows grouping of your server monitors based on a lot of parameters like host name, IP address, and many more (including OS type).

Alerts

Servers are utilized for various reasons, meaning the alerts associated with each server should also be tailor-made; however, tailor-made does not necessarily mean made manually every time. Create threshold profiles for each type of servers you have just once, assign rules to dictate which profile has to be applied to a server, and you are sorted for life. Any new or existing monitor will comply by those rules—configuration rules can never be bent or broken.

Want to understand threshold profiles better? Think of this as a template containing the trigger to alerts. Set the threshold for the performance and health metric just once. It can be associated to hundreds or thousands of servers as per their capacity and usage.

Here is an example of how alert thresholds are usually set. With Site24x7, there are three severity levels for alerts: Down, Critical, and Trouble. Let's say you want to create a threshold profile for a compute VM that is prone to getting memory utilization spikes but is critical to a business process. You would set these thresholds:

"Trouble" alert at "90%" memory utilization.
"Critical" alert at "95%" memory utilization.
Poll frequency (i.e., data fetching frequency) set as "1 minute".
Poll Value set as "2" polls, which gives decent leeway to filter out momentary spikes. This way the alerts are triggered only when the limits are breached for two consecutive data collection cycles (i.e. 2 minutes).

This is just one way to configure alerts for a server or VM. With Site24x7, you are armed with options to set alert limits for more than 80 health and performance metrics of your servers.

Alert fatigue

In case a server is handling a memory utilization spike, it should be enough to alert the DRI who is working the shift and not the one who just finished their shift and hit the bed. How would this be possible? With our notification profiles and on-call schedule.

Think of notification profiles and on-call schedules as templates where you enter who should be alerted at what time and also when to escalate. For example, user alert group "DBAdminAlpha" should get alerts from 8am to 8pm and "DBAdminBeta" should get alerts from 8pm to 8am. If the technicians haven't acknowledged the alert within a timeframe, say 5 minutes, then the next alert goes to the manager, say "DBAngryBoss" who sits at the escalation level. Yes, this will make life much easier.

Using Site24x7's configuration rules

To try configuration rules, you require:

A valid Site24x7 administrator account with a few server monitors (the more the merrier).
And that's it.

Create it

Log in to your Site24x7 account, and on the left pane, click Admin. Under the Inventory section, click Configuration Rules. On the top-right corner of the page, click Add Rule. Time to start configuring!

Name it

Provide an apt name for the rule. "Dave1" and "Rule1" are sufficient names, but let's go with something that most of us would immediately understand by looking at it. Good examples are "DBServersCPURule" and "Prod Servers USWest1 ECom". Optionally, you can also provide a brief description to help others understand why this rule is created.

Prioritize it

When set to Yes, the toggle "Stop Executing Other Rules" prevents other low priority rules (more on priorities later) to run for the set of monitors being targeted by the rule we are creating now.

Configure it

Select Server monitor in the Criteria drop-down menu. You can also apply configuration rules for all other monitor types in the list; but for now, we will stick to server monitoring.

There is a "+" icon next to the Criteria field. Click the + icon to add more fields that enable you to set conditions on whichever monitors the rules will be applied to. For example, once the "+" icon is clicked, it adds one more field with a drop-down menu. You can now set whether the rules have to be run on server monitors, including criteria such as the server monitors tagged as "USWest1", only the server monitors that contain the string "Apache" in the monitor name, only the servers running Linux OS, or the servers that are under a specific IP range.

Now that we have set where the rules have to be run, the next step is to set the actions to be performed. You can select which threshold profile to be associated to the monitors that satisfy the conditions we have just set. Or you can associate a notification profile, add them to a monitor group, tag them with a specific tag, change the data collection frequency, add a resource check profile to monitor files, directories, or ports, enable a process monitor, exclude specific disk partitions from monitoring, or associate a log monitoring profile. There is a barrage of other options as well to benefit from, each of which we encourage you to utilize.

Automate it

Next is the optional IT Automation template, where you set auto remediation actions when any of your servers undergo performance degradation. For example, once your server's memory utilization breaches 95%, you can restart it automatically via IT Automation.

Run it

You can either click Save to run this rule whenever a new server monitor is created, or click Save and Run Rule to run this rule immediately on existing server monitors as well. Once the rules are saved, you will be returned to the Configuration Rules page where you can see the first column that says "Priority." The Priority field is editable (i.e., click the priority number to assign the sequence). If the "Stop Executing Other Rules" toggle is set to Yes, rules of lower priorities will not be applied.

What's next?

Configuration rule is loved by our vast customer base, owing to the plethora of setup configurations it provides. Setting up your server monitors correctly will save hours of workload at the later stages—and configuration rules are provided to do just that. Set configuration rules once, then watch Site24x7 configure all server monitors as per your IT infrastructure requirements and business SLAs.

If you are an existing customer, we encourage you to utilize configuration rules to get the best out of our observability platform.

If you haven't onboarded yet, then we want to welcome you! We are loved by our customers because we monitor every facet of your digital resources—from servers to the tiniest containers, from websites to network devices, and cloud resources along with your applications. One browser tab to rule (monitor) them all!

Want our dedicated team to help you learn how Site24x7 could boost your observability prowess? Let our product experts help you with a personalized demo. See you soon!

Comments (0)