Business Service Management on a Budget: How to Do It for Under $50K with Cribl

Last edited: March 4, 2025

Modern enterprises run on IT, but how do you ensure your technology truly serves your business goals? That’s where Business Service Management (BSM) comes in. BSM, also referred to as ITSM or IT Service Management, is the practice of ensuring customer-facing IT services are available and performant. BSM empowers organizations to map the complex relationships between IT infrastructure and revenue streams, quantify the business impact of technical disruptions, and drive IT decisions based on accurate business metrics.

The following questions can be addressed with BSM:

What revenue-generating applications are impacted by an infrastructure outage?
How does a slow database backup impact user response time when making online purchases?
How does a denial of service attack impact a user's access to the website?

Effective BSM starts with collecting the right infrastructure and application metrics and mapping them to underlying business services. This helps teams identify the root cause faster, measure the business impact of outages or performance degradations, and optimize user experience, which ultimately leads to more revenue.

Some important KPIs to consider include:

Service availability: What is the service's uptime in the customer's eyes (e.g., Can I access my login page?
Response Time: Is the application performant?
Transactions completed
Errors
Infrastructure performance (e.g., servers, Kubernetes pods, network, etc.)

With Cribl's suite of products, it’s possible to implement a robust BSM solution for under $50K annually. In this blog, we'll focus on the following and their estimated costs for a service consisting of 250 servers:

Collecting System and Kubernetes Metrics with Cribl Edge, and Application Traces with OpenTelemetry

For this scenario, we have a typical 3-tier application deployed across 250 AWS EC2 instances with auto-scaling. Deploying Cribl Edge on all these hosts provides an agnostic means of capturing local system CDM (CPU, disk, memory), including process-level metrics. Application metrics such as transaction volumes, response times, and errors can be extracted from access logs by Cribl Edge. Instrumenting parts of the application with an OpenTelemetry agent can provide deeper-dive metrics useful for troubleshooting exactly where in the flow an application is slowing down.

Configuring standard metrics in Cribl Edge is easy and managed at the Fleet level. Simply navigate to the fleet, click on the ‘Collect’ tab, select ‘System Metrics' for Linux or 'Windows Metrics’ for Windows OS, per the following screenshots.

OpenTelemetry (OTel) traces can capture transaction errors and latency across tiers and be sent directly to Cribl Stream. Configuring Cribl Stream is similarly easy for capturing OpenTelemetry logs, metrics, and traces. In this scenario, the application needs to produce traces and application metrics only. To configure, navigate to Sources in Cribl Stream, select OpenTelemetry, and select the listening port and protocol. (There are various other options available. Be sure to consult the Cribl documentation for more details.)

Both the data from the Cribl Edge agents and OpenTelemetry agents flow through Cribl Stream. By using Cribl Cloud, data originating from the Cribl Edge agents would consume 20% fewer credits compared to using other agents and processing results in Cribl Stream. On average, a Cribl Edge agent reporting metrics every 10 seconds generates 30 MB/day/host of standard metrics, and an additional 4 MB/day/host if we also track 3 critical processes sets. Extrapolate that to a server count of 250 servers, and that's 8.5 GB/day. For a distributed application averaging 500 transactions/second, and assuming 300 bytes/transaction logs, that's 13 GB/day for application logs across all servers. Assuming the mid-tier and databases tiers have similar volumes of logs each, and that brings our total to:

Standard metrics every 10 seconds: 30 MB/day/host
3 critical process sets included with standard metrics: 4 MB/day/host
Application transaction logs for 500 transactions/second: 13 GB/day/tier

Annually, Cribl Edge would require 47 GB/day × 0.21 credit/GB × 365 days = 3,603 annual Cribl Cloud credits.

Because this data is captured with Cribl Edge and shipped to Cribl Stream via Cribl_HTTP, it is counted only once! It will not be counted multiple times.

OTel traces can be very voluminous. We'll estimate a conservatively high 40 GB/day for 500 transactions/second for OpenTelemetry traces.

By using a hybrid Cribl Cloud worker group, annually, OpenTelemetry through Cribl Stream would require 40 GB/day * 365 days * 0.26 credit/GB = 3,796 annual Cribl Cloud Credits.

This brings the total for collecting metrics, logs, and traces to:

3,603 + 3,796 = 7,399 annual Cribl Cloud credits. At a cost of $1/credit, that is a steal at under $7500 annually!

Service Mapping Metadata

Cribl Edge has additional benefits that help with infrastructure to service mapping. All events and metrics produced by Cribl Edge include metadata about the environment. This includes:

All Kubernetes metadata is captured in every event (ff Cribl Edge is deployed as a K8s daemonset); annotations can also be specified to be captured.
AWS tags (if Edge is running on EC2 instances or EKS)
Operating system metadata

The screenshot below illustrates the AWS tags captured in every event by Cribl Edge.

In this case, let's assume best practices are utilized, and all EC2 instances have tags for Business Service and Server Farm. To ensure these fields are passed on from Cribl Edge, fields need to be added per the following screenshot for all logs and metrics collected.

Host_Prefix,Server_Farm,BusinessService
ecomweb,Ecommerce Web Farm,Ecommerce
ecomtomcat,Ecommerce Middleware,Ecommerce
ecomproxy,Ecommerce Proxy,Ecommerce
ecomdb,Ecommerce Database,Ecommerce

This helps pre-populate outgoing events, which can be used to create necessary service maps and aggregations later.

Cribl Stream

As mentioned previously, Cribl Stream will be the entry point for OpenTelemetry data and act as the OTel collector for this scenario. It will also process all data collected from Cribl Edge. Data passed on from Cribl Edge will not be counted twice. Remember, previous estimates were at under 7500 annual credits for all our data collection so far.

Cribl Lake

To perform business service management analytics, the data needs to persist for a short time to perform aggregations and service-level correlation. Cribl Lake can be that place and is easy to set up as a Cribl Stream destination. While the data is only needed in Cribl Lake long enough to perform business service aggregation, it can be persisted longer to facilitate other analytics including capacity planning and investigations. With Cribl Lake only consuming 0.05 credits/GB of storage per month, it can be very economical to store data in Cribl Lake long term. In this scenario, we'll keep the raw data for 6 months and service-aggregated data for 2 years.

When Cribl Stream writes to Cribl Lake, directory structures will include year, month, day, and hour. This allows for summarization as frequently as every 5 minutes. Along with subdirectories for business service, server farm, and data type, this will allow for faster summarizations and fewer search credit consumption. An example of a complete directory structure would be:
s3://CriblCloudInstanceName/2024/10/31/20/Ecommerce/Web_Farm/metrics.processset/file.gz

Estimated credits for Cribl Lake to keep raw data for 6 months would be:

87 GB data/day × 0.1 gzip compression ratio × 180 days × 0.05 credits x 12 months= 940 credits/year after 6 months. The first 6 months will actually be a fraction, as there is a ramp-up period.

Conservative estimates for data summarized every 5 minutes and kept for 2 years:

100 KB of results/search * 5 concurrent searches * 12 runs/hour * 24 hours/day = * 0.1 compression * 1GB/1000000 KB = 0.0144 GB/day

For 2 years of such data, that's about 10.5 GB of storage.

The credits for storing 10.5GB of data in Cribl Lake for 1 year are: 10.5 GB * 0.05 credits/GB/month * 12 months = a measly 6.3 credits/year!

This brings the maximum annual Cribl Lake costs to 940 + 6.3 = 946.3 annual Cribl.Cloud credits.

So far, we're at a total of 8345.3 credits/year for collecting and storing the data.

Cribl Search

Cribl Search performs the necessary metric aggregation and mapping to business services, typically every 5 minutes. This is more frequent than many other BSM solutions, which perform the mapping every 15 minutes or longer.

In our scenario, we're leveraging Cribl Edge to expose metadata about the monitored hosts and Cribl Pipelines to enrich the events with business service context. From there, the service hierarchy can be pulled from a CMDB such as ServiceNow into a Cribl Search lookup. But remember that we're leveraging Cribl Edge metadata to populate the service hierarchy in each event.

Here are sample searches to summarize the key metrics.

Consolidated search of CPU & memory for Linux & Windows

dataset="Endpoint_Agents" _source in (metrics.cpu,metrics.memory) (node_cpu_percent_active_all="" or node_memory_Used_percent= or node_vmstat_pgmajfault=* or windows_cpu_percent_all_total=* or node_memory_MemAvailable_bytes=* or windows_memory_available_bytes=*) //merge Windows and Linux cpu// | extend cpu=iff(node_cpu_percent_active_all,node_cpu_percent_active_all,cpu), cpu=iff(windows_cpu_percent_all_total,windows_cpu_percent_all_total,cpu) | extend memAvail=iff(node_memory_MemAvailable_bytes,node_memory_MemAvailable_bytes,memAvail), memAvail=iff(windows_memory_available_bytes,windows_memory_available_bytes,memAvail) | summarize cpu_avg=round(avg(cpu),2), cpu_stdev=stdev(cpu), memAvail_GB_avg=round(avg(memAvail)/1024/1024/1024,3), memAvail_stdev=round(stdev(node_memory_MemAvailable_bytes)/1024/1024/1024,3), //Linux only// mem_percent=round(avg(node_memory_Used_percent),2), mem_stdev=stdev(node_memory_Used_percent), pgmajfaultstoal=sum(node_vmstat_pgmajfault) by business_service, farm, host

Process set level summary metrics: Note, memory represented differently in Linux vs windows

dataset="Endpoint_Agents" _source=metrics.process | extend process_cpu_usage=iff(process_cpu_time_total,process_cpu_time_total,process_cpu_usage) | extend process_faults=iff(process_page_faults_total,process_page_faults_total,process_faults), process_faults=iff(process_major_page_faults,process_major_page_faults,process_faults) | extend process=iff(_os startswith 'windows',process_exe_path+process_cmdline,process_cmdline) | summarize count(), avg_cpu_usage=round(avg(process_cpu_usage),2), stdev(process_cpu_usage), avg_mem_usage=round(avg(process_memory_usage),2), stdev_memory_usage=stdev(process_memory_usage), total_major_page_faults=round(sum(process_major_page_faults),2) by host, process_set, process | summarize process_set_cpu_sum=round(sum(avg_cpu_usage),2), process_set_cpu_stdev=round(sum(stdev_cpu_usage),2), process_set_mem_sum=round(sum(avg_mem_usage),2), process_set_mem_stdev=round(sum(stdev_memory_usage),2), process_set_maj_page_faults=sum(process_major_page_faults) by business_service, farm, host, process_set

Transaction volumes and by URLs, success & failed from access logs

dataset="cribl_search_sample" dataSource="access_combined" request_method=GET (status=200 or status>=400) | extend business_service="Ecommerce", farm="Web farm" | lookup http_codes on status | summarize Total_Trans=count(), Successful_Trans=countif(status == 200), Failed_Trans=countif(status >=400), Percent_Failed_Trans=round(Failed_Trans/Total_Trans* 100,1), Megabytes=round(sum(bytes)/1024/1024,2) by request_uri_path, request_method,host, farm, business_service

The summarized results would persist in a dedicated Cribl Lake dataset by appending the following to each search: | extend searchname=<name_of_scheduled_search> | export to lake BSM_Summary

In this case, searchname should be a Lake accelerated field. Dashboards would be based on results in that Cribl Lake dataset. Within the dashboards, you can define base searches such as: Dataset=BSM_Summary searchname=Transactions and different charts can summarize different metrics.

Estimated credits for Cribl Search: 3 searches every 5 minutes to pull CPU, memory, transaction counts, transaction duration, and error counts. Based on preliminary testing of 250 servers every 5 minutes, searches would consume roughly 0.12 credits each. Your mileage may vary.

3 scheduled searches * 12 searches/hour * 24 hours/day * 365 days/year * 0.12 credits/search = 37,844 annual Cribl Cloud credits for Cribl Search.

So far, our calculated annual Cribl Cloud credits include:

Cribl Edge: 3,603 credits/yr
Cribl Stream: 3,796 credits/yr
Cribl Lake: 946.3 max credits/yr
Cribl Search summarizations: 37,844 credits/year

This brings our total to 46,189 credits— well under $50,000 annually!

Wrap up

In summary, Cribl provides a cost-effective means of capturing telemetry data, forwarding it to popular analytic backends, and performing analytics such as business service management with Cribl Search. Sign up today for a free Cribl.Cloud account, and explore the flexibility, control, and choice the Cribl portfolio provides.

Cribl, the Data Engine for IT and Security, empowers organizations to transform their data strategy. Customers use Cribl’s suite of products to collect, process, route, and analyze all IT and security data, delivering the flexibility, choice, and control required to adapt to their ever-changing needs.

We offer free training, certifications, and a free tier across our products. Our community Slack features Cribl engineers, partners, and customers who can answer your questions as you get started and continue to build and evolve. We also offer a variety of hands-on Sandboxes for those interested in how companies globally leverage our products for their data challenges.

Previous articleNext article