Why Security Analytics Should Start With the End in Mind

While many organizations collect data to ‘see what they get’, successful implementations take an analytics first approach

Photo by Suzanne D. Williams on Unsplash

When it comes to detection and response, the trend from the past decade is to gather as much data as possible. Log data. Network data. Endpoint data. Application data. Gather it all.

Image for post
Image for post

The trend is logical.

  • It is unclear where the next threat indicator will come from.
  • If we analyze user, endpoint, and application behaviors through log, network, and endpoint analysis, we are in the best position to identify and remediate the threat.

This logic also aligns to drops in cloud storage and processing costs, enabling longer time periods and breadth of log data along with more advanced analytics such as statistical analysis and machine learning.

Image for post
Image for post
Figure 1: Data First Implementation Model

This premise of collecting all/any data to then perform analytics ‘to see what you get’ is faulty and does not provide effective results. Engineers designing the self-driving car didn’t load up the car with all kinds of sensors to see what it can then make sense of the outside world. They instead used focused sensors specific to the questions they needed to answer. How many of you have tried to write a query into log data only to discover that the data wasn’t there, or in the right format, or the query itself became so complex that performance became an issue? Take this query to discover PowerShell command execution as an example:

index=powershell LogName=”Windows Powershell” (EventCode=500) | eval MessageA=split(Message,”Details:”) | Eval Short_Message=mvindex(MessageA,0) | Eval MessageB=mvindex(MessageA,1) | eval MessageB = replace (MessageB,”[\n\r]”,”!”) | eval MessageC=split(MessageB,”!!!!”) | Eval Message1=mvindex(MessageC,0) | Eval Message2=mvindex(MessageC,1) | Eval Message3=mvindex(MessageC,2) | eval MessageD=split(Message3,”!!”) | Eval Message4=mvindex(MessageD,3) | eval Message4=split(Message4,”=”) | eval PS_Version=mvindex(Message4,1) | Eval Message5=mvindex(MessageD,4) | Eval Message6=mvindex(MessageD,5) | Eval Message7=mvindex(MessageD,6) | eval Message7=split(Message7,”=”) | eval Command_Name=mvindex(Message7,1) | Eval Message8=mvindex(MessageD,7) | eval Message8=split(Message8,”=”) | eval Command_Type=mvindex(Message8,1) | Eval Message9=mvindex(MessageD,8) | eval Message9=split(Message9,”=”) | eval Script_Name=mvindex(Message9,1)| Eval Message10=mvindex(MessageD,9) | eval Message10=split(Message10,”=”) | eval Command_Path=mvindex(Message10,1) | Eval Message11=mvindex(MessageD,10) | eval Message11=split(Message11,”=”) | eval Command_Line=mvindex(Message11,1) | table _time EventCode, Short_Message, PS_Version, Command_Name, Command_Type, Script_Name, Command_Path, Command_Line

Whether used as a query (unstructured data) or to classify a common event (structured data), the query exemplifies an ineffective process. The performance of the query is inoperable, requires a high degree of ongoing tuning, and is vulnerable to omit information (false negatives) especially with version changes, OS changes, etc.

Whether we are talking about commercial or open source implementations, the ‘see what you get’ approach does not align to security use cases.

  • The ‘T’ in ETL is “transform”. Data needs to be transformed to answer questions. If you want to create a histogram of successful user authentications, the data needs to be transformed to recognize that ‘logon’,’login’, and ‘authentication success’ are all the same. There is a risk when down-stream analytics (e.g. machine-learning data features, correlations) are dependent on the outcome from the transform step due to parsing errors, missing data, or other defects. For example, a company that was in the process of terminating and prosecuting an employee for introducing malware into its supply chain system only to realize their SIEM was not accounting for DHCP IP address changes.
  • Data collection is not ‘free’. Even with excellent free and open source tools such as Elastic Beat, FluentD, or the collector provided by your SIEM, collecting, processing, and storing data comes with a large amount of administrative overhead regardless if your SIEM is SaaS or on-prem. Data-first implementations lose themselves in the management of data collection and tuning before they even solve a security use case. Updates to data sources often change log formats or API calls, creating a constant cat & mouse game of recognizing log sources no longer collecting the full set of content or parsing log content correctly. While it’s relatively easy to recognize a silent log source (zero logs from a data source), it’s incredibly difficult to recognize a specific log event from a data source has altered its format and rendered downstream parsing ineffective. This is critical if machine learning or correlation rules rely on classifications or events from parsed log data. While you can transform on read instead of write, it’s both disk I/O and processing intensive that may cause long delays producing results from advanced analytics.
  • All data is not equivalent. While collecting all the data makes logical sense since threat evidence may lie anywhere, it also creates excess noise that impacts analytics and search performance. The excess volume of data may also render dashboards useless because of the limited span of time represented (e.g. dashboards that can only present that last 4 hours of data instead of the last 7 days). Heterogeneity of data across environments also makes sharing analytics impractical without involving high amounts of customization and ongoing tuning. This is one of the reasons the off-the-shelf correlation rules provided by your SIEM vendor are rarely useful without customization and ongoing tuning.
  • Signal-to-Noise. With more data collected across more data sources, it adversely impacts the signal-to-noise ratio. The Precision Recall Curve describes the balancing act between allowing more visibility with events/alerts but allowing more false positives, or fine tuning to have a small number of events and alerts but risk missing an activity (false negative). Allowing more data to ‘see what you get’ creates more administrative burden on tuning alert criteria and increases false positives impacting SOC effectiveness.

Instead of starting with a collect the data first mentality, similar to our self-driving car example above, security-focused implementations should start with the end in mind.

Image for post
Image for post
Figure 2: Analytics First Implementation Model

Note, examples are simplified for this exercise:

Image for post
Image for post
Table 1 — Analytics First Steps Explained

Organizations taking this use-case approach will find greater return-on-investment with demonstrable evidence of program success to share with key stakeholders.

20+ years in cybersecurity bringing products to market at TippingPoint, HP, and LogRhythm. Currently VP of Marketing @SpyderbatInc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store