Why Security Analytics Should Start With the End in Mind

Photo by Suzanne D. Williams on Unsplash
  • It is unclear where the next threat indicator will come from.
  • If we analyze user, endpoint, and application behaviors through log, network, and endpoint analysis, we are in the best position to identify and remediate the threat.
Figure 1: Data First Implementation Model
index=powershell LogName=”Windows Powershell” (EventCode=500) | eval MessageA=split(Message,”Details:”) | Eval Short_Message=mvindex(MessageA,0) | Eval MessageB=mvindex(MessageA,1) | eval MessageB = replace (MessageB,”[\n\r]”,”!”) | eval MessageC=split(MessageB,”!!!!”) | Eval Message1=mvindex(MessageC,0) | Eval Message2=mvindex(MessageC,1) | Eval Message3=mvindex(MessageC,2) | eval MessageD=split(Message3,”!!”) | Eval Message4=mvindex(MessageD,3) | eval Message4=split(Message4,”=”) | eval PS_Version=mvindex(Message4,1) | Eval Message5=mvindex(MessageD,4) | Eval Message6=mvindex(MessageD,5) | Eval Message7=mvindex(MessageD,6) | eval Message7=split(Message7,”=”) | eval Command_Name=mvindex(Message7,1) | Eval Message8=mvindex(MessageD,7) | eval Message8=split(Message8,”=”) | eval Command_Type=mvindex(Message8,1) | Eval Message9=mvindex(MessageD,8) | eval Message9=split(Message9,”=”) | eval Script_Name=mvindex(Message9,1)| Eval Message10=mvindex(MessageD,9) | eval Message10=split(Message10,”=”) | eval Command_Path=mvindex(Message10,1) | Eval Message11=mvindex(MessageD,10) | eval Message11=split(Message11,”=”) | eval Command_Line=mvindex(Message11,1) | table _time EventCode, Short_Message, PS_Version, Command_Name, Command_Type, Script_Name, Command_Path, Command_Line
  • The ‘T’ in ETL is “transform”. Data needs to be transformed to answer questions. If you want to create a histogram of successful user authentications, the data needs to be transformed to recognize that ‘logon’,’login’, and ‘authentication success’ are all the same. There is a risk when down-stream analytics (e.g. machine-learning data features, correlations) are dependent on the outcome from the transform step due to parsing errors, missing data, or other defects. For example, a company that was in the process of terminating and prosecuting an employee for introducing malware into its supply chain system only to realize their SIEM was not accounting for DHCP IP address changes.
  • Data collection is not ‘free’. Even with excellent free and open source tools such as Elastic Beat, FluentD, or the collector provided by your SIEM, collecting, processing, and storing data comes with a large amount of administrative overhead regardless if your SIEM is SaaS or on-prem. Data-first implementations lose themselves in the management of data collection and tuning before they even solve a security use case. Updates to data sources often change log formats or API calls, creating a constant cat & mouse game of recognizing log sources no longer collecting the full set of content or parsing log content correctly. While it’s relatively easy to recognize a silent log source (zero logs from a data source), it’s incredibly difficult to recognize a specific log event from a data source has altered its format and rendered downstream parsing ineffective. This is critical if machine learning or correlation rules rely on classifications or events from parsed log data. While you can transform on read instead of write, it’s both disk I/O and processing intensive that may cause long delays producing results from advanced analytics.
  • All data is not equivalent. While collecting all the data makes logical sense since threat evidence may lie anywhere, it also creates excess noise that impacts analytics and search performance. The excess volume of data may also render dashboards useless because of the limited span of time represented (e.g. dashboards that can only present that last 4 hours of data instead of the last 7 days). Heterogeneity of data across environments also makes sharing analytics impractical without involving high amounts of customization and ongoing tuning. This is one of the reasons the off-the-shelf correlation rules provided by your SIEM vendor are rarely useful without customization and ongoing tuning.
  • Signal-to-Noise. With more data collected across more data sources, it adversely impacts the signal-to-noise ratio. The Precision Recall Curve describes the balancing act between allowing more visibility with events/alerts but allowing more false positives, or fine tuning to have a small number of events and alerts but risk missing an activity (false negative). Allowing more data to ‘see what you get’ creates more administrative burden on tuning alert criteria and increases false positives impacting SOC effectiveness.
Figure 2: Analytics First Implementation Model
Table 1 — Analytics First Steps Explained



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Seth Goldhammer

Seth Goldhammer

20+ years in cybersecurity bringing products to market at TippingPoint, HP, and LogRhythm. Currently VP of Marketing @SpyderbatInc.