Automating your security data pipeline using a strict data model

This post was written by SenseOn’s CTO, James Mistry.
Know all about the challenges of wrestling with big datasets whose definition is unclear? Go straight to the gory details!

The “More is Better” approach

Security platforms collect a lot of data. A SIEM, for example, might ingest endpoint events, firewall logs, a variety of application logs and threat detection results from a range of other products. The SIEM may accomplish this using schemaless database technology, like Elasticsearch or Splunk. This gives users the flexibility to easily store whatever data they want by pointing the data source at their SIEM (for example, using syslog) with little additional configuration. Getting the data in is easy, but there are downsides to this approach.

Users are going to start paying as soon as their data is stored, although they won’t get value from the data until they’ve started using it (to detect threats or investigate incidents, for example). This is convenient for the vendor if the customer is on a consumption-based pricing plan.

Using the data – whether in automated or manual analysis – requires an understanding of what the data means. This understanding is difficult to achieve when dealing with data coming from multiple third-party products, especially when the data is inherently unstructured (like log messages) and subject to change in future software updates.

To illustrate this, consider the following log messages generated from OpenVPN:

At first glance, it seems obvious what the log is telling us – a TLS handshake has failed. The first two fields look like the related IP address and port (separated by a “:”), with the message following. But all three messages appear to be related to the failure (the IP and port is the same for each). When looking at this example, certain questions arise

Can we confidently use this data to identify actionable security issues in TLS connections?
Are these messages relevant?
Assuming they are relevant, are all of these messages needed? Could we save some storage (and money) by only collecting some of them?
How different will messages look for issues not occurring during handshake?
If the issue isn’t of the specific type the second message appears to indicate (check_tls_errors_co), how will it be presented? Would a rule looking for the prefix Fatal TLS error capture all cases? Presumably not those which aren’t fatal!

OpenVPN is open-source, so the definitive way to answer these questions is to look at the source code.

It’s clear that there are multiple paths through the code that end up emitting the “TLS handshake failed” message, but also many failure conditions which result in different messages. A thorough understanding of each would be required to know how relevant they are to revealing actionable security issues. On the face of it, some of these are not relevant (“TLS Error: TLS key negotiation failed to occur within %d seconds (check your network connectivity)”), while some might be (“Peer tried unsupported key-method 1”). While handshake errors seem to be reliably marked with the text “TLS handshake failed”, fatal errors are not exclusively represented by messages starting “Fatal TLS error”. Some messages defined by OpenVPN as fatal are prefixed “FATAL:” and others don’t mention this at all in the message:

Whether or not fatal errors as a broad category are of interest is a separate question and depends on the user’s and application’s definitions of “fatal” (is the error fatal to the application or TLS connection, for example?)

An initial read of the source code directly responsible for producing those messages creates at least as many questions as it answers. Now consider that we’ve just been looking at one category of log messages, from one module in a single application. Also consider that in the period September 2022 to November 2022, ssl.c (one of those source files) had 15 commits (changes) against it.

Most organisations simply don’t have the security engineering resources to build reliable automations on top of this kind of data at scale, let alone keep up with the data changing frequently. Despite this, a common approach taken by security teams is to start by ingesting a wide range of unstructured data and then work backwards from the data to infer what it means and how it can be used. This takes time, it almost inevitably results in paying for more storage than is needed, and it’s rarely possible to do it accurately. Instead of vendor documentation and guarantees about how the data is structured, users often find themselves having to rely on guesswork and experimentation to figure it out. This also becomes an ongoing effort as the underlying products producing the data change over time. Security teams end up paying dearly in time spent triaging high volumes of alerts and maintaining automations as applications and environments change around them.

DWYNCOT: Describe what you need, collect only that

Choosing what goes in

At SenseOn, we take a threat-led approach in choosing what to put into our data model. This means that our decisions are based on the value of the data – how will it improve the breadth and depth of our ability to detect and respond to threats? We balance this against the cost in terms of building the collection capability (e.g. a change to our endpoint software), and the cost of processing and storing the resulting data.

We avoid including sources of data which are unstructured (such as application logs), preferring to observe data from sources that provide strong guarantees about structure (such as documented operating system APIs). When possible, we choose real-time data over state snapshots to help make sure our product reacts to the most up-to-date information available.

Strict data model

In the most basic terms, a data model is something that describes the structure and meaning of data. At SenseOn, we refer to ours as a “strict” data model. By this we mean that our data model not only serves as documentation about the telemetry (raw security data) we collect and use, but is also built into the product to actively ensure that collected telemetry is compliant with how we’ve described it.

Machine-readability

We achieve this by defining the data model in a machine-readable form which can be used within our platform when telemetry is produced, ingested and used. We do this using Protobuf schemas as the single source of truth for our telemetry.

Protobuf provides a way of describing data schemas and evolving their descriptions in a backwards-compatible manner over time. It also makes it easy to serialise and deserialise records from within many different programming languages and platforms. Unlike text-based formats for representing structured data like JSON and XML, Protobuf’s binary serialisation format is more compact (reducing the storage required), faster to serialise and deserialise (reducing the compute required) and guaranteed to conform to its associated schema.

Here’s a snippet from one of our Protobuf schemas describing process start events:

Data model automation

For us, another key feature of Protobuf – or more specifically, the Protobuf compiler – is its plugin interface. This allows us to create programs which receive the schema parsing results from the Protobuf compiler and interpret the schemas programmatically. We call these programs “schema generators”, and we’ve built a framework for creating them internally. We use schema generators to automatically translate our schemas (the machine-readable descriptions of our telemetry) into other forms we can use directly in our platform.

For example, from our Protobuf schemas we generate ORM (object relational mapping) models expressed in Python. These define the structure of the corresponding data in schemas specific to the database technologies we use, translating between Protobuf types and database engine types and including database-specific schema components such as index definitions. These generated ORM models are used both to automatically perform database migrations after schemas change (for example, by adding a new column), as well as to provide a programmatic interface to stored telemetry.

Below is a snippet from the ORM model generated for process start events by one of our schema generators:

We also use schema generators to automatically generate application code (such as our databasing service), specialised interfaces to our telemetry (such as variant interfaces for accessing telemetry created from a group of related schemas) and user-facing documentation (extracted from parsed comments in the Protobuf schemas).

TL; DR, why should I care?

Using schemas to describe our data, and building automation on top of them saves us huge amounts of effort:

When building components to produce and analyse telemetry (e.g. for threat detection), having a high degree of confidence in what the telemetry should look like upfront makes it harder to produce telemetry that isn’t useful or analyse telemetry in a way that isn’t accurate;
As schema changes are managed formally in code, our telemetry doesn’t change unexpectedly (meaning the applications relying on it don’t stop working unexpectedly, either);
Code which is directly related to the contents of our data model is generated automatically, avoiding tedious manual programming tasks and reducing the opportunity for bugs;
When querying data in our platform directly as part of threat hunting activities, our users can rely on consistent data whose meaning is not subject to sudden, unexpected change.