FOSS Project Spotlight: Sawmill, the Data Processing Project

Sawmill Logo

If you're into centralized logging, you are probably familiar with the ELK Stack: Elasticsearch, Logstash and Kibana. Just in case you're not, ELK (or Elastic Stack, as it's being renamed these days) is a package of three open-source components, each responsible for a different task or stage in a data pipeline.

Logstash is responsible for aggregating the data from your different data sources and processing it before sending it off for indexing and storage in Elasticsearch. This is a key role. How you process your log data directly impacts your analysis work. If your logs are not structured correctly and you have not configured Logstash correctly, your logs will not be parsed in a way that enables you to query and visualize them in Kibana.

Logz.io used to rely heavily on Logstash for ingesting data from our customers, running multiple Logstash instances at any given time. However, we began to experience some pain points that ultimately led us down the path to the project that is the subject of this article: Sawmill.

Explaining the Motivation

Over time, and as our data pipelines became more complex and heavy, we began to encounter serious performance issues. Our Logstash configuration files became extremely complicated, which resulted in extremely long startup times. Processing also was taking too long, especially in the case of long log messages and in cases where there was a mismatch between the configuration and the actual log message.

The above points resulted in serious stability issues, with Logstash coming to a halt or sometimes crashing. The worst thing about it was that troubleshooting was a huge challenge. We lacked visibility and felt a growing need for a way to monitor key performance metrics.

There were additional issues we encountered, such as dynamic configuration reload and the ability to apply business logic, but suffice it to say, Logstash was simply not cutting it for us.

Introducing Sawmill

Before diving into Sawmill, it's important to point out that Logstash has developed since the time we began working on this project, with new features that help deal with some of the pain points described above.

So, what is Sawmill?

Sawmill is an open-source Java library for enriching, transforming and filtering JSON documents.

For Logstash users, the best way to understand Sawmill is as a replacement of the filter section in the Logstash configuration file. Unlike Logstash, Sawmill does not have any inputs or outputs to read and write data. It is responsible only for data transformation.

Using Sawmill pipelines, you can use your groks, geoip, user-agent resolving, add or remove fields/tags and more, in a descriptive manner, using configuration files or builders, in a simple DSL, allowing you to change transformations dynamically.

Sawmill Key Features

Here's a list of the key features and processing capabilities that Sawmill supports:

Using Sawmill

Here is a basic example illustrating how to use Sawmill:


Doc doc = new Doc(myLog);
PipelineExecutor pipelineExecutor = new PipelineExecutor();
pipelineExecutor.execute(pipeline, doc);

As you can see, there are a few entities in Sawmill:

Sawmill Configuration

A Sawmill pipeline can get built from a HOCON string (Human-Optimized Config Object Notation).

Here is a simple configuration snippet, to get the feeling of it:


{
"steps": [{
    "grok": {
        "config": {
            "field": "message",
            "overwrite": ["message"],
"patterns":["%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}"]
           }
        }
    }]
}

Which is equivalent to the following in HOCON:


steps: [{
    grok.config: {
            field : "message"
            overwrite : ["message"]
            patterns :
["%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}"]
           }
    }]

To understand how to use Sawmill, here's a simple example showing GeoIP resolution:


package io.logz.sawmill;

import io.logz.sawmill.Doc;
import io.logz.sawmill.ExecutionResult;
import io.logz.sawmill.Pipeline;
import io.logz.sawmill.PipelineExecutor;

import static io.logz.sawmill.utils.DocUtils.createDoc;

public class SawmillTesting {

    public static void main(String[] args) {

        Pipeline pipeline = new Pipeline.Factory().create(
                "{ steps :[{\n" +
                "    geoIp: {\n" +
                "      config: {\n" +
                "        sourceField: \"ip\"\n" +
                "        targetField: \"geoip\"\n" +
                "        tagsOnSuccess: [\"geo-ip\"]\n" +
                "      }\n" +
                "    }\n" +
                "  }]\n" +
                "}");

        Doc doc = createDoc("message", "testing geoip resolving", 
         ↪"ip", "172.217.11.174");
        ExecutionResult executionResult = new
PipelineExecutor().execute(pipeline, doc);

        if (executionResult.isSucceeded()) {
            System.out.println("Success! result 
             ↪is:"+doc.toString());
        }
    }
}

End Results

We've been using Sawmill successfully in our ingest pipelines for more than a year now, processing the huge amounts of log data shipped to us by our users.

We know Sawmill is still missing some key features, and we are looking forward to getting contributions from the community. We also realize that at the end of the day, Sawmill was developed for our specific needs and might not be relevant for your use case. Still, we'd love to get your feedback.

—Daniel Berman