Taking System Monitoring to the Next Level: an Interview with Scalyr CEO Steve Newman

As computing ecosystems become more complex, monitoring and analyzing those often disconnected moving parts becomes increasingly challenging. By Petros Koutoupis

Today's data center has evolved from a single supplier producing and selling all-in-one offerings, such as the days when EMC, NetApp, HP or even Sun owned your data center and you chose a vendor and stuck with it. Those same vendors provided you with the required tools to monitor, analyze and troubleshoot their entire stack.

Shifting focus to the present, the landscape now appears to be quite different. Instead, you will find environments of mixed offerings provided by an assortment of vendors, both large and small. Proprietary machines work side by side with off-the-shelf commodity devices hosting software-defined software. Half of your applications may be hosted in virtual machines over a hypervisor or just spun up in a container. How does a modern data-center administrator or DevOps professional manage such an environment?

An assortment of platforms and frameworks exist that provide such capabilities, but they're not all one and the same. In some cases, those same tools will need to be coupled with others to produce something useful (for example, ELK: Elasticsearch + Logstash + Kibana). Unfortunately, this arrangement just adds to the complication and frustration when attempting to diagnose or discover problems in your computing environment.

Putting an end to this level of complexity, one company stands out among the rest: Scalyr. Scalyr develops and offers a complete suite of server monitoring, log management, visualization and analysis tools, which integrate with cloud services. I recently had the pleasure of chatting with Scalyr CEO Steve Newman.

His is not a household name, like Steve Jobs or Bill Gates, but you will be familiar with his work and contributions to cloud-enabled technologies. Although this is likely to change with Scalyr, Steve is best known for his work with Writely, a technology that later was acquired by Google and relabeled as Google Docs. In our conversation, Steve and I took the opportunity to discuss Scalyr, its solution and the problem it solves.

Steve Newman headshot

Steve Newman, Scalyr CEO

Petros Koutoupis: Tell me a bit about yourself. Who is Steve Newman?

Steve Newman: I am an engineer by both training and background and have spent most of my career in the startup environment. This is because I enjoy building things. I was at Google for a number of years following an acquisition, and while the experience itself was great, the startup bug in me drove me to Scalyr.

PK: So, now you founded a company called Scalyr. Please tell us, what is Scalyr?

SN: Scalyr is a log management platform for engineers responsible for software development. We collect logs from applications, services, containers and systems, and make that data available to help engineering teams track down problems and generally manage the complexity of modern development and operations.

PK: And why? What problem(s) does your product solve?

SN: Traditional log management tools are complex and often very slow at scale. This leads to a "gatekeeper" approach to log management, where only a select few acquire the expertise (and have the patience!) to access this critical data. Logs become a tool of last resort, hindering the team's ability to rapidly or proactively address issues.

My co-founder Steven Czerwinski and I first experienced this problem ourselves at Google. We were leading an infrastructure project supporting Google Docs, Drive, Photos and other related applications. There were a lot of moving parts, and the engineering team spent close to half of its time simply tracking down problems. We started Scalyr in 2011 to create the tool we wished we'd had at Google—one that would allow us to make sense of the flood of telemetry data and quickly understand why a complex system is misbehaving.

PK: How does Scalyr work?

SN: We're a fully managed SaaS solution. Logs are sent to our centrally hosted search cluster, using our agent or an array of API integrations. Engineers then use our web app (or APIs) to analyze, visualize and explore the logs.

The critical component is the back end. We've built the back-end software stack from scratch, optimizing for the data access patterns that arise in log management. Some interesting aspects of our approach are:

1) Unlike other log management solutions, we don't use indexes. Keyword indexes are optimized for finding the "best ten matches", in a corpus comprised of slowly evolving, human language text such as web pages or product descriptions. Log management use cases are very different, with small units of text (individual log messages), constantly updated, full of record IDs and other non-words that balloon the vocabulary size. Most important, log management queries generally visualize a complete data set, rather than stop after ten high-ranking results. Keyword indexes don't help much there, and they are complex, expensive to maintain and often impose multi-minute ingestion delays.

We've taken a much simpler approach, building a streamlined, columnar data store that's optimized for log data. The basic idea is that we just store logs and scan them, like good old-fashioned grep. We then use a lot of tricks to minimize the amount of data that needs to be scanned; for instance, when querying specific fields of a log, the columnar data layout means that we need to scan only those fields.

2) We process queries one at a time, globally. This allows each customer to use our entire search cluster, with aggregate search performance of 1.5 terabytes per second. It's fast enough (96% of queries complete in less than one second) that queries almost never wait in line—we finish each query before the next one arrives. The nice thing about this approach is that there's an economy of scale: as our customer base grows, performance increases.

3) We've built a separate back end for repetitive queries, such as those used in dashboards or alerting rules. This part of the system is based on a time series database, with a custom time series for each query. The ingestion engine automatically updates these time series as log messages arrive. This means we don't need to execute queries to display a dashboard or evaluate an alerting rule—the relevant data has been precomputed in the time series. In database terms, we're automatically creating materialized views where needed.

PK: What makes Scalyr stand out or competitive with existing solutions?

SN: Speed, simplicity and scalability.

Speed was central to our mission from day one. Logs are a massively useful, detailed data source, but when it takes minutes (or longer) to run a query, engineers avoid using them. We satisfy most queries in a fraction of a second. We also ingest data in real time: new logs are available for querying within a few seconds.

Simplicity goes hand in hand with speed, which is best measured as the time from a question in an engineer's head to an answer on his or her screen. The fastest back end in the world is of little use if you spent five minutes wrestling with the query language. We rely on our performance, as well as our ability to parse logs and extract structured data on ingestion, to provide a set of visual exploration tools that allows engineers to get answers without becoming query language experts. There's a query language, but you can dive in without knowing anything about it.

Finally, customers often choose us for our scalability. We continue to work well not only as server count and data volume increase, but as a team grows. Scalyr works just as quickly whether it's three or 1,000 people looking at logs. This helps teams move away from the gatekeeper model of traditional log management to the concurrent, collaborative engineering model that modern organizations are increasingly adopting.

With Scalyr, companies for the first time have a log management platform that does not charge ruthless licensing fees for new users, involve long ramp-up times or require learning arcane query languages. It's meant for teams who need to move fast.

PK: Who would benefit from using Scalyr?

SN: Our sweet spot is organizations where the application is critical to the business. From B2B software to online retailers, dating platforms and media companies, every business' competitive advantage increasingly is the technology stack and the speed at which that stack can evolve. Scalyr is critical to enabling that. Some of our customers include NBCUniversal, OkCupid, Zalando and ReturnPath.

Within those organizations, the primary Scalyr user usually comes from engineering or DevOps. But Scalyr is simple enough to use that we often see usage spread to other roles like support, which can search logs to track down specific client problems. Some of our customers have upwards of 1000 individuals with Scalyr logins. Typically, half the users are active on a weekly basis, which is huge engagement compared to traditional log management platforms.

PK: How easily does Scalyr integrate into current production environments?

SN: It's our mission to meet customers where they are, so we support many different models. Some run their own servers or virtual servers. Some are on Kubernetes, while others are serverless. Regardless, setup first involves retrieving logs. The most common way of doing this is with a lightweight agent that customers can install as a container, sidecar or whatever they need. We also have API integrations to retrieve logs directly from the wide array of cloud services in use today.

Once the logs are flowing in, you're off and running, but an important further step is to set up parsing rules. This allows us to extract structured data, unlocking the full power of the analysis and visualization tools. To make this as easy as possible, we've built three generations of parsing engines. The current engine is so easy to use, we've literally put a button in the product that tells our support team to set up the rules for you. Of course, being engineers, many of our customers prefer to do it themselves.

Conclusion

With today's internet, the "app" increasingly is the business (think of Uber, Airbnb, Amazon and so on), and getting to the bottom of downtime is crucial. This process is made more difficult than ever as the system or code is increasingly distributed with the use of containers, serverless and other technologies. That is where Scalyr comes in with its log analysis platform. It is crazy fast and easy to use. To learn more about this wonderful product, visit https://www.scalyr.com.

About the Author

Petros Koutoupis, LJ Editor at Large, is currently a senior platform architect at IBM for its Cloud Object Storage division (formerly Cleversafe). He is also the creator and maintainer of the RapidDisk Project. Petros has worked in the data storage industry for well over a decade and has helped pioneer the many technologies unleashed in the wild today.