Kraken Log Storage
Motivation
Traditional log monitoring systems used relational databases or ISAM files for archiving logs. Such a system needs to collect and search massive logs (much more than 10000 logs/sec) in realtime. However, old database engine's paradigm cannot support these kinds of issues well. They use fully structured data format, limited column size, b+tree index, heavy transaction, and so on.
Why are they problems?
- Loss of original details
- You necessarily normalize original logs for fixed database schema.
- Logs should not be updated or deleted
- An exception is that log files can be deleted or moved to other medium periodically.
- Low throughput and limited concurrency
- You don't need whole ACID properties, you just want to make sure logs are written.
- B+tree index do not support concurrent read/write accesses well.
- No full-text search, limited search and stats functions
- Many search engines cannot support realtime incremental indexing. Their indexing is performed in a batch fashion.
- Storage consumption
- Many gigabytes of logs are generated per day. You need to archive logs in compressed format and access in realtime.
- Blocking API
- In general, log table contains very large data. When you start query, you cannot easily cancel, pause, or resume query. You should wait until long query processing is completed.
- Limited distributed processing
- How can I run my mapreduce query?
Kraken Log Storage exploits log-specific constraints and solves these problems.
Design
Principles
Append Only Design
At first time, I tried to use b-link tree algorithm. However, it was too complicated, hard to implement, and fragile to exceptional cases. You can see postgresql implementor's note if you want. It was not the solution for high performance incremental read/write operations. And it also waste page free spaces.
At August 2008, I learned erlang and found CouchDB. CouchDB introduced "crash-only" and "document-oriented" design to me. But at that time, I have no idea how to implement indexing. Log should be searched in order, but sometimes logs may come in out of order. Incremental full-text search was also problem. These can be solved using snapshot and zig-zag join. I will explain this later.
Using "append-only" design, you can write logs very fast, and full-text index also can be built in incremental fashion. Kraken Log Storage buffers log data until it finds "out of order" or cannot allocate more memory in page, and then it flushes page to disk file. Many writer threads can writes to memory buffer, only one thread (per tablet) writes to disk, and many readers can concurrently access to memory buffer and data file without intervention.
Split By Timeline
Unlike other business data, logs are purged or moved to other medium periodically. (e.g. dvd or tape) Searching is naturally for date ranges. For fast and easy maintenance, Kraken Log Storage uses this file system layout:
${kraken.data.dir}
+- kraken-log-storage
+- tables
+- log
+- 1
+- 2011-08-13.idx, 2011-08-13.dat
+- 2011-08-12.idx, 2011-08-12.dat
+- ...
+- 2
+- query
+- fbl8994568864142969407.buf
+- ...
In $kraken.data.dir/kraken-log-storage, there are tables metadata file, log, query directories. tables file maintains table id to table name mappings with some metadata. In "log" directory, there are table directories. .idx and .dat files resides in table directory. .idx contains log id and file offset (before compressed) and .dat contains actual log data per day. query directory contains temporary query result files.
Realtime Compression
Kraken Log Storage support realtime compression and decompression. In general, log query is full scan by date range or point query by log id. You can use variable page size easily in append-only design. Page header includes page size, original data size, min/max time, and log id. When log storage flushes page, it compresses page and write original size and compressed size (i.e. page size). When you want to read specific log or scan page, log storage can look page headers and decide which one should be decompressed.
Just using java.util.zip, you can save 80~90% of spaces and reduce I/O cost dramatically. You can adjust compression level. Higher level is good for space saving, but need more cpu time.
Semi-structured Format
Kraken Log Storage uses Kraken Codec for binary data encoding and decoding. It serializes primitive and well-known collection objects (e.g. Set, List, Map) to binary data, and deserializes binary data to typed object graph. You can insert log data as object, and retrieve object like other object database. For example, you can save and retrieve SNMP trap variable bindings as Map<String, Object>.
Because Kraken Codec is a pure java implementation, it's performance is bound to JVM. For example, JProfiler shows codec's string encoding/decoding is main bottleneck. Kraken Codec tries to avoid string serialization using string-to-binary cache. (about x2 speed up)
Zig-zag Join
Zig-zag join is a multi-way merge join. See following example:
Stream A: 3, 9, 12, 15 Stream B: 1, 4, 9, 10 Stream C: 2, 9, 20
If each stream is sorted, you can merge all and generate one result (i.e. 9). For above example, each stream represents postings list for specific token, and each number represents log id. Zig-zag join works like this:
- Select minimum value from all streams
- All streams read values until they find selected minumum value
- If all streams find selected minimum value, add it to result list
- If all streams fails to find common minimum value, throw away the selected value and set next minimum value
- Repeat until meet any end of stream
Minimum can be maximum if you are searching in descending order. Zig-zag join can be used to handle complex boolean query. You can snapshot inverted index in memory and flush to disk incrementally, and search each snapshot files or memory buffer. You can find logs from huge haystack using inverted index and apply other search functions. (e.g. regex)
Again, inverted indexing can benefit from realtime compression and decompresion technique.
Asynchronous API
Kraken Log Storage provides callback for asynchronous support. You can retrieve partial result and cancel table scan.
package org.krakenapps.logstorage;
public interface LogCallback {
void onLog(Log log);
}
public interface LogSearchCallback extends LogCallback {
void interrupt();
boolean isInterrupted();
}
In addition to this, query processing engine provides background query and server push using Kraken Message Bus. User interface can print first page of search result immediately, and draw trend graph incrementally. (See LogQueryService)
User-Friendly Query
Kraken Log Storage provides splunk like query syntax. For example:
table duration:1w ssl-auth | stats count(code) as count by code | sort -count
This query means that search ssl-auth table for 1 week, count by code, and show result in descending order. Let's see next example:
table duration:1w ssl-auth | eval code != success | stats count(_time) as failed by login
| sort limit:30 -failed | lookup login2user login as login_name output name
This query means that search ssl-auth table for 1 week, and show most failed 30 users. lookup command convert login field to actual user name using login2user mapping table, therefore you can easily identify and analyze failed logins.
Query syntax is implemented using Kraken BNF.
File Formats
Log Index File
Log Data File
API
If you did not experienced OSGi, iPOJO, or Kraken Core yet, visit each link first.
LogStorage interface
LogQueryService interface
LogQueryCommand abstract class
Search Commands
table
eval
stats
timechart
sort
rename
lookup
fields
Authors
- delmitz ( delmitz@nchovy.com)
- Query processing and compression engine
- xeraph ( xeraph@nchovy.com)
- stania ( stania@nchovy.com)
See Also
Attachments
-
blink.pdf
(1.3 MB) -
added by xeraph 6 months ago.
Efficient Locking for Concurrent Operations on B-Trees, Lehman, Yao
-
blink-tree.zip
(93.2 KB) -
added by xeraph 6 months ago.
PostgreSQL B-link tree implementation and notes
