wiki:KrakenLogStorage
Last modified 5 months ago

Kraken Log Storage

Motivation

Traditional log monitoring systems used relational databases or ISAM files for archiving logs. Such a system needs to collect and search massive logs (much more than 10000 logs/sec) in realtime. However, old database engine's paradigm cannot support these kinds of issues well. They use fully structured data format, limited column size, b+tree index, heavy transaction, and so on.

Why are they problems?

  • Loss of original details
    • You necessarily normalize original logs for fixed database schema.
  • Logs should not be updated or deleted
    • An exception is that log files can be deleted or moved to other medium periodically.
  • Low throughput and limited concurrency
    • You don't need whole ACID properties, you just want to make sure logs are written.
    • B+tree index do not support concurrent read/write accesses well.
  • No full-text search, limited search and stats functions
    • Many search engines cannot support realtime incremental indexing. Their indexing is performed in a batch fashion.
  • Storage consumption
    • Many gigabytes of logs are generated per day. You need to archive logs in compressed format and access in realtime.
  • Blocking API
    • In general, log table contains very large data. When you start query, you cannot easily cancel, pause, or resume query. You should wait until long query processing is completed.
  • Limited distributed processing
    • How can I run my mapreduce query?

Kraken Log Storage exploits log-specific constraints and solves these problems.

Design

Principles

Append Only Design

At first time, I tried to use b-link tree algorithm. However, it was too complicated, hard to implement, and fragile to exceptional cases. You can see postgresql implementor's note if you want. It was not the solution for high performance incremental read/write operations. And it also waste page free spaces.

At August 2008, I learned  erlang and found  CouchDB. CouchDB introduced "crash-only" and "document-oriented" design to me. But at that time, I have no idea how to implement indexing. Log should be searched in order, but sometimes logs may come in out of order. Incremental full-text search was also problem. These can be solved using snapshot and zig-zag join. I will explain this later.

Using "append-only" design, you can write logs very fast, and full-text index also can be built in incremental fashion. Kraken Log Storage buffers log data until it finds "out of order" or cannot allocate more memory in page, and then it flushes page to disk file. Many writer threads can writes to memory buffer, only one thread (per tablet) writes to disk, and many readers can concurrently access to memory buffer and data file without intervention.

Split By Timeline

Unlike other business data, logs are purged or moved to other medium periodically. (e.g. dvd or tape) Searching is naturally for date ranges. For fast and easy maintenance, Kraken Log Storage uses this file system layout:

${kraken.data.dir}
+- kraken-log-storage
  +- tables
  +- log
    +- 1
      +- 2011-08-13.idx, 2011-08-13.dat
      +- 2011-08-12.idx, 2011-08-12.dat
      +- ...
    +- 2
  +- query
    +- fbl8994568864142969407.buf
    +- ...

In $kraken.data.dir/kraken-log-storage, there are tables metadata file, log, query directories. tables file maintains table id to table name mappings with some metadata. In "log" directory, there are table directories. .idx and .dat files resides in table directory. .idx contains log id and file offset (before compressed) and .dat contains actual log data per day. query directory contains temporary query result files.

Realtime Compression

Kraken Log Storage support realtime compression and decompression. In general, log query is full scan by date range or point query by log id. You can use variable page size easily in append-only design. Page header includes page size, original data size, min/max time, and log id. When log storage flushes page, it compresses page and write original size and compressed size (i.e. page size). When you want to read specific log or scan page, log storage can look page headers and decide which one should be decompressed.

Just using java.util.zip, you can save 80~90% of spaces and reduce I/O cost dramatically. You can adjust compression level. Higher level is good for space saving, but need more cpu time.

Semi-structured Format

Kraken Log Storage uses Kraken Codec for binary data encoding and decoding. It serializes primitive and well-known collection objects (e.g. Set, List, Map) to binary data, and deserializes binary data to typed object graph. You can insert log data as object, and retrieve object like other object database. For example, you can save and retrieve SNMP trap variable bindings as Map<String, Object>.

Because Kraken Codec is a pure java implementation, it's performance is bound to JVM. For example, JProfiler shows codec's string encoding/decoding is main bottleneck. Kraken Codec tries to avoid string serialization using string-to-binary cache. (about x2 speed up)

Zig-zag Join

Zig-zag join is a multi-way merge join. See following example:

Stream A: 3, 9, 12, 15
Stream B: 1, 4, 9, 10
Stream C: 2, 9, 20

If each stream is sorted, you can merge all and generate one result (i.e. 9). For above example, each stream represents postings list for specific token, and each number represents log id. Zig-zag join works like this:

  1. Select minimum value from all streams
  2. All streams read values until they find selected minumum value
  3. If all streams find selected minimum value, add it to result list
  4. If all streams fails to find common minimum value, throw away the selected value and set next minimum value
  5. Repeat until meet any end of stream

Minimum can be maximum if you are searching in descending order. Zig-zag join can be used to handle complex boolean query. You can snapshot inverted index in memory and flush to disk incrementally, and search each snapshot files or memory buffer. You can find logs from huge haystack using inverted index and apply other search functions. (e.g. regex)

Again, inverted indexing can benefit from realtime compression and decompresion technique.

Asynchronous API

Kraken Log Storage provides callback for asynchronous support. You can retrieve partial result and cancel table scan.

package org.krakenapps.logstorage;

public interface LogCallback {
	void onLog(Log log);
}

public interface LogSearchCallback extends LogCallback {
	void interrupt();

	boolean isInterrupted();
}

In addition to this, query processing engine provides background query and server push using Kraken Message Bus. User interface can print first page of search result immediately, and draw trend graph incrementally. (See  LogQueryService)

User-Friendly Query

Kraken Log Storage provides splunk like query syntax. For example:

table duration:1w ssl-auth | stats count(code) as count by code | sort -count

This query means that search ssl-auth table for 1 week, count by code, and show result in descending order. Let's see next example:

table duration:1w ssl-auth | eval code != success | stats count(_time) as failed by login 
                           | sort limit:30 -failed | lookup login2user login as login_name output name

This query means that search ssl-auth table for 1 week, and show most failed 30 users. lookup command convert login field to actual user name using login2user mapping table, therefore you can easily identify and analyze failed logins.

Query syntax is implemented using Kraken BNF.

File Formats

Log Index File

Log Data File

API

If you did not experienced  OSGi,  iPOJO, or Kraken Core yet, visit each link first.

LogStorage interface

LogQueryService interface

LogQueryCommand abstract class

Search Commands

table

eval

stats

timechart

sort

rename

lookup

fields

File Layouts

Log Index File (*.idx)

Log index file (*.idx) contains file pointers to data blocks in "Log Data File (*.dat)". Log storage can find data block for log ID query in O(log n) time. "n" is block count, and one block contains 640KB data by default. Log block size can be configured.

  • File Header
    • Magic string (16 bytes): "NCHOVY_BEAST_IDX"
    • BOM (2 bytes): 0xFEFF
    • Version (2 bytes): "2" hard-coded
    • Header Size (2 bytes; 4-bytes aligned)
      • Total header byte count including extension
    • Extension (variable)
  • Index Blocks: an index block size is 4 + log count * 4
    • Log count (4 bytes)
    • Log data offset postings (4 bytes * log count): offset from each data block

Log Data File (*.dat)

  • File Header
    • Magic string (16 bytes): "NCHOVY_BEAST_DAT"
    • BOM (2 bytes): 0xFEFF
    • Version (2 bytes): "2" hard-coded
    • Header Size (2 bytes; 4-bytes aligned)
      • Total header byte count including extension
    • Extension (variable)
      • block size (4 byte)
      • option strings separated by comma ("deflater" if compression is used)
  • Data Blocks
    • Data Block Header (24 bytes)
      • Min log date of logs in a block (8 bytes)
      • Max log date of logs in a block (8 bytes)
      • Original block length (4 bytes)
      • Compressed block length (4 bytes)
    • Data Block (compressed block length)
      • Log Records: a log record contains id, date, data length and data
        • Log ID (8 bytes)
        • Date (8 bytes)
        • Log Data Length (4 bytes)
        • Log Data (length equals to Log Data Length)

Commands

  • logstorage.tables
    • Print all log tables ([id] table name: last log date format)
      kraken@Future kraken> logstorage.tables
      Tables
      --------
      [4] ssl-flow: 2012-03-11
      [5] ssl-ca: 2012-02-20
      [1] xtm: 2012-03-11
      [2] ssl-auth: 2012-03-11
      [3] ssl-access: 2012-03-11
      
  • logstorage.table [table name]
    • Print all table metadata and storage consumption
      kraken@Future kraken> logstorage.table xtm
      Table xtm
      
      Table Metadata
      ----------------
      
      Storage Consumption
      ---------------------
      145,181,793 bytes
      
  • logstorage.createTable [table name]
    • create new log table
      kraken@Future kraken> logstorage.createTable sample
      table created
      
  • logstorage.dropTable [table name]
    • drop log table (all *.idx and *.dat log files will be removed from disk!)
      kraken@Future bin> logstorage.dropTable sample
      table dropped
      
  • logstorage.writers
    • Show realtime online writer statuses
    • One log writer consume (block size + block size / 64) memory for log buffering
      • default block size is 640KB
    • Log writer can be evicted from memory when idle time is over
      • you can change idle timeout using logstorage.setParameter log_max_idle_time 30000 (millisecond unit)
  • kraken@Future kraken> logstorage.writers
    Online Writers
    -----------------
    table=xtm, day=2012-03-11, last write=2012-03-11 15:25:56
    
  • logstorage.benchmark
    • Run benchmark test using an real world apache httpd log sample
      2011-08-22 17:30:23 Google 111.222.33.44 GET /search q=cache:xgLxoOQBOoIJ: krakenapps.org/+krakenapps&cd=1&hl=en&ct=clnk&source=www.google.com 80 - 123.234.34.45 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1 404 0 3
      
    • v1.9.6 benchmark result on Linux 2.6.20, Intel Xeon 5140 2.33GHz, Mem 2GB, SAMSUNG HD103UJ (1TB, 7200RPM)
      kraken@Future bin> logstorage.benchmark
      === Test #1 ===
      text(write): 1000000 log/13726 ms (72854 log/s)
      text(read): 1000000 log/9214 ms (108530 log/s)
      map(write): 1000000 log/34498 ms (28987 log/s)
      map(read): 1000000 log/16496 ms (60620 log/s)
      
  • logstorage.parameters
    • Print current logstorage configurations. Log storage uses default value if config is null.
parameter name default value description
log_check_interval 10000 flush and idle check interval in milliseconds
log_max_idle_time 600000 max idle timeout in milliseconds. evicts log writer if idle time is over
log_flush_interval 60000 flush interval in milliseconds. force log flush for every interval
log_block_size 655360 (640KB) log data block size. block size is corresponds to uncompressed data size. block size modification is not supported yet
min_free_disk_space_type Percentage size unit for metering lack of disk free spaces. "Percentage" or "Megabyte"
min_free_disk_space_value 10 log storage stops logging or deletes old log files if free space is lower than this configuration
disk_lack_action RemoveOldLog "StopLogging" or "RemoveOldLog"
kraken@Future kraken> logstorage.parameters
log_storage_dir: null
log_check_interval: null
log_max_idle_time: null
log_flush_interval: null
log_block_size: null
min_free_disk_space_type: Percentage
min_free_disk_space_value: 10
disk_lack_action: RemoveOldLog
  • logstorage.setParameter [key] [value]
    • set logstorage parameter
    • you should call logstorage.reload after set parameter
  • logstorage.reload
    • reload log storage engine

Authors

See Also

Attachments

  • blink.pdf Download (1.3 MB) - added by xeraph 22 months ago. Efficient Locking for Concurrent Operations on B-Trees, Lehman, Yao
  • blink-tree.zip Download (93.2 KB) - added by xeraph 22 months ago. PostgreSQL B-link tree implementation and notes