How to handle multi-line logs with regular expressions

Log collection with multi-line logs is always a headache, developers are not willing to output logs as JSON, so they have to re-structure the logs when collecting them.

Since log collectors are implemented in different ways and standards, how to handle multi-line logs will be different for different collectors. For example, if we use Fluentd as our log collector, we can use the multiline parser to handle multi-line logs.

The multiline parser uses the formatN and format_firstline parameters to parse the logs. format_firstline is used to detect the starting line of the multiline log. formatN, where N is in the range [1..20], is a list of Regexp formats for multiline logs.

Test data

For example, we now have multiple lines of log data as shown below.

2022-06-20 19:32:07.264 DEBUG 7 [TID:bb0e9b6d1d704755a93ea1529265bb99.68.16557246000000125] --- [   scheduling-4] o.s.d.redis.core.RedisConnectionUtils    : Closing Redis Connection.
2022-06-20 19:32:07.264 DEBUG 7 [TID:bb0e9b6d1d704755a93ea1529265bb99.68.16557246000000125] --- [   scheduling-4] io.lettuce.core.RedisChannelHandler      : dispatching command AsyncCommand [type=DEL, output=IntegerOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]
2022-06-20 17:28:27.871 DEBUG 6 [TID:N/A] --- [           main] o.h.l.p.build.spi.LoadPlanTreePrinter    : LoadPlan(entity=com.xxxx.entity.ScheduledLeadsInvalid)
    - Returns
       - EntityReturnImpl(entity=com.xxxx.entity.ScheduledLeadsInvalid, querySpaceUid=<gen:0>, path=com.xxxx.entity.ScheduledLeadsInvalid)
    - QuerySpaces
       - EntityQuerySpaceImpl(uid=<gen:0>, entity=com.xxxx.entity.ScheduledLeadsInvalid)
          - SQL table alias mapping - scheduledl0_
          - alias suffix - 0_
          - suffixed key columns - {id1_51_0_}
2022-06-20 19:32:47.062 DEBUG 7 [TID:N/A] --- [nection-cleaner] h.i.c.PoolingHttpClientConnectionManager : Closing connections idle longer than 60000 MILLISECONDS

First create a fluentd directory, create the etc directory for the fluentd configuration file and the logs directory for the logs, and save the above test logs in the logs/test.log file.

$ mkdir fluentd
$ cd fluentd
# Create the config file etc directory where fluentd is stored and the logs directory where logs are stored.
$ mkdir -p etc logs

General parsing

Then create a fluentd configuration file etcd/fluentd_basic.conf for parsing the logs, with the following contents.

<source>
  @type tail
  path /fluentd/logs/*.log
  pos_file /fluentd/logs/test.log.pos
  tag test.logs
  read_from_head true
  <parse>
    @type regexp
    expression /^(?<timestamp>[^ ]* [^ ]*) (?<level>[^\s]+) (?<pid>[^s+]+) \[TID:(?<tid>[,a-z0-9A-Z./]+)\] --- \[(?<thread>.*)\] (?<message>[\s\S]*)/ 
  </parse>
</source>
<match **>
  @type stdout
</match>

Then we use the docker image to start fluentd to parse our logs.

$ docker run --rm -v $(pwd)/etc:/fluentd/etc -v $(pwd)/logs:/fluentd/logs fluent/fluentd:v1.14-1 -c /fluentd/etc/fluentd_basic.conf -v

fluentd -c /fluentd/etc/fluentd_basic.conf -v
2022-06-20 12:31:17 +0000 [info]: fluent/log.rb:330:info: parsing config file is succeeded path="/fluentd/etc/fluentd_basic.conf"
2022-06-20 12:31:17 +0000 [info]: fluent/log.rb:330:info: gem 'fluentd' version '1.14.3'
2022-06-20 12:31:17 +0000 [warn]: fluent/log.rb:351:warn: define <match fluent.**> to capture fluentd logs in top level is deprecated. Use <label @FLUENT_LOG> instead
2022-06-20 12:31:17 +0000 [info]: fluent/log.rb:330:info: using configuration file: <ROOT>
  <source>
    @type tail
    path "/fluentd/logs/*.log"
    pos_file "/fluentd/logs/test.log.pos"
    tag "test.logs"
    read_from_head true
    <parse>
      @type "regexp"
      expression /^(?<timestamp>[^ ]* [^ ]*) (?<level>[^\s]+) (?<pid>[^s+]+) \[TID:(?<tid>[,a-z0-9A-Z./]+)\] --- \[(?<thread>.*)\] (?<message>[\s\S]*)/
      unmatched_lines 
    </parse>
  </source>
  <match **>
    @type stdout
  </match>
</ROOT>
2022-06-20 12:36:21 +0000 [info]: fluent/log.rb:330:info: starting fluentd-1.14.3 pid=10 ruby="2.7.5"
2022-06-20 12:36:21 +0000 [info]: fluent/log.rb:330:info: spawn command to main:  cmdline=["/usr/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/bin/fluentd", "-c", "/fluentd/etc/fluentd_basic.conf", "-v", "--plugin", "/fluentd/plugins", "--under-supervisor"]
2022-06-20 12:36:22 +0000 [info]: fluent/log.rb:330:info: adding match pattern="**" type="stdout"
2022-06-20 12:36:22 +0000 [info]: fluent/log.rb:330:info: adding source type="tail"
2022-06-20 12:36:22 +0000 [warn]: #0 fluent/log.rb:351:warn: define <match fluent.**> to capture fluentd logs in top level is deprecated. Use <label @FLUENT_LOG> instead
2022-06-20 12:36:22 +0000 [info]: #0 fluent/log.rb:330:info: starting fluentd worker pid=19 ppid=10 worker=0
2022-06-20 12:36:22 +0000 [debug]: #0 fluent/log.rb:309:debug: tailing paths: target = /fluentd/logs/test.log | existing = 
2022-06-20 12:36:22 +0000 [info]: #0 fluent/log.rb:330:info: following tail of /fluentd/logs/test.log
2022-06-20 12:36:22 +0000 [warn]: #0 fluent/log.rb:351:warn: pattern not matched: "    - Returns"
2022-06-20 12:36:22 +0000 [warn]: #0 fluent/log.rb:351:warn: pattern not matched: "       - EntityReturnImpl(entity=com.xxxx.entity.ScheduledLeadsInvalid, querySpaceUid=<gen:0>, path=com.xxxx.entity.ScheduledLeadsInvalid)"
2022-06-20 12:36:22 +0000 [warn]: #0 fluent/log.rb:351:warn: pattern not matched: "    - QuerySpaces"
2022-06-20 12:36:22 +0000 [warn]: #0 fluent/log.rb:351:warn: pattern not matched: "       - EntityQuerySpaceImpl(uid=<gen:0>, entity=com.xxxx.entity.ScheduledLeadsInvalid)"
2022-06-20 12:36:22 +0000 [warn]: #0 fluent/log.rb:351:warn: pattern not matched: "          - SQL table alias mapping - scheduledl0_"
2022-06-20 12:36:22 +0000 [warn]: #0 fluent/log.rb:351:warn: pattern not matched: "          - alias suffix - 0_"
2022-06-20 12:36:22 +0000 [warn]: #0 fluent/log.rb:351:warn: pattern not matched: "          - suffixed key columns - {id1_51_0_}"
2022-06-20 12:36:22 +0000 [warn]: #0 fluent/log.rb:351:warn: pattern not matched: ""
2022-06-20 12:36:22.308970489 +0000 test.logs: {"timestamp":"2022-06-20 19:32:07.264","level":"DEBUG","pid":"7","tid":"bb0e9b6d1d704755a93ea1529265bb99.68.16557246000000125","thread":"   scheduling-4","message":"o.s.d.redis.core.RedisConnectionUtils    : Closing Redis Connection."}
2022-06-20 12:36:22.309013403 +0000 test.logs: {"timestamp":"2022-06-20 19:32:07.264","level":"DEBUG","pid":"7","tid":"bb0e9b6d1d704755a93ea1529265bb99.68.16557246000000125","thread":"   scheduling-4","message":"io.lettuce.core.RedisChannelHandler      : dispatching command AsyncCommand [type=DEL, output=IntegerOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]"}
2022-06-20 12:36:22.309025559 +0000 test.logs: {"timestamp":"2022-06-20 17:28:27.871","level":"DEBUG","pid":"6","tid":"N/A","thread":"           main","message":"o.h.l.p.build.spi.LoadPlanTreePrinter    : LoadPlan(entity=com.xxxx.entity.ScheduledLeadsInvalid)"}
2022-06-20 12:36:22.309715537 +0000 test.logs: {"timestamp":"2022-06-20 19:32:47.062","level":"DEBUG","pid":"7","tid":"N/A","thread":"nection-cleaner","message":"h.i.c.PoolingHttpClientConnectionManager : Closing connections idle longer than 60000 MILLISECONDS"}
2022-06-20 12:36:22 +0000 [info]: #0 fluent/log.rb:330:info: fluentd worker is now running worker=0
2022-06-20 12:36:22.305753588 +0000 fluent.info: {"pid":19,"ppid":10,"worker":0,"message":"starting fluentd worker pid=19 ppid=10 worker=0"}
2022-06-20 12:36:22.308522121 +0000 fluent.debug: {"message":"tailing paths: target = /fluentd/logs/test.log | existing = "}
2022-06-20 12:36:22.308751095 +0000 fluent.info: {"message":"following tail of /fluentd/logs/test.log"}
2022-06-20 12:36:22.309047520 +0000 fluent.warn: {"message":"pattern not matched: \"    - Returns\""}
2022-06-20 12:36:22.309180634 +0000 fluent.warn: {"message":"pattern not matched: \"       - EntityReturnImpl(entity=com.xxxx.entity.ScheduledLeadsInvalid, querySpaceUid=<gen:0>, path=com.xxxx.entity.ScheduledLeadsInvalid)\""}
2022-06-20 12:36:22.309258667 +0000 fluent.warn: {"message":"pattern not matched: \"    - QuerySpaces\""}
2022-06-20 12:36:22.309328608 +0000 fluent.warn: {"message":"pattern not matched: \"       - EntityQuerySpaceImpl(uid=<gen:0>, entity=com.xxxx.entity.ScheduledLeadsInvalid)\""}
2022-06-20 12:36:22.309401309 +0000 fluent.warn: {"message":"pattern not matched: \"          - SQL table alias mapping - scheduledl0_\""}
2022-06-20 12:36:22.309468557 +0000 fluent.warn: {"message":"pattern not matched: \"          - alias suffix - 0_\""}
2022-06-20 12:36:22.309563730 +0000 fluent.warn: {"message":"pattern not matched: \"          - suffixed key columns - {id1_51_0_}\""}
2022-06-20 12:36:22.309723704 +0000 fluent.warn: {"message":"pattern not matched: \"\""}
2022-06-20 12:36:22.310086626 +0000 fluent.info: {"worker":0,"message":"fluentd worker is now running worker=0"}

From the above parsing results, we can see that some of the regular expressions do not match, and some can be parsed normally, for example, the following log is the result of parsing out the previous line of log.

{"timestamp":"2022-06-20 19:32:07.264","level":"DEBUG","pid":"7","tid":"bb0e9b6d1d704755a93ea1529265bb99.68.16557246000000125","thread":"   scheduling-4","message":"o.s.d.redis.core.RedisConnectionUtils    : Closing Redis Connection."}

And what doesn’t match properly is a multi-line log. fluentd will treat each log line as a separate line, which is obviously not what we expect.

Multi-line parser

What we want is to treat multi-line logs as one line, so here we need to use the multiline parser. Create a new configuration file etc/fluentd_multline.conf for multi-line log processing, with the following content.

<source>
  @type tail
  path /fluentd/logs/*.log
  pos_file /fluentd/logs/test.log.pos
  tag test.logs
  read_from_head true
  <parse>
    @type multiline
    format_firstline /\d{4}-\d{1,2}-\d{1,2}/
    format1 /^(?<timestamp>[^ ]* [^ ]*) (?<level>[^\s]+) (?<pid>[^s+]+) \[TID:(?<tid>[,a-z0-9A-Z./]+)\] --- \[(?<thread>.*)\] (?<message>[\s\S]*)/ 
  </parse>
</source>
<match **>
  @type stdout
</match>

Here we use format_firstline /\d{4}-\d{1,2}-\d{1,2}/ to match the beginning of each log line, format1 is used to parse the first log line, if you have more data to match, then you can continue to configure the matching rules for the second line format2 etc. Restart fluentd with this configuration above.

docker run --rm -v $(pwd)/etc:/fluentd/etc -v $(pwd)/logs:/fluentd/logs fluent/fluentd:v1.14-1 -c /fluentd/etc/fluentd_multline.conf -v
fluentd -c /fluentd/etc/fluentd_multline.conf -v
2022-06-20 12:41:58 +0000 [info]: fluent/log.rb:330:info: parsing config file is succeeded path="/fluentd/etc/fluentd_multline.conf"
2022-06-20 12:41:58 +0000 [info]: fluent/log.rb:330:info: gem 'fluentd' version '1.14.3'
2022-06-20 12:41:58 +0000 [warn]: fluent/log.rb:351:warn: define <match fluent.**> to capture fluentd logs in top level is deprecated. Use <label @FLUENT_LOG> instead
2022-06-20 12:41:58 +0000 [info]: fluent/log.rb:330:info: using configuration file: <ROOT>
  <source>
    @type tail
    path "/fluentd/logs/*.log"
    pos_file "/fluentd/logs/test.log.pos"
    tag "test.logs"
    read_from_head true
    <parse>
      @type "multiline"
      format_firstline "/\\d{4}-\\d{1,2}-\\d{1,2}/"
      format1 /^(?<timestamp>[^ ]* [^ ]*) (?<level>[^\s]+) (?<pid>[^s+]+) \[TID:(?<tid>[,a-z0-9A-Z./]+)\] --- \[(?<thread>.*)\] (?<message>[\s\S]*)/
      unmatched_lines 
    </parse>
  </source>
  <match **>
    @type stdout
  </match>
</ROOT>
2022-06-20 12:41:58 +0000 [info]: fluent/log.rb:330:info: starting fluentd-1.14.3 pid=9 ruby="2.7.5"
2022-06-20 12:41:58 +0000 [info]: fluent/log.rb:330:info: spawn command to main:  cmdline=["/usr/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/bin/fluentd", "-c", "/fluentd/etc/fluentd_multline.conf", "-v", "--plugin", "/fluentd/plugins", "--under-supervisor"]
2022-06-20 12:41:59 +0000 [info]: fluent/log.rb:330:info: adding match pattern="**" type="stdout"
2022-06-20 12:41:59 +0000 [info]: fluent/log.rb:330:info: adding source type="tail"
2022-06-20 12:41:59 +0000 [warn]: #0 fluent/log.rb:351:warn: define <match fluent.**> to capture fluentd logs in top level is deprecated. Use <label @FLUENT_LOG> instead
2022-06-20 12:41:59 +0000 [info]: #0 fluent/log.rb:330:info: starting fluentd worker pid=18 ppid=9 worker=0
2022-06-20 12:41:59 +0000 [debug]: #0 fluent/log.rb:309:debug: tailing paths: target = /fluentd/logs/test.log | existing = 
2022-06-20 12:41:59 +0000 [info]: #0 fluent/log.rb:330:info: following tail of /fluentd/logs/test.log
2022-06-20 12:41:59.201105512 +0000 test.logs: {"timestamp":"2022-06-20 19:32:07.264","level":"DEBUG","pid":"7","tid":"bb0e9b6d1d704755a93ea1529265bb99.68.16557246000000125","thread":"   scheduling-4","message":"o.s.d.redis.core.RedisConnectionUtils    : Closing Redis Connection."}
2022-06-20 12:41:59.201140475 +0000 test.logs: {"timestamp":"2022-06-20 19:32:07.264","level":"DEBUG","pid":"7","tid":"bb0e9b6d1d704755a93ea1529265bb99.68.16557246000000125","thread":"   scheduling-4","message":"io.lettuce.core.RedisChannelHandler      : dispatching command AsyncCommand [type=DEL, output=IntegerOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]"}
2022-06-20 12:41:59.201213082 +0000 test.logs: {"timestamp":"2022-06-20 17:28:27.871","level":"DEBUG","pid":"6","tid":"N/A","thread":"           main","message":"o.h.l.p.build.spi.LoadPlanTreePrinter    : LoadPlan(entity=com.xxxx.entity.ScheduledLeadsInvalid)\n    - Returns\n       - EntityReturnImpl(entity=com.xxxx.entity.ScheduledLeadsInvalid, querySpaceUid=<gen:0>, path=com.xxxx.entity.ScheduledLeadsInvalid)\n    - QuerySpaces\n       - EntityQuerySpaceImpl(uid=<gen:0>, entity=com.xxxx.entity.ScheduledLeadsInvalid)\n          - SQL table alias mapping - scheduledl0_\n          - alias suffix - 0_\n          - suffixed key columns - {id1_51_0_}"}
2022-06-20 12:41:59 +0000 [info]: #0 fluent/log.rb:330:info: fluentd worker is now running worker=0
2022-06-20 12:41:59.199950788 +0000 fluent.info: {"pid":18,"ppid":9,"worker":0,"message":"starting fluentd worker pid=18 ppid=9 worker=0"}
2022-06-20 12:41:59.200662918 +0000 fluent.debug: {"message":"tailing paths: target = /fluentd/logs/test.log | existing = "}
2022-06-20 12:41:59.200844577 +0000 fluent.info: {"message":"following tail of /fluentd/logs/test.log"}
2022-06-20 12:41:59.201480874 +0000 fluent.info: {"worker":0,"message":"fluentd worker is now running worker=0"}

You can see that the logs obtained are now normal, and the previous multi-line logs have been parsed into a single line as we expected.

{"timestamp":"2022-06-20 17:28:27.871","level":"DEBUG","pid":"6","tid":"N/A","thread":"           main","message":"o.h.l.p.build.spi.LoadPlanTreePrinter    : LoadPlan(entity=com.xxxx.entity.ScheduledLeadsInvalid)\n    - Returns\n       - EntityReturnImpl(entity=com.xxxx.entity.ScheduledLeadsInvalid, querySpaceUid=<gen:0>, path=com.xxxx.entity.ScheduledLeadsInvalid)\n    - QuerySpaces\n       - EntityQuerySpaceImpl(uid=<gen:0>, entity=com.xxxx.entity.ScheduledLeadsInvalid)\n          - SQL table alias mapping - scheduledl0_\n          - alias suffix - 0_\n          - suffixed key columns - {id1_51_0_}"}

Of course, the whole process is not complicated, the only trouble is that we need to “write regular expressions “ to match the logs, which is probably the problem that most people have a hard time~

Table of Contents

Test data

General parsing

Multi-line parser