Case sharing of JMH and Arthas positioning issues

I would like to summarize what I’ve been using in my recent work, the testing tool JMH and the Java runtime monitoring tool Arthas, which have helped me in my actual work. So here’s how to use these tools. I also want to deepen my familiarity with these tools. For these two tools, I will first briefly introduce the general usage scenarios of these tools, and then I will use an example of a real performance problem I encountered in my work to explain the actual usage of these two tools in detail. Without further ado, let’s get right to the point.

Problem Description

In order to be able to run through the use of these two tools in my later examples, I will first briefly describe the actual performance problems we encountered in our development. Then I’ll draw out the actual use of the two performance tools and see how we used them to successfully locate a performance bottleneck.

The problem is as follows: In order to be able to support the loss rate, we modified the original log4j2 Async+Custom Appender approach and put the asynchronous logic into our own revamped Appender. However, we found that the performance of the modified logging was much lower than the previous Async+Custom Appender approach. Here I didn’t use an example from the actual company for privacy reasons, here I used an alternative way that also reflects the problem. Let’s skip the specific configuration file for now and give the performance test code and results

Code

package com.bryantchang.appendertest;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class AppenderTest {

    private static final String LOGGER_NAME_DEFAULT = "defaultLogger";
    private static final String LOGGER_NAME_INCLUDE = "includeLocationLogger";
    private static final Logger LOGGER = LoggerFactory.getLogger(LOGGER_NAME_INCLUDE);
    public static final long BATCH = 10000;

    public static void main(String[] args) throws InterruptedException {
        while(true) {
            long start, end;
            start = System.currentTimeMillis();
            for (int i = 0; i < BATCH; i++) {
                LOGGER.info("msg is {}", i);
            }
            end = System.currentTimeMillis();
            System.out.println("duration of " + LOGGER_NAME_INCLUDE +  " is " + (end - start) + "ms");
            Thread.sleep(1000);
        }
    }
}

The code logic is simple, it calls logger.info to print 10,000 logs at a time and then records the elapsed time. The comparison between the two is as follows.

sobyte

Comparison of results.

sobyte

From these two pictures we are able to see the same logic, the difference in time consumption between the two programs is tens of times, but looking at the pictures, it seems to be just the name of the logger is different. Analyzing the results of the above experiment, we may have two questions

whether the above code test is standard and regulated
If it is really a performance problem, then the two codes have such a big difference in which method leads to the final performance difference

The following two tools will be used to answer these two questions respectively

Introduction to JMH

The first question is whether the test method is standard. We write our code in a dead loop + back and forth “stopwatch” style. If we were to add a multi-threaded test, we would need a thread pool, and we would have to care about the logic of the test in addition to the logic of the code itself. JMH is one such testing framework for Java. Here is the official definition of JMH.

JMH is a performance benchmarking framework for the Java language or other Java virtual machine languages.

The most suitable scenario is to test the performance of two collections of put and get, such as ArrayList vs. LinkedList, etc. Here we need to test the time required to play a batch of logs, which is also basically in line with the test scenario using JMH. Here is the test code, the bench framework code and the main function.

Methods to be tested

public class LogBenchMarkWorker {

    private LogBenchMarkWorker() {}

    private static class LogBenchMarkWorkerInstance {
        private static final LogBenchMarkWorker instance = new LogBenchMarkWorker();
    }

    public static LogBenchMarkWorker getInstance() {
        return LogBenchMarkWorkerInstance.instance;
    }

    private static final String LOGGER_DEFAULT_NAME = "defaultLogger";
    private static final String LOGGER_INCLUDE_LOCATION = "includeLocationLogger";
    private static final Logger LOGGER = LoggerFactory.getLogger(LOGGER_DEFAULT_NAME);
    private static long BATCH_SIZE = 10000;

    public void logBatch() {
        for (int i = 0; i < BATCH_SIZE; i++) {
            LOGGER.info("msg is {}", i);
        }
    }
}

As you can see the method to be tested is very simple, it is the operation of printing 10,000 logs at once in a single batch, so there is no need to explain any additional parts. Let’s look at the benchmark section again.

public class LogBenchMarkMain {

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1)
    @Threads(1)
    public void testLog1() {
        LogBenchMarkWorker.getInstance().logBatch();
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1)
    @Threads(4)
    public void testLog4() {
        LogBenchMarkWorker.getInstance().logBatch();
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1)
    @Threads(8)
    public void testLog8() {
        LogBenchMarkWorker.getInstance().logBatch();
    }

    @Benchmark
    @BenchmarkMode(Mode.AverageTime)
    @Fork(value = 1)
    @Threads(16)
    public void testLog16() {
        LogBenchMarkWorker.getInstance().logBatch();
    }

In this code, we then find something unique to JMH, which I briefly describe below.

Benchmark注解：标识在某个具体方法上，表示这个方法将是一个被测试的最小方法，在JMH中成为一个OPS
BenmarkMode:测试类型，JMH提供了几种不同的Mode
    Throughput:整体吞吐量
    AverageTime:调用的平均时间，每次OPS执行的时间
    SampleTime:取样，给出不同比例的ops时间，例如99%的ops时间，99.99%的ops时间
fork：fork的次数，如果设置为2，JMH会fork出两个进程来测试
Threads：很容易理解，这个参数表示这个方法同时被多少个线程执行

In the above code, I have defined 4 methods to be tested, the method Fork, BenchmarkMode are the average time to test a single OPS, but the number of threads for the 4 methods are different. In addition to these parameters, there are a few more parameters that I will talk about in the main function, and the main code is shown below.

public class Main {
    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder()
                .include(LogBenchMarkMain.class.getSimpleName())
                .warmupIterations(5)
                .measurementIterations(5)
                .output("logs/BenchmarkCommon.log")
                .build();
        new Runner(options).run();
    }
}

We can see that in the main function, we are to set the basic configuration of JMH, which has several configuration parameters meaning as follows.

include:benchmark所在类的名字，可以使用正则表达
warmupIteration:预热的迭代次数，这里为什么要预热的原因是由于JIT的存在，随着代码的运行，会动态对代码的运行进行优化。因此在测试过程中需要先预热几轮，让代码运行稳定后再实际进行测试
measurementIterations:实际测试轮次
output:测试报告输出位置

I ran the tests with both loggers separately to see the performance test report comparison

Using the normal logger

LogBenchMarkMain.testLog1   avgt    5  0.006 ± 0.001   s/op
LogBenchMarkMain.testLog16  avgt    5  0.062 ± 0.026   s/op
LogBenchMarkMain.testLog4   avgt    5  0.006 ± 0.002   s/op
LogBenchMarkMain.testLog8   avgt    5  0.040 ± 0.004   s/op

Loggers that use INCLUDE_LOCATION

LogBenchMarkMain.testLog1   avgt    5  0.379 ± 0.007   s/op
LogBenchMarkMain.testLog16  avgt    5  1.784 ± 0.670   s/op
LogBenchMarkMain.testLog4   avgt    5  0.378 ± 0.003   s/op
LogBenchMarkMain.testLog8   avgt    5  0.776 ± 0.070   s/op

Here we see that the performance gap is immediately apparent. The performance of using INCLUDE_LOCATION is significantly lower than the performance of using normal logger. This is where we must wonder and ask “what is the slowdown”!

Arthas What exactly does my code do at runtime

Arthas is a java debugging tool open-sourced by Ali, similar to greys, but with more powerful features than greys, for example, you can directly draw the flame graph of java method calls, etc. The principle of both tools is the use of Java’s powerful bytecode technology. After all, I am not a JVM expert, so I can not expand on the details of the implementation. All we have to do is stand on the shoulders of giants, accept and use these good performance monitoring tools with skill.

Hands-on

Talk is cheap, show me your code, since it’s a tool, let’s go straight to the real thing. Let’s run the program we started with locally and then explain how to use arthas.

We first find the process number of the program via jps, then we directly enhance the bytecode of our running program via as.sh given by arthas, and then start arthas with the following command

`1`	`./as.sh pid`

This is the startup screen of arthas, and we use the help command to see the features supported by the tool.

sobyte

As you can see, arthas supports viewing the current state of jvm, viewing the current thread state, monitoring the call time of certain methods, trace, profile generation flame chart, etc., with a full range of functions. We will also only a few more commonly used commands here, other commands if you are interested in see the official website arthas website for details. This article mainly introduces the following functions.

Decompile code
Monitor the invocation of a method
View the call and return value of a method
Trace a method

Monitor method calls

The main command for this is monitor, and according to the official website, the common usage is.

`1`	`monitor -c duration className methodName`

where duration represents every few seconds to show the statistical results, that is, a single statistical cycle, className is the fully qualified name of the class, methodname is the name of the method, which we view the method is the Logger class info method, we use two different logger’s info method, respectively. The class here is org.slf4j.Logger, the method when info, our monitoring statement is.

`1`	`monitor -c 5 org.slf4j.Logger info`

The monitoring results are as follows

Using the normal appender
Use include appender

We can see that the print log method using include appeder is 3 times higher than the normal appender, which makes us wonder how time consuming the various steps of these two methods are. Here we will introduce the second command, trace method.

trace command & jad command

The log4j2 configuration files for these two programs are as follows.

<?xml version="1.0" encoding="UTF-8"?>
<!--status：日志等级   monitorInterval：更新配置文件的时间间隔，单位秒-->
<configuration status="warn" monitorInterval="30">
    <!--定义所有的appender -->
    <appenders>
        <!--这个输出控制台的配置 -->
        <Console name="console" target="SYSTEM_OUT">
            <!--控制台只输出level及以上级别的信息（onMatch），其他的直接拒绝（onMismatch） -->
            <ThresholdFilter level="warn" onMatch="ACCEPT" onMismatch="DENY"/>
            <!--日志打印格式 -->
            <PatternLayout pattern="[%d{HH:mm:ss.SSS}] [%-5p] %l - %m%n"/>
        </Console>


        <Async name="AsyncDefault" blocking="false" includeLocation="false">
            <AppenderRef ref="fileAppender"/>
        </Async>

        <Async name="AsyncIncludeLocation" blocking="false" includeLocation="true">
            <AppenderRef ref="fileAppender"/>
        </Async>

        <!--文件会打印出所有信息，这个log每次运行程序会自动清空，由append属性决定，这个也挺有用的，适合临时测试用 -->
        <!--append为TRUE表示消息增加到指定文件中，false表示消息覆盖指定的文件内容，默认值是true -->
        <File name="fileAppender" fileName="log/test.log" append="false">
            <PatternLayout pattern="[%d{HH:mm:ss.SSS}] [%-5p] %l - %m%n"/>
        </File>

        <!--添加过滤器ThresholdFilter,可以有选择的输出某个级别以上的类别  onMatch="ACCEPT" onMismatch="DENY"意思是匹配就接受,否则直接拒绝  -->
        <File name="ERROR" fileName="logs/error.log">
            <ThresholdFilter level="error" onMatch="ACCEPT" onMismatch="DENY"/>
            <PatternLayout pattern="[%d{yyyy.MM.dd 'at' HH:mm:ss z}] [%-5p] %l - %m%n"/>
        </File>

        <!--这个会打印出所有的信息，每次大小超过size，则这size大小的日志会自动存入按年份-月份建立的文件夹下面并进行压缩，作为存档 -->
        <RollingFile name="rollingFile" fileName="logs/app.log"
                     filePattern="logs/$${date:yyyy-MM}/web-%d{yyyy-MM-dd}.log.gz">
            <PatternLayout pattern="[%d{yyyy-MM-dd HH:mm:ss}] [%-5p] %l - %m%n"/>
            <Policies>
                <TimeBasedTriggeringPolicy modulate="true" interval="1"/>
            </Policies>
            <DefaultRolloverStrategy>
                <Delete basePath="logs" maxDepth="2">
                    <IfFileName glob="*/*.log.gz" />
                    <IfLastModified age="7d" />
                </Delete>
            </DefaultRolloverStrategy>
        </RollingFile>
    </appenders>

    <!--然后定义logger，只有定义了logger并引入的appender，appender才会生效 -->
    <loggers>
        <!--过滤掉spring和mybatis的一些无用的DEBUG信息-->
        <logger name="defaultLogger" additivity="false">
            <appender-ref ref="AsyncDefault"></appender-ref>
        </logger>

        <logger name="includeLocationLogger" additivity="false">
            <appender-ref ref="AsyncIncludeLocation"></appender-ref>
        </logger>

        <!--建立一个默认的root的logger -->
        <root level="INFO">

        </root>
    </loggers>

</configuration>

We both used an AsyncAppender to apply a FileAppender. looking at the fileAppender, we found that there is no difference between the two, except for one option in the asyncAppender, which is includeLocation, one is false and the other is true. As for the meaning of this parameter, we won’t discuss it for now, we now know that the problem may be in AsyncAppender, but how do we verify it. trace command is useful.

The basic usage of the trace command is similar to monitor, where the main argument is -n, which means how many times to trace.

`1`	`trace -n trace_times className methodName.`

As I mentioned in my previous blog about Log4j2, the core method of any appender is its append method. So let’s trace the append method of each of the two programs.

`1`	`trace -n 5 org.apache.logging.log4j.core.appender.AsyncAppender append.`

The trace results are as follows.

Using the normal appender.
Use include appender

We can immediately find that the hot methods of the two traces are not the same. In the appender that uses include, the method that takes the longest time is the createMemento method in the org.apache.logging.log4j.core.impl.Log4jLogEvent class, so how can we know This method in the end what to do it, then out of our next common command jad, this command can decompile the code of the corresponding method. Here we jad the above-mentioned createMemento method, the command is very simple.

`1`	`jad org.apache.logging.log4j.core.impl.Log4jLogEvent createMemento`

The results are as follows.

sobyte

We find that there is an includeLocation parameter in this method, which coincides with the only different configuration we see for the two appenders, so we should have a guess at this point. To verify this guess, let’s introduce the next command, watch.

watch command

watch command can observe the entry, return value and other information of a particular method, we use this command to check the entry of the createMemento method, if the two programs have different entries, then we can basically conclude that this is the cause of the command as follows.

`1`	`watch org.apache.logging.log4j.core.impl.Log4jLogEvent createMemento "params" -x 2 -n 5 -b -f`

The meaning of the parameters here is as follows.

-x 参数展开层次
-n 执行次数
-b 查看方法调用前状态
-f 方法调用后

The param represents the list of parameters of the method call, and there are other monitoring items as detailed in the official website.

The final watch result is as follows.

Using normal logger
Use include

As expected, the two parameters are really a true and a false, we simply look at how this parameter is passed in, we jad the append method of AsyncAppender.

sobyte

We found that this includeLocation is a property of appender, which is the one configured in our xml. Checking the analysis on the official website, we see that this parameter can reduce the performance of logging by 5-10 times.

sobyte

However, to find out, I followed the code statically

This includeLocation will be used in the event’s createMemento to create a LogEventProxy when serializing the generated object, the code is as follows.

public LogEventProxy(final LogEvent event, final boolean includeLocation) {
    this.loggerFQCN = event.getLoggerFqcn();
    this.marker = event.getMarker();
    this.level = event.getLevel();
    this.loggerName = event.getLoggerName();

    final Message msg = event.getMessage();
    this.message = msg instanceof ReusableMessage
            ? memento((ReusableMessage) msg)
            : msg;
    this.timeMillis = event.getTimeMillis();
    this.thrown = event.getThrown();
    this.thrownProxy = event.getThrownProxy();
    this.contextData = memento(event.getContextData());
    this.contextStack = event.getContextStack();
    this.source = includeLocation ? event.getSource() : null;
    this.threadId = event.getThreadId();
    this.threadName = event.getThreadName();
    this.threadPriority = event.getThreadPriority();
    this.isLocationRequired = includeLocation;
    this.isEndOfBatch = event.isEndOfBatch();
    this.nanoTime = event.getNanoTime();
}

If includeLocation is true, then the getSource function will be called, follow it in to see, the code is as follows.

    public StackTraceElement getSource() {
        if (source != null) {
            return source;
        }
        if (loggerFqcn == null || !includeLocation) {
            return null;
        }
        source = Log4jLogEvent.calcLocation(loggerFqcn);
        return source;
    }
     public static StackTraceElement calcLocation(final String fqcnOfLogger) {
        if (fqcnOfLogger == null) {
            return null;
        }
        // LOG4J2-1029 new Throwable().getStackTrace is faster than Thread.currentThread().getStackTrace().
        final StackTraceElement[] stackTrace = new Throwable().getStackTrace();
        StackTraceElement last = null;
        for (int i = stackTrace.length - 1; i > 0; i--) {
            final String className = stackTrace[i].getClassName();
            if (fqcnOfLogger.equals(className)) {
                return last;
            }
            last = stackTrace[i];
        }
        return null;
    }

We see that he will look at the entire call stack to find the line of code that calls this method, and the performance can be imagined. Let’s monitor it with arthas and verify it.

First we trace the crateMemento method.

`1`	`trace -n 5 org.apache.logging.log4j.core.impl.Log4jLogEvent createMemento.`

sobyte

Log4jLogEvent’s serialize() found in the hot method org.apache.logging.log4j.core.impl.Log4jLogEvent, keep tracing.

`1`	`trace -n 5 org.apache.logging.log4j.core.impl.Log4jLogEvent serialize.`

sobyte

See the hotspot is the constructor of org.apache.logging.log4j.core.impl.Log4jLogEvent:LogEventProxy, continue trace.

`1`	`trace -n 5 org.apache.logging.log4j.core.impl.Log4jLogEvent$LogEventProxy <init>`

sobyte

Found the getSource method, continue.

`1`	`trace -n 5 trace -n 5 org.apache.logging.log4j.core.LogEvent getSource`

sobyte

The hotspot was finally located, and it was the calcLocation function of org.apache.logging.log4j.core.impl.Log4jLogEvent, the same code as our static trace.

So far we have located an online performance problem by combining JMH and arthas together. However, I introduced only the tip of the iceberg, more common commands hope that you understand and practice through the official website, after a few hands-on practice, the tool will be familiar with.

Table of Contents

Problem Description

Introduction to JMH

Arthas What exactly does my code do at runtime

Hands-on

Monitor method calls

trace command & jad command

watch command