February 6, 2023
share:

SapMachine Vitals

SAP’s own Java VM SapMachine comes with a handy monitoring tool you may not know: SapMachine Vitals.

Think sysstat tools - pidstat, iostat, etc - plus JVM stats rolled into one and baked into our JVM for your convenience.

Curious? After me.


SapMachine Vitals give you a gap-less history of metrics: JVM-, process- and system-data, periodically sampled and buffered inside the JVM. No need to go back in time to start collecting data - its right there. The record buffer spans ten days, with the last hour kept in higher resolution.

Our developers use SapMachine Vitals for the first-response analysis of JVM issues. Its charm lies in its convenience: while you could get all this data with a combination of different tools, Vitals are just very practical and painless.

Enabling the Vitals

Nothing to do: SapMachine Vitals are enabled by default.

Their overhead is very low - they cost ~300K of memory, a bit more for very long running JVMs. The CPU overhead is negligible. Note that you can still disable them manually with -XX:-EnableVitals.

Accessing the Vitals Report

The Vitals report can be accessed via jcmd, via error log, or via JVM exit report.

Via jcmd

To print the Vitals report, use:

jcmd <pid> VM.vitals

The VM.vitals command has several useful suboptions. help is your friend:

jcmd <pid> help VM.vitals

Syntax : VM.vitals [options]

Options: (options must be specified using the <key> or <key>=<value> syntax)
        scale : [optional] Memory usage in which to scale. Valid values are: k, m, g (fixed scale) or "dynamic" for a dynamically chosen scale. (STRING, dynamic)
        csv : [optional] csv format. (BOOLEAN, false)
        no-legend : [optional] Omit legend. (BOOLEAN, false)
        reverse : [optional] Reverse printing order. (BOOLEAN, false)
        now : [optional] sample now values (BOOLEAN, false)

Example: Print report in CSV format, history ascending (oldest data first), memory sizes in MB:

jcmd <pid> VM.vitals csv scale=M reverse

Via error log (hs_err_<pid>.log)

Vitals are part of the hs_err_<pid>.log (following the NMT section).

That can be highly useful for investigating JVM issues. Vitals show you the development of JVM- and system stats right up to the crash point. The history buffer spans the last ten days, so we usually have a no-gap history too. That makes spotting issues like slowly rising resource leaks a lot easier.

At JVM exit

Start the JVM with -XX:+PrintVitalsAtExit, and it will print the Vitals report to stdout just before it exits.

Its companion -XX:+DumpVitalsAtExit instead dumps two files into the current directory, a text file and a CSV file (sapmachine_vitals_<pid>.txt|.csv). Their names can be overwritten with -XX:VitalsFile=<filename>.

Both switches are very handy for quick analysis, especially if your program is short-lived.

For example, want to know the RSS envelope of your program? Its heap usage, how many classes it loads, how many threads it spawns? Answering these questions would need several tools and could be fiddly. E.g. for an RSS envelope, you would fire up pidstat -e and then have fun isolating the pidstat output. With SapMachine, you just:

java -XX:DumpVitalsAtExit my-java-program

And since -XX:+DumpVitalsAtExit gives you a CSV file too, you can drop that one directly into Excel to get footprint diagrams on the quick.

For very short-lived programs, consider increasing the Vitals sample frequency. By default, Vitals take one data sample every 10 seconds. Using -XX:VitalsSampleInterval, you can change that frequency:

java -XX:DumpVitalsAtExit -XX:VitalsSampleInterval=1 my-java-program

Note that the Vitals report always at least contains one sample taken at JVM exit, regardless of how short-lived the program was.

How to read the Vitals report

The Vitals report has three sub-sections: system (and container) stats, process stats and and JVM stats:

------------system------------
           avail: Memory available without swapping [host] [krn]
            comm: Committed memory [host]
             crt: Committed-to-Commit-Limit ratio (percent) [host]
            swap: Swap space used [host]
              si: Number of pages swapped in [host] [delta]
              so: Number of pages pages swapped out [host] [delta]
               p: Number of processes
               t: Number of threads
              tr: Number of threads running
              tb: Number of threads blocked on disk IO
          cpu-us: CPU user time [host]
          cpu-sy: CPU system time [host]
          cpu-id: CPU idle time [host]
          cpu-st: CPU time stolen [host]
          cpu-gu: CPU time spent on guest [host]
      cgroup-lim: cgroup memory limit [cgrp]
     cgroup-slim: cgroup memory soft limit [cgrp]
      cgroup-usg: cgroup memory usage [cgrp]
     cgroup-kusg: cgroup kernel memory usage (cgroup v1 only) [cgrp]
-----------process------------
            virt: Virtual size
         rss-all: Resident set size, total
        rss-anon: Resident set size, anonymous memory [krn]
        rss-file: Resident set size, file mappings [krn]
         rss-shm: Resident set size, shared memory [krn]
            swdo: Memory swapped out
       cheap-usd: C-Heap, in-use allocations (may be unavailable if RSS > 4G) [glibc]
      cheap-free: C-Heap, bytes in free blocks (may be unavailable if RSS > 4G) [glibc]
          cpu-us: Process cpu user time
          cpu-sy: Process cpu system time
           io-of: Number of open files
           io-rd: IO bytes read from storage or cache
           io-wr: IO bytes written
             thr: Number of native threads
-------------jvm--------------
       heap-comm: Java Heap Size, committed
       heap-used: Java Heap Size, used
       meta-comm: Meta Space Size (class+nonclass), committed
       meta-used: Meta Space Size (class+nonclass), used
        meta-csc: Class Space Size, committed [cs]
        meta-csu: Class Space Size, used [cs]
       meta-gctr: GC threshold
            code: Code cache, committed
         nmt-mlc: Memory malloced by hotspot [nmt]
         nmt-map: Memory mapped by hotspot [nmt]
          nmt-gc: NMT "gc" (GC-overhead, malloc and mmap) [nmt]
         nmt-oth: NMT "other" (typically DBB or Unsafe.allocateMemory) [nmt]
         nmt-ovh: NMT overhead [nmt]
        jthr-num: Number of java threads
         jthr-nd: Number of non-demon java threads
         jthr-cr: Threads created [delta]
         jthr-st: Total reserved size of java thread stacks [nmt] [linux]
        cldg-num: Classloader Data
       cldg-anon: Anonymous CLD
         cls-num: Classes (instance + array)
          cls-ld: Class loaded [delta]
         cls-uld: Classes unloaded [delta]

By default, the report prints its records chronologically in descending order (youngest first). The last hour will be shown in 10-second intervals, dropping down to 10-minute intervals after that.

So, what can I do with it?

It won’t replace a full profiler (of course). But it can alleviate the need for more complex tools when doing first-response analysis. Some examples:

My JVM process shows high RSS. My JVM crashed with a native OOM error. My JVM keeps getting OOM killed. What’s going on?

SapMachine Vitals gives you:

  • Machine-wide memory stats: how much memory is available and committed, how far overcommitted are we, swap activity

  • Container memory statistics and limits

  • Process memory stats: virtual size, RSS (sub-categories, too, on newer kernels), process swap size

  • What NMT thinks we did allocate - both C-heap and virtual memory

  • What we really allocated according to glibc - which can be hilariously off from the NMT numbers (think libc-internal overhead or non-JVM allocations)

  • How much memory the C-heap retains internally

  • JVM memory areas: Java heap, code heap, and Metaspace sizes

  • Some of the more useful NMT sub-categories, e.g., the combined size of GC-related structures outside the Java heap or how much we use for DirectByteBuffers.

These data points help narrow OOMs, helping us avoid “Tuning By Folklore.” For example, a high RSS may be caused by issues outside the core JVM, in which case NMT cannot tell you anything and would be the wrong first-responder tool.

JVM ran out of Metaspace. What’s going on?

Vitals show you the number of class loaders, classes, class load activity, Metaspace sizes, and Metaspace GC threshold.

All these data can point you in the right direction:

  • If the number of classes makes sense to you and they use up the expected amount of space [3], you either deal with Metaspace fragmentation (Java versions older than 16, later versions defragment a lot better) or Metaspace is just plain too small.

  • If the number of classes seems abnormally high, you may have a class loader leak or misjudged how many classes the application loads.

Depending on the answers, your following analysis tool may be jcmd VM.metaspace [4] or a java profiler like JMC.

Other insights

We see the CPU time spent in kernel and userland. We also see the JVM’s IO activity. These data points give us a first indication of whether we face a problem with the OS or whether the JVM itself burns CPU - for the latter, the next step would be to check the GC activity.

Vitals tell us how many threads are alive - both JVM- and outside native threads. The latter is interesting in the context of a JVM embedded in a native application. We also see the thread creation activity.

We also see the number of open file handles and how many bytes we have read and written.

Why is this SapMachine-only? Why not contribute it to mainline OpenJDK?

We tried. Unfortunately, we were unable to convince upstream [5].

But since Vitals as a lightweight first-responder tool was so valuable, we kept the feature downstream-only in SapMachine.