tools/trace_example.txt - platform/external/bcc - Git at Google

 Demonstrations of trace.


 trace probes functions you specify and displays trace messages if a particular
 condition is met. You can control the message format to display function
 arguments and return values.

 For example, suppose you want to trace all commands being exec'd across the
 system:

 # trace 'sys_execve "%s", arg1'
 PID    COMM         FUNC             -
 4402   bash         sys_execve       /usr/bin/man
 4411   man          sys_execve       /usr/local/bin/less
 4411   man          sys_execve       /usr/bin/less
 4410   man          sys_execve       /usr/local/bin/nroff
 4410   man          sys_execve       /usr/bin/nroff
 4409   man          sys_execve       /usr/local/bin/tbl
 4409   man          sys_execve       /usr/bin/tbl
 4408   man          sys_execve       /usr/local/bin/preconv
 4408   man          sys_execve       /usr/bin/preconv
 4415   nroff        sys_execve       /usr/bin/locale
 4416   nroff        sys_execve       /usr/bin/groff
 4418   groff        sys_execve       /usr/bin/grotty
 4417   groff        sys_execve       /usr/bin/troff
 ^C

 The ::sys_execve syntax specifies that you want an entry probe (which is the
 default), in a kernel function (which is the default) called sys_execve. Next,
 the format string to print is simply "%s", which prints a string. Finally, the
 value to print is the first argument to the sys_execve function, which happens
 to be the command that is exec'd. The above trace was generated by executing
 "man ls" in a separate shell. As you see, man executes a number of additional
 programs to finally display the man page.

 Next, suppose you are looking for large reads across the system. Let's trace
 the read system call and inspect the third argument, which is the number of
 bytes to be read:

 # trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
 PID    COMM         FUNC             -
 4490   dd           sys_read         read 1048576 bytes
 4490   dd           sys_read         read 1048576 bytes
 4490   dd           sys_read         read 1048576 bytes
 4490   dd           sys_read         read 1048576 bytes
 ^C

 During the trace, I executed "dd if=/dev/zero of=/dev/null bs=1M count=4".
 The individual reads are visible, with the custom format message printed for
 each read. The parenthesized expression "(arg3 > 20000)" is a filter that is
 evaluated for each invocation of the probe before printing anything.

 Event message filter is useful while you only interesting the specific event.
 Like the program open thousands file and you only want to see the "temp" file
 and print stack.

 # trace 'do_sys_open "%s", arg2@user' -UK -f temp
 PID     TID     COMM            FUNC             -
 9557    9557    a.out           do_sys_open      temp.1
         do_sys_open+0x1 [kernel]
         do_syscall_64+0x5b [kernel]
         entry_SYSCALL_64_after_hwframe+0x44 [kernel]
         __open_nocancel+0x7 [libc-2.17.so]
         __libc_start_main+0xf5 [libc-2.17.so]
 9558    9558    a.out           do_sys_open      temp.2
         do_sys_open+0x1 [kernel]
         do_syscall_64+0x5b [kernel]
         entry_SYSCALL_64_after_hwframe+0x44 [kernel]
         __open_nocancel+0x7 [libc-2.17.so]
         __libc_start_main+0xf5 [libc-2.17.so]

 Process name filter is porting from tools/opensnoop

 # trace 'do_sys_open "%s", arg2@user' -UK -n out
 PID     TID     COMM            FUNC             -
 9557    9557    a.out           do_sys_open      temp.1
         do_sys_open+0x1 [kernel]
         do_syscall_64+0x5b [kernel]
         entry_SYSCALL_64_after_hwframe+0x44 [kernel]
         __open_nocancel+0x7 [libc-2.17.so]
         __libc_start_main+0xf5 [libc-2.17.so]

 You can also trace user functions. For example, let's simulate the bashreadline
 script, which attaches to the readline function in bash and prints its return
 value, effectively snooping all bash shell input across the system:

 # trace 'r:bash:readline "%s", retval'
 PID    COMM         FUNC             -
 2740   bash         readline         echo hi!
 2740   bash         readline         man ls
 ^C

 The special retval keyword stands for the function's return value, and can
 be used only in a retprobe, specified by the 'r' prefix. The next component
 of the probe is the library that contains the desired function. It's OK to
 specify executables too, as long as they can be found in the PATH. Or, you
 can specify the full path to the executable (e.g. "/usr/bin/bash").

 Sometimes it can be useful to see where in code the events happen. There are
 flags to print the kernel stack (-K), the user stack (-U) and optionally
 include the virtual address in the stacks as well (-a):

 # trace.py -U -a 'r::sys_futex "%d", retval'
 PID     TID     COMM            FUNC             -
 793922  793951  poller          sys_futex        0
         7f6c72b6497a __lll_unlock_wake+0x1a [libpthread-2.23.so]
               627fef folly::FunctionScheduler::run()+0x46f [router]
         7f6c7345f171 execute_native_thread_routine+0x21 [libstdc++.so.6.0.21]
         7f6c72b5b7a9 start_thread+0xd9 [libpthread-2.23.so]
         7f6c7223fa7d clone+0x6d [libc-2.23.so]

 Multiple probes can be combined on the same command line. For example, let's
 trace failed read and write calls on the libc level, and include a time column:

 # trace 'r:c:read ((int)retval < 0) "read failed: %d", retval' \
         'r:c:write ((int)retval < 0) "write failed: %d", retval' -T
 TIME     PID    COMM         FUNC             -
 05:31:57 3388   bash         write            write failed: -1
 05:32:00 3388   bash         write            write failed: -1
 ^C

 Note that the retval variable must be cast to int before comparing to zero.
 The reason is that the default type for argN and retval is an unsigned 64-bit
 integer, which can never be smaller than 0.

 trace has also some basic support for kernel tracepoints. For example, let's
 trace the block:block_rq_complete tracepoint and print out the number of sectors
 transferred:

 # trace 't:block:block_rq_complete "sectors=%d", args->nr_sector' -T
 TIME     PID    COMM         FUNC             -
 01:23:51 0      swapper/0    block_rq_complete sectors=8
 01:23:55 10017  kworker/u64: block_rq_complete sectors=1
 01:23:55 0      swapper/0    block_rq_complete sectors=8
 ^C

 Suppose that you want to trace a system-call in a short-lived process, you can use
 the -s option to trace. The option is followed by list of libraries/executables to
 use for symbol resolution.
 # trace -s /lib/x86_64-linux-gnu/libc.so.6,/bin/ping 'p:c:inet_pton' -U
 Note: Kernel bpf will report stack map with ip/build_id
 PID     TID     COMM            FUNC
 4175    4175    ping            inet_pton
         inet_pton+0x136340 [libc.so.6]
         getaddrinfo+0xfb510 [libc.so.6]
         _init+0x2a08 [ping]

 During the trace, 'ping -c1 google.com' was executed to obtain the above results

 To discover the tracepoint structure format (which you can refer to as the "args"
 pointer variable), use the tplist tool. For example:

 # tplist -v block:block_rq_complete
 block:block_rq_complete
     dev_t dev;
     sector_t sector;
     unsigned int nr_sector;
     int errors;
     char rwbs[8];

 This output tells you that you can use "args->dev", "args->sector", etc. in your
 predicate and trace arguments.


 More and more high-level libraries are instrumented with USDT probe support.
 These probes can be traced by trace just like kernel tracepoints. For example,
 trace new threads being created and their function name, include time column
 and on which CPU it happened:

 # trace 'u:pthread:pthread_create "%U", arg3' -T -C
 TIME     CPU PID     TID     COMM            FUNC             -
 13:22:01 25  2627    2629    automount       pthread_create   expire_proc_indirect+0x0 [automount]
 13:22:01 5   21360   21414   osqueryd        pthread_create   [unknown] [osqueryd]
 13:22:03 25  2627    2629    automount       pthread_create   expire_proc_indirect+0x0 [automount]
 13:22:04 15  21360   21414   osqueryd        pthread_create   [unknown] [osqueryd]
 13:22:07 25  2627    2629    automount       pthread_create   expire_proc_indirect+0x0 [automount]
 13:22:07 4   21360   21414   osqueryd        pthread_create   [unknown] [osqueryd]
 ^C

 The "%U" format specifier tells trace to resolve arg3 as a user-space symbol,
 if possible. Similarly, use "%K" for kernel symbols.

 Ruby, Node, and OpenJDK are also instrumented with USDT. For example, let's
 trace Ruby methods being called (this requires a version of Ruby built with
 the --enable-dtrace configure flag):

 # trace 'u:ruby:method__entry "%s.%s", arg1, arg2' -p $(pidof irb) -T
 TIME     PID    COMM         FUNC             -
 12:08:43 18420  irb          method__entry    IRB::Context.verbose?
 12:08:43 18420  irb          method__entry    RubyLex.ungetc
 12:08:43 18420  irb          method__entry    RuxyLex.debug?
 ^C

 In the previous invocation, arg1 and arg2 are the class name and method name
 for the Ruby method being invoked.

 You can also trace exported functions from shared libraries, or an imported
 function on the actual executable:

 # sudo ./trace.py 'r:/usr/lib64/libtinfo.so:curses_version "Version=%s", retval'
 # tput -V

 PID    TID    COMM         FUNC             -
 21720  21720  tput         curses_version   Version=ncurses 6.0.20160709
 ^C


 Occasionally, it can be useful to filter specific strings. For example, you
 might be interested in open() calls that open a specific file:

 # trace 'p:c:open (STRCMP("test.txt", arg1)) "opening %s", arg1' -T
 TIME     PID    COMM         FUNC             -
 01:43:15 10938  cat          open             opening test.txt
 01:43:20 10939  cat          open             opening test.txt
 ^C


 In the preceding example, as well as in many others, readability may be
 improved by providing the function's signature, which names the arguments and
 lets you access structure sub-fields, which is hard with the "arg1", "arg2"
 convention. For example:

 # trace 'p:c:open(char *filename) "opening %s", filename'
 PID    TID    COMM         FUNC             -
 17507  17507  cat          open             opening FAQ.txt
 ^C

 # trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
 PID    TID    COMM         FUNC             -
 777    785    automount    SyS_nanosleep    sleep for 500000000 ns
 777    785    automount    SyS_nanosleep    sleep for 500000000 ns
 777    785    automount    SyS_nanosleep    sleep for 500000000 ns
 777    785    automount    SyS_nanosleep    sleep for 500000000 ns
 ^C

 Remember to use the -I argument include the appropriate header file. We didn't
 need to do that here because `struct timespec` is used internally by the tool,
 so it always includes this header file.

 To aggregate amount of trace, you need specify -A with -M EVENTS. A typical
 example:
 1, if we find that the sys CPU utilization is higher by 'top' command
 2, then find that the timer interrupt is more normal by 'irqtop' command
 3, to confirm kernel timer setting frequence by 'funccount -i 1 clockevents_program_event'
 4, to trace timer setting by 'trace clockevents_program_event -K -A -M 1000'

 1294576 1294584 CPU 0/KVM       clockevents_program_event
         clockevents_program_event+0x1 [kernel]
         hrtimer_start_range_ns+0x209 [kernel]
         start_sw_timer+0x173 [kvm]
         restart_apic_timer+0x6c [kvm]
         kvm_set_msr_common+0x442 [kvm]
         __kvm_set_msr+0xa2 [kvm]
         kvm_emulate_wrmsr+0x36 [kvm]
         vcpu_enter_guest+0x326 [kvm]
         kvm_arch_vcpu_ioctl_run+0xcc [kvm]
         kvm_vcpu_ioctl+0x22f [kvm]
         do_vfs_ioctl+0xa1 [kernel]
         ksys_ioctl+0x60 [kernel]
         __x64_sys_ioctl+0x16 [kernel]
         do_syscall_64+0x59 [kernel]
         entry_SYSCALL_64_after_hwframe+0x44 [kernel]
 -->COUNT 271
 ...
 So we can know that 271 timer setting in recent 1000(~27%).

 As a final example, let's trace open syscalls for a specific process. By
 default, tracing is system-wide, but the -p switch overrides this:

 # trace -p 2740 'do_sys_open "%s", arg2@user' -T
 TIME     PID    COMM         FUNC             -
 05:36:16 15872  ls           do_sys_open      /etc/ld.so.cache
 05:36:16 15872  ls           do_sys_open      /lib64/libselinux.so.1
 05:36:16 15872  ls           do_sys_open      /lib64/libcap.so.2
 05:36:16 15872  ls           do_sys_open      /lib64/libacl.so.1
 05:36:16 15872  ls           do_sys_open      /lib64/libc.so.6
 05:36:16 15872  ls           do_sys_open      /lib64/libpcre.so.1
 05:36:16 15872  ls           do_sys_open      /lib64/libdl.so.2
 05:36:16 15872  ls           do_sys_open      /lib64/libattr.so.1
 05:36:16 15872  ls           do_sys_open      /lib64/libpthread.so.0
 05:36:16 15872  ls           do_sys_open      /usr/lib/locale/locale-archive
 05:36:16 15872  ls           do_sys_open      /home/vagrant
 ^C

 In this example, we traced the "ls ~" command as it was opening its shared
 libraries and then accessing the /home/vagrant directory listing.


 Lastly, if a high-frequency event is traced you may overflow the perf ring
 buffer. This shows as "Lost N samples":

 # trace sys_open
 5087   5087   pgrep        sys_open
 5087   5087   pgrep        sys_open
 5087   5087   pgrep        sys_open
 5087   5087   pgrep        sys_open
 5087   5087   pgrep        sys_open
 Lost 764896 samples
 Lost 764896 samples
 Lost 764896 samples

 The perf ring buffer size can be changed with -b. The unit is size per-CPU buffer
 size and is measured in pages. The value must be a power of two and defaults to
 64 pages.

 # trace.py 'sys_setsockopt(int fd, int level, int optname, char* optval, int optlen)(level==0 && optname == 1 && STRCMP("{0x6C, 0x00, 0x00, 0x00}", optval))' -U -M 1 --bin_cmp
 PID     TID     COMM            FUNC             -
 1855611 1863183 worker          sys_setsockopt   found

 In this example we are catching setsockopt syscall to change IPv4 IP_TOS
 value only for the cases where new TOS value is equal to 108. we are using
 STRCMP helper in binary mode (--bin_cmp flag) to compare optval array
 against int value of 108 (parametr of setsockopt call) in hex representation
 (little endian format)

 For advanced users there is a possibility to insert the kprobes or uprobes
 after a certain offset, rather than the start of the function call
 This is useful for tracing register values at different places of the
 execution of a function. Lets consider the following example:

 int main()
 {
 	int val = 0xdead;
 	printf("%d\n", val);
 	val = 0xbeef;
 	printf("%d\n", val);
 }

 After compiling the code with -O3 optimization the object code looks
 like the following (with GCC 10 and x86_64 architecture):

 objdump --disassemble=main --prefix-addresses a.out

 0000000000001060 <main> endbr64
 0000000000001064 <main+0x4> sub    $0x8,%rsp
 0000000000001068 <main+0x8> mov    $0xdead,%edx
 000000000000106d <main+0xd> mov    $0x1,%edi
 0000000000001072 <main+0x12> xor    %eax,%eax
 0000000000001074 <main+0x14> lea    0xf89(%rip),%rsi
 000000000000107b <main+0x1b> callq  0000000000001050 <__printf_chk@plt>
 0000000000001080 <main+0x20> mov    $0xbeef,%edx
 0000000000001085 <main+0x25> lea    0xf78(%rip),%rsi
 000000000000108c <main+0x2c> xor    %eax,%eax
 000000000000108e <main+0x2e> mov    $0x1,%edi
 0000000000001093 <main+0x33> callq  0000000000001050 <__printf_chk@plt>
 0000000000001098 <main+0x38> xor    %eax,%eax
 000000000000109a <main+0x3a> add    $0x8,%rsp
 000000000000109e <main+0x3e> retq

 The 0xdead and later the 0xbeef values are moved into the edx register.
 As the disassembly shows the edx register contains the 0xdead value
 after the 0xd offset and 0xbeef after the 0x25 offset. To verify this
 with trace lets insert probes to those offsets. The following
 command inserts two uprobe one after the 0xd offset and another one
 after the 0x25 offset of the main function. The probe print the
 value of the edx register which will show us the correct values.

 trace 'p:/tmp/a.out:main+0xd "%x", ctx->dx' 'p:/tmp/a.out:main+0x25 "%x", ctx->dx'
 PID     TID     COMM            FUNC             -
 25754   25754   a.out           main             dead
 25754   25754   a.out           main             beef


 USAGE message:

 usage: trace [-h] [-b BUFFER_PAGES] [-p PID] [-L TID] [--uid UID] [-v]
              [-Z STRING_SIZE] [-S] [-M MAX_EVENTS] [-t] [-u] [-T] [-C]
              [-c CGROUP_PATH] [-n NAME] [-f MSG_FILTER] [-B]
              [-s SYM_FILE_LIST] [-K] [-U] [-a] [-I header]
              probe [probe ...]

 Attach to functions and print trace messages.

 positional arguments:
   probe                 probe specifier (see examples)

 optional arguments:
   -h, --help            show this help message and exit
   -b BUFFER_PAGES, --buffer-pages BUFFER_PAGES
                         number of pages to use for perf_events ring buffer
                         (default: 64)
   -p PID, --pid PID     id of the process to trace (optional)
   -L TID, --tid TID     id of the thread to trace (optional)
   --uid UID             id of the user to trace (optional)
   -v, --verbose         print resulting BPF program code before executing
   -Z STRING_SIZE, --string-size STRING_SIZE
                         maximum size to read from strings
   -S, --include-self    do not filter trace's own pid from the trace
   -M MAX_EVENTS, --max-events MAX_EVENTS
                         number of events to print before quitting
   -t, --timestamp       print timestamp column (offset from trace start)
   -u, --unix-timestamp  print UNIX timestamp instead of offset from trace
                         start, requires -t
   -T, --time            print time column
   -C, --print_cpu       print CPU id
   -c CGROUP_PATH, --cgroup-path CGROUP_PATH
                         cgroup path
   -n NAME, --name NAME  only print process names containing this name
   -f MSG_FILTER, --msg-filter MSG_FILTER
                         only print the msg of event containing this string
   -B, --bin_cmp         allow to use STRCMP with binary values
   -s SYM_FILE_LIST, --sym_file_list SYM_FILE_LIST
                         comma separated list of symbol files to use for symbol
                         resolution
   -K, --kernel-stack    output kernel stack trace
   -U, --user-stack      output user stack trace
   -a, --address         print virtual address in stacks
   -I header, --include header
                         additional header files to include in the BPF program
                         as either full path, or relative to current working
                         directory, or relative to default kernel header search
                         path
   -A, --aggregate       aggregate amount of each trace

 EXAMPLES:

 trace do_sys_open
         Trace the open syscall and print a default trace message when entered
 trace kfree_skb+0x12
         Trace the kfree_skb kernel function after the instruction on the 0x12 offset
 trace 'do_sys_open "%s", arg2@user'
         Trace the open syscall and print the filename being opened @user is
         added to arg2 in kprobes to ensure that char * should be copied from
         the userspace stack to the bpf stack. If not specified, previous
         behaviour is expected.

 trace 'do_sys_open "%s", arg2@user' -n main
         Trace the open syscall and only print event that process names containing "main"
 trace 'do_sys_open "%s", arg2@user' --uid 1001
         Trace the open syscall and only print event that processes with user ID 1001
 trace 'do_sys_open "%s", arg2@user' -f config
         Trace the open syscall and print the filename being opened filtered by "config"
 trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
         Trace the read syscall and print a message for reads >20000 bytes
 trace 'r::do_sys_open "%llx", retval'
         Trace the return from the open syscall and print the return value
 trace 'c:open (arg2 == 42) "%s %d", arg1, arg2'
         Trace the open() call from libc only if the flags (arg2) argument is 42
 trace 'c:malloc "size = %d", arg1'
         Trace malloc calls and print the size being allocated
 trace 'p:c:write (arg1 == 1) "writing %d bytes to STDOUT", arg3'
         Trace the write() call from libc to monitor writes to STDOUT
 trace 'r::__kmalloc (retval == 0) "kmalloc failed!"'
         Trace returns from __kmalloc which returned a null pointer
 trace 'r:c:malloc (retval) "allocated = %x", retval'
         Trace returns from malloc and print non-NULL allocated buffers
 trace 't:block:block_rq_complete "sectors=%d", args->nr_sector'
         Trace the block_rq_complete kernel tracepoint and print # of tx sectors
 trace 'u:pthread:pthread_create (arg4 != 0)'
         Trace the USDT probe pthread_create when its 4th argument is non-zero
 trace 'u:pthread:libpthread:pthread_create (arg4 != 0)'
         Ditto, but the provider name "libpthread" is specified.
 trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
         Trace the nanosleep syscall and print the sleep duration in ns
 trace -c /sys/fs/cgroup/system.slice/workload.service '__x64_sys_nanosleep' '__x64_sys_clone'
         Trace nanosleep/clone syscall calls only under workload.service
         cgroup hierarchy.
 trace -I 'linux/fs.h' \
       'p::uprobe_register(struct inode *inode) "a_ops = %llx", inode->i_mapping->a_ops'
         Trace the uprobe_register inode mapping ops, and the symbol can be found
         in /proc/kallsyms
 trace -I 'kernel/sched/sched.h' \
       'p::__account_cfs_rq_runtime(struct cfs_rq *cfs_rq) "%d", cfs_rq->runtime_remaining'
         Trace the cfs scheduling runqueue remaining runtime. The struct cfs_rq is defined
         in kernel/sched/sched.h which is in kernel source tree and not in kernel-devel
         package.  So this command needs to run at the kernel source tree root directory
         so that the added header file can be found by the compiler.
 trace -I 'net/sock.h' \
       'udpv6_sendmsg(struct sock *sk) (sk->sk_dport == 13568)'
         Trace udpv6 sendmsg calls only if socket's destination port is equal
         to 53 (DNS; 13568 in big endian order)
 trace -I 'linux/fs_struct.h' 'mntns_install "users = %d", $task->fs->users'
         Trace the number of users accessing the file system of the current task
 trace -s /lib/x86_64-linux-gnu/libc.so.6,/bin/ping 'p:c:inet_pton' -U
         Trace inet_pton system call and use the specified libraries/executables for
         symbol resolution.
	Demonstrations of trace.


	trace probes functions you specify and displays trace messages if a particular
	condition is met. You can control the message format to display function
	arguments and return values.

	For example, suppose you want to trace all commands being exec'd across the
	system:

	# trace 'sys_execve "%s", arg1'
	PID COMM FUNC -
	4402 bash sys_execve /usr/bin/man
	4411 man sys_execve /usr/local/bin/less
	4411 man sys_execve /usr/bin/less
	4410 man sys_execve /usr/local/bin/nroff
	4410 man sys_execve /usr/bin/nroff
	4409 man sys_execve /usr/local/bin/tbl
	4409 man sys_execve /usr/bin/tbl
	4408 man sys_execve /usr/local/bin/preconv
	4408 man sys_execve /usr/bin/preconv
	4415 nroff sys_execve /usr/bin/locale
	4416 nroff sys_execve /usr/bin/groff
	4418 groff sys_execve /usr/bin/grotty
	4417 groff sys_execve /usr/bin/troff
	^C

	The ::sys_execve syntax specifies that you want an entry probe (which is the
	default), in a kernel function (which is the default) called sys_execve. Next,
	the format string to print is simply "%s", which prints a string. Finally, the
	value to print is the first argument to the sys_execve function, which happens
	to be the command that is exec'd. The above trace was generated by executing
	"man ls" in a separate shell. As you see, man executes a number of additional
	programs to finally display the man page.

	Next, suppose you are looking for large reads across the system. Let's trace
	the read system call and inspect the third argument, which is the number of
	bytes to be read:

	# trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
	PID COMM FUNC -
	4490 dd sys_read read 1048576 bytes
	4490 dd sys_read read 1048576 bytes
	4490 dd sys_read read 1048576 bytes
	4490 dd sys_read read 1048576 bytes
	^C

	During the trace, I executed "dd if=/dev/zero of=/dev/null bs=1M count=4".
	The individual reads are visible, with the custom format message printed for
	each read. The parenthesized expression "(arg3 > 20000)" is a filter that is
	evaluated for each invocation of the probe before printing anything.

	Event message filter is useful while you only interesting the specific event.
	Like the program open thousands file and you only want to see the "temp" file
	and print stack.

	# trace 'do_sys_open "%s", arg2@user' -UK -f temp
	PID TID COMM FUNC -
	9557 9557 a.out do_sys_open temp.1
	do_sys_open+0x1 [kernel]
	do_syscall_64+0x5b [kernel]
	entry_SYSCALL_64_after_hwframe+0x44 [kernel]
	__open_nocancel+0x7 [libc-2.17.so]
	__libc_start_main+0xf5 [libc-2.17.so]
	9558 9558 a.out do_sys_open temp.2
	do_sys_open+0x1 [kernel]
	do_syscall_64+0x5b [kernel]
	entry_SYSCALL_64_after_hwframe+0x44 [kernel]
	__open_nocancel+0x7 [libc-2.17.so]
	__libc_start_main+0xf5 [libc-2.17.so]

	Process name filter is porting from tools/opensnoop

	# trace 'do_sys_open "%s", arg2@user' -UK -n out
	PID TID COMM FUNC -
	9557 9557 a.out do_sys_open temp.1
	do_sys_open+0x1 [kernel]
	do_syscall_64+0x5b [kernel]
	entry_SYSCALL_64_after_hwframe+0x44 [kernel]
	__open_nocancel+0x7 [libc-2.17.so]
	__libc_start_main+0xf5 [libc-2.17.so]

	You can also trace user functions. For example, let's simulate the bashreadline
	script, which attaches to the readline function in bash and prints its return
	value, effectively snooping all bash shell input across the system:

	# trace 'r:bash:readline "%s", retval'
	PID COMM FUNC -
	2740 bash readline echo hi!
	2740 bash readline man ls
	^C

	The special retval keyword stands for the function's return value, and can
	be used only in a retprobe, specified by the 'r' prefix. The next component
	of the probe is the library that contains the desired function. It's OK to
	specify executables too, as long as they can be found in the PATH. Or, you
	can specify the full path to the executable (e.g. "/usr/bin/bash").

	Sometimes it can be useful to see where in code the events happen. There are
	flags to print the kernel stack (-K), the user stack (-U) and optionally
	include the virtual address in the stacks as well (-a):

	# trace.py -U -a 'r::sys_futex "%d", retval'
	PID TID COMM FUNC -
	793922 793951 poller sys_futex 0
	7f6c72b6497a __lll_unlock_wake+0x1a [libpthread-2.23.so]
	627fef folly::FunctionScheduler::run()+0x46f [router]
	7f6c7345f171 execute_native_thread_routine+0x21 [libstdc++.so.6.0.21]
	7f6c72b5b7a9 start_thread+0xd9 [libpthread-2.23.so]
	7f6c7223fa7d clone+0x6d [libc-2.23.so]

	Multiple probes can be combined on the same command line. For example, let's
	trace failed read and write calls on the libc level, and include a time column:

	# trace 'r:c:read ((int)retval < 0) "read failed: %d", retval' \
	'r:c:write ((int)retval < 0) "write failed: %d", retval' -T
	TIME PID COMM FUNC -
	05:31:57 3388 bash write write failed: -1
	05:32:00 3388 bash write write failed: -1
	^C

	Note that the retval variable must be cast to int before comparing to zero.
	The reason is that the default type for argN and retval is an unsigned 64-bit
	integer, which can never be smaller than 0.

	trace has also some basic support for kernel tracepoints. For example, let's
	trace the block:block_rq_complete tracepoint and print out the number of sectors
	transferred:

	# trace 't:block:block_rq_complete "sectors=%d", args->nr_sector' -T
	TIME PID COMM FUNC -
	01:23:51 0 swapper/0 block_rq_complete sectors=8
	01:23:55 10017 kworker/u64: block_rq_complete sectors=1
	01:23:55 0 swapper/0 block_rq_complete sectors=8
	^C

	Suppose that you want to trace a system-call in a short-lived process, you can use
	the -s option to trace. The option is followed by list of libraries/executables to
	use for symbol resolution.
	# trace -s /lib/x86_64-linux-gnu/libc.so.6,/bin/ping 'p:c:inet_pton' -U
	Note: Kernel bpf will report stack map with ip/build_id
	PID TID COMM FUNC
	4175 4175 ping inet_pton
	inet_pton+0x136340 [libc.so.6]
	getaddrinfo+0xfb510 [libc.so.6]
	_init+0x2a08 [ping]

	During the trace, 'ping -c1 google.com' was executed to obtain the above results

	To discover the tracepoint structure format (which you can refer to as the "args"
	pointer variable), use the tplist tool. For example:

	# tplist -v block:block_rq_complete
	block:block_rq_complete
	dev_t dev;
	sector_t sector;
	unsigned int nr_sector;
	int errors;
	char rwbs[8];

	This output tells you that you can use "args->dev", "args->sector", etc. in your
	predicate and trace arguments.


	More and more high-level libraries are instrumented with USDT probe support.
	These probes can be traced by trace just like kernel tracepoints. For example,
	trace new threads being created and their function name, include time column
	and on which CPU it happened:

	# trace 'u:pthread:pthread_create "%U", arg3' -T -C
	TIME CPU PID TID COMM FUNC -
	13:22:01 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount]
	13:22:01 5 21360 21414 osqueryd pthread_create [unknown] [osqueryd]
	13:22:03 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount]
	13:22:04 15 21360 21414 osqueryd pthread_create [unknown] [osqueryd]
	13:22:07 25 2627 2629 automount pthread_create expire_proc_indirect+0x0 [automount]
	13:22:07 4 21360 21414 osqueryd pthread_create [unknown] [osqueryd]
	^C

	The "%U" format specifier tells trace to resolve arg3 as a user-space symbol,
	if possible. Similarly, use "%K" for kernel symbols.

	Ruby, Node, and OpenJDK are also instrumented with USDT. For example, let's
	trace Ruby methods being called (this requires a version of Ruby built with
	the --enable-dtrace configure flag):

	# trace 'u:ruby:method__entry "%s.%s", arg1, arg2' -p $(pidof irb) -T
	TIME PID COMM FUNC -
	12:08:43 18420 irb method__entry IRB::Context.verbose?
	12:08:43 18420 irb method__entry RubyLex.ungetc
	12:08:43 18420 irb method__entry RuxyLex.debug?
	^C

	In the previous invocation, arg1 and arg2 are the class name and method name
	for the Ruby method being invoked.

	You can also trace exported functions from shared libraries, or an imported
	function on the actual executable:

	# sudo ./trace.py 'r:/usr/lib64/libtinfo.so:curses_version "Version=%s", retval'
	# tput -V

	PID TID COMM FUNC -
	21720 21720 tput curses_version Version=ncurses 6.0.20160709
	^C


	Occasionally, it can be useful to filter specific strings. For example, you
	might be interested in open() calls that open a specific file:

	# trace 'p:c:open (STRCMP("test.txt", arg1)) "opening %s", arg1' -T
	TIME PID COMM FUNC -
	01:43:15 10938 cat open opening test.txt
	01:43:20 10939 cat open opening test.txt
	^C


	In the preceding example, as well as in many others, readability may be
	improved by providing the function's signature, which names the arguments and
	lets you access structure sub-fields, which is hard with the "arg1", "arg2"
	convention. For example:

	# trace 'p:c:open(char *filename) "opening %s", filename'
	PID TID COMM FUNC -
	17507 17507 cat open opening FAQ.txt
	^C

	# trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
	PID TID COMM FUNC -
	777 785 automount SyS_nanosleep sleep for 500000000 ns
	777 785 automount SyS_nanosleep sleep for 500000000 ns
	777 785 automount SyS_nanosleep sleep for 500000000 ns
	777 785 automount SyS_nanosleep sleep for 500000000 ns
	^C

	Remember to use the -I argument include the appropriate header file. We didn't
	need to do that here because `struct timespec` is used internally by the tool,
	so it always includes this header file.

	To aggregate amount of trace, you need specify -A with -M EVENTS. A typical
	example:
	1, if we find that the sys CPU utilization is higher by 'top' command
	2, then find that the timer interrupt is more normal by 'irqtop' command
	3, to confirm kernel timer setting frequence by 'funccount -i 1 clockevents_program_event'
	4, to trace timer setting by 'trace clockevents_program_event -K -A -M 1000'

	1294576 1294584 CPU 0/KVM clockevents_program_event
	clockevents_program_event+0x1 [kernel]
	hrtimer_start_range_ns+0x209 [kernel]
	start_sw_timer+0x173 [kvm]
	restart_apic_timer+0x6c [kvm]
	kvm_set_msr_common+0x442 [kvm]
	__kvm_set_msr+0xa2 [kvm]
	kvm_emulate_wrmsr+0x36 [kvm]
	vcpu_enter_guest+0x326 [kvm]
	kvm_arch_vcpu_ioctl_run+0xcc [kvm]
	kvm_vcpu_ioctl+0x22f [kvm]
	do_vfs_ioctl+0xa1 [kernel]
	ksys_ioctl+0x60 [kernel]
	__x64_sys_ioctl+0x16 [kernel]
	do_syscall_64+0x59 [kernel]
	entry_SYSCALL_64_after_hwframe+0x44 [kernel]
	-->COUNT 271
	...
	So we can know that 271 timer setting in recent 1000(~27%).

	As a final example, let's trace open syscalls for a specific process. By
	default, tracing is system-wide, but the -p switch overrides this:

	# trace -p 2740 'do_sys_open "%s", arg2@user' -T
	TIME PID COMM FUNC -
	05:36:16 15872 ls do_sys_open /etc/ld.so.cache
	05:36:16 15872 ls do_sys_open /lib64/libselinux.so.1
	05:36:16 15872 ls do_sys_open /lib64/libcap.so.2
	05:36:16 15872 ls do_sys_open /lib64/libacl.so.1
	05:36:16 15872 ls do_sys_open /lib64/libc.so.6
	05:36:16 15872 ls do_sys_open /lib64/libpcre.so.1
	05:36:16 15872 ls do_sys_open /lib64/libdl.so.2
	05:36:16 15872 ls do_sys_open /lib64/libattr.so.1
	05:36:16 15872 ls do_sys_open /lib64/libpthread.so.0
	05:36:16 15872 ls do_sys_open /usr/lib/locale/locale-archive
	05:36:16 15872 ls do_sys_open /home/vagrant
	^C

	In this example, we traced the "ls ~" command as it was opening its shared
	libraries and then accessing the /home/vagrant directory listing.


	Lastly, if a high-frequency event is traced you may overflow the perf ring
	buffer. This shows as "Lost N samples":

	# trace sys_open
	5087 5087 pgrep sys_open
	5087 5087 pgrep sys_open
	5087 5087 pgrep sys_open
	5087 5087 pgrep sys_open
	5087 5087 pgrep sys_open
	Lost 764896 samples
	Lost 764896 samples
	Lost 764896 samples

	The perf ring buffer size can be changed with -b. The unit is size per-CPU buffer
	size and is measured in pages. The value must be a power of two and defaults to
	64 pages.

	# trace.py 'sys_setsockopt(int fd, int level, int optname, char* optval, int optlen)(level==0 && optname == 1 && STRCMP("{0x6C, 0x00, 0x00, 0x00}", optval))' -U -M 1 --bin_cmp
	PID TID COMM FUNC -
	1855611 1863183 worker sys_setsockopt found

	In this example we are catching setsockopt syscall to change IPv4 IP_TOS
	value only for the cases where new TOS value is equal to 108. we are using
	STRCMP helper in binary mode (--bin_cmp flag) to compare optval array
	against int value of 108 (parametr of setsockopt call) in hex representation
	(little endian format)

	For advanced users there is a possibility to insert the kprobes or uprobes
	after a certain offset, rather than the start of the function call
	This is useful for tracing register values at different places of the
	execution of a function. Lets consider the following example:

	int main()
	{
	int val = 0xdead;
	printf("%d\n", val);
	val = 0xbeef;
	printf("%d\n", val);
	}

	After compiling the code with -O3 optimization the object code looks
	like the following (with GCC 10 and x86_64 architecture):

	objdump --disassemble=main --prefix-addresses a.out

	0000000000001060 <main> endbr64
	0000000000001064 <main+0x4> sub $0x8,%rsp
	0000000000001068 <main+0x8> mov $0xdead,%edx
	000000000000106d <main+0xd> mov $0x1,%edi
	0000000000001072 <main+0x12> xor %eax,%eax
	0000000000001074 <main+0x14> lea 0xf89(%rip),%rsi
	000000000000107b <main+0x1b> callq 0000000000001050 <__printf_chk@plt>
	0000000000001080 <main+0x20> mov $0xbeef,%edx
	0000000000001085 <main+0x25> lea 0xf78(%rip),%rsi
	000000000000108c <main+0x2c> xor %eax,%eax
	000000000000108e <main+0x2e> mov $0x1,%edi
	0000000000001093 <main+0x33> callq 0000000000001050 <__printf_chk@plt>
	0000000000001098 <main+0x38> xor %eax,%eax
	000000000000109a <main+0x3a> add $0x8,%rsp
	000000000000109e <main+0x3e> retq

	The 0xdead and later the 0xbeef values are moved into the edx register.
	As the disassembly shows the edx register contains the 0xdead value
	after the 0xd offset and 0xbeef after the 0x25 offset. To verify this
	with trace lets insert probes to those offsets. The following
	command inserts two uprobe one after the 0xd offset and another one
	after the 0x25 offset of the main function. The probe print the
	value of the edx register which will show us the correct values.

	trace 'p:/tmp/a.out:main+0xd "%x", ctx->dx' 'p:/tmp/a.out:main+0x25 "%x", ctx->dx'
	PID TID COMM FUNC -
	25754 25754 a.out main dead
	25754 25754 a.out main beef


	USAGE message:

	usage: trace [-h] [-b BUFFER_PAGES] [-p PID] [-L TID] [--uid UID] [-v]
	[-Z STRING_SIZE] [-S] [-M MAX_EVENTS] [-t] [-u] [-T] [-C]
	[-c CGROUP_PATH] [-n NAME] [-f MSG_FILTER] [-B]
	[-s SYM_FILE_LIST] [-K] [-U] [-a] [-I header]
	probe [probe ...]

	Attach to functions and print trace messages.

	positional arguments:
	probe probe specifier (see examples)

	optional arguments:
	-h, --help show this help message and exit
	-b BUFFER_PAGES, --buffer-pages BUFFER_PAGES
	number of pages to use for perf_events ring buffer
	(default: 64)
	-p PID, --pid PID id of the process to trace (optional)
	-L TID, --tid TID id of the thread to trace (optional)
	--uid UID id of the user to trace (optional)
	-v, --verbose print resulting BPF program code before executing
	-Z STRING_SIZE, --string-size STRING_SIZE
	maximum size to read from strings
	-S, --include-self do not filter trace's own pid from the trace
	-M MAX_EVENTS, --max-events MAX_EVENTS
	number of events to print before quitting
	-t, --timestamp print timestamp column (offset from trace start)
	-u, --unix-timestamp print UNIX timestamp instead of offset from trace
	start, requires -t
	-T, --time print time column
	-C, --print_cpu print CPU id
	-c CGROUP_PATH, --cgroup-path CGROUP_PATH
	cgroup path
	-n NAME, --name NAME only print process names containing this name
	-f MSG_FILTER, --msg-filter MSG_FILTER
	only print the msg of event containing this string
	-B, --bin_cmp allow to use STRCMP with binary values
	-s SYM_FILE_LIST, --sym_file_list SYM_FILE_LIST
	comma separated list of symbol files to use for symbol
	resolution
	-K, --kernel-stack output kernel stack trace
	-U, --user-stack output user stack trace
	-a, --address print virtual address in stacks
	-I header, --include header
	additional header files to include in the BPF program
	as either full path, or relative to current working
	directory, or relative to default kernel header search
	path
	-A, --aggregate aggregate amount of each trace

	EXAMPLES:

	trace do_sys_open
	Trace the open syscall and print a default trace message when entered
	trace kfree_skb+0x12
	Trace the kfree_skb kernel function after the instruction on the 0x12 offset
	trace 'do_sys_open "%s", arg2@user'
	Trace the open syscall and print the filename being opened @user is
	added to arg2 in kprobes to ensure that char * should be copied from
	the userspace stack to the bpf stack. If not specified, previous
	behaviour is expected.

	trace 'do_sys_open "%s", arg2@user' -n main
	Trace the open syscall and only print event that process names containing "main"
	trace 'do_sys_open "%s", arg2@user' --uid 1001
	Trace the open syscall and only print event that processes with user ID 1001
	trace 'do_sys_open "%s", arg2@user' -f config
	Trace the open syscall and print the filename being opened filtered by "config"
	trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
	Trace the read syscall and print a message for reads >20000 bytes
	trace 'r::do_sys_open "%llx", retval'
	Trace the return from the open syscall and print the return value
	trace 'c:open (arg2 == 42) "%s %d", arg1, arg2'
	Trace the open() call from libc only if the flags (arg2) argument is 42
	trace 'c:malloc "size = %d", arg1'
	Trace malloc calls and print the size being allocated
	trace 'p:c:write (arg1 == 1) "writing %d bytes to STDOUT", arg3'
	Trace the write() call from libc to monitor writes to STDOUT
	trace 'r::__kmalloc (retval == 0) "kmalloc failed!"'
	Trace returns from __kmalloc which returned a null pointer
	trace 'r:c:malloc (retval) "allocated = %x", retval'
	Trace returns from malloc and print non-NULL allocated buffers
	trace 't:block:block_rq_complete "sectors=%d", args->nr_sector'
	Trace the block_rq_complete kernel tracepoint and print # of tx sectors
	trace 'u:pthread:pthread_create (arg4 != 0)'
	Trace the USDT probe pthread_create when its 4th argument is non-zero
	trace 'u:pthread:libpthread:pthread_create (arg4 != 0)'
	Ditto, but the provider name "libpthread" is specified.
	trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec'
	Trace the nanosleep syscall and print the sleep duration in ns
	trace -c /sys/fs/cgroup/system.slice/workload.service '__x64_sys_nanosleep' '__x64_sys_clone'
	Trace nanosleep/clone syscall calls only under workload.service
	cgroup hierarchy.
	trace -I 'linux/fs.h' \
	'p::uprobe_register(struct inode *inode) "a_ops = %llx", inode->i_mapping->a_ops'
	Trace the uprobe_register inode mapping ops, and the symbol can be found
	in /proc/kallsyms
	trace -I 'kernel/sched/sched.h' \
	'p::__account_cfs_rq_runtime(struct cfs_rq *cfs_rq) "%d", cfs_rq->runtime_remaining'
	Trace the cfs scheduling runqueue remaining runtime. The struct cfs_rq is defined
	in kernel/sched/sched.h which is in kernel source tree and not in kernel-devel
	package. So this command needs to run at the kernel source tree root directory
	so that the added header file can be found by the compiler.
	trace -I 'net/sock.h' \
	'udpv6_sendmsg(struct sock *sk) (sk->sk_dport == 13568)'
	Trace udpv6 sendmsg calls only if socket's destination port is equal
	to 53 (DNS; 13568 in big endian order)
	trace -I 'linux/fs_struct.h' 'mntns_install "users = %d", $task->fs->users'
	Trace the number of users accessing the file system of the current task
	trace -s /lib/x86_64-linux-gnu/libc.so.6,/bin/ping 'p:c:inet_pton' -U
	Trace inet_pton system call and use the specified libraries/executables for
	symbol resolution.