diff options
Diffstat (limited to 'kernel/Documentation/trace')
17 files changed, 7594 insertions, 0 deletions
diff --git a/kernel/Documentation/trace/coresight.txt b/kernel/Documentation/trace/coresight.txt new file mode 100644 index 000000000..77d14d51a --- /dev/null +++ b/kernel/Documentation/trace/coresight.txt @@ -0,0 +1,299 @@ + Coresight - HW Assisted Tracing on ARM + ====================================== + + Author: Mathieu Poirier <mathieu.poirier@linaro.org> + Date: September 11th, 2014 + +Introduction +------------ + +Coresight is an umbrella of technologies allowing for the debugging of ARM +based SoC. It includes solutions for JTAG and HW assisted tracing. This +document is concerned with the latter. + +HW assisted tracing is becoming increasingly useful when dealing with systems +that have many SoCs and other components like GPU and DMA engines. ARM has +developed a HW assisted tracing solution by means of different components, each +being added to a design at synthesis time to cater to specific tracing needs. +Compoments are generally categorised as source, link and sinks and are +(usually) discovered using the AMBA bus. + +"Sources" generate a compressed stream representing the processor instruction +path based on tracing scenarios as configured by users. From there the stream +flows through the coresight system (via ATB bus) using links that are connecting +the emanating source to a sink(s). Sinks serve as endpoints to the coresight +implementation, either storing the compressed stream in a memory buffer or +creating an interface to the outside world where data can be transferred to a +host without fear of filling up the onboard coresight memory buffer. + +At typical coresight system would look like this: + + ***************************************************************** + **************************** AMBA AXI ****************************===|| + ***************************************************************** || + ^ ^ | || + | | * ** + 0000000 ::::: 0000000 ::::: ::::: @@@@@@@ |||||||||||| + 0 CPU 0<-->: C : 0 CPU 0<-->: C : : C : @ STM @ || System || + |->0000000 : T : |->0000000 : T : : T :<--->@@@@@ || Memory || + | #######<-->: I : | #######<-->: I : : I : @@@<-| |||||||||||| + | # ETM # ::::: | # PTM # ::::: ::::: @ | + | ##### ^ ^ | ##### ^ ! ^ ! . | ||||||||| + | |->### | ! | |->### | ! | ! . | || DAP || + | | # | ! | | # | ! | ! . | ||||||||| + | | . | ! | | . | ! | ! . | | | + | | . | ! | | . | ! | ! . | | * + | | . | ! | | . | ! | ! . | | SWD/ + | | . | ! | | . | ! | ! . | | JTAG + *****************************************************************<-| + *************************** AMBA Debug APB ************************ + ***************************************************************** + | . ! . ! ! . | + | . * . * * . | + ***************************************************************** + ******************** Cross Trigger Matrix (CTM) ******************* + ***************************************************************** + | . ^ . . | + | * ! * * | + ***************************************************************** + ****************** AMBA Advanced Trace Bus (ATB) ****************** + ***************************************************************** + | ! =============== | + | * ===== F =====<---------| + | ::::::::: ==== U ==== + |-->:: CTI ::<!! === N === + | ::::::::: ! == N == + | ^ * == E == + | ! &&&&&&&&& IIIIIII == L == + |------>&& ETB &&<......II I ======= + | ! &&&&&&&&& II I . + | ! I I . + | ! I REP I<.......... + | ! I I + | !!>&&&&&&&&& II I *Source: ARM ltd. + |------>& TPIU &<......II I DAP = Debug Access Port + &&&&&&&&& IIIIIII ETM = Embedded Trace Macrocell + ; PTM = Program Trace Macrocell + ; CTI = Cross Trigger Interface + * ETB = Embedded Trace Buffer + To trace port TPIU= Trace Port Interface Unit + SWD = Serial Wire Debug + +While on target configuration of the components is done via the APB bus, +all trace data are carried out-of-band on the ATB bus. The CTM provides +a way to aggregate and distribute signals between CoreSight components. + +The coresight framework provides a central point to represent, configure and +manage coresight devices on a platform. This first implementation centers on +the basic tracing functionality, enabling components such ETM/PTM, funnel, +replicator, TMC, TPIU and ETB. Future work will enable more +intricate IP blocks such as STM and CTI. + + +Acronyms and Classification +--------------------------- + +Acronyms: + +PTM: Program Trace Macrocell +ETM: Embedded Trace Macrocell +STM: System trace Macrocell +ETB: Embedded Trace Buffer +ITM: Instrumentation Trace Macrocell +TPIU: Trace Port Interface Unit +TMC-ETR: Trace Memory Controller, configured as Embedded Trace Router +TMC-ETF: Trace Memory Controller, configured as Embedded Trace FIFO +CTI: Cross Trigger Interface + +Classification: + +Source: + ETMv3.x ETMv4, PTMv1.0, PTMv1.1, STM, STM500, ITM +Link: + Funnel, replicator (intelligent or not), TMC-ETR +Sinks: + ETBv1.0, ETB1.1, TPIU, TMC-ETF +Misc: + CTI + + +Device Tree Bindings +---------------------- + +See Documentation/devicetree/bindings/arm/coresight.txt for details. + +As of this writing drivers for ITM, STMs and CTIs are not provided but are +expected to be added as the solution matures. + + +Framework and implementation +---------------------------- + +The coresight framework provides a central point to represent, configure and +manage coresight devices on a platform. Any coresight compliant device can +register with the framework for as long as they use the right APIs: + +struct coresight_device *coresight_register(struct coresight_desc *desc); +void coresight_unregister(struct coresight_device *csdev); + +The registering function is taking a "struct coresight_device *csdev" and +register the device with the core framework. The unregister function takes +a reference to a "strut coresight_device", obtained at registration time. + +If everything goes well during the registration process the new devices will +show up under /sys/bus/coresight/devices, as showns here for a TC2 platform: + +root:~# ls /sys/bus/coresight/devices/ +replicator 20030000.tpiu 2201c000.ptm 2203c000.etm 2203e000.etm +20010000.etb 20040000.funnel 2201d000.ptm 2203d000.etm +root:~# + +The functions take a "struct coresight_device", which looks like this: + +struct coresight_desc { + enum coresight_dev_type type; + struct coresight_dev_subtype subtype; + const struct coresight_ops *ops; + struct coresight_platform_data *pdata; + struct device *dev; + const struct attribute_group **groups; +}; + + +The "coresight_dev_type" identifies what the device is, i.e, source link or +sink while the "coresight_dev_subtype" will characterise that type further. + +The "struct coresight_ops" is mandatory and will tell the framework how to +perform base operations related to the components, each component having +a different set of requirement. For that "struct coresight_ops_sink", +"struct coresight_ops_link" and "struct coresight_ops_source" have been +provided. + +The next field, "struct coresight_platform_data *pdata" is acquired by calling +"of_get_coresight_platform_data()", as part of the driver's _probe routine and +"struct device *dev" gets the device reference embedded in the "amba_device": + +static int etm_probe(struct amba_device *adev, const struct amba_id *id) +{ + ... + ... + drvdata->dev = &adev->dev; + ... +} + +Specific class of device (source, link, or sink) have generic operations +that can be performed on them (see "struct coresight_ops"). The +"**groups" is a list of sysfs entries pertaining to operations +specific to that component only. "Implementation defined" customisations are +expected to be accessed and controlled using those entries. + +Last but not least, "struct module *owner" is expected to be set to reflect +the information carried in "THIS_MODULE". + +How to use +---------- + +Before trace collection can start, a coresight sink needs to be identify. +There is no limit on the amount of sinks (nor sources) that can be enabled at +any given moment. As a generic operation, all device pertaining to the sink +class will have an "active" entry in sysfs: + +root:/sys/bus/coresight/devices# ls +replicator 20030000.tpiu 2201c000.ptm 2203c000.etm 2203e000.etm +20010000.etb 20040000.funnel 2201d000.ptm 2203d000.etm +root:/sys/bus/coresight/devices# ls 20010000.etb +enable_sink status trigger_cntr +root:/sys/bus/coresight/devices# echo 1 > 20010000.etb/enable_sink +root:/sys/bus/coresight/devices# cat 20010000.etb/enable_sink +1 +root:/sys/bus/coresight/devices# + +At boot time the current etm3x driver will configure the first address +comparator with "_stext" and "_etext", essentially tracing any instruction +that falls within that range. As such "enabling" a source will immediately +trigger a trace capture: + +root:/sys/bus/coresight/devices# echo 1 > 2201c000.ptm/enable_source +root:/sys/bus/coresight/devices# cat 2201c000.ptm/enable_source +1 +root:/sys/bus/coresight/devices# cat 20010000.etb/status +Depth: 0x2000 +Status: 0x1 +RAM read ptr: 0x0 +RAM wrt ptr: 0x19d3 <----- The write pointer is moving +Trigger cnt: 0x0 +Control: 0x1 +Flush status: 0x0 +Flush ctrl: 0x2001 +root:/sys/bus/coresight/devices# + +Trace collection is stopped the same way: + +root:/sys/bus/coresight/devices# echo 0 > 2201c000.ptm/enable_source +root:/sys/bus/coresight/devices# + +The content of the ETB buffer can be harvested directly from /dev: + +root:/sys/bus/coresight/devices# dd if=/dev/20010000.etb \ +of=~/cstrace.bin + +64+0 records in +64+0 records out +32768 bytes (33 kB) copied, 0.00125258 s, 26.2 MB/s +root:/sys/bus/coresight/devices# + +The file cstrace.bin can be decompressed using "ptm2human", DS-5 or Trace32. + +Following is a DS-5 output of an experimental loop that increments a variable up +to a certain value. The example is simple and yet provides a glimpse of the +wealth of possibilities that coresight provides. + +Info Tracing enabled +Instruction 106378866 0x8026B53C E52DE004 false PUSH {lr} +Instruction 0 0x8026B540 E24DD00C false SUB sp,sp,#0xc +Instruction 0 0x8026B544 E3A03000 false MOV r3,#0 +Instruction 0 0x8026B548 E58D3004 false STR r3,[sp,#4] +Instruction 0 0x8026B54C E59D3004 false LDR r3,[sp,#4] +Instruction 0 0x8026B550 E3530004 false CMP r3,#4 +Instruction 0 0x8026B554 E2833001 false ADD r3,r3,#1 +Instruction 0 0x8026B558 E58D3004 false STR r3,[sp,#4] +Instruction 0 0x8026B55C DAFFFFFA true BLE {pc}-0x10 ; 0x8026b54c +Timestamp Timestamp: 17106715833 +Instruction 319 0x8026B54C E59D3004 false LDR r3,[sp,#4] +Instruction 0 0x8026B550 E3530004 false CMP r3,#4 +Instruction 0 0x8026B554 E2833001 false ADD r3,r3,#1 +Instruction 0 0x8026B558 E58D3004 false STR r3,[sp,#4] +Instruction 0 0x8026B55C DAFFFFFA true BLE {pc}-0x10 ; 0x8026b54c +Instruction 9 0x8026B54C E59D3004 false LDR r3,[sp,#4] +Instruction 0 0x8026B550 E3530004 false CMP r3,#4 +Instruction 0 0x8026B554 E2833001 false ADD r3,r3,#1 +Instruction 0 0x8026B558 E58D3004 false STR r3,[sp,#4] +Instruction 0 0x8026B55C DAFFFFFA true BLE {pc}-0x10 ; 0x8026b54c +Instruction 7 0x8026B54C E59D3004 false LDR r3,[sp,#4] +Instruction 0 0x8026B550 E3530004 false CMP r3,#4 +Instruction 0 0x8026B554 E2833001 false ADD r3,r3,#1 +Instruction 0 0x8026B558 E58D3004 false STR r3,[sp,#4] +Instruction 0 0x8026B55C DAFFFFFA true BLE {pc}-0x10 ; 0x8026b54c +Instruction 7 0x8026B54C E59D3004 false LDR r3,[sp,#4] +Instruction 0 0x8026B550 E3530004 false CMP r3,#4 +Instruction 0 0x8026B554 E2833001 false ADD r3,r3,#1 +Instruction 0 0x8026B558 E58D3004 false STR r3,[sp,#4] +Instruction 0 0x8026B55C DAFFFFFA true BLE {pc}-0x10 ; 0x8026b54c +Instruction 10 0x8026B54C E59D3004 false LDR r3,[sp,#4] +Instruction 0 0x8026B550 E3530004 false CMP r3,#4 +Instruction 0 0x8026B554 E2833001 false ADD r3,r3,#1 +Instruction 0 0x8026B558 E58D3004 false STR r3,[sp,#4] +Instruction 0 0x8026B55C DAFFFFFA true BLE {pc}-0x10 ; 0x8026b54c +Instruction 6 0x8026B560 EE1D3F30 false MRC p15,#0x0,r3,c13,c0,#1 +Instruction 0 0x8026B564 E1A0100D false MOV r1,sp +Instruction 0 0x8026B568 E3C12D7F false BIC r2,r1,#0x1fc0 +Instruction 0 0x8026B56C E3C2203F false BIC r2,r2,#0x3f +Instruction 0 0x8026B570 E59D1004 false LDR r1,[sp,#4] +Instruction 0 0x8026B574 E59F0010 false LDR r0,[pc,#16] ; [0x8026B58C] = 0x80550368 +Instruction 0 0x8026B578 E592200C false LDR r2,[r2,#0xc] +Instruction 0 0x8026B57C E59221D0 false LDR r2,[r2,#0x1d0] +Instruction 0 0x8026B580 EB07A4CF true BL {pc}+0x1e9344 ; 0x804548c4 +Info Tracing enabled +Instruction 13570831 0x8026B584 E28DD00C false ADD sp,sp,#0xc +Instruction 0 0x8026B588 E8BD8000 true LDM sp!,{pc} +Timestamp Timestamp: 17107041535 diff --git a/kernel/Documentation/trace/events-kmem.txt b/kernel/Documentation/trace/events-kmem.txt new file mode 100644 index 000000000..194800410 --- /dev/null +++ b/kernel/Documentation/trace/events-kmem.txt @@ -0,0 +1,107 @@ + Subsystem Trace Points: kmem + +The kmem tracing system captures events related to object and page allocation +within the kernel. Broadly speaking there are five major subheadings. + + o Slab allocation of small objects of unknown type (kmalloc) + o Slab allocation of small objects of known type + o Page allocation + o Per-CPU Allocator Activity + o External Fragmentation + +This document describes what each of the tracepoints is and why they +might be useful. + +1. Slab allocation of small objects of unknown type +=================================================== +kmalloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s +kmalloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d +kfree call_site=%lx ptr=%p + +Heavy activity for these events may indicate that a specific cache is +justified, particularly if kmalloc slab pages are getting significantly +internal fragmented as a result of the allocation pattern. By correlating +kmalloc with kfree, it may be possible to identify memory leaks and where +the allocation sites were. + + +2. Slab allocation of small objects of known type +================================================= +kmem_cache_alloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s +kmem_cache_alloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d +kmem_cache_free call_site=%lx ptr=%p + +These events are similar in usage to the kmalloc-related events except that +it is likely easier to pin the event down to a specific cache. At the time +of writing, no information is available on what slab is being allocated from, +but the call_site can usually be used to extrapolate that information. + +3. Page allocation +================== +mm_page_alloc page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s +mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d +mm_page_free page=%p pfn=%lu order=%d +mm_page_free_batched page=%p pfn=%lu order=%d cold=%d + +These four events deal with page allocation and freeing. mm_page_alloc is +a simple indicator of page allocator activity. Pages may be allocated from +the per-CPU allocator (high performance) or the buddy allocator. + +If pages are allocated directly from the buddy allocator, the +mm_page_alloc_zone_locked event is triggered. This event is important as high +amounts of activity imply high activity on the zone->lock. Taking this lock +impairs performance by disabling interrupts, dirtying cache lines between +CPUs and serialising many CPUs. + +When a page is freed directly by the caller, the only mm_page_free event +is triggered. Significant amounts of activity here could indicate that the +callers should be batching their activities. + +When pages are freed in batch, the also mm_page_free_batched is triggered. +Broadly speaking, pages are taken off the LRU lock in bulk and +freed in batch with a page list. Significant amounts of activity here could +indicate that the system is under memory pressure and can also indicate +contention on the zone->lru_lock. + +4. Per-CPU Allocator Activity +============================= +mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d +mm_page_pcpu_drain page=%p pfn=%lu order=%d cpu=%d migratetype=%d + +In front of the page allocator is a per-cpu page allocator. It exists only +for order-0 pages, reduces contention on the zone->lock and reduces the +amount of writing on struct page. + +When a per-CPU list is empty or pages of the wrong type are allocated, +the zone->lock will be taken once and the per-CPU list refilled. The event +triggered is mm_page_alloc_zone_locked for each page allocated with the +event indicating whether it is for a percpu_refill or not. + +When the per-CPU list is too full, a number of pages are freed, each one +which triggers a mm_page_pcpu_drain event. + +The individual nature of the events is so that pages can be tracked +between allocation and freeing. A number of drain or refill pages that occur +consecutively imply the zone->lock being taken once. Large amounts of per-CPU +refills and drains could imply an imbalance between CPUs where too much work +is being concentrated in one place. It could also indicate that the per-CPU +lists should be a larger size. Finally, large amounts of refills on one CPU +and drains on another could be a factor in causing large amounts of cache +line bounces due to writes between CPUs and worth investigating if pages +can be allocated and freed on the same CPU through some algorithm change. + +5. External Fragmentation +========================= +mm_page_alloc_extfrag page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d + +External fragmentation affects whether a high-order allocation will be +successful or not. For some types of hardware, this is important although +it is avoided where possible. If the system is using huge pages and needs +to be able to resize the pool over the lifetime of the system, this value +is important. + +Large numbers of this event implies that memory is fragmenting and +high-order allocations will start failing at some time in the future. One +means of reducing the occurrence of this event is to increase the size of +min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where +pageblock_size is usually the size of the default hugepage size. diff --git a/kernel/Documentation/trace/events-nmi.txt b/kernel/Documentation/trace/events-nmi.txt new file mode 100644 index 000000000..c03c8c89f --- /dev/null +++ b/kernel/Documentation/trace/events-nmi.txt @@ -0,0 +1,43 @@ +NMI Trace Events + +These events normally show up here: + + /sys/kernel/debug/tracing/events/nmi + +-- + +nmi_handler: + +You might want to use this tracepoint if you suspect that your +NMI handlers are hogging large amounts of CPU time. The kernel +will warn if it sees long-running handlers: + + INFO: NMI handler took too long to run: 9.207 msecs + +and this tracepoint will allow you to drill down and get some +more details. + +Let's say you suspect that perf_event_nmi_handler() is causing +you some problems and you only want to trace that handler +specifically. You need to find its address: + + $ grep perf_event_nmi_handler /proc/kallsyms + ffffffff81625600 t perf_event_nmi_handler + +Let's also say you are only interested in when that function is +really hogging a lot of CPU time, like a millisecond at a time. +Note that the kernel's output is in milliseconds, but the input +to the filter is in nanoseconds! You can filter on 'delta_ns': + +cd /sys/kernel/debug/tracing/events/nmi/nmi_handler +echo 'handler==0xffffffff81625600 && delta_ns>1000000' > filter +echo 1 > enable + +Your output would then look like: + +$ cat /sys/kernel/debug/tracing/trace_pipe +<idle>-0 [000] d.h3 505.397558: nmi_handler: perf_event_nmi_handler() delta_ns: 3236765 handled: 1 +<idle>-0 [000] d.h3 505.805893: nmi_handler: perf_event_nmi_handler() delta_ns: 3174234 handled: 1 +<idle>-0 [000] d.h3 506.158206: nmi_handler: perf_event_nmi_handler() delta_ns: 3084642 handled: 1 +<idle>-0 [000] d.h3 506.334346: nmi_handler: perf_event_nmi_handler() delta_ns: 3080351 handled: 1 + diff --git a/kernel/Documentation/trace/events-power.txt b/kernel/Documentation/trace/events-power.txt new file mode 100644 index 000000000..21d514ced --- /dev/null +++ b/kernel/Documentation/trace/events-power.txt @@ -0,0 +1,96 @@ + + Subsystem Trace Points: power + +The power tracing system captures events related to power transitions +within the kernel. Broadly speaking there are three major subheadings: + + o Power state switch which reports events related to suspend (S-states), + cpuidle (C-states) and cpufreq (P-states) + o System clock related changes + o Power domains related changes and transitions + +This document describes what each of the tracepoints is and why they +might be useful. + +Cf. include/trace/events/power.h for the events definitions. + +1. Power state switch events +============================ + +1.1 Trace API +----------------- + +A 'cpu' event class gathers the CPU-related events: cpuidle and +cpufreq. + +cpu_idle "state=%lu cpu_id=%lu" +cpu_frequency "state=%lu cpu_id=%lu" + +A suspend event is used to indicate the system going in and out of the +suspend mode: + +machine_suspend "state=%lu" + + +Note: the value of '-1' or '4294967295' for state means an exit from the current state, +i.e. trace_cpu_idle(4, smp_processor_id()) means that the system +enters the idle state 4, while trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id()) +means that the system exits the previous idle state. + +The event which has 'state=4294967295' in the trace is very important to the user +space tools which are using it to detect the end of the current state, and so to +correctly draw the states diagrams and to calculate accurate statistics etc. + +2. Clocks events +================ +The clock events are used for clock enable/disable and for +clock rate change. + +clock_enable "%s state=%lu cpu_id=%lu" +clock_disable "%s state=%lu cpu_id=%lu" +clock_set_rate "%s state=%lu cpu_id=%lu" + +The first parameter gives the clock name (e.g. "gpio1_iclk"). +The second parameter is '1' for enable, '0' for disable, the target +clock rate for set_rate. + +3. Power domains events +======================= +The power domain events are used for power domains transitions + +power_domain_target "%s state=%lu cpu_id=%lu" + +The first parameter gives the power domain name (e.g. "mpu_pwrdm"). +The second parameter is the power domain target state. + +4. PM QoS events +================ +The PM QoS events are used for QoS add/update/remove request and for +target/flags update. + +pm_qos_add_request "pm_qos_class=%s value=%d" +pm_qos_update_request "pm_qos_class=%s value=%d" +pm_qos_remove_request "pm_qos_class=%s value=%d" +pm_qos_update_request_timeout "pm_qos_class=%s value=%d, timeout_us=%ld" + +The first parameter gives the QoS class name (e.g. "CPU_DMA_LATENCY"). +The second parameter is value to be added/updated/removed. +The third parameter is timeout value in usec. + +pm_qos_update_target "action=%s prev_value=%d curr_value=%d" +pm_qos_update_flags "action=%s prev_value=0x%x curr_value=0x%x" + +The first parameter gives the QoS action name (e.g. "ADD_REQ"). +The second parameter is the previous QoS value. +The third parameter is the current QoS value to update. + +And, there are also events used for device PM QoS add/update/remove request. + +dev_pm_qos_add_request "device=%s type=%s new_value=%d" +dev_pm_qos_update_request "device=%s type=%s new_value=%d" +dev_pm_qos_remove_request "device=%s type=%s new_value=%d" + +The first parameter gives the device name which tries to add/update/remove +QoS requests. +The second parameter gives the request type (e.g. "DEV_PM_QOS_RESUME_LATENCY"). +The third parameter is value to be added/updated/removed. diff --git a/kernel/Documentation/trace/events.txt b/kernel/Documentation/trace/events.txt new file mode 100644 index 000000000..75d25a1d6 --- /dev/null +++ b/kernel/Documentation/trace/events.txt @@ -0,0 +1,496 @@ + Event Tracing + + Documentation written by Theodore Ts'o + Updated by Li Zefan and Tom Zanussi + +1. Introduction +=============== + +Tracepoints (see Documentation/trace/tracepoints.txt) can be used +without creating custom kernel modules to register probe functions +using the event tracing infrastructure. + +Not all tracepoints can be traced using the event tracing system; +the kernel developer must provide code snippets which define how the +tracing information is saved into the tracing buffer, and how the +tracing information should be printed. + +2. Using Event Tracing +====================== + +2.1 Via the 'set_event' interface +--------------------------------- + +The events which are available for tracing can be found in the file +/sys/kernel/debug/tracing/available_events. + +To enable a particular event, such as 'sched_wakeup', simply echo it +to /sys/kernel/debug/tracing/set_event. For example: + + # echo sched_wakeup >> /sys/kernel/debug/tracing/set_event + +[ Note: '>>' is necessary, otherwise it will firstly disable + all the events. ] + +To disable an event, echo the event name to the set_event file prefixed +with an exclamation point: + + # echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event + +To disable all events, echo an empty line to the set_event file: + + # echo > /sys/kernel/debug/tracing/set_event + +To enable all events, echo '*:*' or '*:' to the set_event file: + + # echo *:* > /sys/kernel/debug/tracing/set_event + +The events are organized into subsystems, such as ext4, irq, sched, +etc., and a full event name looks like this: <subsystem>:<event>. The +subsystem name is optional, but it is displayed in the available_events +file. All of the events in a subsystem can be specified via the syntax +"<subsystem>:*"; for example, to enable all irq events, you can use the +command: + + # echo 'irq:*' > /sys/kernel/debug/tracing/set_event + +2.2 Via the 'enable' toggle +--------------------------- + +The events available are also listed in /sys/kernel/debug/tracing/events/ hierarchy +of directories. + +To enable event 'sched_wakeup': + + # echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable + +To disable it: + + # echo 0 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable + +To enable all events in sched subsystem: + + # echo 1 > /sys/kernel/debug/tracing/events/sched/enable + +To enable all events: + + # echo 1 > /sys/kernel/debug/tracing/events/enable + +When reading one of these enable files, there are four results: + + 0 - all events this file affects are disabled + 1 - all events this file affects are enabled + X - there is a mixture of events enabled and disabled + ? - this file does not affect any event + +2.3 Boot option +--------------- + +In order to facilitate early boot debugging, use boot option: + + trace_event=[event-list] + +event-list is a comma separated list of events. See section 2.1 for event +format. + +3. Defining an event-enabled tracepoint +======================================= + +See The example provided in samples/trace_events + +4. Event formats +================ + +Each trace event has a 'format' file associated with it that contains +a description of each field in a logged event. This information can +be used to parse the binary trace stream, and is also the place to +find the field names that can be used in event filters (see section 5). + +It also displays the format string that will be used to print the +event in text mode, along with the event name and ID used for +profiling. + +Every event has a set of 'common' fields associated with it; these are +the fields prefixed with 'common_'. The other fields vary between +events and correspond to the fields defined in the TRACE_EVENT +definition for that event. + +Each field in the format has the form: + + field:field-type field-name; offset:N; size:N; + +where offset is the offset of the field in the trace record and size +is the size of the data item, in bytes. + +For example, here's the information displayed for the 'sched_wakeup' +event: + +# cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/format + +name: sched_wakeup +ID: 60 +format: + field:unsigned short common_type; offset:0; size:2; + field:unsigned char common_flags; offset:2; size:1; + field:unsigned char common_preempt_count; offset:3; size:1; + field:int common_pid; offset:4; size:4; + field:int common_tgid; offset:8; size:4; + + field:char comm[TASK_COMM_LEN]; offset:12; size:16; + field:pid_t pid; offset:28; size:4; + field:int prio; offset:32; size:4; + field:int success; offset:36; size:4; + field:int cpu; offset:40; size:4; + +print fmt: "task %s:%d [%d] success=%d [%03d]", REC->comm, REC->pid, + REC->prio, REC->success, REC->cpu + +This event contains 10 fields, the first 5 common and the remaining 5 +event-specific. All the fields for this event are numeric, except for +'comm' which is a string, a distinction important for event filtering. + +5. Event filtering +================== + +Trace events can be filtered in the kernel by associating boolean +'filter expressions' with them. As soon as an event is logged into +the trace buffer, its fields are checked against the filter expression +associated with that event type. An event with field values that +'match' the filter will appear in the trace output, and an event whose +values don't match will be discarded. An event with no filter +associated with it matches everything, and is the default when no +filter has been set for an event. + +5.1 Expression syntax +--------------------- + +A filter expression consists of one or more 'predicates' that can be +combined using the logical operators '&&' and '||'. A predicate is +simply a clause that compares the value of a field contained within a +logged event with a constant value and returns either 0 or 1 depending +on whether the field value matched (1) or didn't match (0): + + field-name relational-operator value + +Parentheses can be used to provide arbitrary logical groupings and +double-quotes can be used to prevent the shell from interpreting +operators as shell metacharacters. + +The field-names available for use in filters can be found in the +'format' files for trace events (see section 4). + +The relational-operators depend on the type of the field being tested: + +The operators available for numeric fields are: + +==, !=, <, <=, >, >=, & + +And for string fields they are: + +==, !=, ~ + +The glob (~) only accepts a wild card character (*) at the start and or +end of the string. For example: + + prev_comm ~ "*sh" + prev_comm ~ "sh*" + prev_comm ~ "*sh*" + +But does not allow for it to be within the string: + + prev_comm ~ "ba*sh" <-- is invalid + +5.2 Setting filters +------------------- + +A filter for an individual event is set by writing a filter expression +to the 'filter' file for the given event. + +For example: + +# cd /sys/kernel/debug/tracing/events/sched/sched_wakeup +# echo "common_preempt_count > 4" > filter + +A slightly more involved example: + +# cd /sys/kernel/debug/tracing/events/signal/signal_generate +# echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter + +If there is an error in the expression, you'll get an 'Invalid +argument' error when setting it, and the erroneous string along with +an error message can be seen by looking at the filter e.g.: + +# cd /sys/kernel/debug/tracing/events/signal/signal_generate +# echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter +-bash: echo: write error: Invalid argument +# cat filter +((sig >= 10 && sig < 15) || dsig == 17) && comm != bash +^ +parse_error: Field not found + +Currently the caret ('^') for an error always appears at the beginning of +the filter string; the error message should still be useful though +even without more accurate position info. + +5.3 Clearing filters +-------------------- + +To clear the filter for an event, write a '0' to the event's filter +file. + +To clear the filters for all events in a subsystem, write a '0' to the +subsystem's filter file. + +5.3 Subsystem filters +--------------------- + +For convenience, filters for every event in a subsystem can be set or +cleared as a group by writing a filter expression into the filter file +at the root of the subsystem. Note however, that if a filter for any +event within the subsystem lacks a field specified in the subsystem +filter, or if the filter can't be applied for any other reason, the +filter for that event will retain its previous setting. This can +result in an unintended mixture of filters which could lead to +confusing (to the user who might think different filters are in +effect) trace output. Only filters that reference just the common +fields can be guaranteed to propagate successfully to all events. + +Here are a few subsystem filter examples that also illustrate the +above points: + +Clear the filters on all events in the sched subsystem: + +# cd /sys/kernel/debug/tracing/events/sched +# echo 0 > filter +# cat sched_switch/filter +none +# cat sched_wakeup/filter +none + +Set a filter using only common fields for all events in the sched +subsystem (all events end up with the same filter): + +# cd /sys/kernel/debug/tracing/events/sched +# echo common_pid == 0 > filter +# cat sched_switch/filter +common_pid == 0 +# cat sched_wakeup/filter +common_pid == 0 + +Attempt to set a filter using a non-common field for all events in the +sched subsystem (all events but those that have a prev_pid field retain +their old filters): + +# cd /sys/kernel/debug/tracing/events/sched +# echo prev_pid == 0 > filter +# cat sched_switch/filter +prev_pid == 0 +# cat sched_wakeup/filter +common_pid == 0 + +6. Event triggers +================= + +Trace events can be made to conditionally invoke trigger 'commands' +which can take various forms and are described in detail below; +examples would be enabling or disabling other trace events or invoking +a stack trace whenever the trace event is hit. Whenever a trace event +with attached triggers is invoked, the set of trigger commands +associated with that event is invoked. Any given trigger can +additionally have an event filter of the same form as described in +section 5 (Event filtering) associated with it - the command will only +be invoked if the event being invoked passes the associated filter. +If no filter is associated with the trigger, it always passes. + +Triggers are added to and removed from a particular event by writing +trigger expressions to the 'trigger' file for the given event. + +A given event can have any number of triggers associated with it, +subject to any restrictions that individual commands may have in that +regard. + +Event triggers are implemented on top of "soft" mode, which means that +whenever a trace event has one or more triggers associated with it, +the event is activated even if it isn't actually enabled, but is +disabled in a "soft" mode. That is, the tracepoint will be called, +but just will not be traced, unless of course it's actually enabled. +This scheme allows triggers to be invoked even for events that aren't +enabled, and also allows the current event filter implementation to be +used for conditionally invoking triggers. + +The syntax for event triggers is roughly based on the syntax for +set_ftrace_filter 'ftrace filter commands' (see the 'Filter commands' +section of Documentation/trace/ftrace.txt), but there are major +differences and the implementation isn't currently tied to it in any +way, so beware about making generalizations between the two. + +6.1 Expression syntax +--------------------- + +Triggers are added by echoing the command to the 'trigger' file: + + # echo 'command[:count] [if filter]' > trigger + +Triggers are removed by echoing the same command but starting with '!' +to the 'trigger' file: + + # echo '!command[:count] [if filter]' > trigger + +The [if filter] part isn't used in matching commands when removing, so +leaving that off in a '!' command will accomplish the same thing as +having it in. + +The filter syntax is the same as that described in the 'Event +filtering' section above. + +For ease of use, writing to the trigger file using '>' currently just +adds or removes a single trigger and there's no explicit '>>' support +('>' actually behaves like '>>') or truncation support to remove all +triggers (you have to use '!' for each one added.) + +6.2 Supported trigger commands +------------------------------ + +The following commands are supported: + +- enable_event/disable_event + + These commands can enable or disable another trace event whenever + the triggering event is hit. When these commands are registered, + the other trace event is activated, but disabled in a "soft" mode. + That is, the tracepoint will be called, but just will not be traced. + The event tracepoint stays in this mode as long as there's a trigger + in effect that can trigger it. + + For example, the following trigger causes kmalloc events to be + traced when a read system call is entered, and the :1 at the end + specifies that this enablement happens only once: + + # echo 'enable_event:kmem:kmalloc:1' > \ + /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger + + The following trigger causes kmalloc events to stop being traced + when a read system call exits. This disablement happens on every + read system call exit: + + # echo 'disable_event:kmem:kmalloc' > \ + /sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger + + The format is: + + enable_event:<system>:<event>[:count] + disable_event:<system>:<event>[:count] + + To remove the above commands: + + # echo '!enable_event:kmem:kmalloc:1' > \ + /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger + + # echo '!disable_event:kmem:kmalloc' > \ + /sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger + + Note that there can be any number of enable/disable_event triggers + per triggering event, but there can only be one trigger per + triggered event. e.g. sys_enter_read can have triggers enabling both + kmem:kmalloc and sched:sched_switch, but can't have two kmem:kmalloc + versions such as kmem:kmalloc and kmem:kmalloc:1 or 'kmem:kmalloc if + bytes_req == 256' and 'kmem:kmalloc if bytes_alloc == 256' (they + could be combined into a single filter on kmem:kmalloc though). + +- stacktrace + + This command dumps a stacktrace in the trace buffer whenever the + triggering event occurs. + + For example, the following trigger dumps a stacktrace every time the + kmalloc tracepoint is hit: + + # echo 'stacktrace' > \ + /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger + + The following trigger dumps a stacktrace the first 5 times a kmalloc + request happens with a size >= 64K + + # echo 'stacktrace:5 if bytes_req >= 65536' > \ + /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger + + The format is: + + stacktrace[:count] + + To remove the above commands: + + # echo '!stacktrace' > \ + /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger + + # echo '!stacktrace:5 if bytes_req >= 65536' > \ + /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger + + The latter can also be removed more simply by the following (without + the filter): + + # echo '!stacktrace:5' > \ + /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger + + Note that there can be only one stacktrace trigger per triggering + event. + +- snapshot + + This command causes a snapshot to be triggered whenever the + triggering event occurs. + + The following command creates a snapshot every time a block request + queue is unplugged with a depth > 1. If you were tracing a set of + events or functions at the time, the snapshot trace buffer would + capture those events when the trigger event occurred: + + # echo 'snapshot if nr_rq > 1' > \ + /sys/kernel/debug/tracing/events/block/block_unplug/trigger + + To only snapshot once: + + # echo 'snapshot:1 if nr_rq > 1' > \ + /sys/kernel/debug/tracing/events/block/block_unplug/trigger + + To remove the above commands: + + # echo '!snapshot if nr_rq > 1' > \ + /sys/kernel/debug/tracing/events/block/block_unplug/trigger + + # echo '!snapshot:1 if nr_rq > 1' > \ + /sys/kernel/debug/tracing/events/block/block_unplug/trigger + + Note that there can be only one snapshot trigger per triggering + event. + +- traceon/traceoff + + These commands turn tracing on and off when the specified events are + hit. The parameter determines how many times the tracing system is + turned on and off. If unspecified, there is no limit. + + The following command turns tracing off the first time a block + request queue is unplugged with a depth > 1. If you were tracing a + set of events or functions at the time, you could then examine the + trace buffer to see the sequence of events that led up to the + trigger event: + + # echo 'traceoff:1 if nr_rq > 1' > \ + /sys/kernel/debug/tracing/events/block/block_unplug/trigger + + To always disable tracing when nr_rq > 1 : + + # echo 'traceoff if nr_rq > 1' > \ + /sys/kernel/debug/tracing/events/block/block_unplug/trigger + + To remove the above commands: + + # echo '!traceoff:1 if nr_rq > 1' > \ + /sys/kernel/debug/tracing/events/block/block_unplug/trigger + + # echo '!traceoff if nr_rq > 1' > \ + /sys/kernel/debug/tracing/events/block/block_unplug/trigger + + Note that there can be only one traceon or traceoff trigger per + triggering event. diff --git a/kernel/Documentation/trace/ftrace-design.txt b/kernel/Documentation/trace/ftrace-design.txt new file mode 100644 index 000000000..dd5f916b3 --- /dev/null +++ b/kernel/Documentation/trace/ftrace-design.txt @@ -0,0 +1,382 @@ + function tracer guts + ==================== + By Mike Frysinger + +Introduction +------------ + +Here we will cover the architecture pieces that the common function tracing +code relies on for proper functioning. Things are broken down into increasing +complexity so that you can start simple and at least get basic functionality. + +Note that this focuses on architecture implementation details only. If you +want more explanation of a feature in terms of common code, review the common +ftrace.txt file. + +Ideally, everyone who wishes to retain performance while supporting tracing in +their kernel should make it all the way to dynamic ftrace support. + + +Prerequisites +------------- + +Ftrace relies on these features being implemented: + STACKTRACE_SUPPORT - implement save_stack_trace() + TRACE_IRQFLAGS_SUPPORT - implement include/asm/irqflags.h + + +HAVE_FUNCTION_TRACER +-------------------- + +You will need to implement the mcount and the ftrace_stub functions. + +The exact mcount symbol name will depend on your toolchain. Some call it +"mcount", "_mcount", or even "__mcount". You can probably figure it out by +running something like: + $ echo 'main(){}' | gcc -x c -S -o - - -pg | grep mcount + call mcount +We'll make the assumption below that the symbol is "mcount" just to keep things +nice and simple in the examples. + +Keep in mind that the ABI that is in effect inside of the mcount function is +*highly* architecture/toolchain specific. We cannot help you in this regard, +sorry. Dig up some old documentation and/or find someone more familiar than +you to bang ideas off of. Typically, register usage (argument/scratch/etc...) +is a major issue at this point, especially in relation to the location of the +mcount call (before/after function prologue). You might also want to look at +how glibc has implemented the mcount function for your architecture. It might +be (semi-)relevant. + +The mcount function should check the function pointer ftrace_trace_function +to see if it is set to ftrace_stub. If it is, there is nothing for you to do, +so return immediately. If it isn't, then call that function in the same way +the mcount function normally calls __mcount_internal -- the first argument is +the "frompc" while the second argument is the "selfpc" (adjusted to remove the +size of the mcount call that is embedded in the function). + +For example, if the function foo() calls bar(), when the bar() function calls +mcount(), the arguments mcount() will pass to the tracer are: + "frompc" - the address bar() will use to return to foo() + "selfpc" - the address bar() (with mcount() size adjustment) + +Also keep in mind that this mcount function will be called *a lot*, so +optimizing for the default case of no tracer will help the smooth running of +your system when tracing is disabled. So the start of the mcount function is +typically the bare minimum with checking things before returning. That also +means the code flow should usually be kept linear (i.e. no branching in the nop +case). This is of course an optimization and not a hard requirement. + +Here is some pseudo code that should help (these functions should actually be +implemented in assembly): + +void ftrace_stub(void) +{ + return; +} + +void mcount(void) +{ + /* save any bare state needed in order to do initial checking */ + + extern void (*ftrace_trace_function)(unsigned long, unsigned long); + if (ftrace_trace_function != ftrace_stub) + goto do_trace; + + /* restore any bare state */ + + return; + +do_trace: + + /* save all state needed by the ABI (see paragraph above) */ + + unsigned long frompc = ...; + unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; + ftrace_trace_function(frompc, selfpc); + + /* restore all state needed by the ABI */ +} + +Don't forget to export mcount for modules ! +extern void mcount(void); +EXPORT_SYMBOL(mcount); + + +HAVE_FUNCTION_GRAPH_TRACER +-------------------------- + +Deep breath ... time to do some real work. Here you will need to update the +mcount function to check ftrace graph function pointers, as well as implement +some functions to save (hijack) and restore the return address. + +The mcount function should check the function pointers ftrace_graph_return +(compare to ftrace_stub) and ftrace_graph_entry (compare to +ftrace_graph_entry_stub). If either of those is not set to the relevant stub +function, call the arch-specific function ftrace_graph_caller which in turn +calls the arch-specific function prepare_ftrace_return. Neither of these +function names is strictly required, but you should use them anyway to stay +consistent across the architecture ports -- easier to compare & contrast +things. + +The arguments to prepare_ftrace_return are slightly different than what are +passed to ftrace_trace_function. The second argument "selfpc" is the same, +but the first argument should be a pointer to the "frompc". Typically this is +located on the stack. This allows the function to hijack the return address +temporarily to have it point to the arch-specific function return_to_handler. +That function will simply call the common ftrace_return_to_handler function and +that will return the original return address with which you can return to the +original call site. + +Here is the updated mcount pseudo code: +void mcount(void) +{ +... + if (ftrace_trace_function != ftrace_stub) + goto do_trace; + ++#ifdef CONFIG_FUNCTION_GRAPH_TRACER ++ extern void (*ftrace_graph_return)(...); ++ extern void (*ftrace_graph_entry)(...); ++ if (ftrace_graph_return != ftrace_stub || ++ ftrace_graph_entry != ftrace_graph_entry_stub) ++ ftrace_graph_caller(); ++#endif + + /* restore any bare state */ +... + +Here is the pseudo code for the new ftrace_graph_caller assembly function: +#ifdef CONFIG_FUNCTION_GRAPH_TRACER +void ftrace_graph_caller(void) +{ + /* save all state needed by the ABI */ + + unsigned long *frompc = &...; + unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; + /* passing frame pointer up is optional -- see below */ + prepare_ftrace_return(frompc, selfpc, frame_pointer); + + /* restore all state needed by the ABI */ +} +#endif + +For information on how to implement prepare_ftrace_return(), simply look at the +x86 version (the frame pointer passing is optional; see the next section for +more information). The only architecture-specific piece in it is the setup of +the fault recovery table (the asm(...) code). The rest should be the same +across architectures. + +Here is the pseudo code for the new return_to_handler assembly function. Note +that the ABI that applies here is different from what applies to the mcount +code. Since you are returning from a function (after the epilogue), you might +be able to skimp on things saved/restored (usually just registers used to pass +return values). + +#ifdef CONFIG_FUNCTION_GRAPH_TRACER +void return_to_handler(void) +{ + /* save all state needed by the ABI (see paragraph above) */ + + void (*original_return_point)(void) = ftrace_return_to_handler(); + + /* restore all state needed by the ABI */ + + /* this is usually either a return or a jump */ + original_return_point(); +} +#endif + + +HAVE_FUNCTION_GRAPH_FP_TEST +--------------------------- + +An arch may pass in a unique value (frame pointer) to both the entering and +exiting of a function. On exit, the value is compared and if it does not +match, then it will panic the kernel. This is largely a sanity check for bad +code generation with gcc. If gcc for your port sanely updates the frame +pointer under different optimization levels, then ignore this option. + +However, adding support for it isn't terribly difficult. In your assembly code +that calls prepare_ftrace_return(), pass the frame pointer as the 3rd argument. +Then in the C version of that function, do what the x86 port does and pass it +along to ftrace_push_return_trace() instead of a stub value of 0. + +Similarly, when you call ftrace_return_to_handler(), pass it the frame pointer. + + +HAVE_FTRACE_NMI_ENTER +--------------------- + +If you can't trace NMI functions, then skip this option. + +<details to be filled> + + +HAVE_SYSCALL_TRACEPOINTS +------------------------ + +You need very few things to get the syscalls tracing in an arch. + +- Support HAVE_ARCH_TRACEHOOK (see arch/Kconfig). +- Have a NR_syscalls variable in <asm/unistd.h> that provides the number + of syscalls supported by the arch. +- Support the TIF_SYSCALL_TRACEPOINT thread flags. +- Put the trace_sys_enter() and trace_sys_exit() tracepoints calls from ptrace + in the ptrace syscalls tracing path. +- If the system call table on this arch is more complicated than a simple array + of addresses of the system calls, implement an arch_syscall_addr to return + the address of a given system call. +- If the symbol names of the system calls do not match the function names on + this arch, define ARCH_HAS_SYSCALL_MATCH_SYM_NAME in asm/ftrace.h and + implement arch_syscall_match_sym_name with the appropriate logic to return + true if the function name corresponds with the symbol name. +- Tag this arch as HAVE_SYSCALL_TRACEPOINTS. + + +HAVE_FTRACE_MCOUNT_RECORD +------------------------- + +See scripts/recordmcount.pl for more info. Just fill in the arch-specific +details for how to locate the addresses of mcount call sites via objdump. +This option doesn't make much sense without also implementing dynamic ftrace. + + +HAVE_DYNAMIC_FTRACE +------------------- + +You will first need HAVE_FTRACE_MCOUNT_RECORD and HAVE_FUNCTION_TRACER, so +scroll your reader back up if you got over eager. + +Once those are out of the way, you will need to implement: + - asm/ftrace.h: + - MCOUNT_ADDR + - ftrace_call_adjust() + - struct dyn_arch_ftrace{} + - asm code: + - mcount() (new stub) + - ftrace_caller() + - ftrace_call() + - ftrace_stub() + - C code: + - ftrace_dyn_arch_init() + - ftrace_make_nop() + - ftrace_make_call() + - ftrace_update_ftrace_func() + +First you will need to fill out some arch details in your asm/ftrace.h. + +Define MCOUNT_ADDR as the address of your mcount symbol similar to: + #define MCOUNT_ADDR ((unsigned long)mcount) +Since no one else will have a decl for that function, you will need to: + extern void mcount(void); + +You will also need the helper function ftrace_call_adjust(). Most people +will be able to stub it out like so: + static inline unsigned long ftrace_call_adjust(unsigned long addr) + { + return addr; + } +<details to be filled> + +Lastly you will need the custom dyn_arch_ftrace structure. If you need +some extra state when runtime patching arbitrary call sites, this is the +place. For now though, create an empty struct: + struct dyn_arch_ftrace { + /* No extra data needed */ + }; + +With the header out of the way, we can fill out the assembly code. While we +did already create a mcount() function earlier, dynamic ftrace only wants a +stub function. This is because the mcount() will only be used during boot +and then all references to it will be patched out never to return. Instead, +the guts of the old mcount() will be used to create a new ftrace_caller() +function. Because the two are hard to merge, it will most likely be a lot +easier to have two separate definitions split up by #ifdefs. Same goes for +the ftrace_stub() as that will now be inlined in ftrace_caller(). + +Before we get confused anymore, let's check out some pseudo code so you can +implement your own stuff in assembly: + +void mcount(void) +{ + return; +} + +void ftrace_caller(void) +{ + /* save all state needed by the ABI (see paragraph above) */ + + unsigned long frompc = ...; + unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; + +ftrace_call: + ftrace_stub(frompc, selfpc); + + /* restore all state needed by the ABI */ + +ftrace_stub: + return; +} + +This might look a little odd at first, but keep in mind that we will be runtime +patching multiple things. First, only functions that we actually want to trace +will be patched to call ftrace_caller(). Second, since we only have one tracer +active at a time, we will patch the ftrace_caller() function itself to call the +specific tracer in question. That is the point of the ftrace_call label. + +With that in mind, let's move on to the C code that will actually be doing the +runtime patching. You'll need a little knowledge of your arch's opcodes in +order to make it through the next section. + +Every arch has an init callback function. If you need to do something early on +to initialize some state, this is the time to do that. Otherwise, this simple +function below should be sufficient for most people: + +int __init ftrace_dyn_arch_init(void) +{ + return 0; +} + +There are two functions that are used to do runtime patching of arbitrary +functions. The first is used to turn the mcount call site into a nop (which +is what helps us retain runtime performance when not tracing). The second is +used to turn the mcount call site into a call to an arbitrary location (but +typically that is ftracer_caller()). See the general function definition in +linux/ftrace.h for the functions: + ftrace_make_nop() + ftrace_make_call() +The rec->ip value is the address of the mcount call site that was collected +by the scripts/recordmcount.pl during build time. + +The last function is used to do runtime patching of the active tracer. This +will be modifying the assembly code at the location of the ftrace_call symbol +inside of the ftrace_caller() function. So you should have sufficient padding +at that location to support the new function calls you'll be inserting. Some +people will be using a "call" type instruction while others will be using a +"branch" type instruction. Specifically, the function is: + ftrace_update_ftrace_func() + + +HAVE_DYNAMIC_FTRACE + HAVE_FUNCTION_GRAPH_TRACER +------------------------------------------------ + +The function grapher needs a few tweaks in order to work with dynamic ftrace. +Basically, you will need to: + - update: + - ftrace_caller() + - ftrace_graph_call() + - ftrace_graph_caller() + - implement: + - ftrace_enable_ftrace_graph_caller() + - ftrace_disable_ftrace_graph_caller() + +<details to be filled> +Quick notes: + - add a nop stub after the ftrace_call location named ftrace_graph_call; + stub needs to be large enough to support a call to ftrace_graph_caller() + - update ftrace_graph_caller() to work with being called by the new + ftrace_caller() since some semantics may have changed + - ftrace_enable_ftrace_graph_caller() will runtime patch the + ftrace_graph_call location with a call to ftrace_graph_caller() + - ftrace_disable_ftrace_graph_caller() will runtime patch the + ftrace_graph_call location with nops diff --git a/kernel/Documentation/trace/ftrace.txt b/kernel/Documentation/trace/ftrace.txt new file mode 100644 index 000000000..572ca9236 --- /dev/null +++ b/kernel/Documentation/trace/ftrace.txt @@ -0,0 +1,2846 @@ + ftrace - Function Tracer + ======================== + +Copyright 2008 Red Hat Inc. + Author: Steven Rostedt <srostedt@redhat.com> + License: The GNU Free Documentation License, Version 1.2 + (dual licensed under the GPL v2) +Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, + John Kacur, and David Teigland. +Written for: 2.6.28-rc2 +Updated for: 3.10 + +Introduction +------------ + +Ftrace is an internal tracer designed to help out developers and +designers of systems to find what is going on inside the kernel. +It can be used for debugging or analyzing latencies and +performance issues that take place outside of user-space. + +Although ftrace is typically considered the function tracer, it +is really a frame work of several assorted tracing utilities. +There's latency tracing to examine what occurs between interrupts +disabled and enabled, as well as for preemption and from a time +a task is woken to the task is actually scheduled in. + +One of the most common uses of ftrace is the event tracing. +Through out the kernel is hundreds of static event points that +can be enabled via the debugfs file system to see what is +going on in certain parts of the kernel. + + +Implementation Details +---------------------- + +See ftrace-design.txt for details for arch porters and such. + + +The File System +--------------- + +Ftrace uses the debugfs file system to hold the control files as +well as the files to display output. + +When debugfs is configured into the kernel (which selecting any ftrace +option will do) the directory /sys/kernel/debug will be created. To mount +this directory, you can add to your /etc/fstab file: + + debugfs /sys/kernel/debug debugfs defaults 0 0 + +Or you can mount it at run time with: + + mount -t debugfs nodev /sys/kernel/debug + +For quicker access to that directory you may want to make a soft link to +it: + + ln -s /sys/kernel/debug /debug + +Any selected ftrace option will also create a directory called tracing +within the debugfs. The rest of the document will assume that you are in +the ftrace directory (cd /sys/kernel/debug/tracing) and will only concentrate +on the files within that directory and not distract from the content with +the extended "/sys/kernel/debug/tracing" path name. + +That's it! (assuming that you have ftrace configured into your kernel) + +After mounting debugfs, you can see a directory called +"tracing". This directory contains the control and output files +of ftrace. Here is a list of some of the key files: + + + Note: all time values are in microseconds. + + current_tracer: + + This is used to set or display the current tracer + that is configured. + + available_tracers: + + This holds the different types of tracers that + have been compiled into the kernel. The + tracers listed here can be configured by + echoing their name into current_tracer. + + tracing_on: + + This sets or displays whether writing to the trace + ring buffer is enabled. Echo 0 into this file to disable + the tracer or 1 to enable it. Note, this only disables + writing to the ring buffer, the tracing overhead may + still be occurring. + + trace: + + This file holds the output of the trace in a human + readable format (described below). + + trace_pipe: + + The output is the same as the "trace" file but this + file is meant to be streamed with live tracing. + Reads from this file will block until new data is + retrieved. Unlike the "trace" file, this file is a + consumer. This means reading from this file causes + sequential reads to display more current data. Once + data is read from this file, it is consumed, and + will not be read again with a sequential read. The + "trace" file is static, and if the tracer is not + adding more data,they will display the same + information every time they are read. + + trace_options: + + This file lets the user control the amount of data + that is displayed in one of the above output + files. Options also exist to modify how a tracer + or events work (stack traces, timestamps, etc). + + options: + + This is a directory that has a file for every available + trace option (also in trace_options). Options may also be set + or cleared by writing a "1" or "0" respectively into the + corresponding file with the option name. + + tracing_max_latency: + + Some of the tracers record the max latency. + For example, the time interrupts are disabled. + This time is saved in this file. The max trace + will also be stored, and displayed by "trace". + A new max trace will only be recorded if the + latency is greater than the value in this + file. (in microseconds) + + tracing_thresh: + + Some latency tracers will record a trace whenever the + latency is greater than the number in this file. + Only active when the file contains a number greater than 0. + (in microseconds) + + buffer_size_kb: + + This sets or displays the number of kilobytes each CPU + buffer holds. By default, the trace buffers are the same size + for each CPU. The displayed number is the size of the + CPU buffer and not total size of all buffers. The + trace buffers are allocated in pages (blocks of memory + that the kernel uses for allocation, usually 4 KB in size). + If the last page allocated has room for more bytes + than requested, the rest of the page will be used, + making the actual allocation bigger than requested. + ( Note, the size may not be a multiple of the page size + due to buffer management meta-data. ) + + buffer_total_size_kb: + + This displays the total combined size of all the trace buffers. + + free_buffer: + + If a process is performing the tracing, and the ring buffer + should be shrunk "freed" when the process is finished, even + if it were to be killed by a signal, this file can be used + for that purpose. On close of this file, the ring buffer will + be resized to its minimum size. Having a process that is tracing + also open this file, when the process exits its file descriptor + for this file will be closed, and in doing so, the ring buffer + will be "freed". + + It may also stop tracing if disable_on_free option is set. + + tracing_cpumask: + + This is a mask that lets the user only trace + on specified CPUs. The format is a hex string + representing the CPUs. + + set_ftrace_filter: + + When dynamic ftrace is configured in (see the + section below "dynamic ftrace"), the code is dynamically + modified (code text rewrite) to disable calling of the + function profiler (mcount). This lets tracing be configured + in with practically no overhead in performance. This also + has a side effect of enabling or disabling specific functions + to be traced. Echoing names of functions into this file + will limit the trace to only those functions. + + This interface also allows for commands to be used. See the + "Filter commands" section for more details. + + set_ftrace_notrace: + + This has an effect opposite to that of + set_ftrace_filter. Any function that is added here will not + be traced. If a function exists in both set_ftrace_filter + and set_ftrace_notrace, the function will _not_ be traced. + + set_ftrace_pid: + + Have the function tracer only trace a single thread. + + set_graph_function: + + Set a "trigger" function where tracing should start + with the function graph tracer (See the section + "dynamic ftrace" for more details). + + available_filter_functions: + + This lists the functions that ftrace + has processed and can trace. These are the function + names that you can pass to "set_ftrace_filter" or + "set_ftrace_notrace". (See the section "dynamic ftrace" + below for more details.) + + enabled_functions: + + This file is more for debugging ftrace, but can also be useful + in seeing if any function has a callback attached to it. + Not only does the trace infrastructure use ftrace function + trace utility, but other subsystems might too. This file + displays all functions that have a callback attached to them + as well as the number of callbacks that have been attached. + Note, a callback may also call multiple functions which will + not be listed in this count. + + If the callback registered to be traced by a function with + the "save regs" attribute (thus even more overhead), a 'R' + will be displayed on the same line as the function that + is returning registers. + + If the callback registered to be traced by a function with + the "ip modify" attribute (thus the regs->ip can be changed), + an 'I' will be displayed on the same line as the function that + can be overridden. + + function_profile_enabled: + + When set it will enable all functions with either the function + tracer, or if enabled, the function graph tracer. It will + keep a histogram of the number of functions that were called + and if run with the function graph tracer, it will also keep + track of the time spent in those functions. The histogram + content can be displayed in the files: + + trace_stats/function<cpu> ( function0, function1, etc). + + trace_stats: + + A directory that holds different tracing stats. + + kprobe_events: + + Enable dynamic trace points. See kprobetrace.txt. + + kprobe_profile: + + Dynamic trace points stats. See kprobetrace.txt. + + max_graph_depth: + + Used with the function graph tracer. This is the max depth + it will trace into a function. Setting this to a value of + one will show only the first kernel function that is called + from user space. + + printk_formats: + + This is for tools that read the raw format files. If an event in + the ring buffer references a string (currently only trace_printk() + does this), only a pointer to the string is recorded into the buffer + and not the string itself. This prevents tools from knowing what + that string was. This file displays the string and address for + the string allowing tools to map the pointers to what the + strings were. + + saved_cmdlines: + + Only the pid of the task is recorded in a trace event unless + the event specifically saves the task comm as well. Ftrace + makes a cache of pid mappings to comms to try to display + comms for events. If a pid for a comm is not listed, then + "<...>" is displayed in the output. + + snapshot: + + This displays the "snapshot" buffer and also lets the user + take a snapshot of the current running trace. + See the "Snapshot" section below for more details. + + stack_max_size: + + When the stack tracer is activated, this will display the + maximum stack size it has encountered. + See the "Stack Trace" section below. + + stack_trace: + + This displays the stack back trace of the largest stack + that was encountered when the stack tracer is activated. + See the "Stack Trace" section below. + + stack_trace_filter: + + This is similar to "set_ftrace_filter" but it limits what + functions the stack tracer will check. + + trace_clock: + + Whenever an event is recorded into the ring buffer, a + "timestamp" is added. This stamp comes from a specified + clock. By default, ftrace uses the "local" clock. This + clock is very fast and strictly per cpu, but on some + systems it may not be monotonic with respect to other + CPUs. In other words, the local clocks may not be in sync + with local clocks on other CPUs. + + Usual clocks for tracing: + + # cat trace_clock + [local] global counter x86-tsc + + local: Default clock, but may not be in sync across CPUs + + global: This clock is in sync with all CPUs but may + be a bit slower than the local clock. + + counter: This is not a clock at all, but literally an atomic + counter. It counts up one by one, but is in sync + with all CPUs. This is useful when you need to + know exactly the order events occurred with respect to + each other on different CPUs. + + uptime: This uses the jiffies counter and the time stamp + is relative to the time since boot up. + + perf: This makes ftrace use the same clock that perf uses. + Eventually perf will be able to read ftrace buffers + and this will help out in interleaving the data. + + x86-tsc: Architectures may define their own clocks. For + example, x86 uses its own TSC cycle clock here. + + To set a clock, simply echo the clock name into this file. + + echo global > trace_clock + + trace_marker: + + This is a very useful file for synchronizing user space + with events happening in the kernel. Writing strings into + this file will be written into the ftrace buffer. + + It is useful in applications to open this file at the start + of the application and just reference the file descriptor + for the file. + + void trace_write(const char *fmt, ...) + { + va_list ap; + char buf[256]; + int n; + + if (trace_fd < 0) + return; + + va_start(ap, fmt); + n = vsnprintf(buf, 256, fmt, ap); + va_end(ap); + + write(trace_fd, buf, n); + } + + start: + + trace_fd = open("trace_marker", WR_ONLY); + + uprobe_events: + + Add dynamic tracepoints in programs. + See uprobetracer.txt + + uprobe_profile: + + Uprobe statistics. See uprobetrace.txt + + instances: + + This is a way to make multiple trace buffers where different + events can be recorded in different buffers. + See "Instances" section below. + + events: + + This is the trace event directory. It holds event tracepoints + (also known as static tracepoints) that have been compiled + into the kernel. It shows what event tracepoints exist + and how they are grouped by system. There are "enable" + files at various levels that can enable the tracepoints + when a "1" is written to them. + + See events.txt for more information. + + per_cpu: + + This is a directory that contains the trace per_cpu information. + + per_cpu/cpu0/buffer_size_kb: + + The ftrace buffer is defined per_cpu. That is, there's a separate + buffer for each CPU to allow writes to be done atomically, + and free from cache bouncing. These buffers may have different + size buffers. This file is similar to the buffer_size_kb + file, but it only displays or sets the buffer size for the + specific CPU. (here cpu0). + + per_cpu/cpu0/trace: + + This is similar to the "trace" file, but it will only display + the data specific for the CPU. If written to, it only clears + the specific CPU buffer. + + per_cpu/cpu0/trace_pipe + + This is similar to the "trace_pipe" file, and is a consuming + read, but it will only display (and consume) the data specific + for the CPU. + + per_cpu/cpu0/trace_pipe_raw + + For tools that can parse the ftrace ring buffer binary format, + the trace_pipe_raw file can be used to extract the data + from the ring buffer directly. With the use of the splice() + system call, the buffer data can be quickly transferred to + a file or to the network where a server is collecting the + data. + + Like trace_pipe, this is a consuming reader, where multiple + reads will always produce different data. + + per_cpu/cpu0/snapshot: + + This is similar to the main "snapshot" file, but will only + snapshot the current CPU (if supported). It only displays + the content of the snapshot for a given CPU, and if + written to, only clears this CPU buffer. + + per_cpu/cpu0/snapshot_raw: + + Similar to the trace_pipe_raw, but will read the binary format + from the snapshot buffer for the given CPU. + + per_cpu/cpu0/stats: + + This displays certain stats about the ring buffer: + + entries: The number of events that are still in the buffer. + + overrun: The number of lost events due to overwriting when + the buffer was full. + + commit overrun: Should always be zero. + This gets set if so many events happened within a nested + event (ring buffer is re-entrant), that it fills the + buffer and starts dropping events. + + bytes: Bytes actually read (not overwritten). + + oldest event ts: The oldest timestamp in the buffer + + now ts: The current timestamp + + dropped events: Events lost due to overwrite option being off. + + read events: The number of events read. + +The Tracers +----------- + +Here is the list of current tracers that may be configured. + + "function" + + Function call tracer to trace all kernel functions. + + "function_graph" + + Similar to the function tracer except that the + function tracer probes the functions on their entry + whereas the function graph tracer traces on both entry + and exit of the functions. It then provides the ability + to draw a graph of function calls similar to C code + source. + + "irqsoff" + + Traces the areas that disable interrupts and saves + the trace with the longest max latency. + See tracing_max_latency. When a new max is recorded, + it replaces the old trace. It is best to view this + trace with the latency-format option enabled. + + "preemptoff" + + Similar to irqsoff but traces and records the amount of + time for which preemption is disabled. + + "preemptirqsoff" + + Similar to irqsoff and preemptoff, but traces and + records the largest time for which irqs and/or preemption + is disabled. + + "wakeup" + + Traces and records the max latency that it takes for + the highest priority task to get scheduled after + it has been woken up. + Traces all tasks as an average developer would expect. + + "wakeup_rt" + + Traces and records the max latency that it takes for just + RT tasks (as the current "wakeup" does). This is useful + for those interested in wake up timings of RT tasks. + + "nop" + + This is the "trace nothing" tracer. To remove all + tracers from tracing simply echo "nop" into + current_tracer. + + +Examples of using the tracer +---------------------------- + +Here are typical examples of using the tracers when controlling +them only with the debugfs interface (without using any +user-land utilities). + +Output format: +-------------- + +Here is an example of the output format of the file "trace" + + -------- +# tracer: function +# +# entries-in-buffer/entries-written: 140080/250280 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + bash-1977 [000] .... 17284.993652: sys_close <-system_call_fastpath + bash-1977 [000] .... 17284.993653: __close_fd <-sys_close + bash-1977 [000] .... 17284.993653: _raw_spin_lock <-__close_fd + sshd-1974 [003] .... 17284.993653: __srcu_read_unlock <-fsnotify + bash-1977 [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock + bash-1977 [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd + bash-1977 [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock + bash-1977 [000] .... 17284.993657: filp_close <-__close_fd + bash-1977 [000] .... 17284.993657: dnotify_flush <-filp_close + sshd-1974 [003] .... 17284.993658: sys_select <-system_call_fastpath + -------- + +A header is printed with the tracer name that is represented by +the trace. In this case the tracer is "function". Then it shows the +number of events in the buffer as well as the total number of entries +that were written. The difference is the number of entries that were +lost due to the buffer filling up (250280 - 140080 = 110200 events +lost). + +The header explains the content of the events. Task name "bash", the task +PID "1977", the CPU that it was running on "000", the latency format +(explained below), the timestamp in <secs>.<usecs> format, the +function name that was traced "sys_close" and the parent function that +called this function "system_call_fastpath". The timestamp is the time +at which the function was entered. + +Latency trace format +-------------------- + +When the latency-format option is enabled or when one of the latency +tracers is set, the trace file gives somewhat more information to see +why a latency happened. Here is a typical trace. + +# tracer: irqsoff +# +# irqsoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: __lock_task_sighand +# => ended at: _raw_spin_unlock_irqrestore +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + ps-6143 2d... 0us!: trace_hardirqs_off <-__lock_task_sighand + ps-6143 2d..1 259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore + ps-6143 2d..1 263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore + ps-6143 2d..1 306us : <stack trace> + => trace_hardirqs_on_caller + => trace_hardirqs_on + => _raw_spin_unlock_irqrestore + => do_task_stat + => proc_tgid_stat + => proc_single_show + => seq_read + => vfs_read + => sys_read + => system_call_fastpath + + +This shows that the current tracer is "irqsoff" tracing the time +for which interrupts were disabled. It gives the trace version (which +never changes) and the version of the kernel upon which this was executed on +(3.10). Then it displays the max latency in microseconds (259 us). The number +of trace entries displayed and the total number (both are four: #4/4). +VP, KP, SP, and HP are always zero and are reserved for later use. +#P is the number of online CPUs (#P:4). + +The task is the process that was running when the latency +occurred. (ps pid: 6143). + +The start and stop (the functions in which the interrupts were +disabled and enabled respectively) that caused the latencies: + + __lock_task_sighand is where the interrupts were disabled. + _raw_spin_unlock_irqrestore is where they were enabled again. + +The next lines after the header are the trace itself. The header +explains which is which. + + cmd: The name of the process in the trace. + + pid: The PID of that process. + + CPU#: The CPU which the process was running on. + + irqs-off: 'd' interrupts are disabled. '.' otherwise. + Note: If the architecture does not support a way to + read the irq flags variable, an 'X' will always + be printed here. + + need-resched: + 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set, + 'n' only TIF_NEED_RESCHED is set, + 'p' only PREEMPT_NEED_RESCHED is set, + '.' otherwise. + + hardirq/softirq: + 'H' - hard irq occurred inside a softirq. + 'h' - hard irq is running + 's' - soft irq is running + '.' - normal context. + + preempt-depth: The level of preempt_disabled + +The above is mostly meaningful for kernel developers. + + time: When the latency-format option is enabled, the trace file + output includes a timestamp relative to the start of the + trace. This differs from the output when latency-format + is disabled, which includes an absolute timestamp. + + delay: This is just to help catch your eye a bit better. And + needs to be fixed to be only relative to the same CPU. + The marks are determined by the difference between this + current trace and the next trace. + '$' - greater than 1 second + '#' - greater than 1000 microsecond + '!' - greater than 100 microsecond + '+' - greater than 10 microsecond + ' ' - less than or equal to 10 microsecond. + + The rest is the same as the 'trace' file. + + Note, the latency tracers will usually end with a back trace + to easily find where the latency occurred. + +trace_options +------------- + +The trace_options file (or the options directory) is used to control +what gets printed in the trace output, or manipulate the tracers. +To see what is available, simply cat the file: + + cat trace_options +print-parent +nosym-offset +nosym-addr +noverbose +noraw +nohex +nobin +noblock +nostacktrace +trace_printk +noftrace_preempt +nobranch +annotate +nouserstacktrace +nosym-userobj +noprintk-msg-only +context-info +latency-format +sleep-time +graph-time +record-cmd +overwrite +nodisable_on_free +irq-info +markers +function-trace + +To disable one of the options, echo in the option prepended with +"no". + + echo noprint-parent > trace_options + +To enable an option, leave off the "no". + + echo sym-offset > trace_options + +Here are the available options: + + print-parent - On function traces, display the calling (parent) + function as well as the function being traced. + + print-parent: + bash-4000 [01] 1477.606694: simple_strtoul <-kstrtoul + + noprint-parent: + bash-4000 [01] 1477.606694: simple_strtoul + + + sym-offset - Display not only the function name, but also the + offset in the function. For example, instead of + seeing just "ktime_get", you will see + "ktime_get+0xb/0x20". + + sym-offset: + bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0 + + sym-addr - this will also display the function address as well + as the function name. + + sym-addr: + bash-4000 [01] 1477.606694: simple_strtoul <c0339346> + + verbose - This deals with the trace file when the + latency-format option is enabled. + + bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ + (+0.000ms): simple_strtoul (kstrtoul) + + raw - This will display raw numbers. This option is best for + use with user applications that can translate the raw + numbers better than having it done in the kernel. + + hex - Similar to raw, but the numbers will be in a hexadecimal + format. + + bin - This will print out the formats in raw binary. + + block - When set, reading trace_pipe will not block when polled. + + stacktrace - This is one of the options that changes the trace + itself. When a trace is recorded, so is the stack + of functions. This allows for back traces of + trace sites. + + trace_printk - Can disable trace_printk() from writing into the buffer. + + branch - Enable branch tracing with the tracer. + + annotate - It is sometimes confusing when the CPU buffers are full + and one CPU buffer had a lot of events recently, thus + a shorter time frame, were another CPU may have only had + a few events, which lets it have older events. When + the trace is reported, it shows the oldest events first, + and it may look like only one CPU ran (the one with the + oldest events). When the annotate option is set, it will + display when a new CPU buffer started: + + <idle>-0 [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on + <idle>-0 [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on + <idle>-0 [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore +##### CPU 2 buffer started #### + <idle>-0 [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle + <idle>-0 [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog + <idle>-0 [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock + + userstacktrace - This option changes the trace. It records a + stacktrace of the current userspace thread. + + sym-userobj - when user stacktrace are enabled, look up which + object the address belongs to, and print a + relative address. This is especially useful when + ASLR is on, otherwise you don't get a chance to + resolve the address to object/file/line after + the app is no longer running + + The lookup is performed when you read + trace,trace_pipe. Example: + + a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 +x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] + + + printk-msg-only - When set, trace_printk()s will only show the format + and not their parameters (if trace_bprintk() or + trace_bputs() was used to save the trace_printk()). + + context-info - Show only the event data. Hides the comm, PID, + timestamp, CPU, and other useful data. + + latency-format - This option changes the trace. When + it is enabled, the trace displays + additional information about the + latencies, as described in "Latency + trace format". + + sleep-time - When running function graph tracer, to include + the time a task schedules out in its function. + When enabled, it will account time the task has been + scheduled out as part of the function call. + + graph-time - When running function graph tracer, to include the + time to call nested functions. When this is not set, + the time reported for the function will only include + the time the function itself executed for, not the time + for functions that it called. + + record-cmd - When any event or tracer is enabled, a hook is enabled + in the sched_switch trace point to fill comm cache + with mapped pids and comms. But this may cause some + overhead, and if you only care about pids, and not the + name of the task, disabling this option can lower the + impact of tracing. + + overwrite - This controls what happens when the trace buffer is + full. If "1" (default), the oldest events are + discarded and overwritten. If "0", then the newest + events are discarded. + (see per_cpu/cpu0/stats for overrun and dropped) + + disable_on_free - When the free_buffer is closed, tracing will + stop (tracing_on set to 0). + + irq-info - Shows the interrupt, preempt count, need resched data. + When disabled, the trace looks like: + +# tracer: function +# +# entries-in-buffer/entries-written: 144405/9452052 #P:4 +# +# TASK-PID CPU# TIMESTAMP FUNCTION +# | | | | | + <idle>-0 [002] 23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up + <idle>-0 [002] 23636.756054: activate_task <-ttwu_do_activate.constprop.89 + <idle>-0 [002] 23636.756055: enqueue_task <-activate_task + + + markers - When set, the trace_marker is writable (only by root). + When disabled, the trace_marker will error with EINVAL + on write. + + + function-trace - The latency tracers will enable function tracing + if this option is enabled (default it is). When + it is disabled, the latency tracers do not trace + functions. This keeps the overhead of the tracer down + when performing latency tests. + + Note: Some tracers have their own options. They only appear + when the tracer is active. + + + +irqsoff +------- + +When interrupts are disabled, the CPU can not react to any other +external event (besides NMIs and SMIs). This prevents the timer +interrupt from triggering or the mouse interrupt from letting +the kernel know of a new mouse event. The result is a latency +with the reaction time. + +The irqsoff tracer tracks the time for which interrupts are +disabled. When a new maximum latency is hit, the tracer saves +the trace leading up to that latency point so that every time a +new maximum is reached, the old saved trace is discarded and the +new trace is saved. + +To reset the maximum, echo 0 into tracing_max_latency. Here is +an example: + + # echo 0 > options/function-trace + # echo irqsoff > current_tracer + # echo 1 > tracing_on + # echo 0 > tracing_max_latency + # ls -ltr + [...] + # echo 0 > tracing_on + # cat trace +# tracer: irqsoff +# +# irqsoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: run_timer_softirq +# => ended at: run_timer_softirq +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + <idle>-0 0d.s2 0us+: _raw_spin_lock_irq <-run_timer_softirq + <idle>-0 0dNs3 17us : _raw_spin_unlock_irq <-run_timer_softirq + <idle>-0 0dNs3 17us+: trace_hardirqs_on <-run_timer_softirq + <idle>-0 0dNs3 25us : <stack trace> + => _raw_spin_unlock_irq + => run_timer_softirq + => __do_softirq + => call_softirq + => do_softirq + => irq_exit + => smp_apic_timer_interrupt + => apic_timer_interrupt + => rcu_idle_exit + => cpu_idle + => rest_init + => start_kernel + => x86_64_start_reservations + => x86_64_start_kernel + +Here we see that that we had a latency of 16 microseconds (which is +very good). The _raw_spin_lock_irq in run_timer_softirq disabled +interrupts. The difference between the 16 and the displayed +timestamp 25us occurred because the clock was incremented +between the time of recording the max latency and the time of +recording the function that had that latency. + +Note the above example had function-trace not set. If we set +function-trace, we get a much larger output: + + with echo 1 > options/function-trace + +# tracer: irqsoff +# +# irqsoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: ata_scsi_queuecmd +# => ended at: ata_scsi_queuecmd +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + bash-2042 3d... 0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd + bash-2042 3d... 0us : add_preempt_count <-_raw_spin_lock_irqsave + bash-2042 3d..1 1us : ata_scsi_find_dev <-ata_scsi_queuecmd + bash-2042 3d..1 1us : __ata_scsi_find_dev <-ata_scsi_find_dev + bash-2042 3d..1 2us : ata_find_dev.part.14 <-__ata_scsi_find_dev + bash-2042 3d..1 2us : ata_qc_new_init <-__ata_scsi_queuecmd + bash-2042 3d..1 3us : ata_sg_init <-__ata_scsi_queuecmd + bash-2042 3d..1 4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd + bash-2042 3d..1 4us : ata_build_rw_tf <-ata_scsi_rw_xlat +[...] + bash-2042 3d..1 67us : delay_tsc <-__delay + bash-2042 3d..1 67us : add_preempt_count <-delay_tsc + bash-2042 3d..2 67us : sub_preempt_count <-delay_tsc + bash-2042 3d..1 67us : add_preempt_count <-delay_tsc + bash-2042 3d..2 68us : sub_preempt_count <-delay_tsc + bash-2042 3d..1 68us+: ata_bmdma_start <-ata_bmdma_qc_issue + bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd + bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd + bash-2042 3d..1 72us+: trace_hardirqs_on <-ata_scsi_queuecmd + bash-2042 3d..1 120us : <stack trace> + => _raw_spin_unlock_irqrestore + => ata_scsi_queuecmd + => scsi_dispatch_cmd + => scsi_request_fn + => __blk_run_queue_uncond + => __blk_run_queue + => blk_queue_bio + => generic_make_request + => submit_bio + => submit_bh + => __ext3_get_inode_loc + => ext3_iget + => ext3_lookup + => lookup_real + => __lookup_hash + => walk_component + => lookup_last + => path_lookupat + => filename_lookup + => user_path_at_empty + => user_path_at + => vfs_fstatat + => vfs_stat + => sys_newstat + => system_call_fastpath + + +Here we traced a 71 microsecond latency. But we also see all the +functions that were called during that time. Note that by +enabling function tracing, we incur an added overhead. This +overhead may extend the latency times. But nevertheless, this +trace has provided some very helpful debugging information. + + +preemptoff +---------- + +When preemption is disabled, we may be able to receive +interrupts but the task cannot be preempted and a higher +priority task must wait for preemption to be enabled again +before it can preempt a lower priority task. + +The preemptoff tracer traces the places that disable preemption. +Like the irqsoff tracer, it records the maximum latency for +which preemption was disabled. The control of preemptoff tracer +is much like the irqsoff tracer. + + # echo 0 > options/function-trace + # echo preemptoff > current_tracer + # echo 1 > tracing_on + # echo 0 > tracing_max_latency + # ls -ltr + [...] + # echo 0 > tracing_on + # cat trace +# tracer: preemptoff +# +# preemptoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: do_IRQ +# => ended at: do_IRQ +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + sshd-1991 1d.h. 0us+: irq_enter <-do_IRQ + sshd-1991 1d..1 46us : irq_exit <-do_IRQ + sshd-1991 1d..1 47us+: trace_preempt_on <-do_IRQ + sshd-1991 1d..1 52us : <stack trace> + => sub_preempt_count + => irq_exit + => do_IRQ + => ret_from_intr + + +This has some more changes. Preemption was disabled when an +interrupt came in (notice the 'h'), and was enabled on exit. +But we also see that interrupts have been disabled when entering +the preempt off section and leaving it (the 'd'). We do not know if +interrupts were enabled in the mean time or shortly after this +was over. + +# tracer: preemptoff +# +# preemptoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: wake_up_new_task +# => ended at: task_rq_unlock +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + bash-1994 1d..1 0us : _raw_spin_lock_irqsave <-wake_up_new_task + bash-1994 1d..1 0us : select_task_rq_fair <-select_task_rq + bash-1994 1d..1 1us : __rcu_read_lock <-select_task_rq_fair + bash-1994 1d..1 1us : source_load <-select_task_rq_fair + bash-1994 1d..1 1us : source_load <-select_task_rq_fair +[...] + bash-1994 1d..1 12us : irq_enter <-smp_apic_timer_interrupt + bash-1994 1d..1 12us : rcu_irq_enter <-irq_enter + bash-1994 1d..1 13us : add_preempt_count <-irq_enter + bash-1994 1d.h1 13us : exit_idle <-smp_apic_timer_interrupt + bash-1994 1d.h1 13us : hrtimer_interrupt <-smp_apic_timer_interrupt + bash-1994 1d.h1 13us : _raw_spin_lock <-hrtimer_interrupt + bash-1994 1d.h1 14us : add_preempt_count <-_raw_spin_lock + bash-1994 1d.h2 14us : ktime_get_update_offsets <-hrtimer_interrupt +[...] + bash-1994 1d.h1 35us : lapic_next_event <-clockevents_program_event + bash-1994 1d.h1 35us : irq_exit <-smp_apic_timer_interrupt + bash-1994 1d.h1 36us : sub_preempt_count <-irq_exit + bash-1994 1d..2 36us : do_softirq <-irq_exit + bash-1994 1d..2 36us : __do_softirq <-call_softirq + bash-1994 1d..2 36us : __local_bh_disable <-__do_softirq + bash-1994 1d.s2 37us : add_preempt_count <-_raw_spin_lock_irq + bash-1994 1d.s3 38us : _raw_spin_unlock <-run_timer_softirq + bash-1994 1d.s3 39us : sub_preempt_count <-_raw_spin_unlock + bash-1994 1d.s2 39us : call_timer_fn <-run_timer_softirq +[...] + bash-1994 1dNs2 81us : cpu_needs_another_gp <-rcu_process_callbacks + bash-1994 1dNs2 82us : __local_bh_enable <-__do_softirq + bash-1994 1dNs2 82us : sub_preempt_count <-__local_bh_enable + bash-1994 1dN.2 82us : idle_cpu <-irq_exit + bash-1994 1dN.2 83us : rcu_irq_exit <-irq_exit + bash-1994 1dN.2 83us : sub_preempt_count <-irq_exit + bash-1994 1.N.1 84us : _raw_spin_unlock_irqrestore <-task_rq_unlock + bash-1994 1.N.1 84us+: trace_preempt_on <-task_rq_unlock + bash-1994 1.N.1 104us : <stack trace> + => sub_preempt_count + => _raw_spin_unlock_irqrestore + => task_rq_unlock + => wake_up_new_task + => do_fork + => sys_clone + => stub_clone + + +The above is an example of the preemptoff trace with +function-trace set. Here we see that interrupts were not disabled +the entire time. The irq_enter code lets us know that we entered +an interrupt 'h'. Before that, the functions being traced still +show that it is not in an interrupt, but we can see from the +functions themselves that this is not the case. + +preemptirqsoff +-------------- + +Knowing the locations that have interrupts disabled or +preemption disabled for the longest times is helpful. But +sometimes we would like to know when either preemption and/or +interrupts are disabled. + +Consider the following code: + + local_irq_disable(); + call_function_with_irqs_off(); + preempt_disable(); + call_function_with_irqs_and_preemption_off(); + local_irq_enable(); + call_function_with_preemption_off(); + preempt_enable(); + +The irqsoff tracer will record the total length of +call_function_with_irqs_off() and +call_function_with_irqs_and_preemption_off(). + +The preemptoff tracer will record the total length of +call_function_with_irqs_and_preemption_off() and +call_function_with_preemption_off(). + +But neither will trace the time that interrupts and/or +preemption is disabled. This total time is the time that we can +not schedule. To record this time, use the preemptirqsoff +tracer. + +Again, using this trace is much like the irqsoff and preemptoff +tracers. + + # echo 0 > options/function-trace + # echo preemptirqsoff > current_tracer + # echo 1 > tracing_on + # echo 0 > tracing_max_latency + # ls -ltr + [...] + # echo 0 > tracing_on + # cat trace +# tracer: preemptirqsoff +# +# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: ata_scsi_queuecmd +# => ended at: ata_scsi_queuecmd +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + ls-2230 3d... 0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd + ls-2230 3...1 100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd + ls-2230 3...1 101us+: trace_preempt_on <-ata_scsi_queuecmd + ls-2230 3...1 111us : <stack trace> + => sub_preempt_count + => _raw_spin_unlock_irqrestore + => ata_scsi_queuecmd + => scsi_dispatch_cmd + => scsi_request_fn + => __blk_run_queue_uncond + => __blk_run_queue + => blk_queue_bio + => generic_make_request + => submit_bio + => submit_bh + => ext3_bread + => ext3_dir_bread + => htree_dirblock_to_tree + => ext3_htree_fill_tree + => ext3_readdir + => vfs_readdir + => sys_getdents + => system_call_fastpath + + +The trace_hardirqs_off_thunk is called from assembly on x86 when +interrupts are disabled in the assembly code. Without the +function tracing, we do not know if interrupts were enabled +within the preemption points. We do see that it started with +preemption enabled. + +Here is a trace with function-trace set: + +# tracer: preemptirqsoff +# +# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: schedule +# => ended at: mutex_unlock +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / +kworker/-59 3...1 0us : __schedule <-schedule +kworker/-59 3d..1 0us : rcu_preempt_qs <-rcu_note_context_switch +kworker/-59 3d..1 1us : add_preempt_count <-_raw_spin_lock_irq +kworker/-59 3d..2 1us : deactivate_task <-__schedule +kworker/-59 3d..2 1us : dequeue_task <-deactivate_task +kworker/-59 3d..2 2us : update_rq_clock <-dequeue_task +kworker/-59 3d..2 2us : dequeue_task_fair <-dequeue_task +kworker/-59 3d..2 2us : update_curr <-dequeue_task_fair +kworker/-59 3d..2 2us : update_min_vruntime <-update_curr +kworker/-59 3d..2 3us : cpuacct_charge <-update_curr +kworker/-59 3d..2 3us : __rcu_read_lock <-cpuacct_charge +kworker/-59 3d..2 3us : __rcu_read_unlock <-cpuacct_charge +kworker/-59 3d..2 3us : update_cfs_rq_blocked_load <-dequeue_task_fair +kworker/-59 3d..2 4us : clear_buddies <-dequeue_task_fair +kworker/-59 3d..2 4us : account_entity_dequeue <-dequeue_task_fair +kworker/-59 3d..2 4us : update_min_vruntime <-dequeue_task_fair +kworker/-59 3d..2 4us : update_cfs_shares <-dequeue_task_fair +kworker/-59 3d..2 5us : hrtick_update <-dequeue_task_fair +kworker/-59 3d..2 5us : wq_worker_sleeping <-__schedule +kworker/-59 3d..2 5us : kthread_data <-wq_worker_sleeping +kworker/-59 3d..2 5us : put_prev_task_fair <-__schedule +kworker/-59 3d..2 6us : pick_next_task_fair <-pick_next_task +kworker/-59 3d..2 6us : clear_buddies <-pick_next_task_fair +kworker/-59 3d..2 6us : set_next_entity <-pick_next_task_fair +kworker/-59 3d..2 6us : update_stats_wait_end <-set_next_entity + ls-2269 3d..2 7us : finish_task_switch <-__schedule + ls-2269 3d..2 7us : _raw_spin_unlock_irq <-finish_task_switch + ls-2269 3d..2 8us : do_IRQ <-ret_from_intr + ls-2269 3d..2 8us : irq_enter <-do_IRQ + ls-2269 3d..2 8us : rcu_irq_enter <-irq_enter + ls-2269 3d..2 9us : add_preempt_count <-irq_enter + ls-2269 3d.h2 9us : exit_idle <-do_IRQ +[...] + ls-2269 3d.h3 20us : sub_preempt_count <-_raw_spin_unlock + ls-2269 3d.h2 20us : irq_exit <-do_IRQ + ls-2269 3d.h2 21us : sub_preempt_count <-irq_exit + ls-2269 3d..3 21us : do_softirq <-irq_exit + ls-2269 3d..3 21us : __do_softirq <-call_softirq + ls-2269 3d..3 21us+: __local_bh_disable <-__do_softirq + ls-2269 3d.s4 29us : sub_preempt_count <-_local_bh_enable_ip + ls-2269 3d.s5 29us : sub_preempt_count <-_local_bh_enable_ip + ls-2269 3d.s5 31us : do_IRQ <-ret_from_intr + ls-2269 3d.s5 31us : irq_enter <-do_IRQ + ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter +[...] + ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter + ls-2269 3d.s5 32us : add_preempt_count <-irq_enter + ls-2269 3d.H5 32us : exit_idle <-do_IRQ + ls-2269 3d.H5 32us : handle_irq <-do_IRQ + ls-2269 3d.H5 32us : irq_to_desc <-handle_irq + ls-2269 3d.H5 33us : handle_fasteoi_irq <-handle_irq +[...] + ls-2269 3d.s5 158us : _raw_spin_unlock_irqrestore <-rtl8139_poll + ls-2269 3d.s3 158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action + ls-2269 3d.s3 159us : __local_bh_enable <-__do_softirq + ls-2269 3d.s3 159us : sub_preempt_count <-__local_bh_enable + ls-2269 3d..3 159us : idle_cpu <-irq_exit + ls-2269 3d..3 159us : rcu_irq_exit <-irq_exit + ls-2269 3d..3 160us : sub_preempt_count <-irq_exit + ls-2269 3d... 161us : __mutex_unlock_slowpath <-mutex_unlock + ls-2269 3d... 162us+: trace_hardirqs_on <-mutex_unlock + ls-2269 3d... 186us : <stack trace> + => __mutex_unlock_slowpath + => mutex_unlock + => process_output + => n_tty_write + => tty_write + => vfs_write + => sys_write + => system_call_fastpath + +This is an interesting trace. It started with kworker running and +scheduling out and ls taking over. But as soon as ls released the +rq lock and enabled interrupts (but not preemption) an interrupt +triggered. When the interrupt finished, it started running softirqs. +But while the softirq was running, another interrupt triggered. +When an interrupt is running inside a softirq, the annotation is 'H'. + + +wakeup +------ + +One common case that people are interested in tracing is the +time it takes for a task that is woken to actually wake up. +Now for non Real-Time tasks, this can be arbitrary. But tracing +it none the less can be interesting. + +Without function tracing: + + # echo 0 > options/function-trace + # echo wakeup > current_tracer + # echo 1 > tracing_on + # echo 0 > tracing_max_latency + # chrt -f 5 sleep 1 + # echo 0 > tracing_on + # cat trace +# tracer: wakeup +# +# wakeup latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0) +# ----------------- +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + <idle>-0 3dNs7 0us : 0:120:R + [003] 312:100:R kworker/3:1H + <idle>-0 3dNs7 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up + <idle>-0 3d..3 15us : __schedule <-schedule + <idle>-0 3d..3 15us : 0:120:R ==> [003] 312:100:R kworker/3:1H + +The tracer only traces the highest priority task in the system +to avoid tracing the normal circumstances. Here we see that +the kworker with a nice priority of -20 (not very nice), took +just 15 microseconds from the time it woke up, to the time it +ran. + +Non Real-Time tasks are not that interesting. A more interesting +trace is to concentrate only on Real-Time tasks. + +wakeup_rt +--------- + +In a Real-Time environment it is very important to know the +wakeup time it takes for the highest priority task that is woken +up to the time that it executes. This is also known as "schedule +latency". I stress the point that this is about RT tasks. It is +also important to know the scheduling latency of non-RT tasks, +but the average schedule latency is better for non-RT tasks. +Tools like LatencyTop are more appropriate for such +measurements. + +Real-Time environments are interested in the worst case latency. +That is the longest latency it takes for something to happen, +and not the average. We can have a very fast scheduler that may +only have a large latency once in a while, but that would not +work well with Real-Time tasks. The wakeup_rt tracer was designed +to record the worst case wakeups of RT tasks. Non-RT tasks are +not recorded because the tracer only records one worst case and +tracing non-RT tasks that are unpredictable will overwrite the +worst case latency of RT tasks (just run the normal wakeup +tracer for a while to see that effect). + +Since this tracer only deals with RT tasks, we will run this +slightly differently than we did with the previous tracers. +Instead of performing an 'ls', we will run 'sleep 1' under +'chrt' which changes the priority of the task. + + # echo 0 > options/function-trace + # echo wakeup_rt > current_tracer + # echo 1 > tracing_on + # echo 0 > tracing_max_latency + # chrt -f 5 sleep 1 + # echo 0 > tracing_on + # cat trace +# tracer: wakeup +# +# tracer: wakeup_rt +# +# wakeup_rt latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5) +# ----------------- +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + <idle>-0 3d.h4 0us : 0:120:R + [003] 2389: 94:R sleep + <idle>-0 3d.h4 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up + <idle>-0 3d..3 5us : __schedule <-schedule + <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep + + +Running this on an idle system, we see that it only took 5 microseconds +to perform the task switch. Note, since the trace point in the schedule +is before the actual "switch", we stop the tracing when the recorded task +is about to schedule in. This may change if we add a new marker at the +end of the scheduler. + +Notice that the recorded task is 'sleep' with the PID of 2389 +and it has an rt_prio of 5. This priority is user-space priority +and not the internal kernel priority. The policy is 1 for +SCHED_FIFO and 2 for SCHED_RR. + +Note, that the trace data shows the internal priority (99 - rtprio). + + <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep + +The 0:120:R means idle was running with a nice priority of 0 (120 - 20) +and in the running state 'R'. The sleep task was scheduled in with +2389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94) +and it too is in the running state. + +Doing the same with chrt -r 5 and function-trace set. + + echo 1 > options/function-trace + +# tracer: wakeup_rt +# +# wakeup_rt latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5) +# ----------------- +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + <idle>-0 3d.h4 1us+: 0:120:R + [003] 2448: 94:R sleep + <idle>-0 3d.h4 2us : ttwu_do_activate.constprop.87 <-try_to_wake_up + <idle>-0 3d.h3 3us : check_preempt_curr <-ttwu_do_wakeup + <idle>-0 3d.h3 3us : resched_curr <-check_preempt_curr + <idle>-0 3dNh3 4us : task_woken_rt <-ttwu_do_wakeup + <idle>-0 3dNh3 4us : _raw_spin_unlock <-try_to_wake_up + <idle>-0 3dNh3 4us : sub_preempt_count <-_raw_spin_unlock + <idle>-0 3dNh2 5us : ttwu_stat <-try_to_wake_up + <idle>-0 3dNh2 5us : _raw_spin_unlock_irqrestore <-try_to_wake_up + <idle>-0 3dNh2 6us : sub_preempt_count <-_raw_spin_unlock_irqrestore + <idle>-0 3dNh1 6us : _raw_spin_lock <-__run_hrtimer + <idle>-0 3dNh1 6us : add_preempt_count <-_raw_spin_lock + <idle>-0 3dNh2 7us : _raw_spin_unlock <-hrtimer_interrupt + <idle>-0 3dNh2 7us : sub_preempt_count <-_raw_spin_unlock + <idle>-0 3dNh1 7us : tick_program_event <-hrtimer_interrupt + <idle>-0 3dNh1 7us : clockevents_program_event <-tick_program_event + <idle>-0 3dNh1 8us : ktime_get <-clockevents_program_event + <idle>-0 3dNh1 8us : lapic_next_event <-clockevents_program_event + <idle>-0 3dNh1 8us : irq_exit <-smp_apic_timer_interrupt + <idle>-0 3dNh1 9us : sub_preempt_count <-irq_exit + <idle>-0 3dN.2 9us : idle_cpu <-irq_exit + <idle>-0 3dN.2 9us : rcu_irq_exit <-irq_exit + <idle>-0 3dN.2 10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit + <idle>-0 3dN.2 10us : sub_preempt_count <-irq_exit + <idle>-0 3.N.1 11us : rcu_idle_exit <-cpu_idle + <idle>-0 3dN.1 11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit + <idle>-0 3.N.1 11us : tick_nohz_idle_exit <-cpu_idle + <idle>-0 3dN.1 12us : menu_hrtimer_cancel <-tick_nohz_idle_exit + <idle>-0 3dN.1 12us : ktime_get <-tick_nohz_idle_exit + <idle>-0 3dN.1 12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit + <idle>-0 3dN.1 13us : update_cpu_load_nohz <-tick_nohz_idle_exit + <idle>-0 3dN.1 13us : _raw_spin_lock <-update_cpu_load_nohz + <idle>-0 3dN.1 13us : add_preempt_count <-_raw_spin_lock + <idle>-0 3dN.2 13us : __update_cpu_load <-update_cpu_load_nohz + <idle>-0 3dN.2 14us : sched_avg_update <-__update_cpu_load + <idle>-0 3dN.2 14us : _raw_spin_unlock <-update_cpu_load_nohz + <idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock + <idle>-0 3dN.1 15us : calc_load_exit_idle <-tick_nohz_idle_exit + <idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit + <idle>-0 3dN.1 15us : hrtimer_cancel <-tick_nohz_idle_exit + <idle>-0 3dN.1 15us : hrtimer_try_to_cancel <-hrtimer_cancel + <idle>-0 3dN.1 16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel + <idle>-0 3dN.1 16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18 + <idle>-0 3dN.1 16us : add_preempt_count <-_raw_spin_lock_irqsave + <idle>-0 3dN.2 17us : __remove_hrtimer <-remove_hrtimer.part.16 + <idle>-0 3dN.2 17us : hrtimer_force_reprogram <-__remove_hrtimer + <idle>-0 3dN.2 17us : tick_program_event <-hrtimer_force_reprogram + <idle>-0 3dN.2 18us : clockevents_program_event <-tick_program_event + <idle>-0 3dN.2 18us : ktime_get <-clockevents_program_event + <idle>-0 3dN.2 18us : lapic_next_event <-clockevents_program_event + <idle>-0 3dN.2 19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel + <idle>-0 3dN.2 19us : sub_preempt_count <-_raw_spin_unlock_irqrestore + <idle>-0 3dN.1 19us : hrtimer_forward <-tick_nohz_idle_exit + <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward + <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward + <idle>-0 3dN.1 20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11 + <idle>-0 3dN.1 20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns + <idle>-0 3dN.1 21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns + <idle>-0 3dN.1 21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18 + <idle>-0 3dN.1 21us : add_preempt_count <-_raw_spin_lock_irqsave + <idle>-0 3dN.2 22us : ktime_add_safe <-__hrtimer_start_range_ns + <idle>-0 3dN.2 22us : enqueue_hrtimer <-__hrtimer_start_range_ns + <idle>-0 3dN.2 22us : tick_program_event <-__hrtimer_start_range_ns + <idle>-0 3dN.2 23us : clockevents_program_event <-tick_program_event + <idle>-0 3dN.2 23us : ktime_get <-clockevents_program_event + <idle>-0 3dN.2 23us : lapic_next_event <-clockevents_program_event + <idle>-0 3dN.2 24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns + <idle>-0 3dN.2 24us : sub_preempt_count <-_raw_spin_unlock_irqrestore + <idle>-0 3dN.1 24us : account_idle_ticks <-tick_nohz_idle_exit + <idle>-0 3dN.1 24us : account_idle_time <-account_idle_ticks + <idle>-0 3.N.1 25us : sub_preempt_count <-cpu_idle + <idle>-0 3.N.. 25us : schedule <-cpu_idle + <idle>-0 3.N.. 25us : __schedule <-preempt_schedule + <idle>-0 3.N.. 26us : add_preempt_count <-__schedule + <idle>-0 3.N.1 26us : rcu_note_context_switch <-__schedule + <idle>-0 3.N.1 26us : rcu_sched_qs <-rcu_note_context_switch + <idle>-0 3dN.1 27us : rcu_preempt_qs <-rcu_note_context_switch + <idle>-0 3.N.1 27us : _raw_spin_lock_irq <-__schedule + <idle>-0 3dN.1 27us : add_preempt_count <-_raw_spin_lock_irq + <idle>-0 3dN.2 28us : put_prev_task_idle <-__schedule + <idle>-0 3dN.2 28us : pick_next_task_stop <-pick_next_task + <idle>-0 3dN.2 28us : pick_next_task_rt <-pick_next_task + <idle>-0 3dN.2 29us : dequeue_pushable_task <-pick_next_task_rt + <idle>-0 3d..3 29us : __schedule <-preempt_schedule + <idle>-0 3d..3 30us : 0:120:R ==> [003] 2448: 94:R sleep + +This isn't that big of a trace, even with function tracing enabled, +so I included the entire trace. + +The interrupt went off while when the system was idle. Somewhere +before task_woken_rt() was called, the NEED_RESCHED flag was set, +this is indicated by the first occurrence of the 'N' flag. + +Latency tracing and events +-------------------------- +As function tracing can induce a much larger latency, but without +seeing what happens within the latency it is hard to know what +caused it. There is a middle ground, and that is with enabling +events. + + # echo 0 > options/function-trace + # echo wakeup_rt > current_tracer + # echo 1 > events/enable + # echo 1 > tracing_on + # echo 0 > tracing_max_latency + # chrt -f 5 sleep 1 + # echo 0 > tracing_on + # cat trace +# tracer: wakeup_rt +# +# wakeup_rt latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5) +# ----------------- +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + <idle>-0 2d.h4 0us : 0:120:R + [002] 5882: 94:R sleep + <idle>-0 2d.h4 0us : ttwu_do_activate.constprop.87 <-try_to_wake_up + <idle>-0 2d.h4 1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002 + <idle>-0 2dNh2 1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8 + <idle>-0 2.N.2 2us : power_end: cpu_id=2 + <idle>-0 2.N.2 3us : cpu_idle: state=4294967295 cpu_id=2 + <idle>-0 2dN.3 4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0 + <idle>-0 2dN.3 4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000 + <idle>-0 2.N.2 5us : rcu_utilization: Start context switch + <idle>-0 2.N.2 5us : rcu_utilization: End context switch + <idle>-0 2d..3 6us : __schedule <-schedule + <idle>-0 2d..3 6us : 0:120:R ==> [002] 5882: 94:R sleep + + +function +-------- + +This tracer is the function tracer. Enabling the function tracer +can be done from the debug file system. Make sure the +ftrace_enabled is set; otherwise this tracer is a nop. +See the "ftrace_enabled" section below. + + # sysctl kernel.ftrace_enabled=1 + # echo function > current_tracer + # echo 1 > tracing_on + # usleep 1 + # echo 0 > tracing_on + # cat trace +# tracer: function +# +# entries-in-buffer/entries-written: 24799/24799 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + bash-1994 [002] .... 3082.063030: mutex_unlock <-rb_simple_write + bash-1994 [002] .... 3082.063031: __mutex_unlock_slowpath <-mutex_unlock + bash-1994 [002] .... 3082.063031: __fsnotify_parent <-fsnotify_modify + bash-1994 [002] .... 3082.063032: fsnotify <-fsnotify_modify + bash-1994 [002] .... 3082.063032: __srcu_read_lock <-fsnotify + bash-1994 [002] .... 3082.063032: add_preempt_count <-__srcu_read_lock + bash-1994 [002] ...1 3082.063032: sub_preempt_count <-__srcu_read_lock + bash-1994 [002] .... 3082.063033: __srcu_read_unlock <-fsnotify +[...] + + +Note: function tracer uses ring buffers to store the above +entries. The newest data may overwrite the oldest data. +Sometimes using echo to stop the trace is not sufficient because +the tracing could have overwritten the data that you wanted to +record. For this reason, it is sometimes better to disable +tracing directly from a program. This allows you to stop the +tracing at the point that you hit the part that you are +interested in. To disable the tracing directly from a C program, +something like following code snippet can be used: + +int trace_fd; +[...] +int main(int argc, char *argv[]) { + [...] + trace_fd = open(tracing_file("tracing_on"), O_WRONLY); + [...] + if (condition_hit()) { + write(trace_fd, "0", 1); + } + [...] +} + + +Single thread tracing +--------------------- + +By writing into set_ftrace_pid you can trace a +single thread. For example: + +# cat set_ftrace_pid +no pid +# echo 3111 > set_ftrace_pid +# cat set_ftrace_pid +3111 +# echo function > current_tracer +# cat trace | head + # tracer: function + # + # TASK-PID CPU# TIMESTAMP FUNCTION + # | | | | | + yum-updatesd-3111 [003] 1637.254676: finish_task_switch <-thread_return + yum-updatesd-3111 [003] 1637.254681: hrtimer_cancel <-schedule_hrtimeout_range + yum-updatesd-3111 [003] 1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel + yum-updatesd-3111 [003] 1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel + yum-updatesd-3111 [003] 1637.254685: fget_light <-do_sys_poll + yum-updatesd-3111 [003] 1637.254686: pipe_poll <-do_sys_poll +# echo > set_ftrace_pid +# cat trace |head + # tracer: function + # + # TASK-PID CPU# TIMESTAMP FUNCTION + # | | | | | + ##### CPU 3 buffer started #### + yum-updatesd-3111 [003] 1701.957688: free_poll_entry <-poll_freewait + yum-updatesd-3111 [003] 1701.957689: remove_wait_queue <-free_poll_entry + yum-updatesd-3111 [003] 1701.957691: fput <-free_poll_entry + yum-updatesd-3111 [003] 1701.957692: audit_syscall_exit <-sysret_audit + yum-updatesd-3111 [003] 1701.957693: path_put <-audit_syscall_exit + +If you want to trace a function when executing, you could use +something like this simple program: + +#include <stdio.h> +#include <stdlib.h> +#include <sys/types.h> +#include <sys/stat.h> +#include <fcntl.h> +#include <unistd.h> +#include <string.h> + +#define _STR(x) #x +#define STR(x) _STR(x) +#define MAX_PATH 256 + +const char *find_debugfs(void) +{ + static char debugfs[MAX_PATH+1]; + static int debugfs_found; + char type[100]; + FILE *fp; + + if (debugfs_found) + return debugfs; + + if ((fp = fopen("/proc/mounts","r")) == NULL) { + perror("/proc/mounts"); + return NULL; + } + + while (fscanf(fp, "%*s %" + STR(MAX_PATH) + "s %99s %*s %*d %*d\n", + debugfs, type) == 2) { + if (strcmp(type, "debugfs") == 0) + break; + } + fclose(fp); + + if (strcmp(type, "debugfs") != 0) { + fprintf(stderr, "debugfs not mounted"); + return NULL; + } + + strcat(debugfs, "/tracing/"); + debugfs_found = 1; + + return debugfs; +} + +const char *tracing_file(const char *file_name) +{ + static char trace_file[MAX_PATH+1]; + snprintf(trace_file, MAX_PATH, "%s/%s", find_debugfs(), file_name); + return trace_file; +} + +int main (int argc, char **argv) +{ + if (argc < 1) + exit(-1); + + if (fork() > 0) { + int fd, ffd; + char line[64]; + int s; + + ffd = open(tracing_file("current_tracer"), O_WRONLY); + if (ffd < 0) + exit(-1); + write(ffd, "nop", 3); + + fd = open(tracing_file("set_ftrace_pid"), O_WRONLY); + s = sprintf(line, "%d\n", getpid()); + write(fd, line, s); + + write(ffd, "function", 8); + + close(fd); + close(ffd); + + execvp(argv[1], argv+1); + } + + return 0; +} + +Or this simple script! + +------ +#!/bin/bash + +debugfs=`sed -ne 's/^debugfs \(.*\) debugfs.*/\1/p' /proc/mounts` +echo nop > $debugfs/tracing/current_tracer +echo 0 > $debugfs/tracing/tracing_on +echo $$ > $debugfs/tracing/set_ftrace_pid +echo function > $debugfs/tracing/current_tracer +echo 1 > $debugfs/tracing/tracing_on +exec "$@" +------ + + +function graph tracer +--------------------------- + +This tracer is similar to the function tracer except that it +probes a function on its entry and its exit. This is done by +using a dynamically allocated stack of return addresses in each +task_struct. On function entry the tracer overwrites the return +address of each function traced to set a custom probe. Thus the +original return address is stored on the stack of return address +in the task_struct. + +Probing on both ends of a function leads to special features +such as: + +- measure of a function's time execution +- having a reliable call stack to draw function calls graph + +This tracer is useful in several situations: + +- you want to find the reason of a strange kernel behavior and + need to see what happens in detail on any areas (or specific + ones). + +- you are experiencing weird latencies but it's difficult to + find its origin. + +- you want to find quickly which path is taken by a specific + function + +- you just want to peek inside a working kernel and want to see + what happens there. + +# tracer: function_graph +# +# CPU DURATION FUNCTION CALLS +# | | | | | | | + + 0) | sys_open() { + 0) | do_sys_open() { + 0) | getname() { + 0) | kmem_cache_alloc() { + 0) 1.382 us | __might_sleep(); + 0) 2.478 us | } + 0) | strncpy_from_user() { + 0) | might_fault() { + 0) 1.389 us | __might_sleep(); + 0) 2.553 us | } + 0) 3.807 us | } + 0) 7.876 us | } + 0) | alloc_fd() { + 0) 0.668 us | _spin_lock(); + 0) 0.570 us | expand_files(); + 0) 0.586 us | _spin_unlock(); + + +There are several columns that can be dynamically +enabled/disabled. You can use every combination of options you +want, depending on your needs. + +- The cpu number on which the function executed is default + enabled. It is sometimes better to only trace one cpu (see + tracing_cpu_mask file) or you might sometimes see unordered + function calls while cpu tracing switch. + + hide: echo nofuncgraph-cpu > trace_options + show: echo funcgraph-cpu > trace_options + +- The duration (function's time of execution) is displayed on + the closing bracket line of a function or on the same line + than the current function in case of a leaf one. It is default + enabled. + + hide: echo nofuncgraph-duration > trace_options + show: echo funcgraph-duration > trace_options + +- The overhead field precedes the duration field in case of + reached duration thresholds. + + hide: echo nofuncgraph-overhead > trace_options + show: echo funcgraph-overhead > trace_options + depends on: funcgraph-duration + + ie: + + 0) | up_write() { + 0) 0.646 us | _spin_lock_irqsave(); + 0) 0.684 us | _spin_unlock_irqrestore(); + 0) 3.123 us | } + 0) 0.548 us | fput(); + 0) + 58.628 us | } + + [...] + + 0) | putname() { + 0) | kmem_cache_free() { + 0) 0.518 us | __phys_addr(); + 0) 1.757 us | } + 0) 2.861 us | } + 0) ! 115.305 us | } + 0) ! 116.402 us | } + + + means that the function exceeded 10 usecs. + ! means that the function exceeded 100 usecs. + # means that the function exceeded 1000 usecs. + $ means that the function exceeded 1 sec. + + +- The task/pid field displays the thread cmdline and pid which + executed the function. It is default disabled. + + hide: echo nofuncgraph-proc > trace_options + show: echo funcgraph-proc > trace_options + + ie: + + # tracer: function_graph + # + # CPU TASK/PID DURATION FUNCTION CALLS + # | | | | | | | | | + 0) sh-4802 | | d_free() { + 0) sh-4802 | | call_rcu() { + 0) sh-4802 | | __call_rcu() { + 0) sh-4802 | 0.616 us | rcu_process_gp_end(); + 0) sh-4802 | 0.586 us | check_for_new_grace_period(); + 0) sh-4802 | 2.899 us | } + 0) sh-4802 | 4.040 us | } + 0) sh-4802 | 5.151 us | } + 0) sh-4802 | + 49.370 us | } + + +- The absolute time field is an absolute timestamp given by the + system clock since it started. A snapshot of this time is + given on each entry/exit of functions + + hide: echo nofuncgraph-abstime > trace_options + show: echo funcgraph-abstime > trace_options + + ie: + + # + # TIME CPU DURATION FUNCTION CALLS + # | | | | | | | | + 360.774522 | 1) 0.541 us | } + 360.774522 | 1) 4.663 us | } + 360.774523 | 1) 0.541 us | __wake_up_bit(); + 360.774524 | 1) 6.796 us | } + 360.774524 | 1) 7.952 us | } + 360.774525 | 1) 9.063 us | } + 360.774525 | 1) 0.615 us | journal_mark_dirty(); + 360.774527 | 1) 0.578 us | __brelse(); + 360.774528 | 1) | reiserfs_prepare_for_journal() { + 360.774528 | 1) | unlock_buffer() { + 360.774529 | 1) | wake_up_bit() { + 360.774529 | 1) | bit_waitqueue() { + 360.774530 | 1) 0.594 us | __phys_addr(); + + +The function name is always displayed after the closing bracket +for a function if the start of that function is not in the +trace buffer. + +Display of the function name after the closing bracket may be +enabled for functions whose start is in the trace buffer, +allowing easier searching with grep for function durations. +It is default disabled. + + hide: echo nofuncgraph-tail > trace_options + show: echo funcgraph-tail > trace_options + + Example with nofuncgraph-tail (default): + 0) | putname() { + 0) | kmem_cache_free() { + 0) 0.518 us | __phys_addr(); + 0) 1.757 us | } + 0) 2.861 us | } + + Example with funcgraph-tail: + 0) | putname() { + 0) | kmem_cache_free() { + 0) 0.518 us | __phys_addr(); + 0) 1.757 us | } /* kmem_cache_free() */ + 0) 2.861 us | } /* putname() */ + +You can put some comments on specific functions by using +trace_printk() For example, if you want to put a comment inside +the __might_sleep() function, you just have to include +<linux/ftrace.h> and call trace_printk() inside __might_sleep() + +trace_printk("I'm a comment!\n") + +will produce: + + 1) | __might_sleep() { + 1) | /* I'm a comment! */ + 1) 1.449 us | } + + +You might find other useful features for this tracer in the +following "dynamic ftrace" section such as tracing only specific +functions or tasks. + +dynamic ftrace +-------------- + +If CONFIG_DYNAMIC_FTRACE is set, the system will run with +virtually no overhead when function tracing is disabled. The way +this works is the mcount function call (placed at the start of +every kernel function, produced by the -pg switch in gcc), +starts of pointing to a simple return. (Enabling FTRACE will +include the -pg switch in the compiling of the kernel.) + +At compile time every C file object is run through the +recordmcount program (located in the scripts directory). This +program will parse the ELF headers in the C object to find all +the locations in the .text section that call mcount. (Note, only +white listed .text sections are processed, since processing other +sections like .init.text may cause races due to those sections +being freed unexpectedly). + +A new section called "__mcount_loc" is created that holds +references to all the mcount call sites in the .text section. +The recordmcount program re-links this section back into the +original object. The final linking stage of the kernel will add all these +references into a single table. + +On boot up, before SMP is initialized, the dynamic ftrace code +scans this table and updates all the locations into nops. It +also records the locations, which are added to the +available_filter_functions list. Modules are processed as they +are loaded and before they are executed. When a module is +unloaded, it also removes its functions from the ftrace function +list. This is automatic in the module unload code, and the +module author does not need to worry about it. + +When tracing is enabled, the process of modifying the function +tracepoints is dependent on architecture. The old method is to use +kstop_machine to prevent races with the CPUs executing code being +modified (which can cause the CPU to do undesirable things, especially +if the modified code crosses cache (or page) boundaries), and the nops are +patched back to calls. But this time, they do not call mcount +(which is just a function stub). They now call into the ftrace +infrastructure. + +The new method of modifying the function tracepoints is to place +a breakpoint at the location to be modified, sync all CPUs, modify +the rest of the instruction not covered by the breakpoint. Sync +all CPUs again, and then remove the breakpoint with the finished +version to the ftrace call site. + +Some archs do not even need to monkey around with the synchronization, +and can just slap the new code on top of the old without any +problems with other CPUs executing it at the same time. + +One special side-effect to the recording of the functions being +traced is that we can now selectively choose which functions we +wish to trace and which ones we want the mcount calls to remain +as nops. + +Two files are used, one for enabling and one for disabling the +tracing of specified functions. They are: + + set_ftrace_filter + +and + + set_ftrace_notrace + +A list of available functions that you can add to these files is +listed in: + + available_filter_functions + + # cat available_filter_functions +put_prev_task_idle +kmem_cache_create +pick_next_task_rt +get_online_cpus +pick_next_task_fair +mutex_lock +[...] + +If I am only interested in sys_nanosleep and hrtimer_interrupt: + + # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter + # echo function > current_tracer + # echo 1 > tracing_on + # usleep 1 + # echo 0 > tracing_on + # cat trace +# tracer: function +# +# entries-in-buffer/entries-written: 5/5 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + usleep-2665 [001] .... 4186.475355: sys_nanosleep <-system_call_fastpath + <idle>-0 [001] d.h1 4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt + usleep-2665 [001] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt + <idle>-0 [003] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt + <idle>-0 [002] d.h1 4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt + +To see which functions are being traced, you can cat the file: + + # cat set_ftrace_filter +hrtimer_interrupt +sys_nanosleep + + +Perhaps this is not enough. The filters also allow simple wild +cards. Only the following are currently available + + <match>* - will match functions that begin with <match> + *<match> - will match functions that end with <match> + *<match>* - will match functions that have <match> in it + +These are the only wild cards which are supported. + + <match>*<match> will not work. + +Note: It is better to use quotes to enclose the wild cards, + otherwise the shell may expand the parameters into names + of files in the local directory. + + # echo 'hrtimer_*' > set_ftrace_filter + +Produces: + +# tracer: function +# +# entries-in-buffer/entries-written: 897/897 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + <idle>-0 [003] dN.1 4228.547803: hrtimer_cancel <-tick_nohz_idle_exit + <idle>-0 [003] dN.1 4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel + <idle>-0 [003] dN.2 4228.547805: hrtimer_force_reprogram <-__remove_hrtimer + <idle>-0 [003] dN.1 4228.547805: hrtimer_forward <-tick_nohz_idle_exit + <idle>-0 [003] dN.1 4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11 + <idle>-0 [003] d..1 4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt + <idle>-0 [003] d..1 4228.547859: hrtimer_start <-__tick_nohz_idle_enter + <idle>-0 [003] d..2 4228.547860: hrtimer_force_reprogram <-__rem + +Notice that we lost the sys_nanosleep. + + # cat set_ftrace_filter +hrtimer_run_queues +hrtimer_run_pending +hrtimer_init +hrtimer_cancel +hrtimer_try_to_cancel +hrtimer_forward +hrtimer_start +hrtimer_reprogram +hrtimer_force_reprogram +hrtimer_get_next_event +hrtimer_interrupt +hrtimer_nanosleep +hrtimer_wakeup +hrtimer_get_remaining +hrtimer_get_res +hrtimer_init_sleeper + + +This is because the '>' and '>>' act just like they do in bash. +To rewrite the filters, use '>' +To append to the filters, use '>>' + +To clear out a filter so that all functions will be recorded +again: + + # echo > set_ftrace_filter + # cat set_ftrace_filter + # + +Again, now we want to append. + + # echo sys_nanosleep > set_ftrace_filter + # cat set_ftrace_filter +sys_nanosleep + # echo 'hrtimer_*' >> set_ftrace_filter + # cat set_ftrace_filter +hrtimer_run_queues +hrtimer_run_pending +hrtimer_init +hrtimer_cancel +hrtimer_try_to_cancel +hrtimer_forward +hrtimer_start +hrtimer_reprogram +hrtimer_force_reprogram +hrtimer_get_next_event +hrtimer_interrupt +sys_nanosleep +hrtimer_nanosleep +hrtimer_wakeup +hrtimer_get_remaining +hrtimer_get_res +hrtimer_init_sleeper + + +The set_ftrace_notrace prevents those functions from being +traced. + + # echo '*preempt*' '*lock*' > set_ftrace_notrace + +Produces: + +# tracer: function +# +# entries-in-buffer/entries-written: 39608/39608 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + bash-1994 [000] .... 4342.324896: file_ra_state_init <-do_dentry_open + bash-1994 [000] .... 4342.324897: open_check_o_direct <-do_last + bash-1994 [000] .... 4342.324897: ima_file_check <-do_last + bash-1994 [000] .... 4342.324898: process_measurement <-ima_file_check + bash-1994 [000] .... 4342.324898: ima_get_action <-process_measurement + bash-1994 [000] .... 4342.324898: ima_match_policy <-ima_get_action + bash-1994 [000] .... 4342.324899: do_truncate <-do_last + bash-1994 [000] .... 4342.324899: should_remove_suid <-do_truncate + bash-1994 [000] .... 4342.324899: notify_change <-do_truncate + bash-1994 [000] .... 4342.324900: current_fs_time <-notify_change + bash-1994 [000] .... 4342.324900: current_kernel_time <-current_fs_time + bash-1994 [000] .... 4342.324900: timespec_trunc <-current_fs_time + +We can see that there's no more lock or preempt tracing. + + +Dynamic ftrace with the function graph tracer +--------------------------------------------- + +Although what has been explained above concerns both the +function tracer and the function-graph-tracer, there are some +special features only available in the function-graph tracer. + +If you want to trace only one function and all of its children, +you just have to echo its name into set_graph_function: + + echo __do_fault > set_graph_function + +will produce the following "expanded" trace of the __do_fault() +function: + + 0) | __do_fault() { + 0) | filemap_fault() { + 0) | find_lock_page() { + 0) 0.804 us | find_get_page(); + 0) | __might_sleep() { + 0) 1.329 us | } + 0) 3.904 us | } + 0) 4.979 us | } + 0) 0.653 us | _spin_lock(); + 0) 0.578 us | page_add_file_rmap(); + 0) 0.525 us | native_set_pte_at(); + 0) 0.585 us | _spin_unlock(); + 0) | unlock_page() { + 0) 0.541 us | page_waitqueue(); + 0) 0.639 us | __wake_up_bit(); + 0) 2.786 us | } + 0) + 14.237 us | } + 0) | __do_fault() { + 0) | filemap_fault() { + 0) | find_lock_page() { + 0) 0.698 us | find_get_page(); + 0) | __might_sleep() { + 0) 1.412 us | } + 0) 3.950 us | } + 0) 5.098 us | } + 0) 0.631 us | _spin_lock(); + 0) 0.571 us | page_add_file_rmap(); + 0) 0.526 us | native_set_pte_at(); + 0) 0.586 us | _spin_unlock(); + 0) | unlock_page() { + 0) 0.533 us | page_waitqueue(); + 0) 0.638 us | __wake_up_bit(); + 0) 2.793 us | } + 0) + 14.012 us | } + +You can also expand several functions at once: + + echo sys_open > set_graph_function + echo sys_close >> set_graph_function + +Now if you want to go back to trace all functions you can clear +this special filter via: + + echo > set_graph_function + + +ftrace_enabled +-------------- + +Note, the proc sysctl ftrace_enable is a big on/off switch for the +function tracer. By default it is enabled (when function tracing is +enabled in the kernel). If it is disabled, all function tracing is +disabled. This includes not only the function tracers for ftrace, but +also for any other uses (perf, kprobes, stack tracing, profiling, etc). + +Please disable this with care. + +This can be disable (and enabled) with: + + sysctl kernel.ftrace_enabled=0 + sysctl kernel.ftrace_enabled=1 + + or + + echo 0 > /proc/sys/kernel/ftrace_enabled + echo 1 > /proc/sys/kernel/ftrace_enabled + + +Filter commands +--------------- + +A few commands are supported by the set_ftrace_filter interface. +Trace commands have the following format: + +<function>:<command>:<parameter> + +The following commands are supported: + +- mod + This command enables function filtering per module. The + parameter defines the module. For example, if only the write* + functions in the ext3 module are desired, run: + + echo 'write*:mod:ext3' > set_ftrace_filter + + This command interacts with the filter in the same way as + filtering based on function names. Thus, adding more functions + in a different module is accomplished by appending (>>) to the + filter file. Remove specific module functions by prepending + '!': + + echo '!writeback*:mod:ext3' >> set_ftrace_filter + +- traceon/traceoff + These commands turn tracing on and off when the specified + functions are hit. The parameter determines how many times the + tracing system is turned on and off. If unspecified, there is + no limit. For example, to disable tracing when a schedule bug + is hit the first 5 times, run: + + echo '__schedule_bug:traceoff:5' > set_ftrace_filter + + To always disable tracing when __schedule_bug is hit: + + echo '__schedule_bug:traceoff' > set_ftrace_filter + + These commands are cumulative whether or not they are appended + to set_ftrace_filter. To remove a command, prepend it by '!' + and drop the parameter: + + echo '!__schedule_bug:traceoff:0' > set_ftrace_filter + + The above removes the traceoff command for __schedule_bug + that have a counter. To remove commands without counters: + + echo '!__schedule_bug:traceoff' > set_ftrace_filter + +- snapshot + Will cause a snapshot to be triggered when the function is hit. + + echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter + + To only snapshot once: + + echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter + + To remove the above commands: + + echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter + echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter + +- enable_event/disable_event + These commands can enable or disable a trace event. Note, because + function tracing callbacks are very sensitive, when these commands + are registered, the trace point is activated, but disabled in + a "soft" mode. That is, the tracepoint will be called, but + just will not be traced. The event tracepoint stays in this mode + as long as there's a command that triggers it. + + echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \ + set_ftrace_filter + + The format is: + + <function>:enable_event:<system>:<event>[:count] + <function>:disable_event:<system>:<event>[:count] + + To remove the events commands: + + + echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \ + set_ftrace_filter + echo '!schedule:disable_event:sched:sched_switch' > \ + set_ftrace_filter + +- dump + When the function is hit, it will dump the contents of the ftrace + ring buffer to the console. This is useful if you need to debug + something, and want to dump the trace when a certain function + is hit. Perhaps its a function that is called before a tripple + fault happens and does not allow you to get a regular dump. + +- cpudump + When the function is hit, it will dump the contents of the ftrace + ring buffer for the current CPU to the console. Unlike the "dump" + command, it only prints out the contents of the ring buffer for the + CPU that executed the function that triggered the dump. + +trace_pipe +---------- + +The trace_pipe outputs the same content as the trace file, but +the effect on the tracing is different. Every read from +trace_pipe is consumed. This means that subsequent reads will be +different. The trace is live. + + # echo function > current_tracer + # cat trace_pipe > /tmp/trace.out & +[1] 4153 + # echo 1 > tracing_on + # usleep 1 + # echo 0 > tracing_on + # cat trace +# tracer: function +# +# entries-in-buffer/entries-written: 0/0 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + + # + # cat /tmp/trace.out + bash-1994 [000] .... 5281.568961: mutex_unlock <-rb_simple_write + bash-1994 [000] .... 5281.568963: __mutex_unlock_slowpath <-mutex_unlock + bash-1994 [000] .... 5281.568963: __fsnotify_parent <-fsnotify_modify + bash-1994 [000] .... 5281.568964: fsnotify <-fsnotify_modify + bash-1994 [000] .... 5281.568964: __srcu_read_lock <-fsnotify + bash-1994 [000] .... 5281.568964: add_preempt_count <-__srcu_read_lock + bash-1994 [000] ...1 5281.568965: sub_preempt_count <-__srcu_read_lock + bash-1994 [000] .... 5281.568965: __srcu_read_unlock <-fsnotify + bash-1994 [000] .... 5281.568967: sys_dup2 <-system_call_fastpath + + +Note, reading the trace_pipe file will block until more input is +added. + +trace entries +------------- + +Having too much or not enough data can be troublesome in +diagnosing an issue in the kernel. The file buffer_size_kb is +used to modify the size of the internal trace buffers. The +number listed is the number of entries that can be recorded per +CPU. To know the full size, multiply the number of possible CPUs +with the number of entries. + + # cat buffer_size_kb +1408 (units kilobytes) + +Or simply read buffer_total_size_kb + + # cat buffer_total_size_kb +5632 + +To modify the buffer, simple echo in a number (in 1024 byte segments). + + # echo 10000 > buffer_size_kb + # cat buffer_size_kb +10000 (units kilobytes) + +It will try to allocate as much as possible. If you allocate too +much, it can cause Out-Of-Memory to trigger. + + # echo 1000000000000 > buffer_size_kb +-bash: echo: write error: Cannot allocate memory + # cat buffer_size_kb +85 + +The per_cpu buffers can be changed individually as well: + + # echo 10000 > per_cpu/cpu0/buffer_size_kb + # echo 100 > per_cpu/cpu1/buffer_size_kb + +When the per_cpu buffers are not the same, the buffer_size_kb +at the top level will just show an X + + # cat buffer_size_kb +X + +This is where the buffer_total_size_kb is useful: + + # cat buffer_total_size_kb +12916 + +Writing to the top level buffer_size_kb will reset all the buffers +to be the same again. + +Snapshot +-------- +CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature +available to all non latency tracers. (Latency tracers which +record max latency, such as "irqsoff" or "wakeup", can't use +this feature, since those are already using the snapshot +mechanism internally.) + +Snapshot preserves a current trace buffer at a particular point +in time without stopping tracing. Ftrace swaps the current +buffer with a spare buffer, and tracing continues in the new +current (=previous spare) buffer. + +The following debugfs files in "tracing" are related to this +feature: + + snapshot: + + This is used to take a snapshot and to read the output + of the snapshot. Echo 1 into this file to allocate a + spare buffer and to take a snapshot (swap), then read + the snapshot from this file in the same format as + "trace" (described above in the section "The File + System"). Both reads snapshot and tracing are executable + in parallel. When the spare buffer is allocated, echoing + 0 frees it, and echoing else (positive) values clear the + snapshot contents. + More details are shown in the table below. + + status\input | 0 | 1 | else | + --------------+------------+------------+------------+ + not allocated |(do nothing)| alloc+swap |(do nothing)| + --------------+------------+------------+------------+ + allocated | free | swap | clear | + --------------+------------+------------+------------+ + +Here is an example of using the snapshot feature. + + # echo 1 > events/sched/enable + # echo 1 > snapshot + # cat snapshot +# tracer: nop +# +# entries-in-buffer/entries-written: 71/71 #P:8 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + <idle>-0 [005] d... 2440.603828: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2242 next_prio=120 + sleep-2242 [005] d... 2440.603846: sched_switch: prev_comm=snapshot-test-2 prev_pid=2242 prev_prio=120 prev_state=R ==> next_comm=kworker/5:1 next_pid=60 next_prio=120 +[...] + <idle>-0 [002] d... 2440.707230: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2229 next_prio=120 + + # cat trace +# tracer: nop +# +# entries-in-buffer/entries-written: 77/77 #P:8 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + <idle>-0 [007] d... 2440.707395: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2243 next_prio=120 + snapshot-test-2-2229 [002] d... 2440.707438: sched_switch: prev_comm=snapshot-test-2 prev_pid=2229 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120 +[...] + + +If you try to use this snapshot feature when current tracer is +one of the latency tracers, you will get the following results. + + # echo wakeup > current_tracer + # echo 1 > snapshot +bash: echo: write error: Device or resource busy + # cat snapshot +cat: snapshot: Device or resource busy + + +Instances +--------- +In the debugfs tracing directory is a directory called "instances". +This directory can have new directories created inside of it using +mkdir, and removing directories with rmdir. The directory created +with mkdir in this directory will already contain files and other +directories after it is created. + + # mkdir instances/foo + # ls instances/foo +buffer_size_kb buffer_total_size_kb events free_buffer per_cpu +set_event snapshot trace trace_clock trace_marker trace_options +trace_pipe tracing_on + +As you can see, the new directory looks similar to the tracing directory +itself. In fact, it is very similar, except that the buffer and +events are agnostic from the main director, or from any other +instances that are created. + +The files in the new directory work just like the files with the +same name in the tracing directory except the buffer that is used +is a separate and new buffer. The files affect that buffer but do not +affect the main buffer with the exception of trace_options. Currently, +the trace_options affect all instances and the top level buffer +the same, but this may change in future releases. That is, options +may become specific to the instance they reside in. + +Notice that none of the function tracer files are there, nor is +current_tracer and available_tracers. This is because the buffers +can currently only have events enabled for them. + + # mkdir instances/foo + # mkdir instances/bar + # mkdir instances/zoot + # echo 100000 > buffer_size_kb + # echo 1000 > instances/foo/buffer_size_kb + # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb + # echo function > current_trace + # echo 1 > instances/foo/events/sched/sched_wakeup/enable + # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable + # echo 1 > instances/foo/events/sched/sched_switch/enable + # echo 1 > instances/bar/events/irq/enable + # echo 1 > instances/zoot/events/syscalls/enable + # cat trace_pipe +CPU:2 [LOST 11745 EVENTS] + bash-2044 [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist + bash-2044 [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave + bash-2044 [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist + bash-2044 [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist + bash-2044 [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock + bash-2044 [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype + bash-2044 [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist + bash-2044 [002] d... 10594.481034: zone_statistics <-get_page_from_freelist + bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics + bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics + bash-2044 [002] .... 10594.481035: arch_dup_task_struct <-copy_process +[...] + + # cat instances/foo/trace_pipe + bash-1998 [000] d..4 136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000 + bash-1998 [000] dN.4 136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000 + <idle>-0 [003] d.h3 136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003 + <idle>-0 [003] d..3 136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120 + rcu_preempt-9 [003] d..3 136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120 + bash-1998 [000] d..4 136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000 + bash-1998 [000] dN.4 136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000 + bash-1998 [000] d..3 136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120 + kworker/0:1-59 [000] d..4 136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001 + kworker/0:1-59 [000] d..3 136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120 +[...] + + # cat instances/bar/trace_pipe + migration/1-14 [001] d.h3 138.732674: softirq_raise: vec=3 [action=NET_RX] + <idle>-0 [001] dNh3 138.732725: softirq_raise: vec=3 [action=NET_RX] + bash-1998 [000] d.h1 138.733101: softirq_raise: vec=1 [action=TIMER] + bash-1998 [000] d.h1 138.733102: softirq_raise: vec=9 [action=RCU] + bash-1998 [000] ..s2 138.733105: softirq_entry: vec=1 [action=TIMER] + bash-1998 [000] ..s2 138.733106: softirq_exit: vec=1 [action=TIMER] + bash-1998 [000] ..s2 138.733106: softirq_entry: vec=9 [action=RCU] + bash-1998 [000] ..s2 138.733109: softirq_exit: vec=9 [action=RCU] + sshd-1995 [001] d.h1 138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4 + sshd-1995 [001] d.h1 138.733280: irq_handler_exit: irq=21 ret=unhandled + sshd-1995 [001] d.h1 138.733281: irq_handler_entry: irq=21 name=eth0 + sshd-1995 [001] d.h1 138.733283: irq_handler_exit: irq=21 ret=handled +[...] + + # cat instances/zoot/trace +# tracer: nop +# +# entries-in-buffer/entries-written: 18996/18996 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + bash-1998 [000] d... 140.733501: sys_write -> 0x2 + bash-1998 [000] d... 140.733504: sys_dup2(oldfd: a, newfd: 1) + bash-1998 [000] d... 140.733506: sys_dup2 -> 0x1 + bash-1998 [000] d... 140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0) + bash-1998 [000] d... 140.733509: sys_fcntl -> 0x1 + bash-1998 [000] d... 140.733510: sys_close(fd: a) + bash-1998 [000] d... 140.733510: sys_close -> 0x0 + bash-1998 [000] d... 140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8) + bash-1998 [000] d... 140.733515: sys_rt_sigprocmask -> 0x0 + bash-1998 [000] d... 140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8) + bash-1998 [000] d... 140.733516: sys_rt_sigaction -> 0x0 + +You can see that the trace of the top most trace buffer shows only +the function tracing. The foo instance displays wakeups and task +switches. + +To remove the instances, simply delete their directories: + + # rmdir instances/foo + # rmdir instances/bar + # rmdir instances/zoot + +Note, if a process has a trace file open in one of the instance +directories, the rmdir will fail with EBUSY. + + +Stack trace +----------- +Since the kernel has a fixed sized stack, it is important not to +waste it in functions. A kernel developer must be conscience of +what they allocate on the stack. If they add too much, the system +can be in danger of a stack overflow, and corruption will occur, +usually leading to a system panic. + +There are some tools that check this, usually with interrupts +periodically checking usage. But if you can perform a check +at every function call that will become very useful. As ftrace provides +a function tracer, it makes it convenient to check the stack size +at every function call. This is enabled via the stack tracer. + +CONFIG_STACK_TRACER enables the ftrace stack tracing functionality. +To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled. + + # echo 1 > /proc/sys/kernel/stack_tracer_enabled + +You can also enable it from the kernel command line to trace +the stack size of the kernel during boot up, by adding "stacktrace" +to the kernel command line parameter. + +After running it for a few minutes, the output looks like: + + # cat stack_max_size +2928 + + # cat stack_trace + Depth Size Location (18 entries) + ----- ---- -------- + 0) 2928 224 update_sd_lb_stats+0xbc/0x4ac + 1) 2704 160 find_busiest_group+0x31/0x1f1 + 2) 2544 256 load_balance+0xd9/0x662 + 3) 2288 80 idle_balance+0xbb/0x130 + 4) 2208 128 __schedule+0x26e/0x5b9 + 5) 2080 16 schedule+0x64/0x66 + 6) 2064 128 schedule_timeout+0x34/0xe0 + 7) 1936 112 wait_for_common+0x97/0xf1 + 8) 1824 16 wait_for_completion+0x1d/0x1f + 9) 1808 128 flush_work+0xfe/0x119 + 10) 1680 16 tty_flush_to_ldisc+0x1e/0x20 + 11) 1664 48 input_available_p+0x1d/0x5c + 12) 1616 48 n_tty_poll+0x6d/0x134 + 13) 1568 64 tty_poll+0x64/0x7f + 14) 1504 880 do_select+0x31e/0x511 + 15) 624 400 core_sys_select+0x177/0x216 + 16) 224 96 sys_select+0x91/0xb9 + 17) 128 128 system_call_fastpath+0x16/0x1b + +Note, if -mfentry is being used by gcc, functions get traced before +they set up the stack frame. This means that leaf level functions +are not tested by the stack tracer when -mfentry is used. + +Currently, -mfentry is used by gcc 4.6.0 and above on x86 only. + +--------- + +More details can be found in the source code, in the +kernel/trace/*.c files. diff --git a/kernel/Documentation/trace/function-graph-fold.vim b/kernel/Documentation/trace/function-graph-fold.vim new file mode 100644 index 000000000..0544b504c --- /dev/null +++ b/kernel/Documentation/trace/function-graph-fold.vim @@ -0,0 +1,42 @@ +" Enable folding for ftrace function_graph traces. +" +" To use, :source this file while viewing a function_graph trace, or use vim's +" -S option to load from the command-line together with a trace. You can then +" use the usual vim fold commands, such as "za", to open and close nested +" functions. While closed, a fold will show the total time taken for a call, +" as would normally appear on the line with the closing brace. Folded +" functions will not include finish_task_switch(), so folding should remain +" relatively sane even through a context switch. +" +" Note that this will almost certainly only work well with a +" single-CPU trace (e.g. trace-cmd report --cpu 1). + +function! FunctionGraphFoldExpr(lnum) + let line = getline(a:lnum) + if line[-1:] == '{' + if line =~ 'finish_task_switch() {$' + return '>1' + endif + return 'a1' + elseif line[-1:] == '}' + return 's1' + else + return '=' + endif +endfunction + +function! FunctionGraphFoldText() + let s = split(getline(v:foldstart), '|', 1) + if getline(v:foldend+1) =~ 'finish_task_switch() {$' + let s[2] = ' task switch ' + else + let e = split(getline(v:foldend), '|', 1) + let s[2] = e[2] + endif + return join(s, '|') +endfunction + +setlocal foldexpr=FunctionGraphFoldExpr(v:lnum) +setlocal foldtext=FunctionGraphFoldText() +setlocal foldcolumn=12 +setlocal foldmethod=expr diff --git a/kernel/Documentation/trace/histograms.txt b/kernel/Documentation/trace/histograms.txt new file mode 100644 index 000000000..6f2aeabf7 --- /dev/null +++ b/kernel/Documentation/trace/histograms.txt @@ -0,0 +1,186 @@ + Using the Linux Kernel Latency Histograms + + +This document gives a short explanation how to enable, configure and use +latency histograms. Latency histograms are primarily relevant in the +context of real-time enabled kernels (CONFIG_PREEMPT/CONFIG_PREEMPT_RT) +and are used in the quality management of the Linux real-time +capabilities. + + +* Purpose of latency histograms + +A latency histogram continuously accumulates the frequencies of latency +data. There are two types of histograms +- potential sources of latencies +- effective latencies + + +* Potential sources of latencies + +Potential sources of latencies are code segments where interrupts, +preemption or both are disabled (aka critical sections). To create +histograms of potential sources of latency, the kernel stores the time +stamp at the start of a critical section, determines the time elapsed +when the end of the section is reached, and increments the frequency +counter of that latency value - irrespective of whether any concurrently +running process is affected by latency or not. +- Configuration items (in the Kernel hacking/Tracers submenu) + CONFIG_INTERRUPT_OFF_LATENCY + CONFIG_PREEMPT_OFF_LATENCY + + +* Effective latencies + +Effective latencies are actually occuring during wakeup of a process. To +determine effective latencies, the kernel stores the time stamp when a +process is scheduled to be woken up, and determines the duration of the +wakeup time shortly before control is passed over to this process. Note +that the apparent latency in user space may be somewhat longer, since the +process may be interrupted after control is passed over to it but before +the execution in user space takes place. Simply measuring the interval +between enqueuing and wakeup may also not appropriate in cases when a +process is scheduled as a result of a timer expiration. The timer may have +missed its deadline, e.g. due to disabled interrupts, but this latency +would not be registered. Therefore, the offsets of missed timers are +recorded in a separate histogram. If both wakeup latency and missed timer +offsets are configured and enabled, a third histogram may be enabled that +records the overall latency as a sum of the timer latency, if any, and the +wakeup latency. This histogram is called "timerandwakeup". +- Configuration items (in the Kernel hacking/Tracers submenu) + CONFIG_WAKEUP_LATENCY + CONFIG_MISSED_TIMER_OFSETS + + +* Usage + +The interface to the administration of the latency histograms is located +in the debugfs file system. To mount it, either enter + +mount -t sysfs nodev /sys +mount -t debugfs nodev /sys/kernel/debug + +from shell command line level, or add + +nodev /sys sysfs defaults 0 0 +nodev /sys/kernel/debug debugfs defaults 0 0 + +to the file /etc/fstab. All latency histogram related files are then +available in the directory /sys/kernel/debug/tracing/latency_hist. A +particular histogram type is enabled by writing non-zero to the related +variable in the /sys/kernel/debug/tracing/latency_hist/enable directory. +Select "preemptirqsoff" for the histograms of potential sources of +latencies and "wakeup" for histograms of effective latencies etc. The +histogram data - one per CPU - are available in the files + +/sys/kernel/debug/tracing/latency_hist/preemptoff/CPUx +/sys/kernel/debug/tracing/latency_hist/irqsoff/CPUx +/sys/kernel/debug/tracing/latency_hist/preemptirqsoff/CPUx +/sys/kernel/debug/tracing/latency_hist/wakeup/CPUx +/sys/kernel/debug/tracing/latency_hist/wakeup/sharedprio/CPUx +/sys/kernel/debug/tracing/latency_hist/missed_timer_offsets/CPUx +/sys/kernel/debug/tracing/latency_hist/timerandwakeup/CPUx + +The histograms are reset by writing non-zero to the file "reset" in a +particular latency directory. To reset all latency data, use + +#!/bin/sh + +TRACINGDIR=/sys/kernel/debug/tracing +HISTDIR=$TRACINGDIR/latency_hist + +if test -d $HISTDIR +then + cd $HISTDIR + for i in `find . | grep /reset$` + do + echo 1 >$i + done +fi + + +* Data format + +Latency data are stored with a resolution of one microsecond. The +maximum latency is 10,240 microseconds. The data are only valid, if the +overflow register is empty. Every output line contains the latency in +microseconds in the first row and the number of samples in the second +row. To display only lines with a positive latency count, use, for +example, + +grep -v " 0$" /sys/kernel/debug/tracing/latency_hist/preemptoff/CPU0 + +#Minimum latency: 0 microseconds. +#Average latency: 0 microseconds. +#Maximum latency: 25 microseconds. +#Total samples: 3104770694 +#There are 0 samples greater or equal than 10240 microseconds +#usecs samples + 0 2984486876 + 1 49843506 + 2 58219047 + 3 5348126 + 4 2187960 + 5 3388262 + 6 959289 + 7 208294 + 8 40420 + 9 4485 + 10 14918 + 11 18340 + 12 25052 + 13 19455 + 14 5602 + 15 969 + 16 47 + 17 18 + 18 14 + 19 1 + 20 3 + 21 2 + 22 5 + 23 2 + 25 1 + + +* Wakeup latency of a selected process + +To only collect wakeup latency data of a particular process, write the +PID of the requested process to + +/sys/kernel/debug/tracing/latency_hist/wakeup/pid + +PIDs are not considered, if this variable is set to 0. + + +* Details of the process with the highest wakeup latency so far + +Selected data of the process that suffered from the highest wakeup +latency that occurred in a particular CPU are available in the file + +/sys/kernel/debug/tracing/latency_hist/wakeup/max_latency-CPUx. + +In addition, other relevant system data at the time when the +latency occurred are given. + +The format of the data is (all in one line): +<PID> <Priority> <Latency> (<Timeroffset>) <Command> \ +<- <PID> <Priority> <Command> <Timestamp> + +The value of <Timeroffset> is only relevant in the combined timer +and wakeup latency recording. In the wakeup recording, it is +always 0, in the missed_timer_offsets recording, it is the same +as <Latency>. + +When retrospectively searching for the origin of a latency and +tracing was not enabled, it may be helpful to know the name and +some basic data of the task that (finally) was switching to the +late real-tlme task. In addition to the victim's data, also the +data of the possible culprit are therefore displayed after the +"<-" symbol. + +Finally, the timestamp of the time when the latency occurred +in <seconds>.<microseconds> after the most recent system boot +is provided. + +These data are also reset when the wakeup histogram is reset. diff --git a/kernel/Documentation/trace/kprobetrace.txt b/kernel/Documentation/trace/kprobetrace.txt new file mode 100644 index 000000000..d68ea5fc8 --- /dev/null +++ b/kernel/Documentation/trace/kprobetrace.txt @@ -0,0 +1,172 @@ + Kprobe-based Event Tracing + ========================== + + Documentation is written by Masami Hiramatsu + + +Overview +-------- +These events are similar to tracepoint based events. Instead of Tracepoint, +this is based on kprobes (kprobe and kretprobe). So it can probe wherever +kprobes can probe (this means, all functions body except for __kprobes +functions). Unlike the Tracepoint based event, this can be added and removed +dynamically, on the fly. + +To enable this feature, build your kernel with CONFIG_KPROBE_EVENT=y. + +Similar to the events tracer, this doesn't need to be activated via +current_tracer. Instead of that, add probe points via +/sys/kernel/debug/tracing/kprobe_events, and enable it via +/sys/kernel/debug/tracing/events/kprobes/<EVENT>/enabled. + + +Synopsis of kprobe_events +------------------------- + p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS] : Set a probe + r[:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS] : Set a return probe + -:[GRP/]EVENT : Clear a probe + + GRP : Group name. If omitted, use "kprobes" for it. + EVENT : Event name. If omitted, the event name is generated + based on SYM+offs or MEMADDR. + MOD : Module name which has given SYM. + SYM[+offs] : Symbol+offset where the probe is inserted. + MEMADDR : Address where the probe is inserted. + + FETCHARGS : Arguments. Each probe can have up to 128 args. + %REG : Fetch register REG + @ADDR : Fetch memory at ADDR (ADDR should be in kernel) + @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) + $stackN : Fetch Nth entry of stack (N >= 0) + $stack : Fetch stack address. + $retval : Fetch return value.(*) + +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**) + NAME=FETCHARG : Set NAME as the argument name of FETCHARG. + FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types + (u8/u16/u32/u64/s8/s16/s32/s64), "string" and bitfield + are supported. + + (*) only for return probe. + (**) this is useful for fetching a field of data structures. + +Types +----- +Several types are supported for fetch-args. Kprobe tracer will access memory +by given type. Prefix 's' and 'u' means those types are signed and unsigned +respectively. Traced arguments are shown in decimal (signed) or hex (unsigned). +String type is a special type, which fetches a "null-terminated" string from +kernel space. This means it will fail and store NULL if the string container +has been paged out. +Bitfield is another special type, which takes 3 parameters, bit-width, bit- +offset, and container-size (usually 32). The syntax is; + + b<bit-width>@<bit-offset>/<container-size> + + +Per-Probe Event Filtering +------------------------- + Per-probe event filtering feature allows you to set different filter on each +probe and gives you what arguments will be shown in trace buffer. If an event +name is specified right after 'p:' or 'r:' in kprobe_events, it adds an event +under tracing/events/kprobes/<EVENT>, at the directory you can see 'id', +'enabled', 'format' and 'filter'. + +enabled: + You can enable/disable the probe by writing 1 or 0 on it. + +format: + This shows the format of this probe event. + +filter: + You can write filtering rules of this event. + +id: + This shows the id of this probe event. + + +Event Profiling +--------------- + You can check the total number of probe hits and probe miss-hits via +/sys/kernel/debug/tracing/kprobe_profile. + The first column is event name, the second is the number of probe hits, +the third is the number of probe miss-hits. + + +Usage examples +-------------- +To add a probe as a new event, write a new definition to kprobe_events +as below. + + echo 'p:myprobe do_sys_open dfd=%ax filename=%dx flags=%cx mode=+4($stack)' > /sys/kernel/debug/tracing/kprobe_events + + This sets a kprobe on the top of do_sys_open() function with recording +1st to 4th arguments as "myprobe" event. Note, which register/stack entry is +assigned to each function argument depends on arch-specific ABI. If you unsure +the ABI, please try to use probe subcommand of perf-tools (you can find it +under tools/perf/). +As this example shows, users can choose more familiar names for each arguments. + + echo 'r:myretprobe do_sys_open $retval' >> /sys/kernel/debug/tracing/kprobe_events + + This sets a kretprobe on the return point of do_sys_open() function with +recording return value as "myretprobe" event. + You can see the format of these events via +/sys/kernel/debug/tracing/events/kprobes/<EVENT>/format. + + cat /sys/kernel/debug/tracing/events/kprobes/myprobe/format +name: myprobe +ID: 780 +format: + field:unsigned short common_type; offset:0; size:2; signed:0; + field:unsigned char common_flags; offset:2; size:1; signed:0; + field:unsigned char common_preempt_count; offset:3; size:1;signed:0; + field:int common_pid; offset:4; size:4; signed:1; + + field:unsigned long __probe_ip; offset:12; size:4; signed:0; + field:int __probe_nargs; offset:16; size:4; signed:1; + field:unsigned long dfd; offset:20; size:4; signed:0; + field:unsigned long filename; offset:24; size:4; signed:0; + field:unsigned long flags; offset:28; size:4; signed:0; + field:unsigned long mode; offset:32; size:4; signed:0; + + +print fmt: "(%lx) dfd=%lx filename=%lx flags=%lx mode=%lx", REC->__probe_ip, +REC->dfd, REC->filename, REC->flags, REC->mode + + You can see that the event has 4 arguments as in the expressions you specified. + + echo > /sys/kernel/debug/tracing/kprobe_events + + This clears all probe points. + + Or, + + echo -:myprobe >> kprobe_events + + This clears probe points selectively. + + Right after definition, each event is disabled by default. For tracing these +events, you need to enable it. + + echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable + echo 1 > /sys/kernel/debug/tracing/events/kprobes/myretprobe/enable + + And you can see the traced information via /sys/kernel/debug/tracing/trace. + + cat /sys/kernel/debug/tracing/trace +# tracer: nop +# +# TASK-PID CPU# TIMESTAMP FUNCTION +# | | | | | + <...>-1447 [001] 1038282.286875: myprobe: (do_sys_open+0x0/0xd6) dfd=3 filename=7fffd1ec4440 flags=8000 mode=0 + <...>-1447 [001] 1038282.286878: myretprobe: (sys_openat+0xc/0xe <- do_sys_open) $retval=fffffffffffffffe + <...>-1447 [001] 1038282.286885: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=40413c flags=8000 mode=1b6 + <...>-1447 [001] 1038282.286915: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3 + <...>-1447 [001] 1038282.286969: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=4041c6 flags=98800 mode=10 + <...>-1447 [001] 1038282.286976: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3 + + + Each line shows when the kernel hits an event, and <- SYMBOL means kernel +returns from SYMBOL(e.g. "sys_open+0x1b/0x1d <- do_sys_open" means kernel +returns from do_sys_open to sys_open+0x1b). + diff --git a/kernel/Documentation/trace/mmiotrace.txt b/kernel/Documentation/trace/mmiotrace.txt new file mode 100644 index 000000000..664e7386d --- /dev/null +++ b/kernel/Documentation/trace/mmiotrace.txt @@ -0,0 +1,164 @@ + In-kernel memory-mapped I/O tracing + + +Home page and links to optional user space tools: + + http://nouveau.freedesktop.org/wiki/MmioTrace + +MMIO tracing was originally developed by Intel around 2003 for their Fault +Injection Test Harness. In Dec 2006 - Jan 2007, using the code from Intel, +Jeff Muizelaar created a tool for tracing MMIO accesses with the Nouveau +project in mind. Since then many people have contributed. + +Mmiotrace was built for reverse engineering any memory-mapped IO device with +the Nouveau project as the first real user. Only x86 and x86_64 architectures +are supported. + +Out-of-tree mmiotrace was originally modified for mainline inclusion and +ftrace framework by Pekka Paalanen <pq@iki.fi>. + + +Preparation +----------- + +Mmiotrace feature is compiled in by the CONFIG_MMIOTRACE option. Tracing is +disabled by default, so it is safe to have this set to yes. SMP systems are +supported, but tracing is unreliable and may miss events if more than one CPU +is on-line, therefore mmiotrace takes all but one CPU off-line during run-time +activation. You can re-enable CPUs by hand, but you have been warned, there +is no way to automatically detect if you are losing events due to CPUs racing. + + +Usage Quick Reference +--------------------- + +$ mount -t debugfs debugfs /sys/kernel/debug +$ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer +$ cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt & +Start X or whatever. +$ echo "X is up" > /sys/kernel/debug/tracing/trace_marker +$ echo nop > /sys/kernel/debug/tracing/current_tracer +Check for lost events. + + +Usage +----- + +Make sure debugfs is mounted to /sys/kernel/debug. +If not (requires root privileges): +$ mount -t debugfs debugfs /sys/kernel/debug + +Check that the driver you are about to trace is not loaded. + +Activate mmiotrace (requires root privileges): +$ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer + +Start storing the trace: +$ cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt & +The 'cat' process should stay running (sleeping) in the background. + +Load the driver you want to trace and use it. Mmiotrace will only catch MMIO +accesses to areas that are ioremapped while mmiotrace is active. + +During tracing you can place comments (markers) into the trace by +$ echo "X is up" > /sys/kernel/debug/tracing/trace_marker +This makes it easier to see which part of the (huge) trace corresponds to +which action. It is recommended to place descriptive markers about what you +do. + +Shut down mmiotrace (requires root privileges): +$ echo nop > /sys/kernel/debug/tracing/current_tracer +The 'cat' process exits. If it does not, kill it by issuing 'fg' command and +pressing ctrl+c. + +Check that mmiotrace did not lose events due to a buffer filling up. Either +$ grep -i lost mydump.txt +which tells you exactly how many events were lost, or use +$ dmesg +to view your kernel log and look for "mmiotrace has lost events" warning. If +events were lost, the trace is incomplete. You should enlarge the buffers and +try again. Buffers are enlarged by first seeing how large the current buffers +are: +$ cat /sys/kernel/debug/tracing/buffer_size_kb +gives you a number. Approximately double this number and write it back, for +instance: +$ echo 128000 > /sys/kernel/debug/tracing/buffer_size_kb +Then start again from the top. + +If you are doing a trace for a driver project, e.g. Nouveau, you should also +do the following before sending your results: +$ lspci -vvv > lspci.txt +$ dmesg > dmesg.txt +$ tar zcf pciid-nick-mmiotrace.tar.gz mydump.txt lspci.txt dmesg.txt +and then send the .tar.gz file. The trace compresses considerably. Replace +"pciid" and "nick" with the PCI ID or model name of your piece of hardware +under investigation and your nickname. + + +How Mmiotrace Works +------------------- + +Access to hardware IO-memory is gained by mapping addresses from PCI bus by +calling one of the ioremap_*() functions. Mmiotrace is hooked into the +__ioremap() function and gets called whenever a mapping is created. Mapping is +an event that is recorded into the trace log. Note that ISA range mappings +are not caught, since the mapping always exists and is returned directly. + +MMIO accesses are recorded via page faults. Just before __ioremap() returns, +the mapped pages are marked as not present. Any access to the pages causes a +fault. The page fault handler calls mmiotrace to handle the fault. Mmiotrace +marks the page present, sets TF flag to achieve single stepping and exits the +fault handler. The instruction that faulted is executed and debug trap is +entered. Here mmiotrace again marks the page as not present. The instruction +is decoded to get the type of operation (read/write), data width and the value +read or written. These are stored to the trace log. + +Setting the page present in the page fault handler has a race condition on SMP +machines. During the single stepping other CPUs may run freely on that page +and events can be missed without a notice. Re-enabling other CPUs during +tracing is discouraged. + + +Trace Log Format +---------------- + +The raw log is text and easily filtered with e.g. grep and awk. One record is +one line in the log. A record starts with a keyword, followed by keyword- +dependent arguments. Arguments are separated by a space, or continue until the +end of line. The format for version 20070824 is as follows: + +Explanation Keyword Space-separated arguments +--------------------------------------------------------------------------- + +read event R width, timestamp, map id, physical, value, PC, PID +write event W width, timestamp, map id, physical, value, PC, PID +ioremap event MAP timestamp, map id, physical, virtual, length, PC, PID +iounmap event UNMAP timestamp, map id, PC, PID +marker MARK timestamp, text +version VERSION the string "20070824" +info for reader LSPCI one line from lspci -v +PCI address map PCIDEV space-separated /proc/bus/pci/devices data +unk. opcode UNKNOWN timestamp, map id, physical, data, PC, PID + +Timestamp is in seconds with decimals. Physical is a PCI bus address, virtual +is a kernel virtual address. Width is the data width in bytes and value is the +data value. Map id is an arbitrary id number identifying the mapping that was +used in an operation. PC is the program counter and PID is process id. PC is +zero if it is not recorded. PID is always zero as tracing MMIO accesses +originating in user space memory is not yet supported. + +For instance, the following awk filter will pass all 32-bit writes that target +physical addresses in the range [0xfb73ce40, 0xfb800000[ + +$ awk '/W 4 / { adr=strtonum($5); if (adr >= 0xfb73ce40 && +adr < 0xfb800000) print; }' + + +Tools for Developers +-------------------- + +The user space tools include utilities for: +- replacing numeric addresses and values with hardware register names +- replaying MMIO logs, i.e., re-executing the recorded writes + + diff --git a/kernel/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/kernel/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl new file mode 100644 index 000000000..0a120aae3 --- /dev/null +++ b/kernel/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl @@ -0,0 +1,418 @@ +#!/usr/bin/perl +# This is a POC (proof of concept or piece of crap, take your pick) for reading the +# text representation of trace output related to page allocation. It makes an attempt +# to extract some high-level information on what is going on. The accuracy of the parser +# may vary considerably +# +# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe +# other options +# --prepend-parent Report on the parent proc and PID +# --read-procstat If the trace lacks process info, get it from /proc +# --ignore-pid Aggregate processes of the same name together +# +# Copyright (c) IBM Corporation 2009 +# Author: Mel Gorman <mel@csn.ul.ie> +use strict; +use Getopt::Long; + +# Tracepoint events +use constant MM_PAGE_ALLOC => 1; +use constant MM_PAGE_FREE => 2; +use constant MM_PAGE_FREE_BATCHED => 3; +use constant MM_PAGE_PCPU_DRAIN => 4; +use constant MM_PAGE_ALLOC_ZONE_LOCKED => 5; +use constant MM_PAGE_ALLOC_EXTFRAG => 6; +use constant EVENT_UNKNOWN => 7; + +# Constants used to track state +use constant STATE_PCPU_PAGES_DRAINED => 8; +use constant STATE_PCPU_PAGES_REFILLED => 9; + +# High-level events extrapolated from tracepoints +use constant HIGH_PCPU_DRAINS => 10; +use constant HIGH_PCPU_REFILLS => 11; +use constant HIGH_EXT_FRAGMENT => 12; +use constant HIGH_EXT_FRAGMENT_SEVERE => 13; +use constant HIGH_EXT_FRAGMENT_MODERATE => 14; +use constant HIGH_EXT_FRAGMENT_CHANGED => 15; + +my %perprocesspid; +my %perprocess; +my $opt_ignorepid; +my $opt_read_procstat; +my $opt_prepend_parent; + +# Catch sigint and exit on request +my $sigint_report = 0; +my $sigint_exit = 0; +my $sigint_pending = 0; +my $sigint_received = 0; +sub sigint_handler { + my $current_time = time; + if ($current_time - 2 > $sigint_received) { + print "SIGINT received, report pending. Hit ctrl-c again to exit\n"; + $sigint_report = 1; + } else { + if (!$sigint_exit) { + print "Second SIGINT received quickly, exiting\n"; + } + $sigint_exit++; + } + + if ($sigint_exit > 3) { + print "Many SIGINTs received, exiting now without report\n"; + exit; + } + + $sigint_received = $current_time; + $sigint_pending = 1; +} +$SIG{INT} = "sigint_handler"; + +# Parse command line options +GetOptions( + 'ignore-pid' => \$opt_ignorepid, + 'read-procstat' => \$opt_read_procstat, + 'prepend-parent' => \$opt_prepend_parent, +); + +# Defaults for dynamically discovered regex's +my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])'; + +# Dyanically discovered regex +my $regex_fragdetails; + +# Static regex used. Specified like this for readability and for use with /o +# (process_pid) (cpus ) ( time ) (tpoint ) (details) +my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)'; +my $regex_statname = '[-0-9]*\s\((.*)\).*'; +my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*'; + +sub generate_traceevent_regex { + my $event = shift; + my $default = shift; + my $regex; + + # Read the event format or use the default + if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) { + $regex = $default; + } else { + my $line; + while (!eof(FORMAT)) { + $line = <FORMAT>; + if ($line =~ /^print fmt:\s"(.*)",.*/) { + $regex = $1; + $regex =~ s/%p/\([0-9a-f]*\)/g; + $regex =~ s/%d/\([-0-9]*\)/g; + $regex =~ s/%lu/\([0-9]*\)/g; + } + } + } + + # Verify fields are in the right order + my $tuple; + foreach $tuple (split /\s/, $regex) { + my ($key, $value) = split(/=/, $tuple); + my $expected = shift; + if ($key ne $expected) { + print("WARNING: Format not as expected '$key' != '$expected'"); + $regex =~ s/$key=\((.*)\)/$key=$1/; + } + } + + if (defined shift) { + die("Fewer fields than expected in format"); + } + + return $regex; +} +$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag", + $regex_fragdetails_default, + "page", "pfn", + "alloc_order", "fallback_order", "pageblock_order", + "alloc_migratetype", "fallback_migratetype", + "fragmenting", "change_ownership"); + +sub read_statline($) { + my $pid = $_[0]; + my $statline; + + if (open(STAT, "/proc/$pid/stat")) { + $statline = <STAT>; + close(STAT); + } + + if ($statline eq '') { + $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0"; + } + + return $statline; +} + +sub guess_process_pid($$) { + my $pid = $_[0]; + my $statline = $_[1]; + + if ($pid == 0) { + return "swapper-0"; + } + + if ($statline !~ /$regex_statname/o) { + die("Failed to math stat line for process name :: $statline"); + } + return "$1-$pid"; +} + +sub parent_info($$) { + my $pid = $_[0]; + my $statline = $_[1]; + my $ppid; + + if ($pid == 0) { + return "NOPARENT-0"; + } + + if ($statline !~ /$regex_statppid/o) { + die("Failed to match stat line process ppid:: $statline"); + } + + # Read the ppid stat line + $ppid = $1; + return guess_process_pid($ppid, read_statline($ppid)); +} + +sub process_events { + my $traceevent; + my $process_pid; + my $cpus; + my $timestamp; + my $tracepoint; + my $details; + my $statline; + + # Read each line of the event log +EVENT_PROCESS: + while ($traceevent = <STDIN>) { + if ($traceevent =~ /$regex_traceevent/o) { + $process_pid = $1; + $tracepoint = $4; + + if ($opt_read_procstat || $opt_prepend_parent) { + $process_pid =~ /(.*)-([0-9]*)$/; + my $process = $1; + my $pid = $2; + + $statline = read_statline($pid); + + if ($opt_read_procstat && $process eq '') { + $process_pid = guess_process_pid($pid, $statline); + } + + if ($opt_prepend_parent) { + $process_pid = parent_info($pid, $statline) . " :: $process_pid"; + } + } + + # Unnecessary in this script. Uncomment if required + # $cpus = $2; + # $timestamp = $3; + } else { + next; + } + + # Perl Switch() sucks majorly + if ($tracepoint eq "mm_page_alloc") { + $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++; + } elsif ($tracepoint eq "mm_page_free") { + $perprocesspid{$process_pid}->{MM_PAGE_FREE}++ + } elsif ($tracepoint eq "mm_page_free_batched") { + $perprocesspid{$process_pid}->{MM_PAGE_FREE_BATCHED}++; + } elsif ($tracepoint eq "mm_page_pcpu_drain") { + $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++; + $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++; + } elsif ($tracepoint eq "mm_page_alloc_zone_locked") { + $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++; + $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++; + } elsif ($tracepoint eq "mm_page_alloc_extfrag") { + + # Extract the details of the event now + $details = $5; + + my ($page, $pfn); + my ($alloc_order, $fallback_order, $pageblock_order); + my ($alloc_migratetype, $fallback_migratetype); + my ($fragmenting, $change_ownership); + + if ($details !~ /$regex_fragdetails/o) { + print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n"; + next; + } + + $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++; + $page = $1; + $pfn = $2; + $alloc_order = $3; + $fallback_order = $4; + $pageblock_order = $5; + $alloc_migratetype = $6; + $fallback_migratetype = $7; + $fragmenting = $8; + $change_ownership = $9; + + if ($fragmenting) { + $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++; + if ($fallback_order <= 3) { + $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++; + } else { + $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++; + } + } + if ($change_ownership) { + $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++; + } + } else { + $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++; + } + + # Catch a full pcpu drain event + if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} && + $tracepoint ne "mm_page_pcpu_drain") { + + $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++; + $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0; + } + + # Catch a full pcpu refill event + if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} && + $tracepoint ne "mm_page_alloc_zone_locked") { + $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++; + $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0; + } + + if ($sigint_pending) { + last EVENT_PROCESS; + } + } +} + +sub dump_stats { + my $hashref = shift; + my %stats = %$hashref; + + # Dump per-process stats + my $process_pid; + my $max_strlen = 0; + + # Get the maximum process name + foreach $process_pid (keys %perprocesspid) { + my $len = length($process_pid); + if ($len > $max_strlen) { + $max_strlen = $len; + } + } + $max_strlen += 2; + + printf("\n"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", + "Process", "Pages", "Pages", "Pages", "Pages", "PCPU", "PCPU", "PCPU", "Fragment", "Fragment", "MigType", "Fragment", "Fragment", "Unknown"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", + "details", "allocd", "allocd", "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing", "Changed", "Severe", "Moderate", ""); + + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", + "", "", "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", ""); + + foreach $process_pid (keys %stats) { + # Dump final aggregates + if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) { + $stats{$process_pid}->{HIGH_PCPU_DRAINS}++; + $stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0; + } + if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) { + $stats{$process_pid}->{HIGH_PCPU_REFILLS}++; + $stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0; + } + + printf("%-" . $max_strlen . "s %8d %10d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d\n", + $process_pid, + $stats{$process_pid}->{MM_PAGE_ALLOC}, + $stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}, + $stats{$process_pid}->{MM_PAGE_FREE}, + $stats{$process_pid}->{MM_PAGE_FREE_BATCHED}, + $stats{$process_pid}->{MM_PAGE_PCPU_DRAIN}, + $stats{$process_pid}->{HIGH_PCPU_DRAINS}, + $stats{$process_pid}->{HIGH_PCPU_REFILLS}, + $stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}, + $stats{$process_pid}->{HIGH_EXT_FRAG}, + $stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}, + $stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}, + $stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}, + $stats{$process_pid}->{EVENT_UNKNOWN}); + } +} + +sub aggregate_perprocesspid() { + my $process_pid; + my $process; + undef %perprocess; + + foreach $process_pid (keys %perprocesspid) { + $process = $process_pid; + $process =~ s/-([0-9])*$//; + if ($process eq '') { + $process = "NO_PROCESS_NAME"; + } + + $perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}; + $perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}; + $perprocess{$process}->{MM_PAGE_FREE} += $perprocesspid{$process_pid}->{MM_PAGE_FREE}; + $perprocess{$process}->{MM_PAGE_FREE_BATCHED} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_BATCHED}; + $perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}; + $perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}; + $perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}; + $perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}; + $perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}; + $perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}; + $perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}; + $perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}; + $perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN}; + } +} + +sub report() { + if (!$opt_ignorepid) { + dump_stats(\%perprocesspid); + } else { + aggregate_perprocesspid(); + dump_stats(\%perprocess); + } +} + +# Process events or signals until neither is available +sub signal_loop() { + my $sigint_processed; + do { + $sigint_processed = 0; + process_events(); + + # Handle pending signals if any + if ($sigint_pending) { + my $current_time = time; + + if ($sigint_exit) { + print "Received exit signal\n"; + $sigint_pending = 0; + } + if ($sigint_report) { + if ($current_time >= $sigint_received + 2) { + report(); + $sigint_report = 0; + $sigint_pending = 0; + $sigint_processed = 1; + } + } + } + } while ($sigint_pending || $sigint_processed); +} + +signal_loop(); +report(); diff --git a/kernel/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/kernel/Documentation/trace/postprocess/trace-vmscan-postprocess.pl new file mode 100644 index 000000000..8f961ef2b --- /dev/null +++ b/kernel/Documentation/trace/postprocess/trace-vmscan-postprocess.pl @@ -0,0 +1,757 @@ +#!/usr/bin/perl +# This is a POC for reading the text representation of trace output related to +# page reclaim. It makes an attempt to extract some high-level information on +# what is going on. The accuracy of the parser may vary +# +# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe +# other options +# --read-procstat If the trace lacks process info, get it from /proc +# --ignore-pid Aggregate processes of the same name together +# +# Copyright (c) IBM Corporation 2009 +# Author: Mel Gorman <mel@csn.ul.ie> +use strict; +use Getopt::Long; + +# Tracepoint events +use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN => 1; +use constant MM_VMSCAN_DIRECT_RECLAIM_END => 2; +use constant MM_VMSCAN_KSWAPD_WAKE => 3; +use constant MM_VMSCAN_KSWAPD_SLEEP => 4; +use constant MM_VMSCAN_LRU_SHRINK_ACTIVE => 5; +use constant MM_VMSCAN_LRU_SHRINK_INACTIVE => 6; +use constant MM_VMSCAN_LRU_ISOLATE => 7; +use constant MM_VMSCAN_WRITEPAGE_FILE_SYNC => 8; +use constant MM_VMSCAN_WRITEPAGE_ANON_SYNC => 9; +use constant MM_VMSCAN_WRITEPAGE_FILE_ASYNC => 10; +use constant MM_VMSCAN_WRITEPAGE_ANON_ASYNC => 11; +use constant MM_VMSCAN_WRITEPAGE_ASYNC => 12; +use constant EVENT_UNKNOWN => 13; + +# Per-order events +use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11; +use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER => 12; +use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER => 13; +use constant HIGH_KSWAPD_REWAKEUP_PERORDER => 14; + +# Constants used to track state +use constant STATE_DIRECT_BEGIN => 15; +use constant STATE_DIRECT_ORDER => 16; +use constant STATE_KSWAPD_BEGIN => 17; +use constant STATE_KSWAPD_ORDER => 18; + +# High-level events extrapolated from tracepoints +use constant HIGH_DIRECT_RECLAIM_LATENCY => 19; +use constant HIGH_KSWAPD_LATENCY => 20; +use constant HIGH_KSWAPD_REWAKEUP => 21; +use constant HIGH_NR_SCANNED => 22; +use constant HIGH_NR_TAKEN => 23; +use constant HIGH_NR_RECLAIMED => 24; +use constant HIGH_NR_FILE_SCANNED => 25; +use constant HIGH_NR_ANON_SCANNED => 26; +use constant HIGH_NR_FILE_RECLAIMED => 27; +use constant HIGH_NR_ANON_RECLAIMED => 28; + +my %perprocesspid; +my %perprocess; +my %last_procmap; +my $opt_ignorepid; +my $opt_read_procstat; + +my $total_wakeup_kswapd; +my ($total_direct_reclaim, $total_direct_nr_scanned); +my ($total_direct_nr_file_scanned, $total_direct_nr_anon_scanned); +my ($total_direct_latency, $total_kswapd_latency); +my ($total_direct_nr_reclaimed); +my ($total_direct_nr_file_reclaimed, $total_direct_nr_anon_reclaimed); +my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async); +my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async); +my ($total_kswapd_nr_scanned, $total_kswapd_wake); +my ($total_kswapd_nr_file_scanned, $total_kswapd_nr_anon_scanned); +my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async); +my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async); +my ($total_kswapd_nr_reclaimed); +my ($total_kswapd_nr_file_reclaimed, $total_kswapd_nr_anon_reclaimed); + +# Catch sigint and exit on request +my $sigint_report = 0; +my $sigint_exit = 0; +my $sigint_pending = 0; +my $sigint_received = 0; +sub sigint_handler { + my $current_time = time; + if ($current_time - 2 > $sigint_received) { + print "SIGINT received, report pending. Hit ctrl-c again to exit\n"; + $sigint_report = 1; + } else { + if (!$sigint_exit) { + print "Second SIGINT received quickly, exiting\n"; + } + $sigint_exit++; + } + + if ($sigint_exit > 3) { + print "Many SIGINTs received, exiting now without report\n"; + exit; + } + + $sigint_received = $current_time; + $sigint_pending = 1; +} +$SIG{INT} = "sigint_handler"; + +# Parse command line options +GetOptions( + 'ignore-pid' => \$opt_ignorepid, + 'read-procstat' => \$opt_read_procstat, +); + +# Defaults for dynamically discovered regex's +my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)'; +my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)'; +my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)'; +my $regex_kswapd_sleep_default = 'nid=([0-9]*)'; +my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)'; +my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) file=([0-9]*)'; +my $regex_lru_shrink_inactive_default = 'nid=([0-9]*) zid=([0-9]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)'; +my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)'; +my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)'; + +# Dyanically discovered regex +my $regex_direct_begin; +my $regex_direct_end; +my $regex_kswapd_wake; +my $regex_kswapd_sleep; +my $regex_wakeup_kswapd; +my $regex_lru_isolate; +my $regex_lru_shrink_inactive; +my $regex_lru_shrink_active; +my $regex_writepage; + +# Static regex used. Specified like this for readability and for use with /o +# (process_pid) (cpus ) ( time ) (tpoint ) (details) +my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])(\s*[dX.][Nnp.][Hhs.][0-9a-fA-F.]*|)\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)'; +my $regex_statname = '[-0-9]*\s\((.*)\).*'; +my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*'; + +sub generate_traceevent_regex { + my $event = shift; + my $default = shift; + my $regex; + + # Read the event format or use the default + if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) { + print("WARNING: Event $event format string not found\n"); + return $default; + } else { + my $line; + while (!eof(FORMAT)) { + $line = <FORMAT>; + $line =~ s/, REC->.*//; + if ($line =~ /^print fmt:\s"(.*)".*/) { + $regex = $1; + $regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g; + $regex =~ s/%p/\([0-9a-f]*\)/g; + $regex =~ s/%d/\([-0-9]*\)/g; + $regex =~ s/%ld/\([-0-9]*\)/g; + $regex =~ s/%lu/\([0-9]*\)/g; + } + } + } + + # Can't handle the print_flags stuff but in the context of this + # script, it really doesn't matter + $regex =~ s/\(REC.*\) \? __print_flags.*//; + + # Verify fields are in the right order + my $tuple; + foreach $tuple (split /\s/, $regex) { + my ($key, $value) = split(/=/, $tuple); + my $expected = shift; + if ($key ne $expected) { + print("WARNING: Format not as expected for event $event '$key' != '$expected'\n"); + $regex =~ s/$key=\((.*)\)/$key=$1/; + } + } + + if (defined shift) { + die("Fewer fields than expected in format"); + } + + return $regex; +} + +$regex_direct_begin = generate_traceevent_regex( + "vmscan/mm_vmscan_direct_reclaim_begin", + $regex_direct_begin_default, + "order", "may_writepage", + "gfp_flags"); +$regex_direct_end = generate_traceevent_regex( + "vmscan/mm_vmscan_direct_reclaim_end", + $regex_direct_end_default, + "nr_reclaimed"); +$regex_kswapd_wake = generate_traceevent_regex( + "vmscan/mm_vmscan_kswapd_wake", + $regex_kswapd_wake_default, + "nid", "order"); +$regex_kswapd_sleep = generate_traceevent_regex( + "vmscan/mm_vmscan_kswapd_sleep", + $regex_kswapd_sleep_default, + "nid"); +$regex_wakeup_kswapd = generate_traceevent_regex( + "vmscan/mm_vmscan_wakeup_kswapd", + $regex_wakeup_kswapd_default, + "nid", "zid", "order"); +$regex_lru_isolate = generate_traceevent_regex( + "vmscan/mm_vmscan_lru_isolate", + $regex_lru_isolate_default, + "isolate_mode", "order", + "nr_requested", "nr_scanned", "nr_taken", + "file"); +$regex_lru_shrink_inactive = generate_traceevent_regex( + "vmscan/mm_vmscan_lru_shrink_inactive", + $regex_lru_shrink_inactive_default, + "nid", "zid", + "nr_scanned", "nr_reclaimed", "priority", + "flags"); +$regex_lru_shrink_active = generate_traceevent_regex( + "vmscan/mm_vmscan_lru_shrink_active", + $regex_lru_shrink_active_default, + "nid", "zid", + "lru", + "nr_scanned", "nr_rotated", "priority"); +$regex_writepage = generate_traceevent_regex( + "vmscan/mm_vmscan_writepage", + $regex_writepage_default, + "page", "pfn", "flags"); + +sub read_statline($) { + my $pid = $_[0]; + my $statline; + + if (open(STAT, "/proc/$pid/stat")) { + $statline = <STAT>; + close(STAT); + } + + if ($statline eq '') { + $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0"; + } + + return $statline; +} + +sub guess_process_pid($$) { + my $pid = $_[0]; + my $statline = $_[1]; + + if ($pid == 0) { + return "swapper-0"; + } + + if ($statline !~ /$regex_statname/o) { + die("Failed to math stat line for process name :: $statline"); + } + return "$1-$pid"; +} + +# Convert sec.usec timestamp format +sub timestamp_to_ms($) { + my $timestamp = $_[0]; + + my ($sec, $usec) = split (/\./, $timestamp); + return ($sec * 1000) + ($usec / 1000); +} + +sub process_events { + my $traceevent; + my $process_pid; + my $cpus; + my $timestamp; + my $tracepoint; + my $details; + my $statline; + + # Read each line of the event log +EVENT_PROCESS: + while ($traceevent = <STDIN>) { + if ($traceevent =~ /$regex_traceevent/o) { + $process_pid = $1; + $timestamp = $4; + $tracepoint = $5; + + $process_pid =~ /(.*)-([0-9]*)$/; + my $process = $1; + my $pid = $2; + + if ($process eq "") { + $process = $last_procmap{$pid}; + $process_pid = "$process-$pid"; + } + $last_procmap{$pid} = $process; + + if ($opt_read_procstat) { + $statline = read_statline($pid); + if ($opt_read_procstat && $process eq '') { + $process_pid = guess_process_pid($pid, $statline); + } + } + } else { + next; + } + + # Perl Switch() sucks majorly + if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") { + $timestamp = timestamp_to_ms($timestamp); + $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++; + $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp; + + $details = $6; + if ($details !~ /$regex_direct_begin/o) { + print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n"; + print " $details\n"; + print " $regex_direct_begin\n"; + next; + } + my $order = $1; + $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++; + $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order; + } elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") { + # Count the event itself + my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}; + $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++; + + # Record how long direct reclaim took this time + if (defined $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}) { + $timestamp = timestamp_to_ms($timestamp); + my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER}; + my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}); + $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency"; + } + } elsif ($tracepoint eq "mm_vmscan_kswapd_wake") { + $details = $6; + if ($details !~ /$regex_kswapd_wake/o) { + print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n"; + print " $details\n"; + print " $regex_kswapd_wake\n"; + next; + } + + my $order = $2; + $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order; + if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) { + $timestamp = timestamp_to_ms($timestamp); + $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++; + $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp; + $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++; + } else { + $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++; + $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++; + } + } elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") { + + # Count the event itself + my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}; + $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++; + + # Record how long kswapd was awake + $timestamp = timestamp_to_ms($timestamp); + my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER}; + my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}); + $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency"; + $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0; + } elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") { + $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++; + + $details = $6; + if ($details !~ /$regex_wakeup_kswapd/o) { + print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n"; + print " $details\n"; + print " $regex_wakeup_kswapd\n"; + next; + } + my $order = $3; + $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++; + } elsif ($tracepoint eq "mm_vmscan_lru_isolate") { + $details = $6; + if ($details !~ /$regex_lru_isolate/o) { + print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n"; + print " $details\n"; + print " $regex_lru_isolate/o\n"; + next; + } + my $isolate_mode = $1; + my $nr_scanned = $4; + my $file = $6; + + # To closer match vmstat scanning statistics, only count isolate_both + # and isolate_inactive as scanning. isolate_active is rotation + # isolate_inactive == 1 + # isolate_active == 2 + # isolate_both == 3 + if ($isolate_mode != 2) { + $perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned; + if ($file == 1) { + $perprocesspid{$process_pid}->{HIGH_NR_FILE_SCANNED} += $nr_scanned; + } else { + $perprocesspid{$process_pid}->{HIGH_NR_ANON_SCANNED} += $nr_scanned; + } + } + } elsif ($tracepoint eq "mm_vmscan_lru_shrink_inactive") { + $details = $6; + if ($details !~ /$regex_lru_shrink_inactive/o) { + print "WARNING: Failed to parse mm_vmscan_lru_shrink_inactive as expected\n"; + print " $details\n"; + print " $regex_lru_shrink_inactive/o\n"; + next; + } + + my $nr_reclaimed = $4; + my $flags = $6; + my $file = 0; + if ($flags =~ /RECLAIM_WB_FILE/) { + $file = 1; + } + $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED} += $nr_reclaimed; + if ($file) { + $perprocesspid{$process_pid}->{HIGH_NR_FILE_RECLAIMED} += $nr_reclaimed; + } else { + $perprocesspid{$process_pid}->{HIGH_NR_ANON_RECLAIMED} += $nr_reclaimed; + } + } elsif ($tracepoint eq "mm_vmscan_writepage") { + $details = $6; + if ($details !~ /$regex_writepage/o) { + print "WARNING: Failed to parse mm_vmscan_writepage as expected\n"; + print " $details\n"; + print " $regex_writepage\n"; + next; + } + + my $flags = $3; + my $file = 0; + my $sync_io = 0; + if ($flags =~ /RECLAIM_WB_FILE/) { + $file = 1; + } + if ($flags =~ /RECLAIM_WB_SYNC/) { + $sync_io = 1; + } + if ($sync_io) { + if ($file) { + $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}++; + } else { + $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}++; + } + } else { + if ($file) { + $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}++; + } else { + $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}++; + } + } + } else { + $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++; + } + + if ($sigint_pending) { + last EVENT_PROCESS; + } + } +} + +sub dump_stats { + my $hashref = shift; + my %stats = %$hashref; + + # Dump per-process stats + my $process_pid; + my $max_strlen = 0; + + # Get the maximum process name + foreach $process_pid (keys %perprocesspid) { + my $len = length($process_pid); + if ($len > $max_strlen) { + $max_strlen = $len; + } + } + $max_strlen += 2; + + # Work out latencies + printf("\n") if !$opt_ignorepid; + printf("Reclaim latencies expressed as order-latency_in_ms\n") if !$opt_ignorepid; + foreach $process_pid (keys %stats) { + + if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] && + !$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) { + next; + } + + printf "%-" . $max_strlen . "s ", $process_pid if !$opt_ignorepid; + my $index = 0; + while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] || + defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) { + + if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { + printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid; + my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]); + $total_direct_latency += $latency; + } else { + printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) if !$opt_ignorepid; + my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]); + $total_kswapd_latency += $latency; + } + $index++; + } + print "\n" if !$opt_ignorepid; + } + + # Print out process activity + printf("\n"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages", "Pages", "Time"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Rclmed", "Sync-IO", "ASync-IO", "Stalled"); + foreach $process_pid (keys %stats) { + + if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) { + next; + } + + $total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}; + $total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}; + $total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED}; + $total_direct_nr_file_scanned += $stats{$process_pid}->{HIGH_NR_FILE_SCANNED}; + $total_direct_nr_anon_scanned += $stats{$process_pid}->{HIGH_NR_ANON_SCANNED}; + $total_direct_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED}; + $total_direct_nr_file_reclaimed += $stats{$process_pid}->{HIGH_NR_FILE_RECLAIMED}; + $total_direct_nr_anon_reclaimed += $stats{$process_pid}->{HIGH_NR_ANON_RECLAIMED}; + $total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; + $total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; + $total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; + + $total_direct_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}; + + my $index = 0; + my $this_reclaim_delay = 0; + while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { + my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]); + $this_reclaim_delay += $latency; + $index++; + } + + printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u %8u %8.3f", + $process_pid, + $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}, + $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}, + $stats{$process_pid}->{HIGH_NR_SCANNED}, + $stats{$process_pid}->{HIGH_NR_FILE_SCANNED}, + $stats{$process_pid}->{HIGH_NR_ANON_SCANNED}, + $stats{$process_pid}->{HIGH_NR_RECLAIMED}, + $stats{$process_pid}->{HIGH_NR_FILE_RECLAIMED}, + $stats{$process_pid}->{HIGH_NR_ANON_RECLAIMED}, + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}, + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}, + $this_reclaim_delay / 1000); + + if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) { + print " "; + for (my $order = 0; $order < 20; $order++) { + my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]; + if ($count != 0) { + print "direct-$order=$count "; + } + } + } + if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) { + print " "; + for (my $order = 0; $order < 20; $order++) { + my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]; + if ($count != 0) { + print "wakeup-$order=$count "; + } + } + } + + print "\n"; + } + + # Print out kswapd activity + printf("\n"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages", "Pages"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Rclmed", "Sync-IO", "ASync-IO"); + foreach $process_pid (keys %stats) { + + if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) { + next; + } + + $total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}; + $total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED}; + $total_kswapd_nr_file_scanned += $stats{$process_pid}->{HIGH_NR_FILE_SCANNED}; + $total_kswapd_nr_anon_scanned += $stats{$process_pid}->{HIGH_NR_ANON_SCANNED}; + $total_kswapd_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED}; + $total_kswapd_nr_file_reclaimed += $stats{$process_pid}->{HIGH_NR_FILE_RECLAIMED}; + $total_kswapd_nr_anon_reclaimed += $stats{$process_pid}->{HIGH_NR_ANON_RECLAIMED}; + $total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; + $total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; + $total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; + $total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}; + + printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8i %8u", + $process_pid, + $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}, + $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}, + $stats{$process_pid}->{HIGH_NR_SCANNED}, + $stats{$process_pid}->{HIGH_NR_FILE_SCANNED}, + $stats{$process_pid}->{HIGH_NR_ANON_SCANNED}, + $stats{$process_pid}->{HIGH_NR_RECLAIMED}, + $stats{$process_pid}->{HIGH_NR_FILE_RECLAIMED}, + $stats{$process_pid}->{HIGH_NR_ANON_RECLAIMED}, + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}, + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}); + + if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) { + print " "; + for (my $order = 0; $order < 20; $order++) { + my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]; + if ($count != 0) { + print "wake-$order=$count "; + } + } + } + if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) { + print " "; + for (my $order = 0; $order < 20; $order++) { + my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]; + if ($count != 0) { + print "rewake-$order=$count "; + } + } + } + printf("\n"); + } + + # Print out summaries + $total_direct_latency /= 1000; + $total_kswapd_latency /= 1000; + print "\nSummary\n"; + print "Direct reclaims: $total_direct_reclaim\n"; + print "Direct reclaim pages scanned: $total_direct_nr_scanned\n"; + print "Direct reclaim file pages scanned: $total_direct_nr_file_scanned\n"; + print "Direct reclaim anon pages scanned: $total_direct_nr_anon_scanned\n"; + print "Direct reclaim pages reclaimed: $total_direct_nr_reclaimed\n"; + print "Direct reclaim file pages reclaimed: $total_direct_nr_file_reclaimed\n"; + print "Direct reclaim anon pages reclaimed: $total_direct_nr_anon_reclaimed\n"; + print "Direct reclaim write file sync I/O: $total_direct_writepage_file_sync\n"; + print "Direct reclaim write anon sync I/O: $total_direct_writepage_anon_sync\n"; + print "Direct reclaim write file async I/O: $total_direct_writepage_file_async\n"; + print "Direct reclaim write anon async I/O: $total_direct_writepage_anon_async\n"; + print "Wake kswapd requests: $total_wakeup_kswapd\n"; + printf "Time stalled direct reclaim: %-1.2f seconds\n", $total_direct_latency; + print "\n"; + print "Kswapd wakeups: $total_kswapd_wake\n"; + print "Kswapd pages scanned: $total_kswapd_nr_scanned\n"; + print "Kswapd file pages scanned: $total_kswapd_nr_file_scanned\n"; + print "Kswapd anon pages scanned: $total_kswapd_nr_anon_scanned\n"; + print "Kswapd pages reclaimed: $total_kswapd_nr_reclaimed\n"; + print "Kswapd file pages reclaimed: $total_kswapd_nr_file_reclaimed\n"; + print "Kswapd anon pages reclaimed: $total_kswapd_nr_anon_reclaimed\n"; + print "Kswapd reclaim write file sync I/O: $total_kswapd_writepage_file_sync\n"; + print "Kswapd reclaim write anon sync I/O: $total_kswapd_writepage_anon_sync\n"; + print "Kswapd reclaim write file async I/O: $total_kswapd_writepage_file_async\n"; + print "Kswapd reclaim write anon async I/O: $total_kswapd_writepage_anon_async\n"; + printf "Time kswapd awake: %-1.2f seconds\n", $total_kswapd_latency; +} + +sub aggregate_perprocesspid() { + my $process_pid; + my $process; + undef %perprocess; + + foreach $process_pid (keys %perprocesspid) { + $process = $process_pid; + $process =~ s/-([0-9])*$//; + if ($process eq '') { + $process = "NO_PROCESS_NAME"; + } + + $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}; + $perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}; + $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}; + $perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}; + $perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED}; + $perprocess{$process}->{HIGH_NR_FILE_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_FILE_SCANNED}; + $perprocess{$process}->{HIGH_NR_ANON_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_ANON_SCANNED}; + $perprocess{$process}->{HIGH_NR_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED}; + $perprocess{$process}->{HIGH_NR_FILE_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_FILE_RECLAIMED}; + $perprocess{$process}->{HIGH_NR_ANON_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_ANON_RECLAIMED}; + $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; + $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; + $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; + $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}; + + for (my $order = 0; $order < 20; $order++) { + $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]; + $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]; + $perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]; + + } + + # Aggregate direct reclaim latencies + my $wr_index = $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END}; + my $rd_index = 0; + while (defined $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]) { + $perprocess{$process}->{HIGH_DIRECT_RECLAIM_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]; + $rd_index++; + $wr_index++; + } + $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index; + + # Aggregate kswapd latencies + my $wr_index = $perprocess{$process}->{MM_VMSCAN_KSWAPD_SLEEP}; + my $rd_index = 0; + while (defined $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]) { + $perprocess{$process}->{HIGH_KSWAPD_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]; + $rd_index++; + $wr_index++; + } + $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index; + } +} + +sub report() { + if (!$opt_ignorepid) { + dump_stats(\%perprocesspid); + } else { + aggregate_perprocesspid(); + dump_stats(\%perprocess); + } +} + +# Process events or signals until neither is available +sub signal_loop() { + my $sigint_processed; + do { + $sigint_processed = 0; + process_events(); + + # Handle pending signals if any + if ($sigint_pending) { + my $current_time = time; + + if ($sigint_exit) { + print "Received exit signal\n"; + $sigint_pending = 0; + } + if ($sigint_report) { + if ($current_time >= $sigint_received + 2) { + report(); + $sigint_report = 0; + $sigint_pending = 0; + $sigint_processed = 1; + } + } + } + } while ($sigint_pending || $sigint_processed); +} + +signal_loop(); +report(); diff --git a/kernel/Documentation/trace/ring-buffer-design.txt b/kernel/Documentation/trace/ring-buffer-design.txt new file mode 100644 index 000000000..ff747b6fa --- /dev/null +++ b/kernel/Documentation/trace/ring-buffer-design.txt @@ -0,0 +1,955 @@ + Lockless Ring Buffer Design + =========================== + +Copyright 2009 Red Hat Inc. + Author: Steven Rostedt <srostedt@redhat.com> + License: The GNU Free Documentation License, Version 1.2 + (dual licensed under the GPL v2) +Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto, + and Frederic Weisbecker. + + +Written for: 2.6.31 + +Terminology used in this Document +--------------------------------- + +tail - where new writes happen in the ring buffer. + +head - where new reads happen in the ring buffer. + +producer - the task that writes into the ring buffer (same as writer) + +writer - same as producer + +consumer - the task that reads from the buffer (same as reader) + +reader - same as consumer. + +reader_page - A page outside the ring buffer used solely (for the most part) + by the reader. + +head_page - a pointer to the page that the reader will use next + +tail_page - a pointer to the page that will be written to next + +commit_page - a pointer to the page with the last finished non-nested write. + +cmpxchg - hardware-assisted atomic transaction that performs the following: + + A = B iff previous A == C + + R = cmpxchg(A, C, B) is saying that we replace A with B if and only if + current A is equal to C, and we put the old (current) A into R + + R gets the previous A regardless if A is updated with B or not. + + To see if the update was successful a compare of R == C may be used. + +The Generic Ring Buffer +----------------------- + +The ring buffer can be used in either an overwrite mode or in +producer/consumer mode. + +Producer/consumer mode is where if the producer were to fill up the +buffer before the consumer could free up anything, the producer +will stop writing to the buffer. This will lose most recent events. + +Overwrite mode is where if the producer were to fill up the buffer +before the consumer could free up anything, the producer will +overwrite the older data. This will lose the oldest events. + +No two writers can write at the same time (on the same per-cpu buffer), +but a writer may interrupt another writer, but it must finish writing +before the previous writer may continue. This is very important to the +algorithm. The writers act like a "stack". The way interrupts works +enforces this behavior. + + + writer1 start + <preempted> writer2 start + <preempted> writer3 start + writer3 finishes + writer2 finishes + writer1 finishes + +This is very much like a writer being preempted by an interrupt and +the interrupt doing a write as well. + +Readers can happen at any time. But no two readers may run at the +same time, nor can a reader preempt/interrupt another reader. A reader +cannot preempt/interrupt a writer, but it may read/consume from the +buffer at the same time as a writer is writing, but the reader must be +on another processor to do so. A reader may read on its own processor +and can be preempted by a writer. + +A writer can preempt a reader, but a reader cannot preempt a writer. +But a reader can read the buffer at the same time (on another processor) +as a writer. + +The ring buffer is made up of a list of pages held together by a linked list. + +At initialization a reader page is allocated for the reader that is not +part of the ring buffer. + +The head_page, tail_page and commit_page are all initialized to point +to the same page. + +The reader page is initialized to have its next pointer pointing to +the head page, and its previous pointer pointing to a page before +the head page. + +The reader has its own page to use. At start up time, this page is +allocated but is not attached to the list. When the reader wants +to read from the buffer, if its page is empty (like it is on start-up), +it will swap its page with the head_page. The old reader page will +become part of the ring buffer and the head_page will be removed. +The page after the inserted page (old reader_page) will become the +new head page. + +Once the new page is given to the reader, the reader could do what +it wants with it, as long as a writer has left that page. + +A sample of how the reader page is swapped: Note this does not +show the head page in the buffer, it is for demonstrating a swap +only. + + +------+ + |reader| RING BUFFER + |page | + +------+ + +---+ +---+ +---+ + | |-->| |-->| | + | |<--| |<--| | + +---+ +---+ +---+ + ^ | ^ | + | +-------------+ | + +-----------------+ + + + +------+ + |reader| RING BUFFER + |page |-------------------+ + +------+ v + | +---+ +---+ +---+ + | | |-->| |-->| | + | | |<--| |<--| |<-+ + | +---+ +---+ +---+ | + | ^ | ^ | | + | | +-------------+ | | + | +-----------------+ | + +------------------------------------+ + + +------+ + |reader| RING BUFFER + |page |-------------------+ + +------+ <---------------+ v + | ^ +---+ +---+ +---+ + | | | |-->| |-->| | + | | | | | |<--| |<-+ + | | +---+ +---+ +---+ | + | | | ^ | | + | | +-------------+ | | + | +-----------------------------+ | + +------------------------------------+ + + +------+ + |buffer| RING BUFFER + |page |-------------------+ + +------+ <---------------+ v + | ^ +---+ +---+ +---+ + | | | | | |-->| | + | | New | | | |<--| |<-+ + | | Reader +---+ +---+ +---+ | + | | page ----^ | | + | | | | + | +-----------------------------+ | + +------------------------------------+ + + + +It is possible that the page swapped is the commit page and the tail page, +if what is in the ring buffer is less than what is held in a buffer page. + + + reader page commit page tail page + | | | + v | | + +---+ | | + | |<----------+ | + | |<------------------------+ + | |------+ + +---+ | + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +This case is still valid for this algorithm. +When the writer leaves the page, it simply goes into the ring buffer +since the reader page still points to the next location in the ring +buffer. + + +The main pointers: + + reader page - The page used solely by the reader and is not part + of the ring buffer (may be swapped in) + + head page - the next page in the ring buffer that will be swapped + with the reader page. + + tail page - the page where the next write will take place. + + commit page - the page that last finished a write. + +The commit page only is updated by the outermost writer in the +writer stack. A writer that preempts another writer will not move the +commit page. + +When data is written into the ring buffer, a position is reserved +in the ring buffer and passed back to the writer. When the writer +is finished writing data into that position, it commits the write. + +Another write (or a read) may take place at anytime during this +transaction. If another write happens it must finish before continuing +with the previous write. + + + Write reserve: + + Buffer page + +---------+ + |written | + +---------+ <--- given back to writer (current commit) + |reserved | + +---------+ <--- tail pointer + | empty | + +---------+ + + Write commit: + + Buffer page + +---------+ + |written | + +---------+ + |written | + +---------+ <--- next position for write (current commit) + | empty | + +---------+ + + + If a write happens after the first reserve: + + Buffer page + +---------+ + |written | + +---------+ <-- current commit + |reserved | + +---------+ <--- given back to second writer + |reserved | + +---------+ <--- tail pointer + + After second writer commits: + + + Buffer page + +---------+ + |written | + +---------+ <--(last full commit) + |reserved | + +---------+ + |pending | + |commit | + +---------+ <--- tail pointer + + When the first writer commits: + + Buffer page + +---------+ + |written | + +---------+ + |written | + +---------+ + |written | + +---------+ <--(last full commit and tail pointer) + + +The commit pointer points to the last write location that was +committed without preempting another write. When a write that +preempted another write is committed, it only becomes a pending commit +and will not be a full commit until all writes have been committed. + +The commit page points to the page that has the last full commit. +The tail page points to the page with the last write (before +committing). + +The tail page is always equal to or after the commit page. It may +be several pages ahead. If the tail page catches up to the commit +page then no more writes may take place (regardless of the mode +of the ring buffer: overwrite and produce/consumer). + +The order of pages is: + + head page + commit page + tail page + +Possible scenario: + tail page + head page commit page | + | | | + v v v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +There is a special case that the head page is after either the commit page +and possibly the tail page. That is when the commit (and tail) page has been +swapped with the reader page. This is because the head page is always +part of the ring buffer, but the reader page is not. Whenever there +has been less than a full page that has been committed inside the ring buffer, +and a reader swaps out a page, it will be swapping out the commit page. + + + reader page commit page tail page + | | | + v | | + +---+ | | + | |<----------+ | + | |<------------------------+ + | |------+ + +---+ | + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + head page + + +In this case, the head page will not move when the tail and commit +move back into the ring buffer. + +The reader cannot swap a page into the ring buffer if the commit page +is still on that page. If the read meets the last commit (real commit +not pending or reserved), then there is nothing more to read. +The buffer is considered empty until another full commit finishes. + +When the tail meets the head page, if the buffer is in overwrite mode, +the head page will be pushed ahead one. If the buffer is in producer/consumer +mode, the write will fail. + +Overwrite mode: + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + head page + + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + head page + + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + head page + +Note, the reader page will still point to the previous head page. +But when a swap takes place, it will use the most recent head page. + + +Making the Ring Buffer Lockless: +-------------------------------- + +The main idea behind the lockless algorithm is to combine the moving +of the head_page pointer with the swapping of pages with the reader. +State flags are placed inside the pointer to the page. To do this, +each page must be aligned in memory by 4 bytes. This will allow the 2 +least significant bits of the address to be used as flags, since +they will always be zero for the address. To get the address, +simply mask out the flags. + + MASK = ~3 + + address & MASK + +Two flags will be kept by these two bits: + + HEADER - the page being pointed to is a head page + + UPDATE - the page being pointed to is being updated by a writer + and was or is about to be a head page. + + + reader page + | + v + +---+ + | |------+ + +---+ | + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +The above pointer "-H->" would have the HEADER flag set. That is +the next page is the next page to be swapped out by the reader. +This pointer means the next page is the head page. + +When the tail page meets the head pointer, it will use cmpxchg to +change the pointer to the UPDATE state: + + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +"-U->" represents a pointer in the UPDATE state. + +Any access to the reader will need to take some sort of lock to serialize +the readers. But the writers will never take a lock to write to the +ring buffer. This means we only need to worry about a single reader, +and writes only preempt in "stack" formation. + +When the reader tries to swap the page with the ring buffer, it +will also use cmpxchg. If the flag bit in the pointer to the +head page does not have the HEADER flag set, the compare will fail +and the reader will need to look for the new head page and try again. +Note, the flags UPDATE and HEADER are never set at the same time. + +The reader swaps the reader page as follows: + + +------+ + |reader| RING BUFFER + |page | + +------+ + +---+ +---+ +---+ + | |--->| |--->| | + | |<---| |<---| | + +---+ +---+ +---+ + ^ | ^ | + | +---------------+ | + +-----H-------------+ + +The reader sets the reader page next pointer as HEADER to the page after +the head page. + + + +------+ + |reader| RING BUFFER + |page |-------H-----------+ + +------+ v + | +---+ +---+ +---+ + | | |--->| |--->| | + | | |<---| |<---| |<-+ + | +---+ +---+ +---+ | + | ^ | ^ | | + | | +---------------+ | | + | +-----H-------------+ | + +--------------------------------------+ + +It does a cmpxchg with the pointer to the previous head page to make it +point to the reader page. Note that the new pointer does not have the HEADER +flag set. This action atomically moves the head page forward. + + +------+ + |reader| RING BUFFER + |page |-------H-----------+ + +------+ v + | ^ +---+ +---+ +---+ + | | | |-->| |-->| | + | | | |<--| |<--| |<-+ + | | +---+ +---+ +---+ | + | | | ^ | | + | | +-------------+ | | + | +-----------------------------+ | + +------------------------------------+ + +After the new head page is set, the previous pointer of the head page is +updated to the reader page. + + +------+ + |reader| RING BUFFER + |page |-------H-----------+ + +------+ <---------------+ v + | ^ +---+ +---+ +---+ + | | | |-->| |-->| | + | | | | | |<--| |<-+ + | | +---+ +---+ +---+ | + | | | ^ | | + | | +-------------+ | | + | +-----------------------------+ | + +------------------------------------+ + + +------+ + |buffer| RING BUFFER + |page |-------H-----------+ <--- New head page + +------+ <---------------+ v + | ^ +---+ +---+ +---+ + | | | | | |-->| | + | | New | | | |<--| |<-+ + | | Reader +---+ +---+ +---+ | + | | page ----^ | | + | | | | + | +-----------------------------+ | + +------------------------------------+ + +Another important point: The page that the reader page points back to +by its previous pointer (the one that now points to the new head page) +never points back to the reader page. That is because the reader page is +not part of the ring buffer. Traversing the ring buffer via the next pointers +will always stay in the ring buffer. Traversing the ring buffer via the +prev pointers may not. + +Note, the way to determine a reader page is simply by examining the previous +pointer of the page. If the next pointer of the previous page does not +point back to the original page, then the original page is a reader page: + + + +--------+ + | reader | next +----+ + | page |-------->| |<====== (buffer page) + +--------+ +----+ + | | ^ + | v | next + prev | +----+ + +------------->| | + +----+ + +The way the head page moves forward: + +When the tail page meets the head page and the buffer is in overwrite mode +and more writes take place, the head page must be moved forward before the +writer may move the tail page. The way this is done is that the writer +performs a cmpxchg to convert the pointer to the head page from the HEADER +flag to have the UPDATE flag set. Once this is done, the reader will +not be able to swap the head page from the buffer, nor will it be able to +move the head page, until the writer is finished with the move. + +This eliminates any races that the reader can have on the writer. The reader +must spin, and this is why the reader cannot preempt the writer. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The following page will be made into the new head page. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +After the new head page has been set, we can set the old head page +pointer back to NORMAL. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +After the head page has been moved, the tail page may now move forward. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +The above are the trivial updates. Now for the more complex scenarios. + + +As stated before, if enough writes preempt the first write, the +tail page may make it all the way around the buffer and meet the commit +page. At this time, we must start dropping writes (usually with some kind +of warning to the user). But what happens if the commit was still on the +reader page? The commit page is not part of the ring buffer. The tail page +must account for this. + + + reader page commit page + | | + v | + +---+ | + | |<----------+ + | | + | |------+ + +---+ | + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + tail page + +If the tail page were to simply push the head page forward, the commit when +leaving the reader page would not be pointing to the correct page. + +The solution to this is to test if the commit page is on the reader page +before pushing the head page. If it is, then it can be assumed that the +tail page wrapped the buffer, and we must drop new writes. + +This is not a race condition, because the commit page can only be moved +by the outermost writer (the writer that was preempted). +This means that the commit will not move while a writer is moving the +tail page. The reader cannot swap the reader page if it is also being +used as the commit page. The reader can simply check that the commit +is off the reader page. Once the commit page leaves the reader page +it will never go back on it unless a reader does another swap with the +buffer page that is also the commit page. + + +Nested writes +------------- + +In the pushing forward of the tail page we must first push forward +the head page if the head page is the next page. If the head page +is not the next page, the tail page is simply updated with a cmpxchg. + +Only writers move the tail page. This must be done atomically to protect +against nested writers. + + temp_page = tail_page + next_page = temp_page->next + cmpxchg(tail_page, temp_page, next_page) + +The above will update the tail page if it is still pointing to the expected +page. If this fails, a nested write pushed it forward, the current write +does not need to push it. + + + temp page + | + v + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Nested write comes in and moves the tail page forward: + + tail page (moved by nested writer) + temp page | + | | + v v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The above would fail the cmpxchg, but since the tail page has already +been moved forward, the writer will just try again to reserve storage +on the new tail page. + +But the moving of the head page is a bit more complex. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The write converts the head page pointer to UPDATE. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +But if a nested writer preempts here, it will see that the next +page is a head page, but it is also nested. It will detect that +it is nested and will save that information. The detection is the +fact that it sees the UPDATE flag instead of a HEADER or NORMAL +pointer. + +The nested writer will set the new head page pointer. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +But it will not reset the update back to normal. Only the writer +that converted a pointer from HEAD to UPDATE will convert it back +to NORMAL. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +After the nested writer finishes, the outermost writer will convert +the UPDATE pointer to NORMAL. + + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +It can be even more complex if several nested writes came in and moved +the tail page ahead several pages: + + +(first writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The write converts the head page pointer to UPDATE. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Next writer comes in, and sees the update and sets up the new +head page. + +(second writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The nested writer moves the tail page forward. But does not set the old +update page to NORMAL because it is not the outermost writer. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Another writer preempts and sees the page after the tail page is a head page. +It changes it from HEAD to UPDATE. + +(third writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-U->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The writer will move the head page forward: + + +(third writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-U->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +But now that the third writer did change the HEAD flag to UPDATE it +will convert it to normal: + + +(third writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +Then it will move the tail page, and return back to the second writer. + + +(second writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +The second writer will fail to move the tail page because it was already +moved, so it will try again and add its data to the new tail page. +It will return to the first writer. + + +(first writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The first writer cannot know atomically if the tail page moved +while it updates the HEAD page. It will then update the head page to +what it thinks is the new head page. + + +(first writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Since the cmpxchg returns the old value of the pointer the first writer +will see it succeeded in updating the pointer from NORMAL to HEAD. +But as we can see, this is not good enough. It must also check to see +if the tail page is either where it use to be or on the next page: + + +(first writer) + + A B tail page + | | | + v v v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +If tail page != A and tail page != B, then it must reset the pointer +back to NORMAL. The fact that it only needs to worry about nested +writers means that it only needs to check this after setting the HEAD page. + + +(first writer) + + A B tail page + | | | + v v v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Now the writer can update the head page. This is also why the head page must +remain in UPDATE and only reset by the outermost writer. This prevents +the reader from seeing the incorrect head page. + + +(first writer) + + A B tail page + | | | + v v v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + diff --git a/kernel/Documentation/trace/tracepoint-analysis.txt b/kernel/Documentation/trace/tracepoint-analysis.txt new file mode 100644 index 000000000..058cc6c9d --- /dev/null +++ b/kernel/Documentation/trace/tracepoint-analysis.txt @@ -0,0 +1,327 @@ + Notes on Analysing Behaviour Using Events and Tracepoints + + Documentation written by Mel Gorman + PCL information heavily based on email from Ingo Molnar + +1. Introduction +=============== + +Tracepoints (see Documentation/trace/tracepoints.txt) can be used without +creating custom kernel modules to register probe functions using the event +tracing infrastructure. + +Simplistically, tracepoints represent important events that can be +taken in conjunction with other tracepoints to build a "Big Picture" of +what is going on within the system. There are a large number of methods for +gathering and interpreting these events. Lacking any current Best Practises, +this document describes some of the methods that can be used. + +This document assumes that debugfs is mounted on /sys/kernel/debug and that +the appropriate tracing options have been configured into the kernel. It is +assumed that the PCL tool tools/perf has been installed and is in your path. + +2. Listing Available Events +=========================== + +2.1 Standard Utilities +---------------------- + +All possible events are visible from /sys/kernel/debug/tracing/events. Simply +calling + + $ find /sys/kernel/debug/tracing/events -type d + +will give a fair indication of the number of events available. + +2.2 PCL (Performance Counters for Linux) +------- + +Discovery and enumeration of all counters and events, including tracepoints, +are available with the perf tool. Getting a list of available events is a +simple case of: + + $ perf list 2>&1 | grep Tracepoint + ext4:ext4_free_inode [Tracepoint event] + ext4:ext4_request_inode [Tracepoint event] + ext4:ext4_allocate_inode [Tracepoint event] + ext4:ext4_write_begin [Tracepoint event] + ext4:ext4_ordered_write_end [Tracepoint event] + [ .... remaining output snipped .... ] + + +3. Enabling Events +================== + +3.1 System-Wide Event Enabling +------------------------------ + +See Documentation/trace/events.txt for a proper description on how events +can be enabled system-wide. A short example of enabling all events related +to page allocation would look something like: + + $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done + +3.2 System-Wide Event Enabling with SystemTap +--------------------------------------------- + +In SystemTap, tracepoints are accessible using the kernel.trace() function +call. The following is an example that reports every 5 seconds what processes +were allocating the pages. + + global page_allocs + + probe kernel.trace("mm_page_alloc") { + page_allocs[execname()]++ + } + + function print_count() { + printf ("%-25s %-s\n", "#Pages Allocated", "Process Name") + foreach (proc in page_allocs-) + printf("%-25d %s\n", page_allocs[proc], proc) + printf ("\n") + delete page_allocs + } + + probe timer.s(5) { + print_count() + } + +3.3 System-Wide Event Enabling with PCL +--------------------------------------- + +By specifying the -a switch and analysing sleep, the system-wide events +for a duration of time can be examined. + + $ perf stat -a \ + -e kmem:mm_page_alloc -e kmem:mm_page_free \ + -e kmem:mm_page_free_batched \ + sleep 10 + Performance counter stats for 'sleep 10': + + 9630 kmem:mm_page_alloc + 2143 kmem:mm_page_free + 7424 kmem:mm_page_free_batched + + 10.002577764 seconds time elapsed + +Similarly, one could execute a shell and exit it as desired to get a report +at that point. + +3.4 Local Event Enabling +------------------------ + +Documentation/trace/ftrace.txt describes how to enable events on a per-thread +basis using set_ftrace_pid. + +3.5 Local Event Enablement with PCL +----------------------------------- + +Events can be activated and tracked for the duration of a process on a local +basis using PCL such as follows. + + $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \ + -e kmem:mm_page_free_batched ./hackbench 10 + Time: 0.909 + + Performance counter stats for './hackbench 10': + + 17803 kmem:mm_page_alloc + 12398 kmem:mm_page_free + 4827 kmem:mm_page_free_batched + + 0.973913387 seconds time elapsed + +4. Event Filtering +================== + +Documentation/trace/ftrace.txt covers in-depth how to filter events in +ftrace. Obviously using grep and awk of trace_pipe is an option as well +as any script reading trace_pipe. + +5. Analysing Event Variances with PCL +===================================== + +Any workload can exhibit variances between runs and it can be important +to know what the standard deviation is. By and large, this is left to the +performance analyst to do it by hand. In the event that the discrete event +occurrences are useful to the performance analyst, then perf can be used. + + $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free + -e kmem:mm_page_free_batched ./hackbench 10 + Time: 0.890 + Time: 0.895 + Time: 0.915 + Time: 1.001 + Time: 0.899 + + Performance counter stats for './hackbench 10' (5 runs): + + 16630 kmem:mm_page_alloc ( +- 3.542% ) + 11486 kmem:mm_page_free ( +- 4.771% ) + 4730 kmem:mm_page_free_batched ( +- 2.325% ) + + 0.982653002 seconds time elapsed ( +- 1.448% ) + +In the event that some higher-level event is required that depends on some +aggregation of discrete events, then a script would need to be developed. + +Using --repeat, it is also possible to view how events are fluctuating over +time on a system-wide basis using -a and sleep. + + $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \ + -e kmem:mm_page_free_batched \ + -a --repeat 10 \ + sleep 1 + Performance counter stats for 'sleep 1' (10 runs): + + 1066 kmem:mm_page_alloc ( +- 26.148% ) + 182 kmem:mm_page_free ( +- 5.464% ) + 890 kmem:mm_page_free_batched ( +- 30.079% ) + + 1.002251757 seconds time elapsed ( +- 0.005% ) + +6. Higher-Level Analysis with Helper Scripts +============================================ + +When events are enabled the events that are triggering can be read from +/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary +options exist as well. By post-processing the output, further information can +be gathered on-line as appropriate. Examples of post-processing might include + + o Reading information from /proc for the PID that triggered the event + o Deriving a higher-level event from a series of lower-level events. + o Calculating latencies between two events + +Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example +script that can read trace_pipe from STDIN or a copy of a trace. When used +on-line, it can be interrupted once to generate a report without exiting +and twice to exit. + +Simplistically, the script just reads STDIN and counts up events but it +also can do more such as + + o Derive high-level events from many low-level events. If a number of pages + are freed to the main allocator from the per-CPU lists, it recognises + that as one per-CPU drain even though there is no specific tracepoint + for that event + o It can aggregate based on PID or individual process number + o In the event memory is getting externally fragmented, it reports + on whether the fragmentation event was severe or moderate. + o When receiving an event about a PID, it can record who the parent was so + that if large numbers of events are coming from very short-lived + processes, the parent process responsible for creating all the helpers + can be identified + +7. Lower-Level Analysis with PCL +================================ + +There may also be a requirement to identify what functions within a program +were generating events within the kernel. To begin this sort of analysis, the +data must be recorded. At the time of writing, this required root: + + $ perf record -c 1 \ + -e kmem:mm_page_alloc -e kmem:mm_page_free \ + -e kmem:mm_page_free_batched \ + ./hackbench 10 + Time: 0.894 + [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ] + +Note the use of '-c 1' to set the event period to sample. The default sample +period is quite high to minimise overhead but the information collected can be +very coarse as a result. + +This record outputted a file called perf.data which can be analysed using +perf report. + + $ perf report + # Samples: 30922 + # + # Overhead Command Shared Object + # ........ ......... ................................ + # + 87.27% hackbench [vdso] + 6.85% hackbench /lib/i686/cmov/libc-2.9.so + 2.62% hackbench /lib/ld-2.9.so + 1.52% perf [vdso] + 1.22% hackbench ./hackbench + 0.48% hackbench [kernel] + 0.02% perf /lib/i686/cmov/libc-2.9.so + 0.01% perf /usr/bin/perf + 0.01% perf /lib/ld-2.9.so + 0.00% hackbench /lib/i686/cmov/libpthread-2.9.so + # + # (For more details, try: perf report --sort comm,dso,symbol) + # + +According to this, the vast majority of events triggered on events +within the VDSO. With simple binaries, this will often be the case so let's +take a slightly different example. In the course of writing this, it was +noticed that X was generating an insane amount of page allocations so let's look +at it: + + $ perf record -c 1 -f \ + -e kmem:mm_page_alloc -e kmem:mm_page_free \ + -e kmem:mm_page_free_batched \ + -p `pidof X` + +This was interrupted after a few seconds and + + $ perf report + # Samples: 27666 + # + # Overhead Command Shared Object + # ........ ....... ....................................... + # + 51.95% Xorg [vdso] + 47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 + 0.09% Xorg /lib/i686/cmov/libc-2.9.so + 0.01% Xorg [kernel] + # + # (For more details, try: perf report --sort comm,dso,symbol) + # + +So, almost half of the events are occurring in a library. To get an idea which +symbol: + + $ perf report --sort comm,dso,symbol + # Samples: 27666 + # + # Overhead Command Shared Object Symbol + # ........ ....... ....................................... ...... + # + 51.95% Xorg [vdso] [.] 0x000000ffffe424 + 47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2 + 0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc + 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f + 0.01% Xorg [kernel] [k] read_hpet + 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path + 0.00% Xorg [kernel] [k] ftrace_trace_userstack + +To see where within the function pixmanFillsse2 things are going wrong: + + $ perf annotate pixmanFillsse2 + [ ... ] + 0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax) + : } + : + : extern __inline void __attribute__((__gnu_inline__, __always_inline__, _ + : _mm_store_si128 (__m128i *__P, __m128i __B) : { + : *__P = __B; + 12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax) + 0.00 : 34ef5: ff + 12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax) + 0.00 : 34efd: ff + 12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax) + 0.00 : 34f05: ff + 12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax) + 0.00 : 34f0d: ff + 12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax) + 12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax) + 12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax) + 12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax) + +At a glance, it looks like the time is being spent copying pixmaps to +the card. Further investigation would be needed to determine why pixmaps +are being copied around so much but a starting point would be to take an +ancient build of libpixmap out of the library path where it was totally +forgotten about from months ago! diff --git a/kernel/Documentation/trace/tracepoints.txt b/kernel/Documentation/trace/tracepoints.txt new file mode 100644 index 000000000..a3efac621 --- /dev/null +++ b/kernel/Documentation/trace/tracepoints.txt @@ -0,0 +1,145 @@ + Using the Linux Kernel Tracepoints + + Mathieu Desnoyers + + +This document introduces Linux Kernel Tracepoints and their use. It +provides examples of how to insert tracepoints in the kernel and +connect probe functions to them and provides some examples of probe +functions. + + +* Purpose of tracepoints + +A tracepoint placed in code provides a hook to call a function (probe) +that you can provide at runtime. A tracepoint can be "on" (a probe is +connected to it) or "off" (no probe is attached). When a tracepoint is +"off" it has no effect, except for adding a tiny time penalty +(checking a condition for a branch) and space penalty (adding a few +bytes for the function call at the end of the instrumented function +and adds a data structure in a separate section). When a tracepoint +is "on", the function you provide is called each time the tracepoint +is executed, in the execution context of the caller. When the function +provided ends its execution, it returns to the caller (continuing from +the tracepoint site). + +You can put tracepoints at important locations in the code. They are +lightweight hooks that can pass an arbitrary number of parameters, +which prototypes are described in a tracepoint declaration placed in a +header file. + +They can be used for tracing and performance accounting. + + +* Usage + +Two elements are required for tracepoints : + +- A tracepoint definition, placed in a header file. +- The tracepoint statement, in C code. + +In order to use tracepoints, you should include linux/tracepoint.h. + +In include/trace/events/subsys.h : + +#undef TRACE_SYSTEM +#define TRACE_SYSTEM subsys + +#if !defined(_TRACE_SUBSYS_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_SUBSYS_H + +#include <linux/tracepoint.h> + +DECLARE_TRACE(subsys_eventname, + TP_PROTO(int firstarg, struct task_struct *p), + TP_ARGS(firstarg, p)); + +#endif /* _TRACE_SUBSYS_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> + +In subsys/file.c (where the tracing statement must be added) : + +#include <trace/events/subsys.h> + +#define CREATE_TRACE_POINTS +DEFINE_TRACE(subsys_eventname); + +void somefct(void) +{ + ... + trace_subsys_eventname(arg, task); + ... +} + +Where : +- subsys_eventname is an identifier unique to your event + - subsys is the name of your subsystem. + - eventname is the name of the event to trace. + +- TP_PROTO(int firstarg, struct task_struct *p) is the prototype of the + function called by this tracepoint. + +- TP_ARGS(firstarg, p) are the parameters names, same as found in the + prototype. + +- if you use the header in multiple source files, #define CREATE_TRACE_POINTS + should appear only in one source file. + +Connecting a function (probe) to a tracepoint is done by providing a +probe (function to call) for the specific tracepoint through +register_trace_subsys_eventname(). Removing a probe is done through +unregister_trace_subsys_eventname(); it will remove the probe. + +tracepoint_synchronize_unregister() must be called before the end of +the module exit function to make sure there is no caller left using +the probe. This, and the fact that preemption is disabled around the +probe call, make sure that probe removal and module unload are safe. + +The tracepoint mechanism supports inserting multiple instances of the +same tracepoint, but a single definition must be made of a given +tracepoint name over all the kernel to make sure no type conflict will +occur. Name mangling of the tracepoints is done using the prototypes +to make sure typing is correct. Verification of probe type correctness +is done at the registration site by the compiler. Tracepoints can be +put in inline functions, inlined static functions, and unrolled loops +as well as regular functions. + +The naming scheme "subsys_event" is suggested here as a convention +intended to limit collisions. Tracepoint names are global to the +kernel: they are considered as being the same whether they are in the +core kernel image or in modules. + +If the tracepoint has to be used in kernel modules, an +EXPORT_TRACEPOINT_SYMBOL_GPL() or EXPORT_TRACEPOINT_SYMBOL() can be +used to export the defined tracepoints. + +If you need to do a bit of work for a tracepoint parameter, and +that work is only used for the tracepoint, that work can be encapsulated +within an if statement with the following: + + if (trace_foo_bar_enabled()) { + int i; + int tot = 0; + + for (i = 0; i < count; i++) + tot += calculate_nuggets(); + + trace_foo_bar(tot); + } + +All trace_<tracepoint>() calls have a matching trace_<tracepoint>_enabled() +function defined that returns true if the tracepoint is enabled and +false otherwise. The trace_<tracepoint>() should always be within the +block of the if (trace_<tracepoint>_enabled()) to prevent races between +the tracepoint being enabled and the check being seen. + +The advantage of using the trace_<tracepoint>_enabled() is that it uses +the static_key of the tracepoint to allow the if statement to be implemented +with jump labels and avoid conditional branches. + +Note: The convenience macro TRACE_EVENT provides an alternative way to + define tracepoints. Check http://lwn.net/Articles/379903, + http://lwn.net/Articles/381064 and http://lwn.net/Articles/383362 + for a series of articles with more details. diff --git a/kernel/Documentation/trace/uprobetracer.txt b/kernel/Documentation/trace/uprobetracer.txt new file mode 100644 index 000000000..f1cf9a34a --- /dev/null +++ b/kernel/Documentation/trace/uprobetracer.txt @@ -0,0 +1,159 @@ + Uprobe-tracer: Uprobe-based Event Tracing + ========================================= + + Documentation written by Srikar Dronamraju + + +Overview +-------- +Uprobe based trace events are similar to kprobe based trace events. +To enable this feature, build your kernel with CONFIG_UPROBE_EVENT=y. + +Similar to the kprobe-event tracer, this doesn't need to be activated via +current_tracer. Instead of that, add probe points via +/sys/kernel/debug/tracing/uprobe_events, and enable it via +/sys/kernel/debug/tracing/events/uprobes/<EVENT>/enabled. + +However unlike kprobe-event tracer, the uprobe event interface expects the +user to calculate the offset of the probepoint in the object. + +Synopsis of uprobe_tracer +------------------------- + p[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a uprobe + r[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a return uprobe (uretprobe) + -:[GRP/]EVENT : Clear uprobe or uretprobe event + + GRP : Group name. If omitted, "uprobes" is the default value. + EVENT : Event name. If omitted, the event name is generated based + on PATH+OFFSET. + PATH : Path to an executable or a library. + OFFSET : Offset where the probe is inserted. + + FETCHARGS : Arguments. Each probe can have up to 128 args. + %REG : Fetch register REG + @ADDR : Fetch memory at ADDR (ADDR should be in userspace) + @+OFFSET : Fetch memory at OFFSET (OFFSET from same file as PATH) + $stackN : Fetch Nth entry of stack (N >= 0) + $stack : Fetch stack address. + $retval : Fetch return value.(*) + +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**) + NAME=FETCHARG : Set NAME as the argument name of FETCHARG. + FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types + (u8/u16/u32/u64/s8/s16/s32/s64), "string" and bitfield + are supported. + + (*) only for return probe. + (**) this is useful for fetching a field of data structures. + +Types +----- +Several types are supported for fetch-args. Uprobe tracer will access memory +by given type. Prefix 's' and 'u' means those types are signed and unsigned +respectively. Traced arguments are shown in decimal (signed) or hex (unsigned). +String type is a special type, which fetches a "null-terminated" string from +user space. +Bitfield is another special type, which takes 3 parameters, bit-width, bit- +offset, and container-size (usually 32). The syntax is; + + b<bit-width>@<bit-offset>/<container-size> + + +Event Profiling +--------------- +You can check the total number of probe hits and probe miss-hits via +/sys/kernel/debug/tracing/uprobe_profile. +The first column is event name, the second is the number of probe hits, +the third is the number of probe miss-hits. + +Usage examples +-------------- + * Add a probe as a new uprobe event, write a new definition to uprobe_events +as below: (sets a uprobe at an offset of 0x4245c0 in the executable /bin/bash) + + echo 'p: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events + + * Add a probe as a new uretprobe event: + + echo 'r: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events + + * Unset registered event: + + echo '-:bash_0x4245c0' >> /sys/kernel/debug/tracing/uprobe_events + + * Print out the events that are registered: + + cat /sys/kernel/debug/tracing/uprobe_events + + * Clear all events: + + echo > /sys/kernel/debug/tracing/uprobe_events + +Following example shows how to dump the instruction pointer and %ax register +at the probed text address. Probe zfree function in /bin/zsh: + + # cd /sys/kernel/debug/tracing/ + # cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp + 00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh + # objdump -T /bin/zsh | grep -w zfree + 0000000000446420 g DF .text 0000000000000012 Base zfree + + 0x46420 is the offset of zfree in object /bin/zsh that is loaded at + 0x00400000. Hence the command to uprobe would be: + + # echo 'p:zfree_entry /bin/zsh:0x46420 %ip %ax' > uprobe_events + + And the same for the uretprobe would be: + + # echo 'r:zfree_exit /bin/zsh:0x46420 %ip %ax' >> uprobe_events + +Please note: User has to explicitly calculate the offset of the probe-point +in the object. We can see the events that are registered by looking at the +uprobe_events file. + + # cat uprobe_events + p:uprobes/zfree_entry /bin/zsh:0x00046420 arg1=%ip arg2=%ax + r:uprobes/zfree_exit /bin/zsh:0x00046420 arg1=%ip arg2=%ax + +Format of events can be seen by viewing the file events/uprobes/zfree_entry/format + + # cat events/uprobes/zfree_entry/format + name: zfree_entry + ID: 922 + format: + field:unsigned short common_type; offset:0; size:2; signed:0; + field:unsigned char common_flags; offset:2; size:1; signed:0; + field:unsigned char common_preempt_count; offset:3; size:1; signed:0; + field:int common_pid; offset:4; size:4; signed:1; + field:int common_padding; offset:8; size:4; signed:1; + + field:unsigned long __probe_ip; offset:12; size:4; signed:0; + field:u32 arg1; offset:16; size:4; signed:0; + field:u32 arg2; offset:20; size:4; signed:0; + + print fmt: "(%lx) arg1=%lx arg2=%lx", REC->__probe_ip, REC->arg1, REC->arg2 + +Right after definition, each event is disabled by default. For tracing these +events, you need to enable it by: + + # echo 1 > events/uprobes/enable + +Lets disable the event after sleeping for some time. + + # sleep 20 + # echo 0 > events/uprobes/enable + +And you can see the traced information via /sys/kernel/debug/tracing/trace. + + # cat trace + # tracer: nop + # + # TASK-PID CPU# TIMESTAMP FUNCTION + # | | | | | + zsh-24842 [006] 258544.995456: zfree_entry: (0x446420) arg1=446420 arg2=79 + zsh-24842 [007] 258545.000270: zfree_exit: (0x446540 <- 0x446420) arg1=446540 arg2=0 + zsh-24842 [002] 258545.043929: zfree_entry: (0x446420) arg1=446420 arg2=79 + zsh-24842 [004] 258547.046129: zfree_exit: (0x446540 <- 0x446420) arg1=446540 arg2=0 + +Output shows us uprobe was triggered for a pid 24842 with ip being 0x446420 +and contents of ax register being 79. And uretprobe was triggered with ip at +0x446540 with counterpart function entry at 0x446420. |