Why use performance counters?

Performance counters are really useful in order to find bottlenecks in real applications. They report very accurate metrics regarding how the GPU is idle/busy.

How to use performance counters?

GALLIUM_HUD

GALLIUM_HUD is an environment variable in mesa which can be used to monitor performance counters with a nice interface.

To get the list of available queries (which can be different regarding your chipset), run 'GALLIUM_HUD="help" glxgears'.

Once you have that list, select which queries you want to monitor, for example 'inst_executed' and start monitoring with 'GALLIUM_HUD="inst_executed" glxgears'.

apitrace

apitrace is tool for tracing graphics APIs like OpenGL, but it can also be used to replay a trace and monitor perf counters per frames or per draw-calls.

To get the list of available queries, run 'glretrace --list-metrics '. You will see different backends like GL_AMD_performance_monitor and opengl. In each backends, you can have different groups of queries, like "MP counters" and "Performance metrics".

Once you have that list, select which queries you want to monitor before replaying the trace and run 'glretrace --pframes=GL_AMD_performance_monitor:inst_executed' for example.

Global perf counters

These performance counters are global. They are configured through PCOUNTER by writing values directly to MMIO. The kernelspace interface is already merged but the userspace one is still WIP.

Status

Hardware events ?NV50
geom_primitive_in_count WIP
geom_primitive_out_count WIP
geom_vertex_in_count WIP
geom_vertex_out_count WIP
gld_128b WIP
gld_32b WIP
gld_64b WIP
gld_coherent WIP
gld_incoherent WIP
gld_request WIP
gld_total WIP
gpu_idle WIP
gst_128b WIP
gst_32b WIP
gst_64b WIP
gst_coherent WIP
gst_incoherent WIP
gst_request WIP
gst_total WIP
input_asembler_waits_for_fb WIP
input_assembler_busy WIP
local_load WIP
local_store WIP
rasterizer_tiles_in_count WIP
rasterizer_tiles_killed_by_zcull_count WIP
rop_busy WIP
rop_samples_killed_by_earlyz_count WIP
rop_samples_killed_by_latez_count WIP
rop_waits_for_fb WIP
rop_waits_for_shader WIP
setup_line_count WIP
setup_point_count WIP
setup_primitive_count WIP
setup_primitive_culled_count WIP
setup_triangle_count WIP
stream_out_busy WIP
tex_cache_hit WIP
tex_cache_miss WIP
tex_waits_for_fb WIP
vertex_attribute_count WIP

MP perf counters

These performance counters are per-context. They are configured through the command stream and we use a compute shader to read back the values (ie. $pm0..$pm7 sregs).

Status

Hardware events ?SM2012 ?SM212 ?SM30 ?SM35 ?SM50
active_cycles DONE DONE DONE DONE DONE
active_ctas N/A N/A N/A N/A DONE
active_warps DONE DONE WIP WIP DONE
atom_cas_count N/A N/A DONE DONE N/A
atom_count DONE DONE DONE DONE DONE
branch/divergent_branch DONE DONE DONE N/A DONE
{gld,gst}_request DONE DONE DONE DONE N/A
global_atom_cas N/A N/A N/A N/A DONE
global_{load,store} N/A N/A N/A N/A DONE
global_{ld,st}_mem_divergence_replays N/A N/A DONE DONE N/A
global_store_transaction TODO TODO DONE DONE N/A
gred_count DONE DONE DONE DONE DONE
inst_executed DONE DONE DONE DONE DONE
inst_issued (and variants) DONE DONE DONE DONE DONE
l1_global_load_{hit,miss} TODO TODO DONE DONE N/A
__l1_global_{load,store}_transactions N/A N/A DONE DONE N/A
l1_local_{load,store}_{hit,miss} TODO TODO DONE DONE N/A
l1_shared_{load,store}_transactions N/A N/A DONE DONE N/A
local_{load,store} DONE DONE DONE DONE DONE
local_{load,store}_transactions N/A N/A DONE DONE N/A
not_predicated_off_thread_inst_executed N/A N/A N/A DONE DONE
prof_trigger_{00-07} DONE DONE DONE DONE DONE
shared_atom N/A N/A N/A N/A DONE
shared_atom_cas N/A N/A N/A N/A DONE
shared_{ld,st}_transactions N/A N/A N/A N/A DONE
shared_{load,store} DONE DONE DONE DONE DONE
shared_{load,store}_bank_conflict N/A N/A N/A N/A DONE
shared_{load,store}_replay N/A N/A DONE DONE N/A
sm_cta_launched TODO TODO DONE DONE DONE
thread_inst_executed (and variants) DONE DONE N/A DONE DONE
threads_launched DONE DONE DONE DONE N/A
uncached_global_load_transaction TODO TODO DONE DONE N/A
warps_launched DONE DONE DONE DONE DONE

Notes

1 MP perf counters on GF100/GF110 (SM20) are buggy because we have a context-switch problem that needs to be fixed. Results might be wrong, be careful!

2 TODO means those perf counters are exposed through PCOUNTER.

Metrics

Status

Name ?SM20 ?SM21 ?SM30 ?SM35 ?SM50
achieved_occupancy DONE DONE DONE DONE DONE
alu_fu_utilization TODO TODO TODO TODO TODO
atomic_replay_overhead TODO TODO TODO TODO TODO
atomic_throughput TODO TODO TODO TODO TODO
atomic_transactions TODO TODO TODO TODO TODO
atomic_transactions_per_request TODO TODO TODO TODO TODO
branch_efficiency DONE DONE DONE N/A DONE
cf_executed TODO TODO TODO TODO TODO
cf_fu_utilization TODO TODO TODO TODO TODO
cf_issued TODO TODO TODO TODO TODO
dram_read_throughput TODO TODO TODO TODO TODO
dram_read_transactions TODO TODO TODO TODO TODO
dram_utilization TODO TODO TODO TODO TODO
dram_write_throughput TODO TODO TODO TODO TODO
dram_write_transactions TODO TODO TODO TODO TODO
eligible_warps_per_cycle TODO TODO TODO TODO TODO
flop_count_dp TODO TODO TODO TODO TODO
flop_count_d TODO TODO TODO TODO TODO
flop_count_dp_fma TODO TODO TODO TODO TODO
flop_count_dp_mul TODO TODO TODO TODO TODO
flop_count_sp TODO TODO TODO TODO TODO
flop_count_sp_add TODO TODO TODO TODO TODO
flop_count_sp_fma TODO TODO TODO TODO TODO
flop_count_sp_mul TODO TODO TODO TODO TODO
flop_count_sp_special TODO TODO TODO TODO TODO
flop_dp_efficiency TODO TODO TODO TODO TODO
flop_sp_efficiency TODO TODO TODO TODO TODO
gld_efficiency TODO TODO TODO TODO TODO
gld_requested_throughput TODO TODO TODO TODO TODO
gld_throughput TODO TODO TODO TODO TODO
gld_transactions TODO TODO TODO TODO TODO
gld_transactions_per_request TODO TODO TODO TODO TODO
global_cache_replay_overhead TODO TODO TODO TODO TODO
gst_efficiency TODO TODO TODO TODO TODO
gst_requested_throughput TODO TODO TODO TODO TODO
gst_throughput TODO TODO TODO TODO TODO
gst_transactions TODO TODO TODO TODO TODO
gst_transactions_per_request TODO TODO TODO TODO TODO
inst_bit_convert TODO TODO TODO TODO TODO
inst_compute_ld_st TODO TODO TODO TODO TODO
inst_control TODO TODO TODO TODO TODO
inst_executed TODO TODO TODO TODO TODO
inst_fp_32 TODO TODO TODO TODO TODO
inst_fp_64 TODO TODO TODO TODO TODO
inst_integer TODO TODO TODO TODO TODO
inst_inter_thread_communication TODO TODO TODO TODO TODO
inst_issued N/A DONE DONE DONE DONE
inst_misc TODO TODO TODO TODO TODO
inst_per_warp DONE DONE DONE DONE DONE
inst_replay_overhead DONE DONE DONE DONE DONE
ipc DONE DONE DONE DONE DONE
issued_ipc DONE DONE DONE DONE DONE
issue_slots N/A DONE DONE DONE DONE
issue_slot_utilization DONE DONE DONE DONE DONE
l1_cache_global_hit_rate TODO TODO TODO TODO TODO
l1_cache_local_hit_rate TODO TODO TODO TODO TODO
l1_shared_utilization TODO TODO TODO TODO TODO
l2_atomic_throughput TODO TODO TODO TODO TODO
l2_atomic_transactions TODO TODO TODO TODO TODO
l2_l1_read_hit_rate TODO TODO TODO TODO TODO
l2_l1_read_throughput TODO TODO TODO TODO TODO
l2_l1_read_transactions TODO TODO TODO TODO TODO
l2_l1_write_throughput TODO TODO TODO TODO TODO
l2_l1_write_transactions TODO TODO TODO TODO TODO
l2_read_throughput TODO TODO TODO TODO TODO
l2_read_transactions TODO TODO TODO TODO TODO
l2_tex_read_transactions TODO TODO TODO TODO TODO
l2_texture_read_hit_rate TODO TODO TODO TODO TODO
l2_texture_read_throughput TODO TODO TODO TODO TODO
l2_utilization TODO TODO TODO TODO TODO
l2_write_throughput TODO TODO TODO TODO TODO
l2_write_transactions TODO TODO TODO TODO TODO
ldst_executed TODO TODO TODO TODO TODO
ldst_fu_utilization TODO TODO TODO TODO TODO
ldst_issued TODO TODO TODO TODO TODO
local_load_throughput TODO TODO TODO TODO TODO
local_load_transactions TODO TODO TODO TODO TODO
local_load_transactions_per_request TODO TODO TODO TODO TODO
local_memory_overhead TODO TODO TODO TODO TODO
local_replay_overhead TODO TODO TODO TODO TODO
local_store_throughput TODO TODO TODO TODO TODO
local_store_transactions TODO TODO TODO TODO TODO
local_store_transactions_per_request TODO TODO TODO TODO TODO
shared_efficiency TODO TODO TODO TODO TODO
shared_load_throughput TODO TODO TODO TODO TODO
shared_load_transactions TODO TODO TODO TODO TODO
shared_load_transactions_per_request TODO TODO TODO TODO TODO
shared_replay_overhead N/A N/A DONE DONE TODO
shared_store_throughput TODO TODO TODO TODO TODO
shared_store_transactions TODO TODO TODO TODO TODO
shared_store_transactions_per_request TODO TODO TODO TODO TODO
sm_efficiency TODO TODO TODO TODO TODO
stall_exec_dependency TODO TODO TODO TODO TODO
stall_inst_fetch TODO TODO TODO TODO TODO
stall_memory_dependency TODO TODO TODO TODO TODO
stall_memory_throttle TODO TODO TODO TODO TODO
stall_other TODO TODO TODO TODO TODO
stall_pipe_busy TODO TODO TODO TODO TODO
stall_sync TODO TODO TODO TODO TODO
stall_texture TODO TODO TODO TODO TODO
sysmem_read_throughput TODO TODO TODO TODO TODO
sysmem_read_transactions TODO TODO TODO TODO TODO
sysmem_utilization TODO TODO TODO TODO TODO
sysmem_write_throughput TODO TODO TODO TODO TODO
sysmem_write_transactions TODO TODO TODO TODO TODO
tex_cache_hit_rate TODO TODO TODO TODO TODO
tex_cache_throughput TODO TODO TODO TODO TODO
tex_cache_transactions TODO TODO TODO TODO TODO
tex_fu_utilization TODO TODO TODO TODO TODO
tex_utilization TODO TODO TODO TODO TODO
warp_execution_efficiency TODO TODO DONE DONE DONE
warp_nonpred_execution_efficiency N/A N/A N/A DONE DONE