Why use performance counters?
How to use performance counters?
1. GALLIUM_HUD
2. apitrace
Global perf counters
1. Status
MP perf counters
1. Status
2. Notes
Metrics
1. Status

Why use performance counters?

Performance counters are really useful in order to find bottlenecks in real applications. They report very accurate metrics regarding how the GPU is idle/busy.

How to use performance counters?

GALLIUM_HUD

GALLIUM_HUD is an environment variable in mesa which can be used to monitor performance counters with a nice interface.

To get the list of available queries (which can be different regarding your chipset), run 'GALLIUM_HUD="help" glxgears'.

Once you have that list, select which queries you want to monitor, for example 'inst_executed' and start monitoring with 'GALLIUM_HUD="inst_executed" glxgears'.

apitrace

apitrace is tool for tracing graphics APIs like OpenGL, but it can also be used to replay a trace and monitor perf counters per frames or per draw-calls.

To get the list of available queries, run 'glretrace --list-metrics '. You will see different backends like GL_AMD_performance_monitor and opengl. In each backends, you can have different groups of queries, like "MP counters" and "Performance metrics".

Once you have that list, select which queries you want to monitor before replaying the trace and run 'glretrace --pframes=GL_AMD_performance_monitor:inst_executed' for example.

Global perf counters

These performance counters are global. They are configured through PCOUNTER by writing values directly to MMIO. The kernelspace interface is already merged but the userspace one is still WIP.

Status

Hardware events	?NV50
geom_primitive_in_count	WIP
geom_primitive_out_count	WIP
geom_vertex_in_count	WIP
geom_vertex_out_count	WIP
gld_128b	WIP
gld_32b	WIP
gld_64b	WIP
gld_coherent	WIP
gld_incoherent	WIP
gld_request	WIP
gld_total	WIP
gpu_idle	WIP
gst_128b	WIP
gst_32b	WIP
gst_64b	WIP
gst_coherent	WIP
gst_incoherent	WIP
gst_request	WIP
gst_total	WIP
input_asembler_waits_for_fb	WIP
input_assembler_busy	WIP
local_load	WIP
local_store	WIP
rasterizer_tiles_in_count	WIP
rasterizer_tiles_killed_by_zcull_count	WIP
rop_busy	WIP
rop_samples_killed_by_earlyz_count	WIP
rop_samples_killed_by_latez_count	WIP
rop_waits_for_fb	WIP
rop_waits_for_shader	WIP
setup_line_count	WIP
setup_point_count	WIP
setup_primitive_count	WIP
setup_primitive_culled_count	WIP
setup_triangle_count	WIP
stream_out_busy	WIP
tex_cache_hit	WIP
tex_cache_miss	WIP
tex_waits_for_fb	WIP
vertex_attribute_count	WIP

MP perf counters

These performance counters are per-context. They are configured through the command stream and we use a compute shader to read back the values (ie. $pm0..$pm7 sregs).

Status

Hardware events	?SM20¹²	?SM21²	?SM30	?SM35	?SM50
active_cycles	DONE	DONE	DONE	DONE	DONE
active_ctas	N/A	N/A	N/A	N/A	DONE
active_warps	DONE	DONE	WIP	WIP	DONE
atom_cas_count	N/A	N/A	DONE	DONE	N/A
atom_count	DONE	DONE	DONE	DONE	DONE
branch/divergent_branch	DONE	DONE	DONE	N/A	DONE
{gld,gst}_request	DONE	DONE	DONE	DONE	N/A
global_atom_cas	N/A	N/A	N/A	N/A	DONE
global_{load,store}	N/A	N/A	N/A	N/A	DONE
global_{ld,st}_mem_divergence_replays	N/A	N/A	DONE	DONE	N/A
global_store_transaction	TODO	TODO	DONE	DONE	N/A
gred_count	DONE	DONE	DONE	DONE	DONE
inst_executed	DONE	DONE	DONE	DONE	DONE
inst_issued (and variants)	DONE	DONE	DONE	DONE	DONE
l1_global_load_{hit,miss}	TODO	TODO	DONE	DONE	N/A
__l1_global_{load,store}_transactions	N/A	N/A	DONE	DONE	N/A
l1_local_{load,store}_{hit,miss}	TODO	TODO	DONE	DONE	N/A
l1_shared_{load,store}_transactions	N/A	N/A	DONE	DONE	N/A
local_{load,store}	DONE	DONE	DONE	DONE	DONE
local_{load,store}_transactions	N/A	N/A	DONE	DONE	N/A
not_predicated_off_thread_inst_executed	N/A	N/A	N/A	DONE	DONE
prof_trigger_{00-07}	DONE	DONE	DONE	DONE	DONE
shared_atom	N/A	N/A	N/A	N/A	DONE
shared_atom_cas	N/A	N/A	N/A	N/A	DONE
shared_{ld,st}_transactions	N/A	N/A	N/A	N/A	DONE
shared_{load,store}	DONE	DONE	DONE	DONE	DONE
shared_{load,store}_bank_conflict	N/A	N/A	N/A	N/A	DONE
shared_{load,store}_replay	N/A	N/A	DONE	DONE	N/A
sm_cta_launched	TODO	TODO	DONE	DONE	DONE
thread_inst_executed (and variants)	DONE	DONE	N/A	DONE	DONE
threads_launched	DONE	DONE	DONE	DONE	N/A
uncached_global_load_transaction	TODO	TODO	DONE	DONE	N/A
warps_launched	DONE	DONE	DONE	DONE	DONE

Notes

¹ MP perf counters on GF100/GF110 (SM20) are buggy because we have a context-switch problem that needs to be fixed. Results might be wrong, be careful!

² TODO means those perf counters are exposed through PCOUNTER.

Metrics

Status

Name	?SM20	?SM21	?SM30	?SM35	?SM50
achieved_occupancy	DONE	DONE	DONE	DONE	DONE
alu_fu_utilization	TODO	TODO	TODO	TODO	TODO
atomic_replay_overhead	TODO	TODO	TODO	TODO	TODO
atomic_throughput	TODO	TODO	TODO	TODO	TODO
atomic_transactions	TODO	TODO	TODO	TODO	TODO
atomic_transactions_per_request	TODO	TODO	TODO	TODO	TODO
branch_efficiency	DONE	DONE	DONE	N/A	DONE
cf_executed	TODO	TODO	TODO	TODO	TODO
cf_fu_utilization	TODO	TODO	TODO	TODO	TODO
cf_issued	TODO	TODO	TODO	TODO	TODO
dram_read_throughput	TODO	TODO	TODO	TODO	TODO
dram_read_transactions	TODO	TODO	TODO	TODO	TODO
dram_utilization	TODO	TODO	TODO	TODO	TODO
dram_write_throughput	TODO	TODO	TODO	TODO	TODO
dram_write_transactions	TODO	TODO	TODO	TODO	TODO
eligible_warps_per_cycle	TODO	TODO	TODO	TODO	TODO
flop_count_dp	TODO	TODO	TODO	TODO	TODO
flop_count_d	TODO	TODO	TODO	TODO	TODO
flop_count_dp_fma	TODO	TODO	TODO	TODO	TODO
flop_count_dp_mul	TODO	TODO	TODO	TODO	TODO
flop_count_sp	TODO	TODO	TODO	TODO	TODO
flop_count_sp_add	TODO	TODO	TODO	TODO	TODO
flop_count_sp_fma	TODO	TODO	TODO	TODO	TODO
flop_count_sp_mul	TODO	TODO	TODO	TODO	TODO
flop_count_sp_special	TODO	TODO	TODO	TODO	TODO
flop_dp_efficiency	TODO	TODO	TODO	TODO	TODO
flop_sp_efficiency	TODO	TODO	TODO	TODO	TODO
gld_efficiency	TODO	TODO	TODO	TODO	TODO
gld_requested_throughput	TODO	TODO	TODO	TODO	TODO
gld_throughput	TODO	TODO	TODO	TODO	TODO
gld_transactions	TODO	TODO	TODO	TODO	TODO
gld_transactions_per_request	TODO	TODO	TODO	TODO	TODO
global_cache_replay_overhead	TODO	TODO	TODO	TODO	TODO
gst_efficiency	TODO	TODO	TODO	TODO	TODO
gst_requested_throughput	TODO	TODO	TODO	TODO	TODO
gst_throughput	TODO	TODO	TODO	TODO	TODO
gst_transactions	TODO	TODO	TODO	TODO	TODO
gst_transactions_per_request	TODO	TODO	TODO	TODO	TODO
inst_bit_convert	TODO	TODO	TODO	TODO	TODO
inst_compute_ld_st	TODO	TODO	TODO	TODO	TODO
inst_control	TODO	TODO	TODO	TODO	TODO
inst_executed	TODO	TODO	TODO	TODO	TODO
inst_fp_32	TODO	TODO	TODO	TODO	TODO
inst_fp_64	TODO	TODO	TODO	TODO	TODO
inst_integer	TODO	TODO	TODO	TODO	TODO
inst_inter_thread_communication	TODO	TODO	TODO	TODO	TODO
inst_issued	N/A	DONE	DONE	DONE	DONE
inst_misc	TODO	TODO	TODO	TODO	TODO
inst_per_warp	DONE	DONE	DONE	DONE	DONE
inst_replay_overhead	DONE	DONE	DONE	DONE	DONE
ipc	DONE	DONE	DONE	DONE	DONE
issued_ipc	DONE	DONE	DONE	DONE	DONE
issue_slots	N/A	DONE	DONE	DONE	DONE
issue_slot_utilization	DONE	DONE	DONE	DONE	DONE
l1_cache_global_hit_rate	TODO	TODO	TODO	TODO	TODO
l1_cache_local_hit_rate	TODO	TODO	TODO	TODO	TODO
l1_shared_utilization	TODO	TODO	TODO	TODO	TODO
l2_atomic_throughput	TODO	TODO	TODO	TODO	TODO
l2_atomic_transactions	TODO	TODO	TODO	TODO	TODO
l2_l1_read_hit_rate	TODO	TODO	TODO	TODO	TODO
l2_l1_read_throughput	TODO	TODO	TODO	TODO	TODO
l2_l1_read_transactions	TODO	TODO	TODO	TODO	TODO
l2_l1_write_throughput	TODO	TODO	TODO	TODO	TODO
l2_l1_write_transactions	TODO	TODO	TODO	TODO	TODO
l2_read_throughput	TODO	TODO	TODO	TODO	TODO
l2_read_transactions	TODO	TODO	TODO	TODO	TODO
l2_tex_read_transactions	TODO	TODO	TODO	TODO	TODO
l2_texture_read_hit_rate	TODO	TODO	TODO	TODO	TODO
l2_texture_read_throughput	TODO	TODO	TODO	TODO	TODO
l2_utilization	TODO	TODO	TODO	TODO	TODO
l2_write_throughput	TODO	TODO	TODO	TODO	TODO
l2_write_transactions	TODO	TODO	TODO	TODO	TODO
ldst_executed	TODO	TODO	TODO	TODO	TODO
ldst_fu_utilization	TODO	TODO	TODO	TODO	TODO
ldst_issued	TODO	TODO	TODO	TODO	TODO
local_load_throughput	TODO	TODO	TODO	TODO	TODO
local_load_transactions	TODO	TODO	TODO	TODO	TODO
local_load_transactions_per_request	TODO	TODO	TODO	TODO	TODO
local_memory_overhead	TODO	TODO	TODO	TODO	TODO
local_replay_overhead	TODO	TODO	TODO	TODO	TODO
local_store_throughput	TODO	TODO	TODO	TODO	TODO
local_store_transactions	TODO	TODO	TODO	TODO	TODO
local_store_transactions_per_request	TODO	TODO	TODO	TODO	TODO
shared_efficiency	TODO	TODO	TODO	TODO	TODO
shared_load_throughput	TODO	TODO	TODO	TODO	TODO
shared_load_transactions	TODO	TODO	TODO	TODO	TODO
shared_load_transactions_per_request	TODO	TODO	TODO	TODO	TODO
shared_replay_overhead	N/A	N/A	DONE	DONE	TODO
shared_store_throughput	TODO	TODO	TODO	TODO	TODO
shared_store_transactions	TODO	TODO	TODO	TODO	TODO
shared_store_transactions_per_request	TODO	TODO	TODO	TODO	TODO
sm_efficiency	TODO	TODO	TODO	TODO	TODO
stall_exec_dependency	TODO	TODO	TODO	TODO	TODO
stall_inst_fetch	TODO	TODO	TODO	TODO	TODO
stall_memory_dependency	TODO	TODO	TODO	TODO	TODO
stall_memory_throttle	TODO	TODO	TODO	TODO	TODO
stall_other	TODO	TODO	TODO	TODO	TODO
stall_pipe_busy	TODO	TODO	TODO	TODO	TODO
stall_sync	TODO	TODO	TODO	TODO	TODO
stall_texture	TODO	TODO	TODO	TODO	TODO
sysmem_read_throughput	TODO	TODO	TODO	TODO	TODO
sysmem_read_transactions	TODO	TODO	TODO	TODO	TODO
sysmem_utilization	TODO	TODO	TODO	TODO	TODO
sysmem_write_throughput	TODO	TODO	TODO	TODO	TODO
sysmem_write_transactions	TODO	TODO	TODO	TODO	TODO
tex_cache_hit_rate	TODO	TODO	TODO	TODO	TODO
tex_cache_throughput	TODO	TODO	TODO	TODO	TODO
tex_cache_transactions	TODO	TODO	TODO	TODO	TODO
tex_fu_utilization	TODO	TODO	TODO	TODO	TODO
tex_utilization	TODO	TODO	TODO	TODO	TODO
warp_execution_efficiency	TODO	TODO	DONE	DONE	DONE
warp_nonpred_execution_efficiency	N/A	N/A	N/A	DONE	DONE