Hi. We have a Dragonboard820c running 4.14 kernel version. We monitor the temperature of the snapdragon 820 in /sys/class/thermal/thermal_zone0/temp. If the temperature of the chip is in the range of 7C to ~12C, sometimes the board with crash. The dmesg log shows a lot of these messages:
GPU Crash when running in temp range 9C to ~12C
Posted: Thu, 2022-02-10 06:23
[email protected]-alip:/sys/class/thermal# cat thermal_zone0/temp
7300
[email protected]-alip:/sys/class/thermal# cat thermal_zone0/temp
7000
[email protected]-alip:/sys/class/thermal# [ 295.414124] msm 900000.mdss: gpu fault ring 1 fence 13b85 status C0040101 rb 1aad/1aad ib1 00000000034CA000/0000 ib2 0000000003274000/0000
[ 295.414181] msm 900000.mdss: A530: hangcheck recover!
[ 295.430749] msm 900000.mdss: gpu fault ring 1 fence 13b88 status C00401C3 rb 1b10/1b1c ib1 0000000003876000/0000 ib2 0000000003274000/0000
[ 295.430809] msm 900000.mdss: A530: hangcheck recover!
[ 295.442982] msm 900000.mdss: A530: offending task: X:flush_queue0 (/usr/lib/xorg/Xorg -nolisten tcp -auth /var/run/sddm/{7f7de7ac-f4b2-4c4d-a384-ee1fe3330703} -background none -noreset -displayfd 17 -seat seat0 vt7)
[ 295.448138] revision: 530 (5.3.0.2)
[ 295.466960] rb 0: fence: 0/0
[ 295.470400] rptr: 39
[ 295.473553] rb wptr: 39
[ 295.476325] rb 1: fence: 80773/80776
[ 295.478820] rptr: 6848
[ 295.482403] rb wptr: 6940
[ 295.485183] rb 2: fence: 0/0
[ 295.487871] rptr: 0
[ 295.490883] rb wptr: 0
[ 295.493340] rb 3: fence: 0/0
[ 295.495768] rptr: 0
[ 295.498870] rb wptr: 0
[ 295.501329] CP_SCRATCH_REG0: 0
[ 295.503758] CP_SCRATCH_REG1: 0
[ 295.506859] CP_SCRATCH_REG2: 80773
[ 295.509925] CP_SCRATCH_REG3: 0
[ 295.513305] CP_SCRATCH_REG4: 0
[ 295.516346] CP_SCRATCH_REG5: 487267
[ 295.519382] CP_SCRATCH_REG6: 487276
[ 295.522746] CP_SCRATCH_REG7: 487284
[ 295.604584] msm 900000.mdss: gpu fault ring 1 fence 13ba4 status C0040101 rb 0456/0456 ib1 0000000003AF2000/0000 ib2 0000000003AF3000/0000
[ 295.604639] msm 900000.mdss: A530: hangcheck recover!
Our procedure to see this "gpu fault ring" message is to start running the glxgears application while the dragonboard820c is at room temperature in a temp chamber. Lower the chamber temperature until the temperature read in /sys/class/thermal/thermal_zone0/temp is in the critical temperature zone (7Cto 12C). We slowly lower the temperature through the full range until we see the above dmesg. We notice that the glxgears application is much more choppy and sometimes crashes. Has anyone seen anything like this on the Qualcomm Flight Pro?
Thanks,
Kim