内容简介:很久以前遇到过一个案例,也是一套RAC环境出现了非常高的GC cr/current grant 2-way 等待事件,持续较高的GC类等待,NODE1
很久以前遇到过一个案例,也是一套RAC环境出现了非常高的GC cr/current grant 2-way 等待事件, 《Performance tuning ‘gc cr¤t grant 2-way’ event (当主机扩容cpu后)》 当是这个案例是因为同一cluster的两个实例主机不同的CPU个数导致出现了不同的LMS进程数,负载能力不均衡导致, 这次要分享的案例是另个一种特殊情况。
来自SMARTDBOPS
持续较高的GC类等待,
NODE1
Event | Waits | Time(s) | Avg wait (ms) | % DB time | Wait Class |
---|---|---|---|---|---|
enq: TX – index contention | 27,057 | 32,531 | 1202 | 25.09 | Concurrency |
DB CPU | 26,320 | 20.30 | |||
enq: TX – row lock contention | 165,277 | 9,237 | 56 | 7.12 | Application |
gc current grant 2-way | 61,769 | 9,208 | 149 | 7.10 | Cluster |
log file sync | 1,937,689 | 6,747 | 3 | 5.20 | Commit |
NODE2
Top 5 Timed Foreground Events
Event | Waits | Time(s) | Avg wait (ms) | % DB time | Wait Class |
---|---|---|---|---|---|
gc buffer busy acquire | 84,112 | 31,365 | 373 | 35.95 | Cluster |
gc current grant 2-way | 24,906 | 11,217 | 450 | 12.86 | Cluster |
gc current block 2-way | 71,984 | 8,190 | 114 | 9.39 | Cluster |
gc cr grant 2-way | 34,748 | 6,061 | 174 | 6.95 | Cluster |
enq: TX – index contention | 3,067 | 4,735 | 1544 | 5.43 | Concurrency |
如果从节点1看会认为是并发争用enq index/row lock contention;还有gc cr/current grant 2-way, 该event表示Block is not in any cache,
Permission granted to read from the disk.
在诊断RAC问题时有几个黄金规则:
1. 谨防被TOP Event所欺骗
在RAC环境中有可能根本问题在另一个节点中隐藏,基于不会显示在top 5 event中, 被影响的等待事件可能比根本原因等待事件本身还要高, 实例的性能错综复杂, 很可能一个节点的后台进程出现问题(如LMS),而影响了其它所有节点出现Application class等待事件。
gc buffer busy xx 等待事件是RAC性能分析根本问题之前的主要证状, 简单认为在等待一个BLOCK的BL lock请求, 其它所有进程在等待这个BLOCK的BL lock可用其间的时间消耗都会记入gc buffer busy xx 等待事件中.
ASH和dump block trace 文件可以定位GC BUFFER BUSY等待, 主要是识别对象,对象类型,块类型信息
Object type | Block type | Possible issue(s) |
---|---|---|
Table | Segment header | Freelists, freelist groups, deletes from one node,ASSM bugs etc |
Table | Segment blocks | Heavy insert workload, heavy update to few blocks, SQL performance issues scanning few objects aggressively. |
Index | Leaf block | Numerous inserts on indexed columns, sequence or sysdate generated keys |
Index | Root / branch block |
Insert + numerous index scans, concurrent inserts and deletes etc |
Undo | Undo header block | Numerous short transactions |
Undo | Undo block | Long pending transaction and CR block generation |
Dictionary | SEQ$ (object) | Sequences with nocache or order set and excessive access to sequence. |
2. 消除基础架构环境问题
Gc cr grants 2-way 如果BLOCK没有在任何实例的Buffer cache中,时间会记入这个等待事件中,前台进程发送一个请求给远程LMS, 远程的LMS回复‘read from disk’. 出现gc 2-way 3-way这类意为着高并发出现的拥挤产生的等待,一个block和message传输的最小时间通常<2ms, (Exadata<0.5ms). 需要查看 AWR中的wait event histogram 查看2/3- way事件不同时间段的占比。排除网络问题
3. 识别问题实例
SELECT INSTANCE ||'->’ || inst_id transfer, class, cr_block cr_blk, Trunc(cr_block_time / cr_block / 1000, 2) avg_Cr, current_block cur_blk, Trunc(current_block_time / current_block / 1000, 2) avg_cur FROM gv$instance_cache_transfer WHERE cr_block > 0 AND current_block > 0 ORDER BY INSTANCE, inst_id, class /
4. 检查发送端的性能指标,不只是接收端
发送端
gc cr block receive time=
Time to send message to a remote LMS process by FG
+ Time taken by LMS to build block (statistics: gc cr block build time)
+ LMS wait for LGWR latency ( statistics: gc cr block flush time)
+ LMS send time (Statistics: gc cr block send time)
+ Wire latency.
5. 用柱状图去识别异常情况,而不只是平均值
AWR中的wait event histogram
案例分析
接着分析一开始AAS 上显示GC wait问题案例,这是一个11.2.0.3 2-NODES RAC ON HPUX 11.31的数据库,
# instance 1
Global Cache Load Profile
Per Second | Per Transaction | |
---|---|---|
Global Cache blocks received: | 1,018.74 | 1.08 |
Global Cache blocks served: | 78.53 | 0.08 |
GCS/GES messages received: | 904.33 | 0.96 |
GCS/GES messages sent: | 2,532.56 | 2.68 |
DBWR Fusion writes: | 3.07 | 0.00 |
Estd Interconnect traffic (KB) | 18,227.67 |
Global Cache Efficiency Percentages (Target local+remote 100%)
Buffer access – local cache %: | 99.48 |
Buffer access – remote cache %: | 0.39 |
Buffer access – disk %: | 0.14 |
Global Cache and Enqueue Services – Workload Characteristics
Avg global enqueue get time (ms): | 2.9 |
Avg global cache cr block receive time (ms): | 6.8 |
Avg global cache current block receive time (ms): | 3.8 |
Avg global cache cr block build time (ms): | 0.0 |
Avg global cache cr block send time (ms): | 0.0 |
Global cache log flushes for cr blocks served %: | 0.5 |
Avg global cache cr block flush time (ms): | 29.3 |
Avg global cache current block pin time (ms): | 1.4 |
Avg global cache current block send time (ms): | 0.0 |
Global cache log flushes for current blocks served %: | 2.1 |
Avg global cache current block flush time (ms): | 30.8 |
Global Cache and Enqueue Services – Messaging Statistics
Avg message sent queue time (ms): | 250.4 –issue instance |
Avg message sent queue time on ksxp (ms): | 6.8 |
Avg message received queue time (ms): | 3.0 |
Avg GCS message process time (ms): | 1.6 |
Avg GES message process time (ms): | 0.0 |
% of direct sent messages: | 43.52 |
% of indirect sent messages: | 45.21 |
% of flow controlled messages: | 11.27 |
TIPS:
AWR中的 Average message queue time 数据来自 (G)V$GES_STATISTICS. 该值是把indirect message到message队列和发送它所需的时间(GCS/GES communication layer). 我们需要查看KSXP消息统计时间,以确定IPC层中是否存在问题。 IPC时间通常<1ms,您可以查询X$KSXPCLIENT以获取更多数据, 该值很高时也有可能在这段时间内CPU非常繁忙(100%CPU)或者 interconnect达到顶峰, 这点需要结合OSW确认。
# instance 2
Global Cache Load Profile
Per Second | Per Transaction | |
---|---|---|
Global Cache blocks received: | 96.24 | 0.75 |
Global Cache blocks served: | 1,043.19 | 8.17 |
GCS/GES messages received: | 3,607.20 | 28.26 |
GCS/GES messages sent: | 1,342.88 | 10.52 |
DBWR Fusion writes: | 3.09 | 0.02 |
Estd Interconnect traffic (KB) | 19,197.66 |
Global Cache Efficiency Percentages (Target local+remote 100%)
Buffer access – local cache %: | 99.43 |
Buffer access – remote cache %: | 0.20 |
Buffer access – disk %: | 0.37 |
Global Cache and Enqueue Services – Workload Characteristics
Avg global enqueue get time (ms): | 3.5 |
Avg global cache cr block receive time (ms): | 509.7 |
Avg global cache current block receive time (ms): | 96.6 |
Avg global cache cr block build time (ms): | 0.0 |
Avg global cache cr block send time (ms): | 0.0 |
Global cache log flushes for cr blocks served %: | 1.0 |
Avg global cache cr block flush time (ms): | 1.3 |
Avg global cache current block pin time (ms): | 0.2 |
Avg global cache current block send time (ms): | 0.0 |
Global cache log flushes for current blocks served %: | 0.1 |
Avg global cache current block flush time (ms): | 1.7 |
Global Cache and Enqueue Services – Messaging Statistics
Avg message sent queue time (ms): | 0.0 |
Avg message sent queue time on ksxp (ms): | 142.6 |
Avg message received queue time (ms): | 0.0 |
Avg GCS message process time (ms): | 0.0 |
Avg GES message process time (ms): | 0.0 |
% of direct sent messages: | 73.94 |
% of indirect sent messages: | 18.43 |
% of flow controlled messages: | 7.63 |
Tips:
Avg global cache cr block receive time (ms) 从请求实例发送消息到mastering instance花费的时间。这个时间包括在holding instance生成数据块一致性读映像的时间。CR数据块获取耗费的时间不应该大于15ms。
Avg global cache current block receive time (ms): 从请求实例发送消息到mastering instance(2-way get)和一些到holding instance (3-way get)花费的时间。这个时间包括holding instance日志刷新花费的时间。Current Block获取耗费的时间不大于30ms.
Avg message sent queue time on ksxp (ms)对端IPC层确认收到的消息并返回ACK所花费的时间,通常是个很小的毫秒值,直接反应了网络延迟,一般小于1ms, The send time is measured when a block is handed to the IPC layer to be sent. The requests are queued in a completion queue, and the send time is measured from when a block is handed off to the IPC layer until the system call returns a completion, i.e. a status indicating that the send request has been successfully passed on to the network and control returns to the cache layer.
如果该值比较大,可能是CLUSTER INTERCONNECT和LMS进程出现了问题。
做ipc诊断:
oradebug setmypid
oradebug ipc
1)如果网络配置没有问题,可以检查一下TCP 和UDP的buffer是否不足
root#netstat -s | grep overflow 值为了0一般情况下。
root#netstat -i
2) 查看负载ps aux | grep lms
查看是否是有某些表产生了过多的global cr request
select STATISTIC_NAME stat, OWNER, OBJECT_NAME obj, sum(value) val from v$segment_statistics where STATISTIC_NAME like 'global%' and value > 10000 group by STATISTIC_NAME,OWNER,OBJECT_NAME order by val desc;
2、查看一段时间内,CR的值是否很大。
select CR_REQUESTS cr, CURRENT_REQUESTS cur, DATA_REQUESTS data, UNDO_REQUESTS undo, TX_REQUESTS tx from v$cr_block_server;
确认问题节点
WAIT time | NODE1 | NODE2 |
---|---|---|
Avg global cache cr block receive time (ms) | 6.8 | 509.7 |
Avg global cache current block receive time (ms) | 3.8 | 96.6 |
Node:
RAC环境中, 慢的节点会导致其他节点出现gc性能问题, 根据这个逻辑,确认是节点1 出了问题。
#查看当时的LMS进程CPU使用
# node1 oracle@qdty1:/home/oracle:11G> ps -efl|head -n 1;ps -efl|grep lms F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD 1401 S oracle 5117 1 237 178 20 e000001c934ba340 94816 e000000792a7f050 Jul 13 ? 15776:29 ora_lms1_tyjc1 1401 S grid 23810 1 0 178 20 e000001c9de0e040 90895 e000000792a7f1d0 Jan 25 ? 1550:30 asm_lms0_+ASM1 1401 R oracle 5119 1 254 178 20 e000001ce7c1b9c0 94816 - Jul 13 ? 16214:17 ora_lms2_tyjc1 401 S oracle 9366 7730 0 154 20 e000001cdb304c40 30 e000000da0da6472 10:50:38 pts/4 0:00 grep lms 1401 R oracle 5115 1 252 178 20 e000001cf7042a00 94752 - Jul 13 ? 16329:36 ora_lms0_tyjc1 # node 2 oracle@qdty2:/home/oracle:11G> ps -efl|head -n 1;ps -efl|grep lms|grep -v grep F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD 1401 S grid 12516 1 0 178 20 e000001b43146400 90854 e000000792d71348 Jan 25 ? 1461:04 asm_lms0_+ASM2 1401 S oracle 6446 1 24 178 20 e000001ef09b0d00 95666 e000000792d710d0 Jul 13 ? 16447:18 ora_lms2_tyjc2 1401 S oracle 6444 1 30 178 20 e000001ebe6c7d00 95666 e000000792d71248 Jul 13 ? 17583:22 ora_lms1_tyjc2 1401 S oracle 6442 1 43 178 20 e000001e7bc97c80 95666 e000000792d71240 Jul 13 ? 17764:57 ora_lms0_tyjc2
Note:
当前为HPUX 11.31平台,节点1所以PS显示的C[PU] 列会超过100%现象, 但也表明了当前的CPU是非常繁忙。节点2是正常的。
# node1
Instance Activity Stats
Statistic | Total | per Second | per Trans |
---|---|---|---|
data blocks consistent reads – undo records applied | 3.7 TB | 1.1 GB | 1.1 MB |
rollback changes – undo records applied | 4,074,774 | 1,124.49 | 1.19 |
transaction tables consistent reads – undo records applied | 100.1 MB | 28.3 KB | 0 B |
# node 2
Instance Activity Stats
Statistic | Total | per Second | per Trans |
---|---|---|---|
data blocks consistent reads – undo records applied | 4.6 GB | 2.7 MB | 21.6 KB |
rollback changes – undo records applied | 1,043 | 0.59 | 0.00 |
transaction tables consistent reads – undo records applied | 112.0 KB | 0 B | 0 B |
如果不是应用 SQL 和网络问题,那为什么LMS会占用那么高的cpu呢? 去MOS中搜,不难发现,又命中了一个bug 14672057 .
原因是因为从11.2.0.2版本开始为了解决之前的bug 9569316,而引入了_max_cr_rollbacks参数, 该参数默认为0。 表示最大cr block rolback blocks.没有找到过多的关于该参数的解释。如果当出现花费大量的时间在undo cr block生成时,把该参数调整为非0值,会启用light work 规则,有助于减少lms进程压力。 该bug只影响11.2.0.2和11.2.0.3, 解决解决方法可以动态的修改 “_max_cr_rollbacks”参数解决,所以没有再提供one-off patch, 该bug 在11.2.0.4版本已修复, 当执行完如下修改后,GC 等待和enq 瞬间消失,恢复了正常状态。所以问题不是节点1 top event中表现的index split问题。
SQL> alter system set "_max_cr_rollbacks"=200; System altered.
# 修改后LMS 进程 node1
oracle@qdty1:/home/oracle:11G> ps -efl|head -n 1;ps -efl|grep lms|grep -v grep F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD 1401 S oracle 5117 1 55 178 20 e000001c934ba340 94816 e000000792a7f0d0 Jul 13 ? 15778:04 ora_lms1_tyjc1 1401 S grid 23810 1 0 178 20 e000001c9de0e040 90895 e000000792a7f350 Jan 25 ? 1550:30 asm_lms0_+ASM1 1401 S oracle 5119 1 76 178 20 e000001ce7c1b9c0 94816 e000000792a7f1c8 Jul 13 ? 16216:40 ora_lms2_tyjc1 1401 S oracle 5115 1 80 178 20 e000001cf7042a00 94752 e000000792a7f3c0 Jul 13 ? 16331:29 ora_lms0_tyjc1
以上所述就是小编给大家介绍的《Troubleshooting high “gc current grant 2-way” and “gc cr grant 2-way” 案例(二)》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:- iOS混合开发库(GICXMLLayout)布局案例分析(1)今日头条案例
- 17个云计算开源案例入围第三届中国优秀云计算开源案例评选
- Spring Boot 2.0 基础案例(十二):基于转账案例,演示事务管理操作
- 基于MNIST数据集实现2层神经网络案例实战-大数据ML样本集案例实战
- Nginx相关实战案例
- SSIS 开发案例
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。