内容简介:某客户数据库出现了严重的性能问题,导致应用出现大范围超时以及会话激增等问题,多次尝试kill session都无法彻底解决问题,重启后系统恢复正常。拿到故障时刻的AWR报告,可以发现问题时刻,数据库的主要等待为:Global transaction acquire instance locks和enq: TX – row lock contention。其中TX – row lock contention等待十分常见,这个等待事件造成应用的阻塞也很容易理解,但是Global transaction acq
某客户数据库出现了严重的性能问题,导致应用出现大范围超时以及会话激增等问题,多次尝试kill session都无法彻底解决问题,重启后系统恢复正常。
拿到故障时刻的AWR报告,可以发现问题时刻,数据库的主要等待为:Global transaction acquire instance locks和enq: TX – row lock contention。
Event | Waits | Time(s) | Avg wait (ms) | % DB time | Wait Class |
Global transaction acquire instance locks | 5,342 | 5,343 | 1000 | 74.09 | Configuration |
enq: TX – row lock contention | 5 | 1,437 | 287308 | 19.92 | Application |
DB CPU | 331 | 4.59 | |||
direct path read | 37,708 | 72 | 2 | 0.99 | User I/O |
log file sync | 7,817 | 12 | 2 | 0.16 | Commit |
其中TX – row lock contention等待十分常见,这个等待事件造成应用的阻塞也很容易理解,但是Global transaction acquire instance locks并不是常见等待,从字面上理解,是全局事务在尝试获取实例锁。这个等待在等待时间占比上,消耗了将近75%的DBTIME。
当然数据库中TOP 5中最严重的等待不一定是问题的根源,
分析问题时刻的ASH信息,在问题时刻,最先出现的是全局事务获取锁的等待,随后开始出现行锁等待:
SQL> select to_char(sample_time, ‘hh24miss’), session_id, event, blocking_session
2 from dba_hist_active_sess_history
3 where sample_time >= to_date(‘201810180652’, ‘yyyymmddhh24mi’)
4 and sample_time <= to_date(‘201810180700’, ‘yyyymmddhh24mi’)
5 and instance_number = 1
6 order by 1, 2;
TO_CHA SESSION_ID EVENT BLOCKING_SESSION
—— ———- —————————————— —————-
065204 69 Global transaction acquire instance locks
065214 69 Global transaction acquire instance locks
065224 69 Global transaction acquire instance locks
065224 535
065234 69 Global transaction acquire instance locks
065234 232
065234 535
065234 763 Global transaction acquire instance locks
065244 69 Global transaction acquire instance locks
065244 535 direct path read
065244 695 enq: TX – row lock contention 763
065244 763 Global transaction acquire instance locks
065254 69 Global transaction acquire instance locks
065254 535
065254 695 enq: TX – row lock contention 763
065254 763 Global transaction acquire instance locks
065304 136 Global transaction acquire instance locks
065304 695 enq: TX – row lock contention 763
065304 763 Global transaction acquire instance locks
065314 136 Global transaction acquire instance locks
065314 695 enq: TX – row lock contention 763
065314 763 Global transaction acquire instance locks
065324 69 Global transaction acquire instance locks
065324 136 Global transaction acquire instance locks
065324 695 enq: TX – row lock contention 763
065324 763 Global transaction acquire instance locks
065334 69 Global transaction acquire instance locks
065334 136 Global transaction acquire instance locks
065334 695 enq: TX – row lock contention 763
065344 69 Global transaction acquire instance locks
065344 136 Global transaction acquire instance locks
065344 434 enq: TX – row lock contention 763
065344 695 enq: TX – row lock contention 763
065354 69 Global transaction acquire instance locks
065354 136 Global transaction acquire instance locks
065354 434 enq: TX – row lock contention 763
065354 695 enq: TX – row lock contention 763
065404 69 Global transaction acquire instance locks
065404 136 Global transaction acquire instance locks
065404 434 enq: TX – row lock contention 763
065404 695 enq: TX – row lock contention 763
.
.
.
065944 302 enq: TX – row lock contention 763
065944 336 enq: TX – row lock contention 763
065944 434 enq: TX – row lock contention 763
065944 695 enq: TX – row lock contention 763
065954 71 Global transaction acquire instance locks
065954 302 enq: TX – row lock contention 763
065954 336 enq: TX – row lock contention 763
065954 434 enq: TX – row lock contention 763
065954 695 enq: TX – row lock contention 763
216 rows selected.
首先出现问题的是会话763,其处于Global transaction acquire instance locks等待中,该会话并未显示被其他会话阻塞。之后开始出现了行锁等待,这些等待enq: TX – row lock contention的会话,其阻塞对象都是会话763。
显然在本次故障中,虽然最终导致大范围业务挂起的是enq: TX – row lock contention等待,但是最终问题的根源是Global transaction acquire instance locks等待。
几乎与此同时,后台告警日志出现大量的报错:
Thu Oct 18 06:53:33 2018
opiodr aborting process unknown ospid (26428) as a result of ORA-24756
Thu Oct 18 06:53:36 2018
Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc:
ORA-24756: transaction does not exist
Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc:
ORA-24756: transaction does not exist
Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc:
ORA-24756: transaction does not exist
Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc:
ORA-24756: transaction does not exist
数据库中出现该等待的时间点06:52,再考虑到等待在报错前会经历一个超时,因此数据库中的等待与告警日志中的ORA-24756错误有密切的关系。
以ORA-24756作为关键字查询MOS,可以找到与当前现象非常接近的文章:PMON cleanup fails with ORA-24756: transaction does not exist (文档 ID 2299927.1)。
文档中描述的报错现象与当前问题一致,且数据库版本也很接近。Oracle认为这个错误是由于PMON进程在清理不存在的事务时出现了卡住的现象。Oracle给出了几个比较类似的bug,但是这些bug的状态不是不可重现就是已中止,因此类似的问题并没有明确的结论:
Bug 20676168 – PMON GOT CONTINUOUS ORA-24756 DURING SHUTDOWN <<<<<<<<<< Closed, Could Not Reproduce
Bug 16317766 – EXADATA : ORA-24756: TRANSACTION DOES NOT EXIST <<<<<<<<<<<< Suspended, Req’d Info not Avail
Bug 9757979 – PMON CLEANUP FAILS WITH ORA-24756: TRANSACTION DOES NOT EXIST <<<<<<<<<<< Closed, Could Not Reproduce
无论是Global transaction acquire instance locks等待还是后台alert日志中不断出现的ORA-24756错误,问题都指向Oracle的锁和全局事务处理。
而从11g以后,RAC的全局事务的处理由后台进程GTXn来自动维护。该进程是否启动受初始化参数global_txn_processes的控制,该参数默认值为1,也就是在数据库启动的时候会默认启动GTXn进程。
询问客户近期是否进行过应用和数据库的调整,可以确认的是客户的应用程序并未发生任何变化,在之前也未出现过类似的问题,不过在出现问题之前,数据库做过主库和Data Guard备库之间的切换演练,而切换演练完成后,恢复正常业务时,就出现了这次故障。显然这次切换演练的操作是被怀疑的重点。
在详细检查了alert告警日志后发现,在DATA GUARD演练后切换回主库时,GTXn进程未启动:
Thu Oct 18 02:36:18 2018
alter database commit to switchover to physical standby with session shutdown
ALTER DATABASE COMMIT TO SWITCHOVER TO PHYSICAL STANDBY [Process Id: 4366] (orcl1)
.
.
.
Switchover: Primary controlfile converted to standby controlfile succesfully.
Switchover: Complete – Database shutdown required
Completed: alter database commit to switchover to physical standby with session shutdown
Thu Oct 18 02:45:27 2018
.
.
.
RFS[8]: Assigned to RFS process 3216
RFS[8]: Identified database type as ‘physical standby’: Client is ARCH pid 11990
Thu Oct 18 03:11:13 2018
alter database open
AUDIT_TRAIL initialization parameter is changed to OS, as DB is NOT compatible for database opened with read-only access
This instance was first to open
Beginning standby crash recovery.
Serial Media Recovery started
.
.
.
Thu Oct 18 03:11:13 2018
SMON: enabling cache recovery
Dictionary check beginning
Dictionary check complete
Database Characterset is ZHS16GBK
No Resource Manager plan active
Starting background process GTX0
Thu Oct 18 03:11:14 2018
GTX0 started with pid=51, OS id=5041
replication_dependency_tracking turned off (no async multimaster replication found)
Physical standby database opened for read only access.
Completed: alter database open
.
.
.
Thu Oct 18 04:57:19 2018
alter database commit to switchover to primary with session shutdown
ALTER DATABASE SWITCHOVER TO PRIMARY (orcl1)
Maximum wait for role transition is 15 minutes.
Switchover: Media recovery is still active
Role Change: Canceling MRP – no more redo to apply
.
.
.
Switchover: Complete – Database mounted as primary
Completed: alter database commit to switchover to primary with session shutdown
Thu Oct 18 04:57:39 2018
alter database open
This instance was first to open
Picked broadcast on commit scheme to generate SCNs
.
.
.
Thu Oct 18 04:57:44 2018
QMNC started with pid=41, OS id=22585
LOGSTDBY: Validating controlfile with logical metadata
LOGSTDBY: Validation complete
Completed: alter database open
.
.
.
Thu Oct 18 06:53:33 2018
opiodr aborting process unknown ospid (26428) as a result of ORA-24756
Thu Oct 18 06:53:36 2018
Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc:
ORA-24756: transaction does not exist
Errors in file /oracle/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_pmon_30688.trc:
.
.
.
Starting background process GTX0
Thu Oct 18 09:51:20 2018
GTX0 started with pid=36, OS id=32470
Starting background process RCBG
Thu Oct 18 09:51:20 2018
从上面节选的部分alert日志可以看到,在2点36分,主库执行了SWITCHOVER操作,开始切换演练。
在3点11分,数据库打开,alert中提示,当前数据库为物理备库,为只读打开模式,在随后的日志中加粗的部分,显示此时后台进程GTX0进程启动。
在4点57分,数据库再次执行SWITCHOVER切换,此时数据库并未完全SHUTDOWN,而是直接转换为MOUNT模式,并于4点57分44秒完成了数据库的OPEN过程。但是这个时间段的alert中并未发现GXTn进程的启动。
而在alert中发现GTX0进程再次出现,是故障之后实例1重启时刻的9点51分。
显然,问题已经非常明确,在数据库由物理备库切换为主库的过程中,GTXn进程没有启动,这导致了Oracle无法处理分布式事务的锁问题,因此前台会话出现Global transaction acquire instance locks等待,而后台PMON也无法正常的清理会话和事务。
由于实例2在SWITCHOVER切换过程中是被SHUTDOWN后重新启动的,因此实例2上的GTXn进程是正常启动的,这也是应用切换到实例2上不在出现故障的原因。
如果要避免类似的问题产生,一方面在进行DATA GUARD的切换演练后,在主库切换到PRIMARY模式后,再执行一次数据库重启的操作,确保所有必要的进程都会启动。另一方面,对于应用程序,如果确实需要修改其他数据库中的表,应该通过建立其他数据库连接的方式来实现,避免在数据库中通过数据库链的方式直接更新远端表。
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
用户至上:用户研究方法与实践(原书第2版)
凯茜·巴克斯特 / 王兰、杨雪、苏寅 / 机械工业出版社 / 2017-5-1 / 99
《UI/UE系列丛书 用户至上:用户研究方法与实践(原书第2版)》是用户研究方法指南,谷歌用户体验研究员十几年工作经验结晶,从理论到实战,包含完整的实操案例,是设计以人为中心产品的实用手册。 《UI/UE系列丛书 用户至上:用户研究方法与实践(原书第2版)》包含五个部分共15章。入门篇包括第1~5章:介绍用户体验入门,如何理解目标用户,道德与法律问题,如何搭建研究设施,如何选择用户体验研究方......一起来看看 《用户至上:用户研究方法与实践(原书第2版)》 这本书的介绍吧!