MMNL生成非常大的trace文件,包含ASH dump, MMON进程不存在

栏目: 数据库 · 发布时间: 5年前

内容简介:前几日一套11.2.0.4.7 on suse 11 2nodes2 RAC中节点2的oracle软件目录使用率高,发现mmnl后台进程生成了64GB的trace文件, 包含大量的ash dump信息, 下面简单的记录一下该问题。Mmnl trace file=========================

前几日一套11.2.0.4.7 on suse 11 2nodes2 RAC中节点2的oracle软件目录使用率高,发现mmnl后台进程生成了64GB的trace文件, 包含大量的ash dump信息, 下面简单的记录一下该问题。

Mmnl trace file

=========================

*** 2018-08-22 20:33:13.512
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
kewa_sampler: Re-signalling MMON(74) (!= 73) at 22-AUG-18 08.53.28.796 PM

*** 2018-08-22 20:53:28.796
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
kewa_sampler: Re-signalling MMON(75) (!= 74) at 22-AUG-18 09.03.37.184 PM

*** 2018-08-22 21:03:37.184
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
kewa_sampler: Re-signalling MMON(76) (!= 75) at 22-AUG-18 09.13.36.996 PM

*** 2018-08-22 21:13:36.996
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453

*** 2018-08-22 21:23:37.692
kewa_sampler: Re-signalling MMON(77) (!= 76) at 22-AUG-18 09.23.37.692 PM
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
kgskmetricupd: could not get change-schema latch
kgskmetricupd: could not get change-schema latch
kewa_sampler: Re-signalling MMON(78) (!= 77) at 23-AUG-18 12.23.40.741 AM

*** 2018-08-23 00:23:40.741
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
kgskmetricupd: could not get change-schema latch
kgskmetricupd: could not get change-schema latch

*** 2018-08-23 01:02:31.559
kgskmetricupd: could not get change-schema latch

*** 2018-08-23 01:04:30.886
kgskmetricupd: could not get change-schema latch
kewa_sampler: Re-signalling MMON(79) (!= 78) at 23-AUG-18 01.23.57.348 AM

*** 2018-08-23 01:23:57.349
kewa_sampler_cb: Exception handler 453
kewa_sampler_cb: Cleared error 453
ASH dump

*** 2018-08-23 01:38:45.614
<<<ACTIVE SESSION HISTORY - PROCESS TRACE DUMP HEADER BEGIN>>>
****************
SCRIPT TO IMPORT
****************
------------------------------------------
Step 1: Create destination table <ashdump>
------------------------------------------
CREATE TABLE ashdump AS
SELECT * FROM SYS.WRH$_ACTIVE_SESSION_HISTORY WHERE rownum < 0
----------------------------------------------------------------
Step 2: Create the SQL*Loader control file <ashldr.ctl> as below
----------------------------------------------------------------
load data
infile * "str '\n####\n'"
append
into table ashdump
fields terminated by ',' optionally enclosed by '"'
(
SNAP_ID CONSTANT 0 ,
DBID ,
INSTANCE_NUMBER ,

...
...
...
<<<ACTIVE SESSION HISTORY - PROCESS TRACE DUMP END>>>

Note:

Trace文件中包含大旱的ASH DUMP信息,MMNL后台进程负责当前活动会话历史记录(ASH)进行采样,存储在循环SGA内存缓冲区中的信息将通过快照或紧急刷新刷新到AWR。

如果循环缓冲区已满三分之二,则ASH紧急冲洗将启动。如果

由于某种原因无法将信息写入AWR

,则ASH会将其内容(如果没有重新启动,则每12小时一次)转储到MMNL跟踪文件。

将信息写入跟踪而不是AWR的可能原因包括:

ASH缓冲区太小<===首先检查ASH缓冲区是否太小

MMON进程没有响应<====然后,检查MMON进程没有响应的原因

1, 首先确认当前的ASH 大小是否紧张?

如果因为ash buffer size小而紧急刷新,可以从db alert log确认,也可以使用下面的 sql 确认:

SQL> select total_size,awr_flush_count,awr_flush_emergency_count from v$ash_info;

TOTAL_SIZE AWR_FLUSH_COUNT AWR_FLUSH_EMERGENCY_COUNT
---------- --------------- -------------------------
 134217728           20529                      6828

Note:

这里的紧急刷新(AWR_FLUSH_EMERGENCY_COUNT)次数不少,可以调大该参数(可在线调整),确认ash buffer从v$ash_info.total_size, 默认“_ash_size”并非当前值,但是修改该内存大小是修改”_ash_size”参数值,可以每次尝试增长50%观察使用情况。

SQL> alter system set "_ash_size"=100m;
alter system set "_ash_size"=100m
*
ERROR at line 1:
ORA-02097: parameter cannot be modified because specified value is invalid
ORA-02097: parameter cannot be modified because specified value is invalid

SQL> @pd ash_size
Show all parameters and session values from x$ksppi/x$ksppcv...

      INDX I_HEX NAME                                               VALUE                          DESCRIPTION
---------- ----- -------------------------------------------------- ------------------------------ ----------------------------------------------------------------------
      2647   A57 _olap_dimension_corehash_size                      30                             OLAP Dimension In-Core Hash Table Maximum Memory Use
      2731   AAB _kffmap_hash_size                                  1024                           size of kffmap_hash table
      2732   AAC _kffmop_hash_size                                  2048                           size of kffmop_hash table
      2840   B18 _ash_size                                          1048618                        To set the size of the in-memory Active Session History buffers

SQL> alter system set "_ash_size"=200m;
System altered.

SQL> select total_size,awr_flush_count,awr_flush_emergency_count from v$ash_info;

TOTAL_SIZE AWR_FLUSH_COUNT AWR_FLUSH_EMERGENCY_COUNT
---------- --------------- -------------------------
 209715200           20531                      6829

Note:

如果调整的值小于当前值会提示ora-2097错误,但是在12C之前的版本超过了“_ash_size”的最大上限254Mb,同样会提示ora-2097错误。 但是从12c以后版本_ash_size值可以调整到大于254MB.

2. 检查MMON

既然写了这么多的ASH dump说明很多次是flush AWR失败的,  负责AWR Snapshot收集的是MMON进程, 然后检查MMON trace无日志文件。

oracle@ANBOB-2:/oracle/app/oracle/diag/rdbms/ANBOB/ANBOB2/trace>ps -ef|grep mmon
grid 8576 1 0 2016 ? 03:29:29 asm_mmon_+ASM2

oracle@ANBOB-2:/oracle/app/oracle/diag/rdbms/ANBOB/ANBOB2/trace>ps -ef|grep mmnl
grid 8578 1 0 2016 ? 2-17:07:26 asm_mmnl_+ASM2
oracle 9176 1 1 2016 ? 11-20:41:46 ora_mmnl_ANBOB2
 
SQL> col systimestamp form a35
SQL> col most_recent_snap_time form a25
SQL> col snap_interval form a17
SQL> select systimestamp, most_recent_snap_time, snap_interval from wrm$_wr_control where dbid = (select dbid from v$database);

SYSTIMESTAMP MOST_RECENT_SNAP_TIME SNAP_INTERVAL
----------------------------------- ------------------------- -----------------
24-AUG-18 09.36.27.638363 AM +08:00 24-AUG-18 09.00.33.122 AM +00000 01:00:00.0

SQL> select thread#,open_time from v$thread;

THREAD# OPEN_TIME
---------- -----------------
1 20161225 17:46:00
2 20160226 14:31:03

SQL> select open_mode from v$database;

OPEN_MODE
--------------------
READ WRITE

SQL> col instance_number form 999 head INST
SQL>  col begin_interval_time form a25
SQL>  col flush_elapsed form a17
SQL>  col status form 999
SQL>  col error_count form 999 head ERR
SQL>  col snap_flag form 999 head SNAP
SQL> select * from
      (select snap_id,
      instance_number,
      begin_interval_time,
      flush_elapsed,
      status,
      error_count,
      snap_flag
      from wrm$_snapshot
      where dbid = (select dbid from v$database)
      and instance_number=2   --# instance 2
      order by snap_id desc)
      where rownum <= 10
      order by snap_id ;
 -- none --

Note:

该DB实例的MMON进程都不存在,且近段时间均为生成AWR snapshot.

AWR没有生成SNAPSHOT的原因:

1,数据库配置问题和当前状态

1.1 实例参数STATISTICS_LEVEL

1.2 数据库没打开

1.3 数据库打开没有在READ WRITE模式

1.4 配置的间隔时间问题

2,数据库MMON后台进程问题

2.1 MMON进程未启动

2.2 MMON进程挂起

2.3 MMON在收集某项时时间异常久,没有完成

2.4 SYSAUX表空间可用空间紧张

实例1的AWR正常,只有实例2的存在这现象,此时可以尝试手动生成AWR Snapshot,检查是否有报错或是某项收集时间长。

SQL> exec dbms_workload_repository.create_snapshot;
SQL> exec dbms_workload_repository.create_snapshot;
SQL> 

SQL> select * from
  2        (select snap_id,
  3        instance_number,
  4        begin_interval_time,
  5        flush_elapsed,
  6        status,
  7        error_count,
  8        snap_flag
  9        from wrm$_snapshot
 10        where dbid = (select dbid from v$database)
 11        and instance_number=2   --# instance 2
 12        order by snap_id desc)
 13        where rownum <= 10
 14        order by snap_id ;

  Snap Id INST BEGIN_INTERVAL_TIME       FLUSH_ELAPSED     STATUS  ERR SNAP
--------- ---- ------------------------- ----------------- ------ ---- ----
    22765    2 16-SEP-17 08.56.39.415 AM +00000 00:00:12.2      0    0    1
    22766    2 24-AUG-18 09.51.14.766 AM +00000 00:00:03.3      0    0    1

Note:

手动是可以生成,说明在AWR实际收集是没有问题的,只是调度进程出了问题, 当然如果手动收集时间都异常久可以配置AWR debug trace查看具体日志

Gathering MMON action trace and snapshot flush trace:
SQL> alter session set "_swrf_test_action" = 28;
Session altered. 

SQL> alter session set "_swrf_test_action" = 10; 
Session altered.

等待AWR snapshot生成后,关闭TRACE对应_ swrf_test_action 值使用29和11, 同时还有 awr_test dump event level(1,2,3,4)可以trace AWR的过程。该事件虽然是alter session, 但影响是实例级的。

对于本案例主要是因为MMON进程不存在,配置EVENT已没有意义, 下一步查看db alert log 查找MMON进程没有启动或停止的原因。

# alert log
Sat Sep 16 05:56:38 2017
VKRM started with pid=87, OS id=44418
Sat Sep 16 07:24:10 2017
Thread 2 advanced to log sequence 3377 (LGWR switch)
  Current log# 3 seq# 3377 mem# 0: +DATADG/yygk/redo03.log
Sat Sep 16 09:14:04 2017
ORA-25455 encountered when generating server alert SMG-3500
Sat Sep 16 09:15:04 2017
Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x9929640, kocdsgt()+368] [flags: 0x0, count: 1]
Errors in file /oracle/app/oracle/diag/rdbms/yygk/yygk2/trace/yygk2_mmon_9174.trc  (incident=288273):
ORA-07445: exception encountered: core dump [kocdsgt()+368] [SIGSEGV] [ADDR:0x0] [PC:0x9929640] [SI_KERNEL(general_protection)] []
Incident details in: /oracle/app/oracle/diag/rdbms/yygk/yygk2/incident/incdir_288273/yygk2_mmon_9174_i288273.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Sat Sep 16 09:15:06 2017
Dumping diagnostic data in directory=[cdmp_20170916091506], requested by (instance=2, osid=9174 (MMON)), summary=[incident=288273].
Sat Sep 16 14:58:12 2017
Thread 2 advanced to log sequence 3378 (LGWR switch)
  Current log# 4 seq# 3378 mem# 0: +DATADG/yygk/redo04.log
Sun Sep 17 01:56:39 2017
Closing Resource Manager plan via scheduler window
Clearing Resource Manager plan via parameter
Sun Sep 17 05:56:39 2017
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Sun Sep 17 05:56:39 2017
Starting background process VKRM
Sun Sep 17 05:56:39 2017


oracle@anbob-2:/home/oracle>vi /oracle/app/oracle/diag/rdbms/anbob/anbob2/incident/incdir_288273/anbob2_mmon_9174_i288273.trc
Dump file /oracle/app/oracle/diag/rdbms/anbob/anbob2/incident/incdir_288273/anbob2_mmon_9174_i288273.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/app/oracle/product/11.2.0/db
System name:    Linux
Node name:      anbob-2
Release:        3.0.76-0.11-default
Version:        #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990)
Machine:        x86_64
Instance name: anbob2
Redo thread mounted by this instance: 2
Oracle process number: 34
Unix process pid: 9174, image: oracle@anbob-2 (MMON)


*** 2017-09-16 09:15:05.098
*** SESSION ID:(4795.1) 2017-09-16 09:15:05.098
*** CLIENT ID:() 2017-09-16 09:15:05.098
*** SERVICE NAME:(SYS$BACKGROUND) 2017-09-16 09:15:05.098
*** MODULE NAME:() 2017-09-16 09:15:05.098
*** ACTION NAME:() 2017-09-16 09:15:05.098

Dump continued from file: /oracle/app/oracle/diag/rdbms/anbob/anbob2/trace/anbob2_mmon_9174.trc
ORA-07445: exception encountered: core dump [kocdsgt()+368] [SIGSEGV] [ADDR:0x0] [PC:0x9929640] [SI_KERNEL(general_protection)] []

========= Dump for incident 288273 (ORA 7445 [kocdsgt()+368]) ========
----- Beginning of Customized Incident Dump(s) -----
Exception [type: SIGSEGV, SI_KERNEL(general_protection)] [ADDR:0x0] [PC:0x9929640, kocdsgt()+368] [flags: 0x0, count: 1]
Registers:
%rax: 0x00007f14f45e3640 %rbx: 0x0000000000000000 %rcx: 0x0000000000000008
%rdx: 0x0140000700000100 %rdi: 0x00007f14f4325568 %rsi: 0x0000000000000000
%rsp: 0x00007fffbec8d8c0 %rbp: 0x00007fffbec8db60  %r8: 0x0000000000000000
 %r9: 0x00007f14f4325d70 %r10: 0x00007f14f4326570 %r11: 0x0000000000000000
%r12: 0x00007f14f4344e20 %r13: 0x0000000000002010 %r14: 0x0000000000000000
%r15: 0x00007fffbec8dca8 %rip: 0x0000000009929640 %efl: 0x0000000000010203
  kocdsgt()+352 (0x9929630) add $-48,%rdx
  kocdsgt()+356 (0x9929634) mov %rdx,-0x40(%rbp)
  kocdsgt()+360 (0x9929638) mov -0x40(%rbp),%rax
  kocdsgt()+364 (0x992963c) mov 0x8(%rax),%rdx
> kocdsgt()+368 (0x9929640) movzwl (%rdx),%ebx
  kocdsgt()+371 (0x9929643) mov -0x38(%rbp),%rcx
  kocdsgt()+375 (0x9929647) movzwl 0xc(%rcx),%esi
  kocdsgt()+379 (0x992964b) cmp %esi,%ebx
  kocdsgt()+381 (0x992964d) jnz 0x992982b
----- SQL Statement (None) -----
Current SQL information unavailable - no cursor.

----- Call Stack Trace -----

location             type  
-------------------- ------
skdstdst()+41        <  ksedst1()+103        <  ksedst()+39          <  dbkedDefDump()+2746  <  
ksedmp()+41          <  ssexhd()+2462        <  __sighandler()       <  kocdsgt()+368        signal
kocioc()+648         <  kokdn2p()+520        <  kodin2p()+176        <  kocapm()+1051        <  koioapm()+525     <  
OCITypeByName()+816  <  keltgtyp()+75        <  kpotcuexec()+254     <  rpiswu2()+1776       <  
OCIKSwitchUserWithPriv()+200    <      keltenq()+583        <  
keltpost()+1812      <  kelr_issue_alert()309   kelr_monitor_metrics()+3631   <              
ksb_run_managed_action()+384  <                ksbcti()+2490        <  ksbabs()+1735        <  
kebm_mmon_main()+209 <                         ksbrdp()+1045        <  opirip()+623         <  opidrv()+603         <  
sou2o()+103          <  opimai_real()+250    <  ssthrdmain()+265     <  
main()+201

Note:

根据DB ALERT LOG可以发现该实例启动后,过了一段时间MMON进程因ORA-7445内部错误挂掉后就一直没有再启动,AWR SNAPSHOT也一直没有再生成。 MOS中有个相似的CASE。 Bug 22494766 : ORA-7445[KOCCNGT+74] ON MMON PROCESS AFTER ORA-25455

解决方法

1,重启实例

或者

2. 临时启动并禁用限制模式

ALTER SYSTEM enable restricted session;

ALTER SYSTEM disable restricted session;

因为这是一个新交维的边缘库,平时没有什么新连接和活动,但是重启实例目前是不可以的。 启用限制模式只影响新连接请求,已有连接无影响,我们采用方案2.

— db alert log

2018-08-24 09:01:23.236000 +08:00
Thread 2 advanced to log sequence 6820 (LGWR switch)
  Current log# 4 seq# 6820 mem# 0: +DATADG/anbob/redo04.log
2018-08-24 10:25:59.102000 +08:00
Stopping background process MMNL
2018-08-24 10:26:00.101000 +08:00
Starting background process MMON
MMON started with pid=35, OS id=76908
Starting background process MMNL
MMNL started with pid=120, OS id=76910
ALTER SYSTEM enable restricted session;
ALTER SYSTEM disable restricted session;
2018-08-24 10:26:01.811000 +08:00
minact-scn: Inst 2 is a slave inc#:40 mmon proc-id:76908 status:0x2
minact-scn status: grec-scn:0x0ccf.d5bb5cdf gmin-scn:0x0ccf.d5b7d696 gcalc-scn:0x0ccf.d5bb3a01
Some DDE async actions failed or were cancelled

oracle@anbob-2:/home/oracle>ps -ef|grep ora_m|grep -v grep
oracle    9144     1  0  2016 ?        02:07:06 ora_mman_anbob2
oracle    9184     1  0  2016 ?        01:55:18 ora_mark_anbob2
oracle   76908     1  0 10:25 ?        00:01:35 ora_mmon_anbob2
oracle   76910     1  1 10:25 ?        00:05:52 ora_mmnl_anbob2

SQL> select * from
  2        (select snap_id,
  3        instance_number,
  4        begin_interval_time,
  5        flush_elapsed,
  6        status,
  7        error_count,
  8        snap_flag
  9        from wrm$_snapshot
 10        where dbid = (select dbid from v$database)
 11        and instance_number=2   --# instance 2
 12        order by snap_id desc)
 13        where rownum <= 10 14 order by snap_id ; Snap Id INST BEGIN_INTERVAL_TIME FLUSH_ELAPSED STATUS ERR SNAP --------- ---- ------------------------- ----------------- ------ ---- ---- 22765 2 16-SEP-17 08.56.39.415 AM +00000 00:00:12.2 0 0 1 22766 2 24-AUG-18 09.51.14.766 AM +00000 00:00:03.3 0 0 1 
SQL> /

  Snap Id INST BEGIN_INTERVAL_TIME       FLUSH_ELAPSED     STATUS  ERR SNAP
--------- ---- ------------------------- ----------------- ------ ---- ----
    22765    2 16-SEP-17 08.56.39.415 AM +00000 00:00:12.2      0    0    1
    22766    2 24-AUG-18 09.51.14.766 AM +00000 00:00:03.3      0    0    1
    22767    2 24-AUG-18 09.52.24.661 AM +00000 00:00:05.1      0    0    0
    22768    2 24-AUG-18 11.00.09.946 AM +00000 00:00:04.0      0    0    0
    22769    2 24-AUG-18 12.00.22.008 PM +00000 00:00:03.2      0    0    0
    22770    2 24-AUG-18 01.00.33.819 PM +00000 00:00:05.1      0    0    0
    22771    2 24-AUG-18 02.00.45.857 PM +00000 00:00:04.0      0    0    0
    22772    2 24-AUG-18 03.00.57.960 PM +00000 00:00:04.0      0    0    0
    22773    2 24-AUG-18 04.00.09.952 PM +00000 00:00:03.9      0    0    0
    22774    2 24-AUG-18 05.00.22.065 PM +00000 00:00:04.1      0    0    0

后期确认数据库实例已经可以正常收集AWR,MMON记启动,同时MMNL进程已重启,(KILL两进程对实例可用性无影响,该进程可以自动重启)。MMNL TRACE没有再生成那么大量的日志。

— enjoy it —


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

代码之美

代码之美

Grey Wilson / 聂雪军 / 机械工业出版社 / 2008年09月 / 99.00元

《代码之美》介绍了人类在一个奋斗领域中的创造性和灵活性:计算机系统的开发领域。在每章中的漂亮代码都是来自独特解决方案的发现,而这种发现是来源于作者超越既定边界的远见卓识,并且识别出被多数人忽视的需求以及找出令人叹为观止的问题解决方案。 《代码之美》33章,有38位作者,每位作者贡献一章。每位作者都将自己心目中对于“美丽的代码”的认识浓缩在一章当中,张力十足。38位大牛,每个人对代码之美都有自......一起来看看 《代码之美》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

在线进制转换器
在线进制转换器

各进制数互转换器

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码