Skip to main content

IPC Send timeout detected - The Story of a Database

Some errors make us panic, nervous, and beyond belief when we positively attempt to resolve a panic situation. Specific to case of Clusterware where it attempts to resolve a conflict by removing dead resources from the cluster but it suffers because of lack of additional resources to complete the task, and makes the whole situation even worst so everything that has in contact with the key resource also suffers. The key resource what we are talking is a RAC Database and the Clusterware under a panic situation is Oracle Clusterware.

Here's the environment
  1. Oracle RDBMS
  2. Oracle GI
  3. 3 Node RAC Database with vault function enabled
  4. NetApp NFS Filer based Storage

What was the issue?

On Instance 3
I have noticed "IPC Send timeout".
Fri Aug 14 04:46:05 2015
IPC Send timeout detected. Receiver ospid 20545 [
Fri Aug 14 04:49:55 2015
And then, Oracle has started suspending MMON actions one by one as we noticed below messages in the log
Fri Aug 14 10:49:07 2015
Suspending MMON action 'AWR Auto CPU USAGE Task' for 82800 seconds
Fri Aug 14 07:22:11 2015
Suspending MMON action 'tablespace alert monitor' for 82800 seconds

On Instance 1
Suspending MMON slave action kttetslv_ for 82800 seconds
Fri Aug 14 07:57:53 2015
IPC Send timeout detected. Sender: ospid 4812 [oracle@phyeimrac01-tst (J000)]
Receiver: inst 3 binc 4 ospid 20545
Fri Aug 14 08:03:11 2015

This usually occurs when there is high resource contention (Locks/CPU/Memory/Storage/Network), where processes queue increases while the demand for resource increases, and global enqueue and cache services stop responding to each other in a RAC environment. In the end, it gets to a situation where it can't make any actions except for being hung, and keep trying to resolve itself. Hence, we see "Inter Process Communication" Send timeout detected.

I came to know it was because of massive load on CPUs, found from AWR report. Here's the screenshot from AWR report.

How was it resolved?

Well, during the issue, I was completely helpless to see what's going on. Nothing worked period. srvctl/crsctl, well, nothing worked. Fortunately, a connection to SQLPLUS prompt worked on unaffected instances in the cluster i.e., instance 1 and instance 2 but again, GV$ views did not work. And, strange enough, Instance eviction was not happening at all while everything was just hanging. The real culprits, Application people keep complaining "Man, performance is terrible, please do something, it's not moving at all" :). How many times we hear this, well, quite often during benchmark phase of a development project. They keep changing the code left right center and think what they do will work. Eventually, it creates a nightmare for DBA.

As we have Golden Gate configured on node 1 (its VIP runs on node1), I first brought down all replicats and then brought down instance 1 and instance 2 using "SQLPLUS prompt by "SHUT ABORT" becuase srvctl does not respond at all. It makes sense because "srvctl" in clusterware environment is driven by clusterware for mutual benefit of clusterware resources. At this stage, Two down, one to go. Well, to bring down the instance 3, the only solution to reboot the node 3. So, communicated to the server team to reboot the node 3.

Here comes the real devil - "file 1 needs to be either taken out of backup mode or media recovered"

Node 3 was rebooted, you think we are all set now? Eh? Well, here came the devil, Database was not coming up, all instances were failing to open the Database with the error below.

Abort recovery for domain 0
Errors in file /u01/app/oracle/diag/rdbms/teim/teim1/trace/teim1_ora_11744.trc:
ORA-10873: file 1 needs to be either taken out of backup mode or media recovered
ORA-01110: data file 1: '/oracle/oradata/teim/data1/system01.dbf'
ORA-10873 signalled during: ALTER DATABASE OPEN /* db agent *//* {2:30594:28372} */... 

Because I still did not know what else has happened at this time, For a moment, I thought, what the heck I did, Technically, I have just aborted all instances and expected SMON to perform Instance Crash Recovery by replaying what's in redo (roll forward committed changes/rollback uncommitted changes, and make everything consistent), and it should be all set for operations again. But, why the heck did it ask me to perform a media recovery. Not even in my dreams could I have ever imagined this.

Ok, this had to be troubleshooted now. Panic over panic, eh? Don't be panic. Come on, Oracle!!

While digging into the issue
We use NetApp snapshots for Database backup, it is basically a traditional Oracle hot backup. Yes, the one that makes "redolog" life miserable :).  The script we have in place performs like this

  1. First put the Database in BEGIN BACKUP mode - Done 
  2. Make a log switch to archive - Done 
  3. "grep" for "oradata" from "df -h" and get the list of "Database file system" volumes - Hanging
  4. Create snapshots of each volume with the date - Forget it
  5. Make a log switch again to archive - Forget it
  6. End backup with "ALTER DATABASE END BACKUP" - Forget it

Well, the step 3 was hanging because there were some stale NFS mounts. This was the devil. So, the script did not manage to END the backup, and left all Database files in "ACTIVE BACKUP" mode.
We were not aware of this while resolving the other issue "IPC Send timeout detected". Of course, why did we think of the backup issue? Nope, we do not.

Alright, no need to be panic. follow the below to bring the Database up again

Bring the Database to mount state
$srvctl start database -d <db_unique_name> -o mount
End the backup
Make sure none of the files in BACKUP mode
SQL>select * from gv$backup;
Bring the Database down
$srvctl stop database -d <db_unique_name>
Bring the Database up
$srvctl start database -d <db_unique_name>

Database is up, take a deep breath :)

I hope you have enjoyed reading this post as much as I have enjoyed resolving the issue. Please feel free to comment and share.


Popular posts from this blog

Setting NLS_LANG on Windows for Oracle Database

Sometimes, it is difficult to deal with Windows Platform as it drains the hell out of us!! Recently, I have come across a situation where one of my client's requirement was to input Arabic language into Oracle Database [or] read/retrieve the output into Various client Applications such as PL/SQL Developer/SQL Developer/TOAD etc.. Inputting the non-english language into Database  has never been difficult as we are given plenty of language options within our beloved Oracle Database, but, the problem lies within the complex Windows OS when user wanted to view the data in his/her beloved language, in Applications such as PL/SQL Developer or SQL Developer etc.. This post is all about it.

Character Set, Character encoding, & Code point Yes, it is a group of characters that is recognised by the Hardware through the OS Interface. Every character is allocated a number, called a code point, these code points will be represented in the computer by one or more bytes. So, every character h…

Rebuilding Standby Database Manually (11g and up)

One of my remote customers has been facing severe network outages between Production and DR sites to reasons unknown and challenges with certain geographic limitations, the bandwidth between sites is limited to 2MBPS and the replication between sites is aching slow. So, Customer has had tentative requirements to rebuild Standby more often between sites until the network issue is fully resolved.

This paper outlines the procedure to rebuild the STANDBY manually provided we have following things in place already between/on sites. And this procedure has a special case for One node RAC, Oracle dynamically applies SID naming convention in case of an online relocation. Due to this fact, I have to mention which SID to export before logging to the Database via SQLPLUS/RMAN/ASMCMD
(PR refers to - Production/Primary and PR1/PR2 are two nodes in RAC,  DR refers to - Disaster Recovery/Standby and DR1/DR2 are two nodes in RAC at DR Site)

A working Dataguard Configuration between Sites (All Paramete…