November 19, 2010

Recipe: handling database communication breakdown in Rails and Django

This recipe is for users who manually handle persistent (shared) database connection, with Oracle, although it's probably easy to adapt for other database backends with small adjustments. I needed to implement this both in Ruby (Rails' ActiveRecord) and Python (django.db) daemons, both working in the same setting. I thought the solution was elegant enough to be cleaned up into a recipe.

Automated database connection pooling for Python (nor Ruby) is not supported with Oracle 10g. Only 11g introduced Database Resident Connection Pooling (DRCP). It probably makes this recipe redundant for Python users with 11g. Oracle has official support for using Python with Oracle Database 11g, using the cx_Oracle driver. I'm not sure but I think Oracle doesn't have any form of official support for Ruby, but the de-facto ruby-oci8 driver works well with ActiveRecord.

Without automated connection pooling you have two decent options. Either open and close a new connection for each request, or save open connections into shared memory. The latter has obvious performance benefits if the expected request rate is high, but exposes a number of other problems this recipe aims to address. A connection can abruptly terminate because of e.g. a network glitch or maintenance downtime, and the sysadmins deploying your software will be happier having to deal with less services to reboot.

The handler should catch exceptions only related to network or database resource failures, whereas in other cases (SQL typo?) the original exception should not be reacted upon. Figuring out the proper ORA error codes to use was an exciting task, I hope I got them right ;). Those guys at Oracle sure love to make things easy. Since I use a regular expression to match the error message string, a multitude of errors can be matched with a single rule. Fortunately, all errors dealing with Transparent Network Substrate are marked by prefix "TNS". In my tests most errors were indeed caught with the single "ORA-.....: TNS" regexp.

Python and Ruby have much in common, both could let me write this very expressively with about the same LOC. There is one small detail I'd like to bring up. These two expressions both compare the codes listed with the exception (error message in variable "message"), and return true if the message matches to one of the given ORA error codes.

Python

# Example message="ORA-03113: end-of-file on communication channel"
any(filter(lambda oracode: re.search('ORA-'+oracode,message), re.split(' *\n *| +(?=\d)',"""
    .....: TNS
    01000 01001 01014 01033 01034 01037 01089 01090
    011.. 015.. 016..
    031..
    28547
    30678
    """)[1:-1]))
Ruby
# Example message="ORA-01034: ORACLE not available"
%w{ .....:\ TNS
    01000 01001 01014 01033 01034 01037 01089 01090
    011.. 015.. 016..
    031..
    28547
    30678
}.select {|oracode| message[/ORA-#{oracode}/]}.any?
Especially take notice of Ruby's %w{multiline string} versus Python's re.split(' *\n *| +(?=\d)',"""multiline string""")[1:-1]. This just for the sake of the visual appearance of input data on the screen. Ruby makes it easier to write aesthetic code. Perhaps Python has an idiom for this, though, it just didn't pass my mind.

example_oci_exception_handler.py

# -*- coding: utf-8 -*-
"""
You may use this file under the terms of the BSD license as follows:

Redistribution and use in source and binary forms, with or without modification, are
permitted provided that the following conditions are met:

   1. Redistributions of source code must retain the above copyright notice, this list of
      conditions and the following disclaimer.

   2. Redistributions in binary form must reproduce the above copyright notice, this list
      of conditions and the following disclaimer in the documentation and/or other materials
      provided with the distribution.

THIS SOFTWARE IS PROVIDED "AS IS" AND ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL  OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The views and conclusions contained in the software and documentation are those of the
authors and should not be interpreted as representing official policies, either expressed
or implied, of copyright holder.
"""
import re
import sys
from time import sleep
from myapp.settings import settings # your django settings module
from django.core.management import setup_environ
setup_environ(settings)
import cx_Oracle
from django.db import connection
from django.db.utils import DatabaseError
from logging import getLogger
log = getLogger(__name__)

class ExampleOCIExceptionHandler():
    """Recipe for handling lost Oracle database connection.
    
    Depends on Django ORM, and that proper settings.py exist and is initialized properly.
    
    Selects the network and connection errors, waits until the server is reachable again,
    and calls the method again during upon which the exception happened.
    
    This is very useful if the connection, for some reason, is shared between requests.
    Oracle 10g does not support connection pooling with cx_Oracle.
    """

    @staticmethod
    def dispatch(message_data):
        """Processes request.

        In case of database connection failure, the waitForOCIConnection() loop is called
        and connection should be re-established. In case of other Oracle errors,
        the error is raised again.
        """
        try:
            # do some database operations...

        except (cx_Oracle.Error, DatabaseError):
            """
            Catch lost database connection.

            Handles all TNS errors:
              ORA-xxxxx: TNS errors

            And all errors from these ranges:
              ORA-011xx: Database file errors
              ORA-015xx: Execution errors
              ORA-016xx: Execution errors
              ORA-031xx: communication errors

            Along with these specific errors:
              ORA-01000: maximum open cursors exceeded
              ORA-01001: invalid cursor
              ORA-01014: ORACLE shutdown in progress
              ORA-01033: ORACLE initialization or shutdown in progress
              ORA-01034: ORACLE not available
              ORA-01037: cannot allocate sort work area cursor; too many cursors
              ORA-01089: immediate shutdown in progress - no operations are permitted
              ORA-01090: shutdown in progress - connection is not permitted
              ORA-28547: connection to server failed, probable Oracle Net admin error
              ORA-30678: too many open connections

            """
            msg = str(sys.exc_info()[1]).rstrip()

            # is the error about broken connection?
            if any(filter(lambda oracode: re.search('ORA-'+oracode,msg), re.split(' *\n *| +(?=\d)',"""
                .....: TNS
                01000 01001 01014 01033 01034 01037 01089 01090
                011.. 015.. 016..
                031..
                28547
                30678
                """)[1:-1]) # ignore outermost items as they are empty strings
            ):
                log.error(msg)
                ExampleOCIExceptionHandler.waitForOCIConnection()
                # enter recursion and call this method again..
                sleep(2)
                return ExampleOCIExceptionHandler.dispatch(message_data)

            else:
                log.debug(msg)
                log.debug("Error seems to be not about lost connection, raising again ..")
                raise

    @staticmethod
    def waitForOCIConnection(time_to_wait=60):
        """Creates a new database connection, loops until one is established."""
        # nullify the connection first, since it can't discover that the socket is gone
        connection.connection = None
        while not connection._valid_connection():
            try:
                log.info("Attempt to establish OCI connection to %s ..." % [
                    settings.DATABASES['default']['NAME']])
                # in django parlance, calling _cursor() opens the database connection
                cursor = connection._cursor()
                sleep(2)
                return connection._valid_connection()

            except (cx_Oracle.Error, DatabaseError):
                exctype, message = sys.exc_info()[:2]
                log.error(str(message).rstrip())
                log.info("Retrying OCI connection in %i seconds" % time_to_wait)
                sleep(time_to_wait)
                return ExampleOCIExceptionHandler.waitForOCIConnection(time_to_wait)


example_oci_exception_handler.rb
# encoding: utf-8
=begin
You may use this file under the terms of the BSD license as follows:

Redistribution and use in source and binary forms, with or without modification, are
permitted provided that the following conditions are met:

   1. Redistributions of source code must retain the above copyright notice, this list of
      conditions and the following disclaimer.

   2. Redistributions in binary form must reproduce the above copyright notice, this list
      of conditions and the following disclaimer in the documentation and/or other materials
      provided with the distribution.

THIS SOFTWARE IS PROVIDED "AS IS" AND ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL  OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The views and conclusions contained in the software and documentation are those of the
authors and should not be interpreted as representing official policies, either expressed
or implied, of copyright holder.
=end
require 'rubygems'
require 'active_record'

# Recipe for handling lost Oracle database connection with ActiveRecord.
#
# Depends on RAILS_ROOT and RAILS_ENV variables to be set and RAILS_ROOT/config/database.yml file to exist.
#
# Selects the network and connection errors, waits until the server is reachable again,
# and calls the method again during upon which the exception happened.
#
# This is very useful if the connection, for some reason, is shared between requests.
class ExampleOCIExceptionHandler

  # Processes request.
  #
  # In case of database connection failure, the waitForOCIConnection() loop is called
  # and connection should be re-established. In case of other Oracle errors,
  # the error is raised again.
  def self.dispatch(message_data)
    begin
      # do some database operations...

    # Catch lost database connection.
    #
    # Handles all TNS errors:
    #   ORA-xxxxx: TNS errors
    #
    # And all errors from these ranges:
    #   ORA-011xx: Database file errors
    #   ORA-015xx: Execution errors
    #   ORA-016xx: Execution errors
    #   ORA-031xx: communication errors
    #
    # Along with these specific errors:
    #   ORA-01000: maximum open cursors exceeded
    #   ORA-01001: invalid cursor
    #   ORA-01014: ORACLE shutdown in progress
    #   ORA-01033: ORACLE initialization or shutdown in progress
    #   ORA-01034: ORACLE not available
    #   ORA-01037: cannot allocate sort work area cursor; too many cursors
    #   ORA-01089: immediate shutdown in progress - no operations are permitted
    #   ORA-01090: shutdown in progress - connection is not permitted
    #   ORA-28547: connection to server failed, probable Oracle Net admin error
    #   ORA-30678: too many open connections
    #
    rescue OCIError, ActiveRecord::StatementInvalid
      msg = $!.message
      
      # is the error about broken connection?
      if %w{
        .....: TNS
        01000 01001 01014 01033 01034 01037 01089 01090
        011.. 015.. 016..
        031..
        28547
        30678
      }.select{|oracode| msg[/ORA-#{oracode}/]}.any?
      
        logger.error msg
        waitForOCIConnection()
        # enter recursion and call this method again..
        return dispatch(message_data)

      else
        logger.debug msg
        logger.debug("Error seems to be not about lost connection, raising again ..")
        raise $!
      end
    end
  end


  # Attempts to establish database connection.
  # Loops until ActiveRecord is connected.
  #
  # Set time_to_wait in seconds.
  def self.waitForOCIConnection(time_to_wait=60)
    begin
      database_configuration = YAML.load_file(File.join(RAILS_ROOT,"config","database.yml"))
      ActiveRecord::Base.configurations = database_configuration
      logger.info(
        "Attempt to establish OCI connection to %s ..." % database_configuration[RAILS_ENV]['database'])
      ActiveRecord::Base.establish_connection(database_configuration[RAILS_ENV])
      ActiveRecord::Base.connection # essential to open connection
      return ActiveRecord::Base.connected?

    rescue OCIError, ActiveRecord::StatementInvalid
      logger.error($!.message)
      logger.info("Retrying OCI connection in %i seconds" % time_to_wait)
      sleep time_to_wait
      return waitForOCIConnection(time_to_wait)
    end
  end
end