Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
Clear All
new posts

    3.1d crashing server - too many open cursors

    3.1d Nightly 10/26/2012

    This didn't happen on 3.0 but is happening in 3.1d. As users use the app, our web app and J2EE container crashes (using Glassfish). The culprit seems to be "too many open cursors".

    In the open cursor history in Oracle, we are seeing the following query being called constantly until the server crashes:
    "select 1 from dual".

    A search through our entire app only has 1 file with this in it:
    ISC_DBConsole.js

    Any suggestions?

    #2
    I seem to remember having a problem like this once, and if memory serves, we had the default connection pooling enabled along with the Oracle UCP pool. Something like that anyway.

    Comment


      #3
      Thanks but I don't believe that is the cause. SOmething has changed in the developer console with 3.1d.

      After removing the inherit for that in our gwt bootstrap, our connections don't keep increasing.

      I'll wait for a fix before we re-enable it.

      Comment


        #4
        "select 1 from dual" is a configurable "ping test" query used to verify whether a connection is still open or has been closed by Oracle.

        We would concur with the explanation that adding two levels of pool would create this issue, and there is no other known way to create it.

        About the Developer Console, we're not sure why you think it's related, but if you are doing something like adding and removing the GWT <inherits> and seeing the issue appear or disappear, that's almost certainly a false symptom. Adding that inherit doesn't change server behavior. Instead, something in your build process is most likely picking up a different copy of server.properties when you change the .gwt.xml file.

        Comment


          #5
          I'll test more tomorrow but here is all I know:
          - Open Cursor History on Oracle shows "select 1 from dual"
          - There is 0 lines of code anywhere in our web app with that in it (compiled, uncompiled, archived, exploded, etc) that has that code in it... except for DBConsole.js
          - Removing the inherits in gwt.xml fixes the issue (I'll re-verify tomorrow... it wasn't an extensive test but we did push that to production so we'll see if the server crashes since)
          - I'll post my server.properties file

          We don't have a build process that changes server.properties. This happens in Eclipse dev mode as well as production and the server.properties are the same either way.

          I hear what you are saying... and it makes perfect sense. But if I verify tomorrow that removing that inherits makes the problem stay away... I think it warrants a double-check on your end for sanity's sake.

          Code:
          webRoot: __AUTODETECT__
          gwtModuleName: drmsWebApp
          isomorphicPathRootRelative: $gwtModuleName/sc
          
          #database config
          sql.defaultDatabase: DRMS
          sql.DRMS.database.type: oracle
          sql.DRMS.driver.networkProtocol: tcp
          sql.DRMS.driver: oracle.jdbc.pool.OracleDataSource
          sql.DRMS.interface.type: dataSource
          sql.DRMS.database.supportsSQLLimit: true
          sql.DRMS.interface.credentialsInURL: false
          sql.DRMS.driver.serverName: localhost
          sql.DRMS.driver.databaseName: drms
          sql.DRMS.driver.portNumber: 1521
          sql.DRMS.driver.driverType: thin
          #user and pw scrubbed
          
          project.datasources: $webRoot/ds/core, $webRoot/ds/ext, $webRoot/ds
          project.ui: $webRoot/shared/ui
          project.apps: $webRoot/shared/app
          
          RPCManager.enabledBuiltinMethods: *
          FilesystemDataSource.enabled: true
          
          messaging.keepaliveInterval: 3000
          messaging.keepaliveReestablishDelay: 1000
          messaging.connectTimeout: 4000
          messaging.connectionTTL: 120000
          messaging.flushBufferSize: 8096
          messaging.dispatcherImplementer: com.isomorphic.messaging.LocalMessageDispatcher

          Comment


            #6
            To clarify, "select 1 from dual" is present in framework.properties in the smartgwtee.jar.

            We can say for a certainty that if SmartGWT is issuing those queries, it's coming from this source and not DBConsole.js, which wouldn't even be executing in your application. That's why we're pretty sure it's a red herring that changing the Developer Console <inherits> could cause this issue on it's own.

            Note that 'select 1 from dual' is a widely recognized and widely recommend "ping test" for Oracle and other DBs, which basically all pooling implementations need to do, so it could be coming from "oracle.jdbc.pool.OracleDataSource" as well.

            Comment


              #7
              Thanks. We'll take a look again tomorrow.

              Comment


                #8
                Important Update

                Okay, I retested some stuff per your suggestion and you are correct.... but there is definitely an issue with 3.1.

                The error was still happening whether we imported the Tools or not.

                However, when moving to 3.0, the issue goes away and it is pretty dramatic.

                Here is the query you can run:
                Code:
                SELECT s.machine, oc.user_name, oc.sql_text, count(1) 
                FROM v$open_cursor oc, v$session s
                WHERE oc.sid = s.sid
                GROUP BY user_name, sql_text, machine
                HAVING COUNT(1) > 2
                ORDER BY count(1) DESC;
                In 3.1, this query will return tons of rows. And, when navigating the app, the totals just keep increasing (they never get closed). What seems to be happening is that whenever a DataSource.fetch() is called, cursor(s) are opened and never closed but calling the same fetch won't open new cursors. However, the 'select 1 from dual' will keep increasing and never close even when calling the same fetches.

                For example: You can test this by calling invalidateCache on a listGrid bound to a DataSource. The first time you call it, you'll see some cursors open up but it will never add more. However, the 'select 1 from dual' entries associated with that invalidateCache will continue to keep piling up.

                Another thing we've noticed is SQL-related log messages have changed from 3.0 to 3.1. In 3.0, we use to only see this message:
                Code:
                === 2012-10-31 13:21:59,808 [10-9] DEBUG PoolableSQLConnectionFactory - [builtinApplication.someDataSource_fetch] Returning pooled Connection
                In 3.1, we see logs that state things like 'About to close poolable connection with hashcode 'xyz'' in addition to other log messages not seen in 3.0.

                When using 3.0, and running the same open cursor query shown above, we can't even refresh the query fast enough to show 1 open cursor... let alone hundreds. It is like smartGwt is closing them so fast that it appears as if there are never any open cursors (what we would expect).

                As a sanity check, I've used 2 different server.properties entries for our SQL connection:
                Code:
                sql.DRMS.driver: oracle.jdbc.pool.OracleDataSource
                sql.DRMS.interface.type: dataSource
                Code:
                sql.DRMS.driver: oracle.jdbc.driver.OracleDriver
                sql.DRMS.interface.type: driverManager
                We see identical results in 3.1 and 3.0 when using both types.

                Comment


                  #9
                  OK, thanks, we'll take a look. Your log messages look like they are coming from SQLDataSource, but can we just confirm which of our built-in DS's you are using? Just SQLDS or are you using Hibernate or JPA DataSource as well?

                  Comment


                    #10
                    Yes - SQLDataSource.

                    Thanks for taking a look.

                    Comment


                      #11
                      We checked this out, what we found was that cursors were not being leaked, but were, in the specific case of our "ping test", only being recovered by garbage collection.

                      We couldn't reproduce this leading to an actual crash, and the behavior in this area hasn't changed between 3.0 and 3.1.

                      Our best speculation is that if you had a machine configured with a very large main memory and relatively small connection pool and just attempted continuous queries, you might hit this error.

                      The difference between your 3.0 vs 3.1 environment, and your environments and our test environments, might be any combination of driver version difference, Oracle version difference, JVM vendor or version (affecting GC frequency), operating system, more/less system memory taken up by other allocated Java objects or deployed applications, or just about any other factor involving memory usage patterns or DB behavior.

                      Either way, we've changed the code for both 3.0 and 3.1 to explicitly close() instead of relying on GC. Please let us know whether this fixes your issue.

                      Comment


                        #12
                        Thanks for looking into it. I'll try the nightly available tomorrow and see if it fixes the issue in 3.1.

                        As for the environments, the only difference is 3.0 to 3.1.

                        We have very controlled environments. The only difference in the auto-deploy and auto-build scripts was the 1 line where we use 3.1 instead of 3.0 .jar files.

                        The issue was there whether we used virtualized boxes, physical boxes, developer mode, hosted mode, clustered mode, etc.

                        Comment


                          #13
                          Good to know, but just for completeness, this does not suggest a regression given our other findings.

                          In fact, the most plausible cause we can imagine is that 3.1 uses less memory than 3.0 so garbage collection happens less frequently.

                          Comment

                          Working...
                          X