Hazelcast Clustering Plugin Readme

Overview

The Hazelcast plugin adds support for running multiple redundant Openfire servers together in a cluster. By running Openfire as a cluster, you can distribute the connection load among several servers, while also providing failover in the event that one of your servers fails. This plugin is a drop-in replacement for the original Openfire clustering plugin, using the open source Hazelcast In-Memory Data Grid data distribution framework in lieu of an expensive proprietary third-party product.

The current Hazelcast release is version 3.9.2.

Clustering vs. Federation

XMPP is designed to scale in ways that are similar to email. Each Openfire installation supports a single XMPP domain, and a server-to-server (S2S) protocol as described in the specification is provided to link multiple XMPP domains together. This is known as federation. It represents a powerful way to "scale out" XMPP, as it allows an XMPP user to communicate securely with any user in any other such federated domain. These federations may be public or private as appropriate. Federated domains may exchange XMPP stanzas across the Internet (WAN) and may even discover one another using DNS-based service lookup and address resolution.

By contrast, clustering is a technique used to "scale up" a single XMPP domain. The server members within a cluster all share an identical configuration. Each member will allow any user within the domain to connect, authenticate, and exchange stanzas. Clustered servers all share a single database, and are also required to be resident within the same LAN-based (low latency) network infrastructure. This type of deployment is suitable to provide runtime redundancy and will support a larger number of users and connections (within a single domain) than a single server would be able to provide.

For very large Openfire deployments, a combination of federation and clustering will provide the best results. Whereas a single clustered XMPP domain will be able to support tens or even hundreds of thousands of users, a federated deployment will be needed to reach true Internet scale of millions of concurrent XMPP connections.

Installation

To create an Openfire cluster, you should have at least two Openfire servers, and each server must have the Hazelcast plugin installed. To install Hazelcast, simply drop the hazelcast.jar into $OPENFIRE_HOME/plugins along with any other plugins you may have installed. You may also use the Plugins page from the admin console to install the plugin. Note that all servers in a given cluster must be configured to share a single external database (not the Embedded DB).

By default during the Openfire startup/initialization process, the servers will discover each other by exchanging UDP (multicast) packets via a configurable IP address and port. However, be advised that many other initialization options are available and may be used if your network does not support multicast communication (see Configuration below).

After the Hazelcast plugin has been deployed to each of the servers, use the radio button controls located on the Clustering page in the admin console to activate/enable the cluster. You only need to enable clustering once; the change will be propagated to the other servers automatically. After refreshing the Clustering page you will be able to see all the servers that have successfully joined the cluster.

Note that Hazelcast and the earlier clustering plugins (clustering.jar and enterprise.jar) are mutually exclusive. You will need to remove any existing older clustering plugin(s) before installing Hazelcast into your Openfire server(s).

With your cluster up and running, you will now want some form of load balancer to distribute the connection load among the members of your Openfire cluster. There are several commercial and open source alternatives for this. For example, if you are using the HTTP/BOSH Openfire connector to connect to Openfire, the Apache web server (httpd) plus the corresponding proxy balancer module (mod_proxy_balancer) could provide a workable solution. Some other popular options include the F5 LTM (commercial) and HAProxy (open source), among many more.

A simple round-robin DNS configuration can help distribute XMPP connections across multiple Openfire servers in a cluster. While popular as a lightweight and low-cost way to provide basic scalability, note that this approach is not considered adequate for true load balancing nor does it provide high availability (HA) from a client perspective. If you are evaluating these options, you can read more here.

Upgrading the Hazelcast Plugin

The process of upgrading the Hazelcast plugin requires a few additional steps when compared with a traditional plugin due to the cross-server dependencies within a running cluster. Practically speaking, all the members of the cluster need to be running the same version of the plugin to prevent various errors and data synchronization issues.

Option 1: Offline

NOTE: This upgrade procedure is neat and tidy, but will incur a brief service outage.

  1. Shut down Openfire on all servers in the cluster.
  2. For the first server in the cluster, perform the following steps:
    1. Remove the existing plugins/hazelcast.jar
    2. Remove (recursively) the plugins/hazelcast directory
    3. Copy the updated hazelcast.jar into the plugins directory
    4. Restart Openfire to unpack and install the updated plugin
  3. Repeat these steps for the remaining servers in the cluster.

Option 2: Online

NOTE: Using this approach you should be able to continue servicing XMPP connections during the upgrade.

  1. Shut down Openfire on all servers except one.
  2. Using the Plugins page from the online server, remove the existing Hazelcast plugin.
  3. Upload the new Hazelcast plugin and confirm it is installed (refresh the page if necessary)
  4. Use the "Offline" steps above to upgrade and restart the remaining servers.

Option 3: Split-Brain

NOTE: Use this approach if you only have access to the Openfire console. Note however that users may not be able to communicate with each other during the upgrade (if they are connected to different servers).

  1. From the Clustering page in the Openfire admin console, disable clustering. This will disable clustering for all members of the cluster.
  2. For each server, update the Hazelcast plugin using the Plugins page.
  3. After upgrading the plugin on all servers, use the Clustering page to enable clustering. This will activate clustering for all members of the cluster.

Configuration

There are several configuration options built into the Hazelcast plugin as Openfire system properties:

  1. hazelcast.startup.retry.count (1): Number of times to retry initialization if the cluster fails to start on the first attempt.
  2. hazelcast.startup.retry.seconds (10): Number of seconds to wait between subsequent attempts to start the cluster.
  3. hazelcast.max.execution.seconds (30): Maximum time to wait when running a synchronous task across members of the cluster.
  4. hazelcast.config.xml.filename (hazelcast-cache-config.xml): Name of the Hazelcast configuration file. By overriding this value you can easily install a custom cache configuration file in the Hazelcast plugin /classes/ directory, in the directory named via the hazelcast.config.xml.directory property (described below), or in the classpath of your own custom plugin.
  5. hazelcast.config.xml.directory ({OPENFIRE_HOME}/conf): Directory that will be added to the plugin's classpath. This allows a custom Hazelcast configuration file to be located outside the Openfire home directory.
  6. hazelcast.config.jmx.enabled (false): Enables JMX support for the Hazelcast cluster if JMX has been enabled via the Openfire admin console. Refer to the Hazelcast JMX docs for additional information.
Note: The default hazelcast-cache-config.xml file included with the plugin will include a file conf/hazelcast-local-config.xml that will be preserved between plugin updates. It is recommended that local changes are kept in this file.

The Hazelcast plugin uses the XML configuration builder to initialize the cluster from the XML file conf/hazelcast-local-config.xml. By default the cluster members will attempt to discover each other via UDP multicast at the following location:

These values can be overridden in the conf/hazelcast-local-config.xml file via the multicast-group and multicast-port elements. Many other initialization and discovery options exist, as documented in the Hazelcast configuration docs noted above. For example, to set up a two-node cluster using well-known DNS name/port values, try the following alternative:
...
<join>
    <multicast enabled="false"/>
    <tcp-ip enabled="true">
      <member>of-node-a.example.com</member>
      <member>of-node-b.example.com</member>
    </tcp-ip>
</join>
...

Please refer to the Hazelcast reference manual for more information.

A Word About Garbage Collection

Hazelcast is quite sensitive to delays that may be caused by long-running GC cycles which are typical of servers using a default runtime JVM configuration. In most cases it will be preferable to activate the concurrent garbage collector (CMS) or the new G1 garbage collector to minimize blocking within the JVM. When using CMS, you may be able to counter the effects of heap fragmentation by using JMX to invoke System.gc() when the cluster is relatively idle (e.g. overnight). This has the effect of temporarily interrupting the concurrent GC algorithm in favor of the default GC to collect and compact the heap.

In addition, the runtime characteristics of your Openfire cluster will vary greatly depending on the number and type of clients that are connected, and which XMPP services you are using in your deployment. However, note that because many of the objects allocated on the heap are of the short-lived variety, increasing the proportion of young generation (eden) space may also have a positive impact on performance. As an example, the following OPENFIRE_OPTS have been shown to be suitable in a three-node cluster of servers (four CPUs each), supporting approximately 50k active users:

OPENFIRE_OPTS="-Xmx4G -Xms4G -XX:NewRatio=1 -XX:SurvivorRatio=4 
               -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+UseParNewGC
               -XX:+CMSParallelRemarkEnabled -XX:CMSFullGCsBeforeCompaction=1 
               -XX:CMSInitiatingOccupancyFraction=80 -XX:+UseCMSInitiatingOccupancyOnly 
               -XX:+PrintGCDetails -XX:+PrintPromotionFailure"

This GC configuration will also emit helpful GC diagnostic information to the console to aid further tuning and troubleshooting as appropriate for your deployment.