Thursday, January 24, 2013

WS-BA Participant Completion Race Condition: Part One

Overview

The WS-BA participant-completion protocol has a benign race condition that, in unusual circumstances, can cause some Business Activities to be cancelled that would have otherwise been able to close. This is safe as no inconsistency arises, but it can be annoying for users. This blog post explains why this can happen, under what conditions, and what you can do to tolerate it. This post gives you an overview of the issue and should provide enough details for most developers. A follow up post will get into the nitty-gritty details of what's really going on.

What's happening, in a nutshell

Imagine a scenario where the client begins a business activity and then invokes a Web service. If the Web service uses participant completion, it will notify the coordinator when it has completed its work and then return control to the client. This notification is asynchronous, so it's possible that the client will then ask the coordinator to close the activity before the coordinator processes (or even receives) the completed notification from the participant. In this situation the coordinator will cancel the activity as not all participants (from its perspective) have completed their work. As a result all completed participants are compensated (including, eventually, the participant with the late 'completed' notification) and the client receives a "TransactionRolledBackException".

When is it most likely to happen?

Typically this happens when the client, coordinator and participant are running inside the same VM. This scenario is unlikely to happen in production, but can happen regularly during development where a single VM is used to keep things simple.

How do I know if this is affecting my application?

If the client is occasionally receiving a TransactionRolledbackException when calling UserBusinessActivity#close(), but none of the machines involved in running the transaction have crashed, you could be affected by this. Especially if you are running the client, coordinator and participant(s) in the same server.

We've now added a log message to help you identify this. However, to see this, you will either need to be building transactions from the current source (4.17 or master branches in GitHub) or wait for the JBossTS 4.17.4 or Narayana 5.0.0.M2 release. This is the log message to look out for:

WARN  [com.arjuna.mw.wstx] (TaskWorker-2) ARJUNA045062: Coordinator cancelled the activity

This is only an indication that you are seeing this issue as the coordinator can elect to cancel the activity for other reasons. For example, network problems might mean the coordinator cannot tell the web service to close the activity.

Why can't this be avoided?

The short answer is that for the protocol to avoid this it would need to make the complete message synchronous, throttling throughput by slowing down both the participant and coordinator and holding sockets open for longer.

What can the application do to tolerate this?

A real, distributed deployment will rarely see this problem because communication latency between client, participant and coordinator will dominate the race condition. Even if it does happen your application should tolerate it. Transaction rollbacks and activity cancellations are inevitable in a distributed environment and can happen for many reasons. When handling TransactionRolledBack exceptions you can either retry the Transaction/Activity or notify the caller of the failure. What you choose to do will depend on the requirements of your application.

In part two, I'll get into the details of what's happening.

No comments: