Monday, December 21, 2009

JGroups 2.8.0.GA released

I'm happy to announce that JGroups 2.8.0 is finally GA !

It has taken us almost a year since the last major release (2.7 was released in January), but to our defense 2.8.0.GA contains a lot of new features and I think they are worth the wait. We also released a number of 2.6.x versions in 2009, which are used in the JBoss Enterprise Application Platform (EAP).

Before I get into a summary of some of the new features (a detailed list can be found at [1]), I'd like to thank all the developers, users and contributors of JGroups. Without this healthy community, producing code, bug reports, patches, documentation and user stories, JGroups wouldn't be anywhere close to where it is today !

So a big thanks to everyone involved, Happy Holidays and a great start into 2010 !

Here's a short list of features that made it into 2.8.0.GA (here are the release notes):
  • Logical addresses: decouples physical addresses (which can change) from logical ones. Eliminates reincarnation issues. This alone is worth 2.8, as it eliminates a big source of problems !
  • Logical names: allow for meaningful channel names, logical names stay with a channel for its lifetime, even after reconnecting it
  • Improved merging / no more shunning: shunning was replaced by merging. Now we have a much simpler model: JOIN - LEAVE - MERGE. The merging algorithm was improved to take 'weird' (e.g. asymmetric) merges into account
  • Better IPv6 support
  • Better support for defaults for addresses: based on the type of the stack (IPv4, IPv6), we perform sanity checks and set default addresses of the correct type
  • FILE_PING / S3_PING: new discovery protocols, file-based and Amazon S3 based. The latter protocol can be used as a replacement for GossipRouter on EC2
  • Speaking of which: major overhaul of GossipRouter
  • Ability to have multiple protocols of the same class in the same stack
  • Ability to override message bundling on a per-message basis
  • Much improved and faster UNICAST
  • XSD schema for protocol configurations
  • STREAMING_STATE_TRANSFER now doesn't need to use TCP, but can also use the configured transport, e.g. UDP
  • RpcDispatcher: additional methods returning a Future rather than blocking
  • Probe.sh: ability to invoke methods cluster-wide. E.g. run message stability on all nodes: probe.sh invoke=STABLE.runMessageGarbageCollection
  • Logging
    • Removal of commons-logging.jar: JGroups now has ZERO dependencies !
    • Configure logging level at runtime, e.g. through JMX (jconsole) or probe.sh, or programmatically. Use case: set logging for NAKACK from "warn" to "trace" for a unit test, then reset it back to "warn"
    • Ability to set custom log provider. This allows for support of new logging frameworks (JGroups ships with support for log4j and JDK logging)
Enjoy !
Bela, Vladimir and Richard

[1] http://javagroups.cvs.sourceforge.net/viewvc/javagroups/JGroups/doc/ReleaseNotes-2.8.txt?revision=1.10&view=markup&pathrev=Branch_JGroups_2_8

[2] http://community.jboss.org/wiki/Support

Thursday, November 05, 2009

IPv6 addresses in JGroups

I finished code to support scoped IPv6 link local addresses [1]. A link local address is an address that's not guaranteed to be unique on a given host (althougbh in most cases it will be), so it can be assigned on different interfaces of the same host.

To differentiate between interfaces, a scope-id can be added, e.g. fe80::216:cbff:fea9:c3b5%en0 or fe80::216:cbff:fea9:c3b5%3, where the %X suffix denotes the interface.

Note that this is only relevant for TCP sockets, multicast or datagram sockets are not affected.

Now, on the server side, we can bind to a scoped or unscoped link-local socket, e.g.

ServerSocket srv_sock=new ServerSocket(7500, 50, InetAddress.getByName("fe80::216:cbff:fea9:c3b5"))

binds to an unscoped link-local address, and

ServerSocket srv_sock=new ServerSocket(7500, 50, InetAddress.getByName("fe80::216:cbff:fea9:c3b5%en0"))

binds to the scoped equivalent.

This is all fine, but on the client side, we cannot use scoped link-local addresses, e.g.

Socket sock=new Socket(InetAddress.getByName("fe80::216:cbff:fea9:c3b5%en0"), 7500)

fails !

The reason is that a scope-id "en0" does not mean anything on a client, which might run on a different host.

The correct code is

Socket sock=new Socket(InetAddress.getByName("fe80::216:cbff:fea9:c3b5"), 7500),

with the scope-id removed.

JGroups runs into this problem, too: whenever we have a bind_addr which is a scoped link-local IPv6 address, certain discovery protocols (e.g. MPING, TCPGOSSIP) will return the scoped addresses, and the joiners will then try to connect to the existing members using the scoped addresses.

To fix this, all Socket.connect() calls in JGroups have been replaced with Util.connect(Socket, SocketAddress, port). This method checks for scoped link-local IPv6 addresses and simply removes the scope-id from the destination address, so the connect() call will work.

Note that this problem doesn't occur with global IPv6 addresses.

I need to test whether this solution works on other operating systems, too, .e.g. on Windows, Solaris and MacOS.

OK, I'm off to http://www.davidoffswissindoors.ch, hope to see some good tennis !

[1] http://www.jboss.org/community/wiki/IPv6

Wednesday, October 28, 2009

JGroups 2.8.0.CR3 released

Unfortunately, a little later than estimated, but better late than never ! The reason is that I got side tracked by EAP 5 performance testing and also by the good feedback from the community (you !) on CR2, and the associated bug reports.

This version contains bug fixes, and mostly work around IPv6 versus IPv4 addresses. We now try to be smart and attempt to find out the type of stack used, and then default undefined IP addresses to addresses of the correct type. Note that IPv6 support is not yet 100% done, I'm continuing to work on this for either CR4 or GA. More on this topic in a later post...

CR3 also added a new feature, which is marshaller pools in the transport. When we send messages, they're either bundled and sent as a batch of messages, or not. In either case, the marshalling of a message or message list is done in an output buffer for which we have to acquire a lock. When we have heavy message sending, e.g. through multiple sender threads, that lock is heavily contended.

Not to say this is a big issue because the sender side is almost never the culprit in slow performance (the receiver side is !), but I've introduced a marshaller pool, which provides N output streams (default=2) rather than 1. The property marshaller_pool_size defines how many output streams we want in the pool and marshaller_pool_initial_size the initial size of each output stream (in bytes).

Note that, for UDP, each output stream can grow up to 63535 bytes, so take that into account when allocating a large number of streams.

In my perf tests, I haven't found that increasing the pool size makes a difference to performance, but if you use many threads which send messages concurrently, this does make a difference.

2.8.0.CR3 can be downloaded from http://sourceforge.net/projects/javagroups/files/JGroups/2.8.0.CR3.
Enjoy !

Friday, September 18, 2009

JGroups 2.6.13.CR2 released

OK, going from CR1 to CR2 doesn't seem like a big deal, and certainly not worth posting as a blog entry ?

You might wonder if I have nothing better to do (like biking in the French Alps) :-)

But actually, there have been significant changes since CR1, so please read on !

CR2 only contains 3 JIRA issues:
  1. Backport of NAKACK from head
  2. Backport of UNICAST from head and
  3. Removal of UNICAST contention issues
#1 is a partial backport of NAKACK from head (2.8) to the 2.6 branch. This version doesn't acquire locks for incoming messages anymore, but uses a CAS (compare-and-swap) operation to decide whether to process a message, or not.

What used to happen when a message from P is received is that we grabbed the receiver window for P and added the message. Then we grabbed the lock associated with P's window and - once acquired - removed as many messages as possible and passed them up to the application sequentially. Sequential order is always respected unless a message is tagged as OOB (out-of-band).

So here's what happened: say we received 10 multicast messages from B and 3 from A. Both A's and B's messages would be delivered in parallel with respect to each other, but sequentially for a given sender. So A's message #34 would always get delivered before #35 before #36 and so on...

However, say we have to process 10 messages from B: 1 2 3 4 5 6 7 8 9 10:
  • Every message would get into NAKACK on a separate thread
  • All the 10 messages would get added into B's receiver window
  • The thread with message #3 would grab the lock
  • All other threads would block, trying to acquire the lock
  • The thread with the lock would remove #1 and pass it up the stack, then #2, then #3 and so on, until it passed #10 up the stack to the application
  • Now it releases the lock
  • All other 9 threads now compete for the lock, but every single thread will return because there are no more messages in the receiver window
This is a terrible waste: we've wasted 9 threads; for the duration of removing and passing up 10 messages, these threads could have been put to better use, e.g. processing other messages !

For example, if our total thread pool only had 10 threads, and 1 of them was processing messages and 9 were blocked on lock acquisition, if a message from a different sender came in (which could be delivered in parallel to B's messages), then no thread would be available !

So the simple but effective change was to replace the lock on the receive window with a CAS: when a thread tries to remove messages, it simply set the CAS from false to true. If it succeed, it goes into the removal loop and sets the CAS back to false when done. Else, the thread simply returns because it knows that someone else will be processing the message it just added.

Result: we've returned 9 threads to the thread pool, ready to serve other messages, without even locking !

The net affect is faster performance and smaller thread pools. As a rule of thumb, a thread pool's max threads can now be around the number of cluster nodes: if every node sends messages, we only need 1 thread per sender to process all of the sender's messages...


#2 has 2 changes: same as above (locks replaced by CAS) and the changes outlined in the design document. The latter changes simplify UNICAST a lot and also handle the cases of asymmetrical connection closings. This was also back-ported from head (2.8)


#3 UNICAST contention issues
We used to have 2 big fat locks in UNICAST, which severely impacted performance on high unicast message volumes. The bottleneck was detected as part of our EAP testing for JBoss.

This has been fixed and is getting forward-ported to CVS head.

I guess the 3 changes are worth trying out 2.6.13.CR2; in some cases this should make a real difference in performance !

Enjoy,

Monday, August 24, 2009

2.8.0.CR1 released

I just released 2.8.0.CR1, it can be downloaded from SourceForge (binary and source).

This version is pretty stable, and I expect a GA soon. The only open issues are currently a few IPv6 related issues and an issue which fixes spurious merges.

The release notes are here.

Enjoy,

Monday, August 17, 2009

2.6.12.GA released

Just uploaded to SourceForge, the JIRA issues are at https://jira.jboss.org/jira/secure/IssueNavigator.jspa?reset=true&pid=10053&fixfor=12313820.

In a nutshell, 2.6.12 contains only 4 issues:
  • GossipRouter consumed 40% CPU without doing anything: fixed
  • S3_PING is a new file-basedx discovery protocol for running JGroups on EC2 / S3
  • There was a memory leak in the GMS protocol on high member churn (high rate of joins and leaves)
  • FLUSH could lock up the entire cluster when the initial flush phase ran into a timeout. Thanks to Rado and Brian for discovering this bug by adding all weird combination of failure scenarios to their merciless tests... :-) And kudos to Vladimir for investigating the (60MB !) logs, finding the offending code and fixing it, all within 60 minutes !
2.6.12.GA can be downloaded from SourceForge in binary and source versions.

Enjoy,

Thursday, July 16, 2009

Bike tour Nice

Executive summary: very nice tour with 600km, 8 mountain passes and ca 17'000m of climbing, but unfortunately cut short by the weather, so the total is only 8 instead of 10 passes.

That's the stats, if you want to know more, read on...

By the way: a 'bike' is a bicycle, *not* a motorbike ! :-)



Day 1: FRI July 10th 2009


I took the 9am flight from Zurich and arrived in Nice at 10:00am. My biggest concern was to assemble the bike and get out of the airport as soon as possible because it must be busy this time of the season (in France, vacation time started July 3rd).

However, I was pleasantly surprised when I found that NCE even had a bike assembly station inside the airport, with a stand and tools. Who would have thought ?

As you can see, I only had 2-3 kilos of baggage with me, attached to the saddle (no back pack).

Alors, I assembled my bike, passed customs and took off. First, the ride was along the shore (at 0 meters elevation), then through Nice, with a bit too much traffic and off I went into the mountains.

Once I found D19 towards Tourrette-Levens, traffic eased and the long but steady climb began. The ride was mostly through wooded areas, with lots of ups and downs and curves. Almost no traffic anymore, only other bikers.

After the Col de St. Martin (1500m), I had already booked a hotel at La Bolline and spent the night there. Very small place, but the good thing is it's quite high up so no air conditioning was needed. As a matter of fact, I spent all nights at altitudes over 1000m, so it was not too hot and not too cold. Just perfect !

The only thing that wasn't perfect was that the French (at least in the South) start their repas (dinner) at 7:30pm, so when I arrived (usually very hungry), I still had to wait for a few hours !

I 'fixed' this by having a late and long lunch (called dejeuner in France), so I would survive until 7:30pm...


Day 2: SAT July 11th 2009

Big day: today I wanted to ride 2 passes, the Cole De Bonnette and the Col de Vars.
But first things first. In the morning, there was a nice downhill from La Bolline to the junction with D2205.



From there on, the climb to the Bonnette started. In the picture, coming down from Valdeblore, I took a right and started my ascent to Bonnette. This point is ca. at 500m, and Bonnette at 2802m, so a long climb of 2300m !

All the passes I did are not very steep, but the climbs are very long, at maybe 5-8%, and that wears you out, too ! I prefer steep climbs and long downhill rides :-)

The Col De Bonnette is the highest pass in Europe, but only because of a trick: some resourceful people (probably from the tourist office) added a loop around the top of Bonnette in the 60's, which added a few tens of meters, so Bonnette would surpass the Col de L'Iseran !

In the picture, one can see the loop starting and going around the top clock-wise.

In Jausiers, I unfortunately had a big ham and cheese toast, with a few cokes and beers and - as the experienced athletes among you will know - this was somewhat detrimental to my effort to climb the Col de Vars ! Only 800m to climb, but I had to walk my bike and push because of (a) my stomach and (b) cramps.

Note to self: tonight have loads of salt to avoid the cramps (binds the water in the body) and next time have spaghetti or something with carbs (and salt) for lunch !

Anyway, I made it to the top (even biking the last 1.5kms) and after a nice downhill spent the night at Vars (a mountain resort).


Day 3: SUN July 12th 2009

3 passes under the buckle and 7 to go, I had a nice downhill ride (that's the advantage of spending the night halfway up the mountain, you always have a downhill the next day!) to Guillestre. From here, the most beautiful pass, the Col de L'Izoard, started. In the picture below, you can see Brunissard, looking back.







The road climbed nicely through a dense forest and later passed the famous Casse Desert (Broken Desert), which looks like a piece of the moon right before the summit of the Izoard pass.

At the top of the Izoard, there was a concession stand, offering drinks and souvenirs, and there were many motorbikes and bikes (and cars, too). However, the downhill was fantastic: roads in great shape and winded curves, excellent to cruise down.

The next pass (#5) was the Montgenevre, starting from Briancon. At only 600 meters of climbing, it would have been a nice pass, but the traffic was overwhelming. Maybe because it was Sunday, everybody (and their grandmothers) was on their motorbikes. And sometimes they just love to accelerate when passing bikers (the real ones :-))...

At least the hotel at the top was nice (Le Chalet Blanc), had a nice TV and so I could watch the Tour De France for that day.


Day 4: MON July 13th 2009


The day started with a nice downhill to Cesana Torinese, and then on to Susa, Italy. But that was it in terms of niceness: the next climb up Mont Cenis was hard, because it started from 500m and went all the way up to 2100m, for a climb of 1600m.

In addition, once I could see the top (a bunch of hotels) through the fog, when I got closer I found out that there was a dam behind them, another 200m of climbing. Then I found out that the road didn't go alongside the lake, but climbed another hill, adding 200m again !

At the top, it was very cold and so I just put on a long wind breaker and rode the downhill into Lanslevillard where I had lunch (no ham/cheese toast though !).

I decided to ride a little further to Bessans and took a hotel there.


Day 5: TUE July 14th 2009

Pass #7 was the Col de L'iseran (2770m), which is the second highest mountain pass in Europe, only to be passed by the Bonnette (albeit through a trick, as mentioned above). However, I started at 1677m, so the actual climbing was only 1100m, much better than the 2300m for the Bonnette.
As once can see, I took the obligatory picture of the sign at the top of the pass, to prove that I was there :-)
Well, actually, I'm not in the picture, so... hmmm :-)

The downhill ride was very pleasant, through Val d'Isere, which is a famous winter sports resort and down to Montvalezan.

The Col de L'Iseran was pass # 7.

From here, I took a shortcut to the Petit St. Bernard, although it had a steep climb of up to 16%. The ride up St Bernard would have been nice because it features a mild climb of mostly 5% or less, but the wind at the top made it hard to reach the old monestary, which is at the top.


Here's the picture of me at the top.

The weather forecast had rain and storms for the evening, so I didn't stay too long at the top and made my way down to La Thuile, Italy (the Pt. St. Bernard is the border between France and Italy) and took a hotel.

Petit St. Bernard was pass #8.












Day 6: WED July 15th 2009


Unfortunately, it rained the whole night and the forecast for where I wanted to go (Gr. St. Bernard and Furka (Valais) in Switzerland) was very bad (rain and thunderstorms), so I decided to finish this tour by heading west.

When the rain stopped, I rode down to Pre-Saint-Didier (Italy) and up the hill to Courmayeur (Italy). There I took a bus through the Mont Blanc tunnel into Chamonix (France again).

From Chamonix, I rode more or less A40 back to Geneva (Switzerland), where I took the train back home. The A40 is partially a 2 lane highway, but due to lack of other roads (very narrow valley), they allow bikes to use it. Hmm, not very pleasant to be passed by heavy trucks, you going at 65km/h and the trucks going at 100km/h...

Anyway, this was a great tour, friendly people everywhere, I practiced my French, and toured a beautiful scenery. I might just do it again some day ! Next time though, maybe with an iphone (or even better, an Android HTC phone !) and lots of plan B's: in some places they had no public transportation (besides the bi-weekly bus :-)) and I would have been stuck in that place had it started to rain...

If someone is interested in the exact tour, I have it on bikemap.net, let me know.
Cheers,
Bela

Tuesday, July 07, 2009

Tour de France

From July 10th - 18th, I'll be on my own Tour De France: from Nice (France) back to my home town of Kreuzlingen (Switzerland).

I'm flying to Nice this Friday (July 10th), and then bike (= bicycle) back, over some of the highest passes in Europe, e.g. Restefond-Bonnette (2802m, the highest pass in Europe), Iseran (the 2nd highest), Isoard, Cenis, Vars, Pt. St. Bernhard, Gr. St. Bernhard, Furka, etc...

Google says this tour is ca 1000 kilometers and my guess is this will be over 20'000m of climbing (if anyone knows who to measure the total climbing in Google Earth, please let me know !).

Last year I biked a similar tour, from Graz (Austria) back home, but that tour was shorter (850km) and the mountains lower...

I (hope to !) be back on the 20th :-)

Answering questions, bug fixing and all such stuff will be suspended during this time !
Cheers,

Friday, June 19, 2009

Jazoon and JBossWorld

I'll be speaking at the Jazoon (in Zurich, June 24th, next week) and JBossWorld (in Chicago, Sept 2) conferences. At Jazoon, I'll talk about a memcached implementation in Java, at JBossWorld I'll talk about large JBoss clusters.

Hope to see some of you there !

Wednesday, June 17, 2009

Shunning has been shunned

I finally completed the MERGE4 functionality, which now handles asymmetrical merges and greatly improves the usefulness of JGroups in mobile networks. I've blogged about this earlier this year.

The new merge functionality also allowed me to trash shunning, which is great, because I've always had problems explaining the difference between shunning and merging. Merging would usually be needed when we had real network partitions, whereas shunning would be needed when only a single member was expelled from the group (e.g. because it failed to respond to heartbeats, but hasn't really crashed).

However, with FD_ALL, there could be a scenario where everybody shunned everybody else (a shun-fest :-)), and so all the cluster nodes would leave and re-join the cluster, possibly even multiple times. Clearly not a desirable scenario, even though it didn't lead to incorrect results !

The new model is now much simpler: we have members join, leave and merge. The latter happens on a network partition, for example. In the old model, when a member was unresponsive, it was shunned and subsequently rejoined. In the new model, there's simply going to be a merge between the group which found that member unresponsive and the (now newly responsive) member.

Since I also improved merging speed and correctness (wrt concurrent merges), I suggest download 2.8.beta2 (which I'll upload to SourceForge shortly) and give it a try.

One thing that I'll have to talk about (in my next post) is what to do with merging. For example, if we have shared state and it diverged during a network partition, how can the application make sure that the merge doesn't cause inconsistent states.

More on this later, enjoy,

Wednesday, May 20, 2009

Testing a protocol in isolation

Protocols are the most important feature of JGroups. They provide the actual functionality in any stack, such as retransmission of lost messages, ordering, flow control, state transfer, fragmentation and so on.

While we do have a sizeable number of unit tests in JGroups, we don't have many tests which test just 1 protocol in isolation.

Take a look at GMS_MergeTest (CVS head), that's how we need to test
protocols in the future. GMS_MergeTest tests (concurrent) merging, and only concurrent merging.

The following features are in this test:

* Injection of views. We inject the subpartitioned views directly
into the cluster, rather than waiting for a failure detector to
kick in. Remember, we test merging and NOT failure detection. The
good thing is that (a) this is much faster and (b) we can really
focus on 1 protocol/feature compared to multiple ones
* Injection of MERGE events. Rather than relying on MERGE2 to (after
some time) generate a MERGE event, we directly inject MERGE events
into the merge leader(s). Same as above: this is much faster and
we don't test MERGE2, but GMS/CoordGmsImpl
* Temporary enabling/disabling of logging for GMS: injectMerge()
enabled TRACE logging for GMS, so we can see what's going on
*selectively*, and don't get a huge TRACE log for the stuff that's
going on before
* SHARED_LOOPBACK: well, I might as well use UDP, but
SHARED_LOOPBACK never loses any messages, therefore no
retransmissions. Again, focus is on testing merging and not
merging with packet loss
* No failure detection, no merge protocol in the stack: this allows
us to set a breakpoint in a debugger and step around for as long
as we want to, without FD suspecting and excluding us

The use of these features allows us to really focus on testing one
feature in isolation. We need to go further into this direction,
thererfore, if you have existing tests you want to modify, go ahead !
New tests should follow this paradigm whenever possible.

Monday, May 04, 2009

First alpha of 2.8

I've uploaded a first alpha of 2.8 to SourceForge [1].

The major new features are logical addresses, improved support for asymmetrical merges, GossipRouter changes, and a new shared dir based discovery protocol.

Despite its name, alpha3 is already very stable, and I wanted to release it so I can get some useful feedback to go into beta1. There are still ca 20 JIRA issues open, but I expect we close them and release GA by the summer 09.

The remaining tasks are hardening of the merging code (e.g. to better handle concurrent merges), dropping commons-logging (so we can ship jgroups.jar without any other JAR dependency !), testing asymmetric merges some more and writing documentation for these new features.

I want to thank Vladimir and Richard for their hard work on 2.8 !

Enjoy !

[1] https://sourceforge.net/project/showfiles.php?group_id=6081&package_id=94868#

Friday, April 24, 2009

FILE_PING: new discovery protocol based on shared storage

I've just created a first version of FILE_PING in 2.6 and 2.8. This is a new discovery protocol which uses a shared directory into which all nodes of a cluster write their addresses.

New nodes can read the contents of that directory and then send their discovery requests to all nodes found in the dir.

When a node leaves, it'll remove its address from the directory again.

When would someone use FILE_PING, e.g. over TCPGOSSIP and GossipRouter ?

When IP multicasting is not enabled, or cannot be used for other reasons, we have to resort to either TCPPING , which lists nodes statically in the config, or TCPGOSSIP, which retrieves initial membership information from external process(es), the GossipRouter(s).

The latter solution is a bit cumbersome since an additional process has to be maintained.

FILE_PING is a simple solution to replace GossipRouter, so we don't have to maintain that external process.

However, note that performance will most likely not be better: a shared directory e.g. on NFS or SMB requires a round trip for a read or write, too. So if we have 10 nodes which wrote their information to file, then we have to make 10 round trips via SMB to fetch that information, compared to 1 round trip to the GossipRouter(s) !

So FILE_PING is an option for developers who prefer to take the perf hit (maybe in the order of a few additional milliseconds per discovery phase) over having to maintain an external GossipRouter process.

FILE_PING is part of 2.6.10, which will be released early next week, or it can be downloaded from CVS (2.6. branch) or here. In the latter case, place the FILE_PING.java into the src/org/jgroups/protocols directory and execute the 'jar' target in the build.xml Ant script of your JGroups src distro.

Wednesday, April 01, 2009

Those damn edge cases !

While JGroups is over 10 years old and very mature, sometimes I still run into cases that aren't handled. While the average user won't run into edge cases because we test the normal cases very well, if you do run into one, in the best case, you have 'undefined' behavior (whatever that means !), in the worst case, you're hosed.

Here's one.

The other week I was at a (european) army, for a week of JGroups consulting. They have a system which runs JGroups nodes over flappy links, radio and satellite networks. Sometimes, a link between 2 nodes A and B can even turn asymmetric, meaning A can send to B, but B not to A !

It turns out that they have a lot of partitions (e.g. when a satellite link goes down), followed by subsequent remerging when the link is restored. Sometimes, members would not be able to communicate with each other after the merge.

This was caused by an edge case in UNICAST which doesn't handle overlapping partitions.

A non-overlapping partition is a partition where a cluster of {A,B,C,D} falls apart into 2 (or more) subclusters of {A,B} and {C,D}, or {A}, {B}, {C}, {D}. The latter case can easily be reproduced when you kill a switch connecting the 4 nodes.

An overlapping partition is when the cluster falls apart into subclusters that overlap, e.g. {A,B,C} and {C,D}. This can happen with asymmetrical links (which never happens with a regular switch !), or FD and many nodes being killed at the same time and a merge occuring before all dead nodes have been removed from the cluster.

If this sounds obsure, it actually is !

But anyway, here's what happens at the UNICAST level.

Warning: rough road ahead...

UNICAST keeps state for each connection. E.g. if A sends a unicast message to B, A maintains the last sequence number (seqno) sent to B (e.g. #25) and B maintains the highest seqno received from A (#25). The same holds for message from B to A, let's say B's last message to A was #7.

Now we have a network partition, which creates a new view {A,B} at A and {B} at B. So, in other words, B unilaterally excluded A from its view, but A didn't exclude B. The reason is that A can communicate with B, but B cannot communicate with A.

Now, you might ask, wait a minute ! If A can communicate with B, why can't B communicate with A ?

This doesn't happen with switches, but here we're talking about separate up and down links over radios, and if a radio up-link goes down, that just means we cannot send, but still receive (through the down-link) !

Let's now look at what happens:

When B receives the new view {B}, it removes the entry for A from its connection table. It therefore loses the memory that its last message to A was #7.

On the other side, A doesn't remove its connection entry for B, which is still at #25.

When the partition heals and a merge ensues, A sends a message to B. The message's seqno is #25, the next message to B will be #26 and so on.

On the receiver side, B creates a new connection table entry for A with seqno #1. When A#25 and A#26 are received, they're stored in the table, but not passed up to the application because we expect messages #1-#24 from A first.

This is terrible because A will never send messages #1-#24 ! Because B will simply store all messages from A, it will run out of memory at some point, unless there's another view excluding A !

Doom also lurks in the reverse direction: when B sends #1 to A, A expects #7 and therefore discards messages #1-#6 from B !

This is bad and caused me to enhance the design for UNICAST. The new design includes connection IDs, so we'll reset an old connection when a new connection ID is received, and receivers asking senders for the initial seqnos if they have no entry for a given sender.

This change will not affect current users, running systems which are connected via switches/VLANs etc. But it will remove a showstopper for running JGroups in rough environments like the one described above.

The design for the new enhanced UNICAST protocol can be found at [1].

[1] http://javagroups.cvs.sourceforge.net/viewvc/javagroups/JGroups/doc/design/UNICAST.new.txt?revision=1.1.2.5&view=markup&pathrev=Branch_JGroups_2_6_MERGE

Monday, March 16, 2009

Status update and training

Here's a quick status update.

Logical addresses are almost done and work with the UDP and TCP transports. I have yet to support TUNNEL (GossipRouter), but because Vladimir made a fair amount of changes to GR/TUNNEL on CVS head, I'll have to merge the logical addresses branch back to head first, before I can apply logical addresses to CVS head.

Because I have some other important changes in 2.8 (e.g. anycasting support), I decided to curtail the scope of 2.8 a bit and move some stuff to the newly created 2.9. For example, currently there can only be 1 physical address associated with 1 logical address (UUID). Multiple physical addresses will be supported in 2.9.

Also, canonicalization of UUIDs into shorts was pushed into 2.9. This is largely an optimization.

Speaking of optimizations, performance of the branch looks promising ! Although we now ship both dest and src addresses with a Message, which makes the serialized message a bit bigger (for IPv4, *not* IPv6 !), UUIDs take up less space in memory and thus I got a throughput increase from 105MBytes/sec to 113MBytes/sec ! These are only preliminary results, and I have yet to run a full perf test.

I'll probably merge the branch back to head and then make the necessary changes to TUNNEL/GR this week. Then we might release an alpha, for folks to try out logical addresses.

The thing is that I'll be traveling a bit for the next couple of weeks: next week I'm near Amsterdam and April 6-9 I'll be in Munich, teaching the JBoss Clustering course (JB439). If you want to meet over a beer, or even join the course, drop me an email !
Cheers,

Monday, February 16, 2009

What's cool about logical addresses ?

Finally, logical addresses (https://jira.jboss.org/jira/browse/JGRP-129) will get implemented (in 2.8) !

For those of you who've used JGroups, you'll know that the identity of a node was always its IP address and the port on which the receiver thread was listening, e.g. 192.168.1.5:7800.

While this gives you a relatively compact and readable address (you can deduct from the address on which host it resides), there's also a problem: this type of address is not unique over space and time.

Let's look at an example.

Say we have a cluster of {A,B,C}. C's address is 192.168.1.5:7800. Let's assume A has sent 25 messages to C and C has multicast 104 messages. We're using sequence numbers (seqnos) to order messages, attached to a message via a header.

So the next message that C will multicast is #105 and the next message it expects from A is #26.

This is state that is maintained by the respective protocols in a JGroups stack.

Now let's assume C is killed and restarted. Or C is shunned, therefore leaves the channel and then automatically (if configured) reconnects. Let's also assume that the failure detection protocol has not yet kicked in and therefore A and B will not have received a view {A,B} which excludes C.

Now C rejoins the cluster. Because this is a reincarnation of C, it creates a new protocol stack, and all the state mentioned above is gone. The reincarnated C now sends #1 as next seqno and expects #1 from A as well.

There are 2 things that happen now:
  1. When C multicasts its next message with seqno #1, both A and B will drop it. A drops it because it expects C's next message to be #105, not #1. As a matter of fact A will drop the first 104 messages from C !
  2. A multicasts a message with seqno #26. However, C expects #1 from A and therefore buffers message #26. As a matter of fact, C will buffer all messages from A until it receives #1 which will not happen ! Consequence: C will run out of memory at some point. Even worse: C will prevent stability messages from purging messages seen by all cluster nodes, so in the worst case, all cluster nodes will run out of memory !
OK, while this is somewhat of an edge case and can be remedied by (a) waiting some time before restarting a node and/or (b) not pinning down ports, the fact is still that when this happens, it wreaks havoc.

So how are logical address going to change this ?

A logical address consists of
  • an org.jgroups.util.UUID (copied from java.util.UUID and relieved of some useless fields) and
  • a logical name
The logical name is given to a channel when the channel is created, e.g.
JChannel channel=new JChannel("node-4", "/home/bela/udp.xml");

This means that the channel's address will always get printed as "node-4". Under the cover, however, we use a UUID (for equals() and hashCode()), which is unique over space and time. The UUID is recreated on channel connect, so the above reincarnation issue will not happen.

The logical name is syntactic sugar, because if we have views consisting of UUIDs (16 bytes), that's not a pretty sight, so views like {"node-1", "node-2", "node-3", "node-4"} look much better.

Note that the user will be able to pick whether to see UUIDs or logical names.

Also, if null is passed as logical name, JGroups will create a logical name (e.g. using the host name and a counter).

A UUID will get mapped to one or more physical addresses. The mapping is maintained by the transport and there will be an ARP-like protocol (handled by Discovery) to fetch the initial mappings, and to fetch a mapping if not available.

The detailed design is described in http://javagroups.cvs.sourceforge.net/viewvc/javagroups/JGroups/doc/design/LogicalAddresses.txt?revision=1.12&view=markup.

So the most important aspect of logical addresses is that they decouple the identity of a JGroups node from its physical address.

This opens up interesting possibilities.

We might for example associate multiple physical address with a UUID, and load balance over the physical addresses. We could open multiple sockets, and associate each (receiver) socket's physical address with the UUID. We could even change this at runtime: e.g. if a NIC is down and we get exceptions on the socket, simply create another socket, remove the old association across the cluster (there's a call for this) and associate the new physical address with the UUID.

Another possibilty is to implement NATting, firewalling or STUNning this way !

I'll probably make the picking of a physical address given a UUID pluggable, so developers can even provide their own address translation in the transport !

This change is overdue and I'm happy that work has finally started on this. If you want to follow this, the branch is Branch_JGroups_2_8_LogicalAddresses.

Wednesday, January 21, 2009

ReplCache: storing your data in the cloud with variable replication

Some time ago, I wrote a prototype of a cache which distributes its elements (key-value pairs) across all cluster nodes. This is done by computing the consistent hash of a key K and picking a cluster node based on the hash mod N where N is the cluster size. So any given element will only ever be stored once in the cluster.

This is great because it maximizes use of the aggregated memory of the 'cloud' (a.k.a. all cluster nodes). For example, if we have 10 nodes, and each node has 1 GB of memory, then the aggregated (cloud) memory is 10 GB. This is similar to a logical volume manager (e.g. LVM in Linux), where we 'see' a virtual volume, the size of which can grow or shrink, and which hides the mapping to physical disks.

So, if we pick a good consistent hash algorithm, then for 1'000 elements, we can assume that in a cluster of 10 nodes, each node stores on average 100 elements. Also, with consistent hashing, if you pick a good hash algorithm, rehashing on view changes is minimal.

Now, the question is what we do when a node crashes. All elements stored by that node are gone, and have to be re-read from somewhere, usually a database.

To provide highly available data and minimize access to the database, a common technique is to replicate data. For example, if we replicate K to all 10 nodes, then we can tolerate 9 nodes going down and will still have K available.

However, this comes at a cost: if everyone replicates all of its elements to all cluster nodes, then we can effectively only use 1/N of the 'cloud memory' (10 GB), which is 1 GB... So we trade access to the large cloud memory for availability.

This is like RAID: if we have 2 disks of 500 GB each, then we can use them as RAID 0 or JBOD (Just a Bunch of Disks) and have 1 TB available for our data. If one of the disks crashes, we lose data that resides on that disk. If we happen to have a file F with 10 blocks, and 5 were stored on the crashed disk, then F is gone.

If we use RAID 1, then the contents of disk-1 are mirrored onto disk-2 and vice versa. This is great, because we can now lose 1 disk and still have all of our data available. However, we now have only 500 MB of disk space available for our data !

Enter ReplCache. This is a prototype I've been working on for the last 2 weeks.

ReplCache allows for variable replication, which means we can tell it on a put(key, value, K) how many copies (replication count) of that element should be stored in the cloud. A replication count K can be:
  • K == 1: the element is stored only once. This is the same as what PartitionedHashMap does
  • K == -1: the element is stored on all nodes in the cluster
  • K == > 1: the element is stored on K nodes only. ReplCache makes sure to always have K instances of an element available, and if K drops because a node leaves or crashes, ReplCache might copy or move the element to bring K back up to the original value
So why is ReplCache better than PartitionedHashMap ?

ReplCache is a superset of PartitionedHashMap, which means it can be used as a PartitionedHashMap: just use K == 1 for all elements to be inserted !

The more important feature, however, is that ReplCache can use more of the available cloud memory and that it allows a user to define availability as a quality of service per data element ! Data that can be re-read from the DB can be stored with K == 1. Data that should be highly available should use K == -1, and data which should be more or less highly available, but can still be read from the DB (but maybe that's costly), should be stored with K > 1.

Compare this to RAID: once we've configured RAID 1, then all data written to disk-1 will always be mirrored to disk-2, even data that could be trashed on a crash, for example data in /tmp.

With ReplCache, the user (who knows his/her data best) takes control and defines QoS for each element !

Below is a screenshot of 2 ReplCache instances (started with java org.jgroups.demos.ReplCacheDemo -props /home/bela/udp.xml) which shows that we've added some data:


It shows that both instance have key "everywhere" because it is replicated to all cluster nodes due to K == -1. The same goes for key "two": because K == 2, it is stored on both instances as we only have 2 cluster nodes.
There are 2 keys with K == 1: "id" and "name". Both are stored on instance 2, but that's coincidence. For K keys and N cluster nodes, every node should store approximately K/N keys.

ReplCache is experimental, and serves as a prototype to play with data partitioning/striping for JBossCache.
ReplCache is in the JGroups CVS (head) and the code can be downloaded here. To run the demo, execute:
java -jar replcachedemo.jar

For the technical details, the design is here.

There is a nice 5 minute demo at http://www.jgroups.org/demos.html.

Feedback is appreciated, use the JGroups mailing lists !

Enjoy !

Monday, January 05, 2009

JGroups 2.7 released

Finally, after almost a year of development, I released 2.7.0.GA this morning. It can be downloaded from http://sourceforge.net/project/showfiles.php?group_id=6081&package_id=94868&release_id=651542.

Although 2.7 has 211 JIRA issues (bugfixes, tasks or features), most of the bugs have been back ported to the 2.6 branch. Why ? Because 2.6.7 is the version that ships with JBoss 5, and we made sure JGroups works optimally with it.

So what's new ?

There are almost no new features ! (Can you tell I'm not a marketing person ? :-))

Most work (besides bug fixes) went into refactoring, e.g. we converted our test suite from JUnit to TestNG, allowing for parallel test execution and thus reduced our testing time from 2.5 hours to 15 minutes !

Another change was that all properties are now set using JSR 175 annotations, so we could remove a lot of boilerplate code from protocol implementations. In my opinion, the more code I can remove (without impacting functionality), the better !

Using annotations for properties also allows us to automatically generate documentation for the properties of all protocols.

I also marked unsupported or experimental classes/methods with @Unsupported or @Experimental annotations.

We were able to increase performance a bit, compared to 2.6, but 2.6 is already quite fast, so unless you need those additional 5-10%, go for 2.6.7.

In a nutshell, 2.7 serves as the groundwork for 2.8, which will have many new features.