Initially Lotus were clueless - their first suggestion was that we had a corruption on the NAB and that we should just push out a new copy. I told them we wouldn't do that unless they could provide a better reason than "we think it might fix it" - it was not a trivial exercise. I wanted to know why it had only just started happening and had not been a
problem for over a year.
Then they thought the server might be caching old (wrong) addresses to other servers (have a look at a server doc and look for $Saved.... fields) - but the IP network had not changed.
Our connection documents were using the server clusternames as destinations. On a vague hunch I created server groups and used these in place of the cluster names in the connection docs. The routers seemed happy with this and declined further misbehaviour. So, the router must treat groups and clusternames differently - thought I. No, said Lotus, you have dodgy $Saved addresses in your server docs. What they suggested was to stop all servers, purge all the $Saved fields from the server docs and bring up the servers one at a time so they all got the changed docs and we could manage replication.
Again, I refused as I wanted some form of assurance that this was the problem - but they only said it was something to try and we should try it. We had mail routing quite happily using the server groups, so I was not keen to make a lot of effort to satisfy one of their hunches.
After sending a shirty mail telling them that I wouldn't try their latest suggestion at our expense as we had so far tried four other "fixes" that had had a 100% failure rate and AGAIN explaining the problem, the issue now gets passed to someone (who knew more than how to search kbase) very deep in the bowels of the support centre. Note that this is now eight weeks since the problem started happening and six weeks since I put in the fix to the connection docs.
Well it seems that the server does have some hard limits when using clusters. Cluster.ncf can only retain a fixed number of clusters. If you use clusters as destinations for mail routing and the number of clusters in your network goes over that limit - then bad routing karma comes to visit. Groups are managed differently and are not affected. "Try this magic notes.ini setting" came the reply and this will fix your problem.
The routing tasks now have a wider horizon, able to see further and discern their more distant brethren, providing a merry conduit for communication.
And I get to say "I told you so" to Lotus.
No comments:
Post a Comment