Musings of Dave: May 2013

There's a myth I want to dispel. It's a myth that I've had pedalled at me by people working at major utilities who should know better and by companies selling scada products. "Telemetry", they say, "is the same as SCADA".

This comes with the caveat that it needs to be a proper "enterprise-class" scada system of course, but basically they believe that a system designed to operate a single plant is going to work as a telemetry system. It won't. Don't believe them when they tell you this. Not only is it wrong, but this myth is dangerous – I've seen two (actually, I think I've seen three) utility companies plump for "Enterprise class scada" when they wanted to buy a telemetry system, and it didn't work out well for them.

If you're looking to procure a telemetry system, and someone's pushing a scada - then read the following and ask some pertinent questions - it might save you some serious heartache.

Knowledge that it's multiple assets
The first point I wanted to highlight is that telemetry systems, by their very nature, know they are monitoring multiple assets. Often thousands of separate sites are monitored, and telemetry systems are happy with that.

However scada systems seem to find this concept tricky to grasp. They would like to think they are controlling one very large plant - and though that's a valid paradigm for operation of your network of assets; it's not a paradigm that is applicable in all circumstances. There are just too many assets to always think of them as being a single big plant; and Telemetry systems know that and let you break it down into bite size pieces.

Multiple assets means you need the ability to group assets together in some way (e.g. geographically or functionally), and have the same asset appear in multiple groups ("all sites in the South region"; "all sites of type x"). It means reports, alarms, mimics and so on all need to be groupable and searchable in a sensible manner because you have thousands of them to worry about. Telemetry systems do this well and have it baked in to the way they work - and that just does not seem to be the case with scada systems.

I have to say that typically Telemetry systems are quite bad at the "operate your network as a single large plant" thing - and that's something they should get better at. Having said that, operating something as complex as a water distribution network or an electricity network as a single large plant is probably outside the domain of the rule-based kind of control typically available and needs something more specialist – think modelling engines and statistical control.

Things change
Seems obvious, but scada systems seem to hate it: things change.

By that I mean that the configuration of the monitored sites and the points on them are in constant flux. Every day something somewhere will change - some asset being telemetered will be added, modified or taken away and the telemetry has to be updated with the new configuration for that site.

There's plenty that can be written about how best to do configuration management, but the nub of the issue here is that generally scada systems don't like you making changes to the configuration to the running scada – it can be poorly supported, perform badly and will be a poorly exercised piece of functionality that is more likely to be buggy than the rest of the system.

Telemetry systems, on the other hand, know full well they are dealing with points across multiple assets and expect the config to change all the time and don't blink an eye when you kill off 1000 points and a whole site.

Devolving the intelligence to the outstation
Telemetry loves its outstations. The dubious nature of wide-area communications links and the sheer number of sites and monitored points has meant that historically we've deployed outstations to the field and left the comms switched.

That means that the outstation recognises when something has happened that the central system needs to know about and calls in. The outstation takes on the responsibility of logging changes to monitored points, and uploads logged data at the end of the day.

Hence, our central system needs to understand that it is connected to an outstation that is going to do some of the thinking for it.

This seems really hard for a scada to understand: the scada wants to be connected to the instrument (or at least the PLC the instrument lands on); and the scada wants to be the one that logs the data and spots the alarm condition. My experience is that scada systems hate the devolution of intelligence to an outstation and really struggle to work in that mode.

There is an argument - I think it's a bogus one - that the outstation is no longer required, and the central scada really can be connected to all the instrumentation just like it is at site. Sure: your comms is perfect - you don’t really need to log that critical regulatory data at the site because you can always get the data back to the centre. I doubt it: even as the availability of high speed, always-on broadband and wireless connections roll out, simply expecting perfect comms for all assets all of the time, regardless of class or geographic location, is just naive.

So the outstation is staying - or at least that local intelligence is. We might get rid of the outstation on some larger sites where the local scada can do the job of the outstation, but basically something local will be doing the thinking, and the central system needs to understand that and work with it.

Communications
On a related note the wide-area communications topic is a real elephant in the scada/telemetry room. We'd love to think comms is going to be good and will get better, but let's be honest: wide area comms is generally poor. When it comes to wide area comms across thousands of sites, then it is always really poor.

Something will always be down; something will always be slow to respond - it's just a characteristic of having thousands of comms links to remote assets: it will never all be working properly.

Telemetry systems understand this implicitly, but scada systems just don't get it. Making them work against large numbers of properly remote assets seems to always be hard. The communications needs to be managed - and from what I've seen scadas just think the comms will be there for it to use without it needing to worry about it.

I've seen real problems caused by a scada system that thought it was connected directly to the instrument and wouldn't give a positive acknowledgement that your operation was successful - it just told you if it failed.

That would have been fine if it was connected directly to the instrument - you’d have got something back pretty much instantly; but that wasn't the case.

Imagine you hit "Start" - and then… nothing! Nothing for ages while in the background a modem will try to connect, fail, back-off, retry, connect, negotiate, authenticate, send the command, and finally get a response. If the operation fails then at least you’ll know about it (finally), but if the operation succeeds the best you can do is wait around until the maximum time you think it is reasonable for the operation to have failed in – and in the absence of the failure notification assume it worked.

This is just awful: it's impossible for the users to really know if their operation worked or not - and when that operation is a command to a piece of critical national infrastructure, that’s just not good enough.

Alarm management interface
Finally I turn to one of my favourite topics: alarm management. Alarm management in telemetry is just not the same as alarm management in scada. I'm aware that a forthcoming alarm management best practice guide will even include a separate section for telemetry alarms because there are so many differences.

A real telemetry system knows its users are going to be managing alarms from many assets concurrently. A scada system, well, doesn't Remember it thinks it’s operating one great big plant. So the operators charged with managing the alarms need to be able to group alarms and define areas of interest to enable them to only see alarms that they are interested in.

What’s more those operators are working 24/7, and potentially from different control centres - so they'll log out and another will log in and take over from where the previous operator was.

That places demands on the alarm management interface that scada systems just don't expect - what? Do dynamic grouping of alarms? Allow redefinition of those groups on-the-fly in the running system? Mark the steps in the alarm resolution life-cycle? Manage handover of an alarm from one user to another? Who the hell needs to do all that? In telemetry world these kinds of functions are vital to successful alarm management and to date, I've not seen a scada that can hack it.

While I'm banging on about this, those alarm handlers are not the people who are running a plant, where an alarm is an interruption to what their day-job. In telemetry world, the alarm handler's job is to manage alarms - and (generally), not to respond to the alarms themselves, but to hand off the alarms to operations staff who can do something about the problem.

Given these two key differences: alarm situations across multiple assets and the different role of the alarm handler, the whole dynamic of alarm handling in telemetry is different from scada; and scada systems just don't come up to the mark.

That alarm life-cycle after the alarm has been acknowledged is something that the user needs help in managing. I have to say that I think even telemetry systems could do it better, but my big bugbear with scada systems is that once the user has clicked "acknowledge" on the alarm the scada washes its hands of the alarm and figures its job is over. That's just wrong: in telemetry, that's where the job of rectifying the problem starts, and the alarm's life-cycle management needs to reflect that.

Conclusions
So - some people think a decent enterprise scada will do the job of a telemetry system. Well - I have to admit I don't really see any reason why there should be the difference between them, but the big scada vendors need to get savvy about the telemetry market - where there are demands on the system that don't appear, to date, to have been considered properly.

And if you are looking to procure a new telemetry system, then beware the big scada vendors who'll show you fancy graphics and a big list of customer successes - ask some questions based on the above and make sure they really can deliver. Let me know if you have a success!

I've had these floating around for a while, but figured I'd bash them out here to make them "public". Also I was recently at an IT Strategy conference, and noticed that a Gartner model of IT strategy included at least two of these in the "success evaluation" section. I was proud.

These were inspired by mealy-mouthed project managers that claimed another "success" regardless of the usability of what was delivered - pointing at the Critical Success Factors they'd managed to influence to be the metrics used to gauge success.

So these are Dave's Critical Failure Factors - if you get a "no" on any of these, then I think you failed; and I don't care what other metrics you have that show you succeeded - you didn't...

1) Did you improve the business's capability to operate?
This could be rephrased as "Are my customers happier now than they were?". Can the users of whatever it was you delivered do what they used to be able to do more easily? Can they do new things that they wanted to be able to do, and do them in the way they wanted to do them?

I'd like to carry out a quick surveymonkey survey of the users to gauge whether what was delivered made their lives better - almost a single question: "Is the world a better place now this delivery has happened?" would suffice, but I figure I could bulk it out a bit with a bit of imagination.

2) Did you deliver a supportable, strategic asset?
There are two key words in here: "supportable" and "strategic". They're closely related, which is why this is a a single Critical Failure Factor rather than two separate ones - things that are strategic tend to be more supportable. Conversely - if you've delivered something unsupportable, it's probably not strategic either.

"Supportable" means "Did you make life better for the people in support who have to look after the service you delivered?". Does it work? Is it stable? Does it need babying along and continually standing up when it falls over? Does it all work together with everything else, or are there daily tweaks to make it talk to legacy system x? Again, I'd do a quick surveymonkey, asking essentially the same question: "Is the world a better place now this delivery has happened?" - only this time I'd survey the people charged with supporting what you delivered.

Remember, your project is an investment the company is making - the company has been out, convinced the bank that the project will increase the value of the company and borrowed the money to do it. You are delivering a new asset - something that increases the value of the company so much that they can pay back that money they borrowed to do the project.

Those guys in support are not an asset, they are a cost. They cost the company money just to keep the lights on, and your delivery of something that needs more of them or more time from them is a straightforward cost that the business can do without.

"Strategic" means "is it aligned to the strategic roadmap?" - or perhaps better: "Did this delivery take us in the right direction?". You might have a nice strategy, but typically that strategy will be delivered by individual projects, as projects provide the energy and structure necessary to make something happen.

Jim Crookes, chief architect at BT said "It's a lot easier to tack your way forward than to row into the teeth of the wind..."*. There's a good chance that factors such as money and time will mean that the project does not deliver exactly in line with strategy, but you can deliver something that takes you in the right direction. So - go ask the architects: "Is the world a better place now this delivery has happened?"

3) Did you finish?
This should be the easiest - and to be honest, if what you're doing is delivering a brand new capability on a greenfield site, then it can be. "Did you finish?" means things like "Is it handed over to support?" and "Are the users all trained up and all using the system?". It means "Can the delivery team walk away?".

But nine times out of ten you're not delivering a brand new shiny capability with no old system to migrate people off and decommission - and that's where this, I swear, becomes the hardest Failure Factor to achieve, and the hardest thing to actually get built into a project as PMs are darn scared of this one.

In these, overwhelmingly more common, cases "Did you finish?" means "did you actually migrate all the users off the old system and onto the new one, and (critically) TURN OFF the old system?".

Has the old system gone? Gone properly - software turned off and hardware decommissioned. Not nearly gone - how often do you hear this: "Oh well, yes we've delivered but we've had to leave the old system running for a bit while we ..." <-- FAIL.

You've finished when you've finished - the new system is in, everyone's using it and the thing it was supposed to replace has been properly and totally decommissioned.

Feel free to adapt, adopt - or just ignore if you're a project manager :)

* Jim Crookes, quoted in Enterprise Architecture As Strategy - Ross, Weill & Robertson; Harvard Business Press 2006

Wednesday, 8 May 2013

Telemetry - it ain't just big scada you know...

Thursday, 2 May 2013

Project success - Dave's Critical Failure Factors