ATC Failure

ubuysa

The BSOD Doctor
I hope this doesn't get deleted because I don't want to get into the details of whose fault this ATC foul-up was, but I am interested in a particular technical detail, and I would like to keep this thread focussed ONLY on this one technical detail please.

There is one aspect that keeps being repeated and I wanted to get other techie's views on it, especially those who make their living from writing code. The oft-repeated cause of this problem is...
Rolfe [Martin Rolfe, NATS CEO] said the chaos was triggered when NATS received an “unusual piece of data” it could not process

Now I didn't write a lot of (non-system related) code in my time, but validating the data was always the first thing you code for when using user-input data. Checking that each field had the right data type, the right number of characters, etc. etc. Does anyone else puzzle over this "unusual piece of data" cause? It just doesn't seem credible to me. What do others think?

Please stay on topic to avoid the thread being deleted.
 

TonyCarter

VALUED CONTRIBUTOR
Validation should be the first step, whether that means each field is validated as each field is input, whether validation only occurs when the whole form has been input, or even at the very latest when the record has been completed and is about to be transmitted.

Even if the 'foreign' system allows for longer strings/strange characters for their own use, both ends of the system should be checking for 'translation' errors when data needs to be shared/transmitted.
 

sck451

MOST VALUED CONTRIBUTOR
The number one rule of programming is surely "never trust user data". I can't comprehend** how an essential piece of software can be brought down by malformed data.

** actually, I can, but I'm not sure if a discussion of government IT procurement would be off-topic!
 

TonyCarter

VALUED CONTRIBUTOR
** actually, I can, but I'm not sure if a discussion of government IT procurement would be off-topic!
I actually left out the bit about me working on projects with a large IT company for both NATS and Eurocontrol (over 20 years ago though).
 

SpyderTracks

We love you Ukraine
I hope this doesn't get deleted because I don't want to get into the details of whose fault this ATC foul-up was, but I am interested in a particular technical detail, and I would like to keep this thread focussed ONLY on this one technical detail please.

There is one aspect that keeps being repeated and I wanted to get other techie's views on it, especially those who make their living from writing code. The oft-repeated cause of this problem is...


Now I didn't write a lot of (non-system related) code in my time, but validating the data was always the first thing you code for when using user-input data. Checking that each field had the right data type, the right number of characters, etc. etc. Does anyone else puzzle over this "unusual piece of data" cause? It just doesn't seem credible to me. What do others think?

Please stay on topic to avoid the thread being deleted.
It's trying to bypass ownership, it should absolutely have been verified as a part of the project migration before it was passed to BAU support. For this guy to say what he has it's clear he knows nothing about computing
 
Last edited:

Scott

Behold The Ford Mondeo
Moderator
What should happen and what does happen is a very different beast. I would have thought with this level of reliance required this would never happen though.

It's one of these things where I doubt we will ever get the whole story.

With regards to the technical query, IME it happens all the time. We use a system in work called SAP, it's a global software and runs a huge number of companies ERP. When utilised it's completely intrinsic to the business it's adopted with so any downtime causes major upset. It's database driven, with huge datasets & various table layouts etc. Even just at the very basics of SQL field validation and restriction is a primary consideration, there's no way that anything should ever be brought to a halt with erroneous input.

Unsurprisingly we were brought to our knees a number of years ago. Daily operations on the system completely halted. Being in a large company, getting something fixed at our business level took some time as there's various levels of IT to wade through to get the issue fixed. Turned out that at some point the requirements for the validation had changed slightly. There was a check done to make sure that none of the current data would fail the new validation, but surprise surprise something was missed. Literally one entry when pulled created an error, it was an error that wasn't handled or considered so the entire reporting run fell over and stopped. This created a chain of reporting errors as it was one of the low level queries that was failing, that all others then relied upon.

Again, this took months to fix properly. It wasn't even a member of the IT team that found and corrected the issue, it was a SAP superuser within the company that was asked to "take a look" as noone else could figure it out.

There's a real ignorance when it comes to IT imo. I have my head in my hands on a daily basis at some of the misinformation and lack of understanding that our own IT team have. There's so much red tape and the left hand never knows what the right hand is doing. The captain at the wheel has a limited knowledge so anyone with any sort of knowledge looks like a master to them.

On the face of it.... these things shouldn't happen. Having witnessed the car crashes that I've seen with alleged IT professionals though, nothing surprises me.

There is the potential that this is all smoke and mirrors, with the actual cause being some sort of more malicious attack. This is entirely possible but I genuinely believe it's entirely possible that something being overlooked in data validation could actually be the cause.
 
Top