Security in complex systems is always a tricky business. Consider production Grid infrastructures as an example. The intricacies of establishing working trust relationships between the users and the infrastructure, and between the systems themselves, is a mammoth task. Solving problems with such systems is also very tricky, as I’ve previously found when developing EU-wide Grid interoperability demonstrators of open standards. They appear like dragons: huge, daunting, and difficult to defeat.
The UK National Grid Service asked Steve (well, the Institute really) to help them out with their SARoNGS system. Our arrangement was very effective. The Software Sustainability Institute provided development effort for the investigation, whilst the NGS fixed issues and offered the in-depth systems knowledge that only they could provide.So what is SARoNGS all about? The Shibboleth Access to Resources on the NGS service greatly simplifies authentication to NGS resources by accepting institutional Shibboleth credentials. It’s great for users, because they don’t need to apply for, own and use an X509 certificate. However, it appeared that the automatically generated SARoNGS certificates were being rejected by the NGS’s Workload Management Service (WMS). In short, you could no longer use SARoNGS certificates to submit jobs through the WMS without seeing a rather ominous error light up the screen:
Connection failed: CA certificate verification failed
We were warned here be dragons, but we ploughed on heedless of the danger.
You may have heard of software decay. This can occur when the environment around a piece of software changes, which leads to failures in the system as a whole. For example, an update to a dependent library or to the operating system could cause a problem. Updating one of the ubiquitous jar files in Java, only to find some of the API functions have become deprecated, can also cause grief. The good news is that there are things you can do to avoid this problem, some of which I’ve looked at already.
Security problems are often esoteric and difficult to solve. The problem could be a software dependency issue, say a newly updated security library with a bug that incorrectly interprets certificate attributes. It could also be a problem with the way in which trust relationships are defined. Sophisticated production Grid systems often trust a veritable legion of Certificate Authorities (CAs). Each CA has its own CA trust certificates, Certificate Revocation Lists (CRLs – a list of certificates not to trust) and signing policies. (I won’t get into how VOMS fits into the picture in this post, but if you’re interested, let me know.) Sorting out certificate problems can be like looking for a needle in a haystack… in a tornado. However, once identified, these issues are often easy to fix.
Systems can also fail when you haven’t changed anything at all, and this was the case with the first problem we found with SARoNGS.
Time is an important concept in security. For example, the NGS proxy certificates have a limited lifetime to reduce their vulnerability if the proxy is compromised. CRLs must also be kept up to date. The problem with expired CRLs is that they can cause the entire authentication step to fail, and this is what had happened with SARoNGS: a CRL in a critical location had expired. We updated the CRL and the first dragon lay dead on our screens!
When establishing trust in Grid systems, you need to decide which certificates to trust and where in the system to trust them. The second problem with SARoNGS was caused by two different signing policies being simultaneously. Some sites were intentionally configured to trust SARoNGS, and others were not. However, the installation of an update using the International Grid Trust Federation (IGTF) bundle meant that the UK e-Science signing policy reverted to the IGTF default: do not trust SARoNGS certificates. Again, an easy problem to fix, but a difficult one to identify. Once SARoNGS trust was reinstated in the signing policy (we used a modified NGS IGTF+ bundle) the problem was resolved and the last dragon soundly defeated.
And so the legend goes, the dragons of SARoNGS were slain. If you ever find yourself developing Grid software and run into a security brick wall, why not take a look at those conspicuous looking CRL and signing policy files? They could be dragons. And dragons need slaying!