Skills Transcend Language

A common tendency when looking for software developers is to focus closely on their background in specific programming languages. As previously discussed in Why Do You Hire Programmers?, unnecessary focus on a specific language can lead to other, more appropriate options being overlooked.

A second hazard arises in that, by looking for language-specific experience, this practice contains an implicit assumption that programming languages stand alone, with little relation to one another, and that learning a new one is a major undertaking.

These issues are two sides of the same coin, both flawed in their failure to recognize that, while programming languages are the means used to develop software, they are not the primary skills of software development.

The Right Tool For The Job

A commonly-used metaphor, and one which would have been appropriate to the points made in Why Do You Hire Programmers?, is that of a carpenter’s toolbox. “When you want to build a house, you look for a carpenter who builds houses, not a carpenter who has 5 years of experience using hammers and 2 years using nailguns with preference given to candidates who are also familiar with screwdrivers.

While this metaphor does make a good point, that the craftsman’s skills go beyond the specific tools used and that you should trust him to use the right tool at the right time, I still have minor issues with it in that it preserves the misconception that each software development tool is fundamentally different from the others. Although there is some overlap between the skills of using hammers, nailguns, or screwdrivers, they are, by and large, three distinct activities. The skills of writing software in C, Java, or Smalltalk, on the other hand, overlap by much more than they differ.

Something A Little Less Prosaic

A much more apt metaphor would be to compare the programmer to the poet.

Skilled poets can write in many forms. A master of the sonnet can also be expected to competently produce ballads or limericks when called for and would likely be able to write a mean haiku with very little practice. The core skills of handling meter and rhyme or choosing the right words to paint a vivid image apply broadly across all poetic forms.

Similarly, a well-rounded developer is proficient in a variety of programming languages and can apply his core skills in logic and program structure to new languages with a minimum of difficulty. The techniques do vary from one language to another, with some placing more emphasis on one type of logic and others emphasizing structure more heavily, just as rhyme is everything to a limerick and irrelevant to a haiku, but this is generally a minor obstacle, if any at all.

Learning The Language

There is a common case where this falls apart, however: Programmers who only know a single language. Although it’s not inevitable, this often indicates someone who has solely learned that language without also learning the more general techniques of software development and how they apply beyond the one language.

The primary skills needed in software development are ways of thinking, of finding solutions for problems, and of structuring those solutions. They are not the vocabulary or syntax of particular programming languages. You will often be much better off with someone who has extensive experience with a wide range of technologies, but doesn’t know your particular language, than with someone who has extensive experience with your language, but has never used anything else.

     

And Now A Brief Word From Our Author

Just two quick updates today:

1. A Correction To Last Tuesday’s Post

In Off the Record: Passwords, I recommended the use of SHA1 rather than MD5 hashes when storing passwords. Since then, I have encountered a persuasive argument in favor of abandoning both of them and using bcrypt instead, as it’s designed to be less time-efficient, thus dramatically reducing the number of potential passwords that a brute-force attack can attempt in a given amount of time.

As this can make your passwords much more secure against attackers while still keeping single-hash generation running at a reasonable pace, I have revised that post to recommend the use of bcrypt over SHA1 where available.

2. I’m Not Disappearing

New posts to this blog have slowed down substantially over the last couple weeks because the stockpile which I prepared prior to launching the blog have all been published and my new writing process, which I expect to work out much better in the long run, is taking longer than the old one to produce completed posts, ready to publish. If all goes as it appears that it will, the rate of new posts should pick up again by the end of next week if not sooner.

     

Off the Record: Passwords

In 1999, I accepted a programming job with a company selling voicemail service. When it came time for the boss to demo the company’s product for me in full, he wanted to show me some feature that needed my PIN to be entered. Rather than having me enter it, he turned to his computer, brought up my account information, and read the PIN off the screen.

Although I didn’t say a word, I just about died on the spot. Although he wasn’t aware of it, he now knew the PIN for my bank account!

Granted, I should have known better than to reuse a PIN (and I do know better now!), but it’s a very common practice nonetheless, especially as the number of PINs and passwords that each of us needs to keep track of seems to be increasing daily.

Back To The Present

More recently, there has been discussion on the debian-user mailing list of ways to log the passwords entered on unsuccessful login attempts. While this seems innocuous at first glance, and an effective way of seeing what passwords are being tried by brute-force attackers, it will also catch passwords entered incorrectly by legitimate users. These mistyped passwords will generally only be off by a letter or two, thus giving anyone with access to these logs the ability to much more easily guess the correct version.

Despite the potential advantages in being able to revise your password requirements to improve their strength against the latest dictionary attacks, the issue of password reuse remains. These kinds of system logs are generally locked down well enough that they’re only accessible to system administrators whose legitimate powers are broad enough that knowing other users’ passwords will not allow them to do any additional damage locally beyond what they can already do using their own account, but granting them knowledge of others’ passwords also potentially allows them to impersonate the user on other systems.

Trusting them with this information is not necessary for them to perform their legitimate duties. It allows them to do additional harm, but does not expand their powers for good. Therefore, it should be kept from them and any security-conscious sysadmin or developer should recognize this. (Personally, I have, on several occasions, cut off users who were about to tell me their passwords and explained that I would not allow them to do so.)

Avoiding Password Leakage

To maintain secrecy, user-entered passwords should never be recorded in plaintext. I can’t think of any situations in which there would be a legitimate need to be able to recover an actual user password, but, should such a case exist, there are plenty of good, reversible encryption methods which can be used to store it, safe from prying eyes or casual glances.

The proper and generally-accepted method of dealing with passwords is to pass them through a non-reversible cryptographic hashing function, such as SHA11 bcrypt2, then storing the hash rather than the password itself. Use of reversible encryption would still leave the password vulnerable to anyone with access to the encryption key and the will to use it. A SHA1 bcrypt hash allows you to determine whether the entered password matches the correct password without ever knowing (or being able to know) what the correct password is.

Afterword: Honeypots and Testing

I did mention, above, that there are potential benefits to being able to see what passwords are being attempted by attackers. If this information is needed, it can be obtained without compromising the security of your actual users’ passwords by setting up a honeypot system with no valid user accounts (or at least none which accept password-based logins over the network). Since no legitimate logins will be attempted, you can safely record the passwords without giving away information about legitimate users’ accounts.

The other situation in which I’ve seen it suggested that it is appropriate to record passwords is when testing or debugging software under development to ensure that input (including the password) is being received and processed correctly. Safeguarding actual user credentials is again easily done by means of not having any actual user data (or at least not any actual passwords) in the testing environment.

 


1 MD5 has been widely used for many years, but weaknesses are finally being found in it, so SHA1 is now considered a better choice for this purpose.

2 Shortly after making this post, I ran across this article, which makes an excellent case for using bcrypt rather than MD5 or SHA1. If a bcrypt library is available in the environment where you’re operating, use it. If it’s not available, see if you can get it added.

     

Optimizing Software From 20,000 Feet

The First Rule of Program Optimization: Don’t do it.
The Second Rule of Program Optimization (for experts only!): Don’t do it yet.”

- Michael A. Jackson

If you spend much time with people who have any involvement with software development, you’re going to run across a conversation about optimizing software. Either the program is too big or too slow or too chatty on the network or too something and somebody wants to do a little optimization to improve on that.

It usually starts with the natural impulse of geeks to tinker with their systems, but project leads, marketing departments, and managers are often drawn in by the promise of a better product.

Should you find yourself in that position, contemplating either undertaking an optimization project or authorizing one, stop and consider what the proposed optimization is and how much it will actually improve things.

Optimizing Line-By-Line Is Rarely Worth It

Micro-optimizations are generally not worth the time it takes to make them. Shaving off a couple milliseconds from a function that only runs once is entirely pointless - it will produce no perceptible improvement, yet it comes at a potentially high cost in development time. Simple optimizations of this type are often already done automatically by the programming language behind the scenes anyhow, while more complex attempts may backfire.

The primary exception to this is when the code in question will be repeated many, many times in rapid succession. (Programmers often call this a “tight loop”.) If the code will run 10,000 times, then making it take a millisecond less each time will save a total of 10 seconds. If your user is sitting and waiting for the task to complete, that 10 seconds is an eternity. On the other hand, if it’s part of a 6-hour non-interactive process to close out your monthly books, the 10 seconds saved is meaningless despite the repetition.

Optimizing Algorithms Works Much Better

Several years ago, as I was just starting my programming career, I had a job doing data entry on a system which needed to run a check for duplicate records before posting completed tasks into its archival database. For performance reasons, it checked only those active records which were marked complete - and it still took nearly half an hour to run.

Eventually, the programmer who wrote the system left the company and I inherited his responsibilities for it. One of the first things I did was take a hard look at the duplicate checking code.

The duplicate check, as originally written, looked at every single record in the archival database for each active record that was being checked. Testing 10 active records for duplication against a 10,000-record archive database required 100,000 record-level comparisons. No wonder it was slow. It used an extremely inefficient algorithm.

Within a day, I had rewritten it with a better algorithm which only needed to make one pass over each database. With 1,000 active records and 10,000 archived, it could test every active record (not just the completed ones) with only around 11,000 record-level comparisons. It also used the databases more efficiently, bringing total run time down to roughly 15 seconds. Vastly improved performance, plus a more thorough check. A truly worthwhile optimization!

The Most Important Optimization

If I were to revise the original version of that duplicate check today, I could do even better. I’m confident that I could get it under 5 seconds and probably down into the 1-2 second range, while still running on the same, now horribly outdated, hardware and software.

Why are such extreme improvements possible? Because, at the time, I was able to devise a better algorithm than the original programmer had and because I now have several more years experience behind me than I did then.

If presented with my revised version, though, I would argue against further optimization of that code. Since it was just run once a week, it wouldn’t be worth the effort involved in bringing it down from 15 seconds to 5, maybe not even if it could get down to 1. Rewriting to eliminate the need to run the check at all might be worthwhile, but making the check faster would not.

The most important optimization is to optimize the skills and experience of your developers. Software development is not a commodity product. Getting someone with the skill to choose the right algorithms and the experience to know when an optimization wouldn’t produce sufficient improvement to justify the time invested will get you better results, in less time, and often for a lower overall cost.

     

Email Address Validation

In Validation Vexation, I wrote a bit about ways that validation rules for user-entered data can go awry by being too narrowly-defined. This post adds three more principles for dealing with data validation which are primarily focused on the results of the validation rather than the rules used to do it. The examples used are rather specific to handling email addresses, but the principles themselves still apply much more broadly.

Incorrect Validation Is Worse Than No Validation

A recurrent question I’ve seen on a couple different programming sites is “What regular expression can I use to validate an email address?” They’ve gotten a lot of responses explaining ways to examine an email address and reject it if it’s “invalid”. The only problem is that every one of them that tried to be more restrictive than just checking for the presence of an “@” was wrong.

The most common flaw was failing to accept “+” in the local-part of an address, causing many perfectly valid, deliverable email addresses to be rejected. There were also several which failed on top-level domains (TLDs) which were four or more characters long, such as .info or .museum.

On the flip side, there were users accustomed to using disposable email addresses of the form username+foo@domain.com or with addresses automatically generated from their surname of O’Malley - including the apostrophe - complaining about sites refusing to accept their “invalid” (but syntactically correct and perfectly functional) email addresses.

Rejecting these addresses may only shut out a percent or two of the world, but is that really an acceptable cost to pay in an attempt to shut out truly bogus addresses? Particularly given that…

Valid Does Not Imply Correct

Let’s say that you’ve managed to find yourself a perfect validation algorithm which is 100% accurate at determining whether email addresses are syntactically valid, even for the O’Malleys. You’re getting only good email addresses, right?

Wrong.

abcde@fghi.jkl is syntactically valid, but utterly bogus. There isn’t even a .jkl TLD!

“OK,” you say, “I’ll just add a DNS lookup to my perfect validator so that it rejects domains that don’t exist or don’t have a mail exchanger (MX) record defined.”

Still no good. dave.sherohman@whitehouse.gov is syntactically valid. It points to a domain that exists and accepts mail. It’s still no good.

“No problem. I’ll connect to the mail server and validate that the user account exists.”

I can still give you a bad email address. Try president@whitehouse.gov on for size. The account exists, but it’s certainly not mine.

If There’s Any Chance Of An Incorrect Rejection, Don’t Pre-Validate

The only way to authoritatively validate the correctness of an email address is to send email to the address and request that the user respond to it in some manner to confirm receipt.

You may still benefit from doing a preliminary validation to avoid the trouble of sending out mail that could never possibly be received, but this pre-validation cannot reasonably be considered authoritative and should, therefore, only be used to filter out the most blatantly incorrect cases. Attempting to make it more restrictive will only introduce an unnecessary risk of rejecting genuine, correct, deliverable addresses.

The cost of performing the authoritative validation is low and the cost of rejecting an address which is incorrectly flagged as invalid by a flawed pre-validation is high, so don’t pre-validate beyond the point at which you are absolutely, 100% certain that it will not fail anything which might pass the authoritative validation.