In Validation Vexation, I wrote a bit about ways that validation rules for user-entered data can go awry by being too narrowly-defined. This post adds three more principles for dealing with data validation which are primarily focused on the results of the validation rather than the rules used to do it. The examples used are rather specific to handling email addresses, but the principles themselves still apply much more broadly.
Incorrect Validation Is Worse Than No Validation
A recurrent question I’ve seen on a couple different programming sites is “What regular expression can I use to validate an email address?” They’ve gotten a lot of responses explaining ways to examine an email address and reject it if it’s “invalid”. The only problem is that every one of them that tried to be more restrictive than just checking for the presence of an “@” was wrong.
The most common flaw was failing to accept “+” in the local-part of an address, causing many perfectly valid, deliverable email addresses to be rejected. There were also several which failed on top-level domains (TLDs) which were four or more characters long, such as .info or .museum.
On the flip side, there were users accustomed to using disposable email addresses of the form username+foo@domain.com or with addresses automatically generated from their surname of O’Malley - including the apostrophe - complaining about sites refusing to accept their “invalid” (but syntactically correct and perfectly functional) email addresses.
Rejecting these addresses may only shut out a percent or two of the world, but is that really an acceptable cost to pay in an attempt to shut out truly bogus addresses? Particularly given that…
Valid Does Not Imply Correct
Let’s say that you’ve managed to find yourself a perfect validation algorithm which is 100% accurate at determining whether email addresses are syntactically valid, even for the O’Malleys. You’re getting only good email addresses, right?
Wrong.
abcde@fghi.jkl is syntactically valid, but utterly bogus. There isn’t even a .jkl TLD!
“OK,” you say, “I’ll just add a DNS lookup to my perfect validator so that it rejects domains that don’t exist or don’t have a mail exchanger (MX) record defined.”
Still no good. dave.sherohman@whitehouse.gov is syntactically valid. It points to a domain that exists and accepts mail. It’s still no good.
“No problem. I’ll connect to the mail server and validate that the user account exists.”
I can still give you a bad email address. Try president@whitehouse.gov on for size. The account exists, but it’s certainly not mine.
If There’s Any Chance Of An Incorrect Rejection, Don’t Pre-Validate
The only way to authoritatively validate the correctness of an email address is to send email to the address and request that the user respond to it in some manner to confirm receipt.
You may still benefit from doing a preliminary validation to avoid the trouble of sending out mail that could never possibly be received, but this pre-validation cannot reasonably be considered authoritative and should, therefore, only be used to filter out the most blatantly incorrect cases. Attempting to make it more restrictive will only introduce an unnecessary risk of rejecting genuine, correct, deliverable addresses.
The cost of performing the authoritative validation is low and the cost of rejecting an address which is incorrectly flagged as invalid by a flawed pre-validation is high, so don’t pre-validate beyond the point at which you are absolutely, 100% certain that it will not fail anything which might pass the authoritative validation.