As a follow-up to my
Top 25 Most Dangerous Programming Errors post, I want to expand on the error that has me the most worried. It turns out it's at the top of the list: Improper Input Validation. The folks who wrote the list call it "the number one killer of healthy software." It's not only highly prevalent, but it's downright easy to defend against. Here are some tips for fighting the good fight.
Understand where your inputs come from
When you're building a web application, your inputs come from four obvious sources:
- URLs
- Query Strings
- Forms
- Cookies
... and a couple of not-so-obvious ones:
- Other sites or resources on either your network or the web
- Your own data
URLs may not be all that obvious to some people. Traditionally URLs are paths to files on a file system, and those paths don't leave much room for malicious activity: if you specify anything other than a known path, you get a 404 error. Fair enough. But with the rise of RESTful development, more sites are using parameterized URLs. You'll see addresses like http://www.flickr.com/sackville (sorry, I haven't posted any photos yet). That username in the URL is input, and it gets processed. There's a SQL query done that uses 'sackville' as a parameter. That's a prime attack point.
Query strings and form inputs should be pretty obvious: they're prime targets for attack.
Cookies, which you may think of as innocuous because they're usually written by your app (rather than by a user), are just as vulnerable as any other input. The reason is simple: anything anywhere in your client's HTTP request (and that includes all of the above) should be suspect. Hackers use custom-built web browsers that emulate your favorite commercial browser, but submit customized requests. They bypass all of your client-side security measures, and submit their own URLs, query strings, form fields, and cookies. We'll get to that "bypass all of your client-side security measures" issue in a moment, and you'll understand why you still need them.
As for your own data and the other resources your app integrates with, the best strategy is to assume that someone has already gotten to those systems, and that their data has been compromised. If it's going into your system, validate it as if it were from an untrusted source. If you're displaying it, encode or escape it (that's another topic altogether).
Define what is OK
Your app doesn't have to accept all forms of input, and in fact it should not. Your responsibility as a developer is to strictly define exactly what sorts of input are acceptable for every single input your app receives. Take a whitelist approach, not a blacklist approach. That means you explicitly define what input is allowed, rather than defining what is not allowed. If a character doesn't make your whitelist, it doesn't make it into the app.
You'll want to define more than one whitelist. Keep one for each type of data your app receives (phone numbers, email addresses, etc.). Err on the side of a more narrow definition of what is OK. If your users complain about a character not being accepted, add that character rather than an entire class of characters. Design your app to limit the variety of input that is accepted: if you are storing US-only phone numbers, there's no reason for them to be letters, and you don't need to store more than 10 digits (save the dashes and parentheses for display).
If your users are choosing from a list your app provides, then there's no reason why you should accept anything that's not on that list. If you can display the list, then you can (and should) validate against it.
Check, check, and then check
Validation should happen in three places: on the client (before submitting input), on the server (as input is received), and at application boundaries (as input is received by one application from another).
As I mentioned above, client-side data validation is not enough. Your app needs to validate -- on the server side -- any data coming in from the client. But that's no reason not to validate on the client side. Client-side validation is your first defense, and can greatly enhance the user experience (by eliminating the round trip to the server just to validate input). There's another huge gain that client-side validation can give you: if you receive invalid input from your client, and catch it on the server using a validation algorithm that is supposed to match the client-side validation exactly, then you have detected an attack on your system. You can respond as you see fit. I recommend something tactical and nuclear.
Server-side validation should start, as stated above, with a validation algorithm that matches the client-side validation. Your business logic may then provide even further validation opportunities, based on the results of some server-side processing.
Any system your server communicates with should perform its own validation of input crossing application boundaries. This includes the app-server-to-DB-server boundary. Your SQL should be validating input, though this is generally limited to data-type checking, data-length checking, and referential integrity checks.
Use Your Tools
Validation is, of course, a common practice, and we have many validation tools at our disposal. I use SQL Server in most of my development, and just using stored procedures for all of my DB interaction gives me a head start -- invalid data types and lengths are automatically rejected. Using foreign key relationships between tables ensures referential integrity. As a .NET developer, I get a vast array of input validation tools built into the framework I'm building with, on the client and server sides. In any modern language you can use regular expressions to build your validation whitelists. Google regular expressions and you'll find some well written, proven regular expressions available for common data types.
Proper input validation requires some work. It takes some time. But really all we need to do is establish some good habits, build up a set of tools we can use on every project, and make validation into just another thing we always do.
Ss.