As a web programmer, not a day goes by without needing to touch SQL in some form. Some days you're lucky enough to be modeling a schema, but other days leave you struggling to piece together a query that's over 100 lines long, just to return one number! There are an unlimited number of resources on creating quality, normalized database schemas, but unfortunately, not near as many dealing with breaking down and creating a complex query.
The software development process has a very extensive lifecycle, starting with project definition & requirements, then stepping through architecture and high-level design. Eventually, you get to the pseudocode phase, which happens right before you start coding. Constructing a single, complex SQL query can follow a somewhat streamlined version of that process.
Definition
In most cases, this step is simple, and the definition is provided to you. A few days ago, I had a client of mine ask me to create a report that determines customer value. There it is, the query's meaning is defined. I'll use this customer value example for the remaining steps.
Requirements
The requirements get a little more detailed. While the query definition will typically be only a sentence long, the requirements will provide more detail, including:
- What information is expected in the result set?
Keep the number of fields in the result set as small as possible. Using the asterisk to return every field in a table is a bad practice, no matter who you ask. I've been known to defend certain practices that are typically labelled as "bad", but in this case, there's just no argument. If you have a bunch of "select all field" queries running rampant throughout your code, what happens when you add a blob field to your table?
Database queries pull information from disk and store it in memory. With memory typically being a lot more precious than disk space, we want to keep that result set as small as possible. It also ensures that you keep the routine that called the database query on a need-to-know basis, not providing it any information is doesn't need. For example, if the primary key field is not needed in the result set, don't include it.
For the customer value query, we'll need the following fields returned:
customer first & last name, phone number, email address, years the customer has been active, total number of orders, orders per year (the customer value) - What criteria allows a record to be included in the result set?
This question isn't always necessary, but it usually is. At times, the requirements may simply call for a list of all customers, in which case, criteria is removed from the equation. For our needs, we don't want just any customer, but one that has returned to place additional orders after their first order.
The customer must have placed more than one order.
- How should the records be organized?
While the answer to this question is, many times, a simple response such as alphabetically by last name, often times you'll be required to provide more advanced sorting. What if the first and last names are combined into one field, you'll need to use a string function to grab the last name before applying the sort. Defining the sorting fields and method beforehand make the next step that much easier.
Order the report by the number of orders a customer placed per year, descending (highest number of customer orders per year will be the first row in the result)
- How many records are needed in the final result set?
This, of course, goes back to our previous mention of memory usage. The larger the result set, the more space it consumes within the calling programs allocated memory. A keyword search is a common culprit of unnecessarily lengthy result sets. How many users are willing to browse through 1,000 search results? A hard limit of 100 search results is probably all we need in this case, perhaps even less.
We're interested in the top 100 most valuable customers.
High-Level Design / Pseudocode
In the case of a database query, at this point, you should be in good shape to start creating some pseudocode. It will no doubt be a waste of your time to step through every piece of the software lifecycle for a single query. The software cycle was designed for large projects. Individual components of that project don't require a reiteration through the entire process. Designing a query is a lot like creating a function or method. Once you know that function's purpose, you can jump into the pseudocode process. Creating pseudocode for a query may not be a concept you're used to, but why should a query be any different? To start, you can define the basic components that make up the query.
- Determine what fields from which tables need to be returned (selected)
- Add FROM statement, using table with most selected fields first
- Add JOINs for remaining tables that have returned fields
- Create GROUP clause (if needed)
- Create WHERE clause
- Examine the WHERE clause for tables which have not yet been JOINED, then JOIN them as necessary
- Add LIMIT and ORDER clauses last, as they are least coupled with the other clauses
Let's use the pieces identified above to write some pseudocode for the requirements defined above:
SELECT customer first name, customer last name, phone number, email address, total orders, years since registered, orders per year
FROM customers, orders
GROUP BY customer
HAVING order count > 1
ORDER BY order per year DESC
Everything here looks straightforward, except orders per year. Writing pseudocode often reveals areas that need further definition. So, we decide to define orders per year as follows:
Years = The difference, in days, between today and the day the customer signed up to the site, divided by 365.
Orders Per Year = Total Number of Orders, divided by Years
The Moment of Truth
It's time to write the query. I use "Moment of Truth", because this is where we find out how well our high-level design / pseudocode was thought out.
If you're working from a command-line, good luck. You may be a hardcore, life-time Linux administrator, but the command prompt is no place to test multi-line queries (or anything multi-line for that matter). Open up your favorite database workbench/query tool. I personally use PHPMyAdmin. I like it because it sits on a remote server, so I can access it from any PC. If you do take this approach, make sure to secure your access to PHPMyAdmin via SSL (create a self-signed SSL certificate if you have to), and consider restricting access only to your IP address (use Apache Allow, Deny directives).
SELECT u.userid, u.firstname, u.lastname, u.email, count(1) as orders,
(DateDiff(now(), u.ts_created) / 365) as years,
Round((count(1) / (DateDiff(now(), u.ts_created) / 365)), 1) as orders_per_year
FROM users u
INNER JOIN orders o
ON o.userid = u.userid
GROUP BY u.userid
HAVING count(1) > 1
ORDER BY orders_per_year DESC
LIMIT 100