SQL: Join Performance
Welcome to the Join Performance lesson!
This lesson is shown as static text below. However, it's designed to be used interactively. Click the button below to start!
In a previous lesson, we used
JOINto list all pairs of people and their cats. We could have done that by looping in JavaScript, querying to find each person's cats. In this lesson, we'll see why we chose to write aJOINinstead of using JavaScript.(In this lesson, our database will always be set up with a
peopletable and acatstable. The cats table has anowner_id, which is a foreign key to a person. Amir has a cat named Ms. Fluff, and Betty has a cat named Keanu. You'll see this in the setup for each example, but it's always the same, so you can skip reading it.)First, let's get every person-and-cat pair using JavaScript.
Here's a code problem:
Finish the function
peopleAndCats()so that it creates a list of each pair of cat and person. For every cat, callresults.push({person: person.name, cat: cat.name})to build up a list of results.exec(`CREATE TABLE people (id INTEGER PRIMARY KEY NOT NULL,name TEXT NOT NULL);CREATE TABLE cats (owner_id INTEGER REFERENCES people(id) NOT NULL,name TEXT NOT NULL);INSERT INTO people (id, name) VALUES (100, 'Amir');INSERT INTO cats (owner_id, name) VALUES (100, 'Ms. Fluff');INSERT INTO people (id, name) VALUES (200, 'Betty');INSERT INTO cats (owner_id, name) VALUES (200, 'Keanu');`);function peopleAndCats() {const results = [];const people = exec(`SELECT * FROM people`);for (const person of people) {const catsOwnedByPerson = exec(`SELECT * FROM cats WHERE cats.owner_id = ?`, [person.id]);for (const cat of catsOwnedByPerson) {results.push({person: person.name, cat: cat.name});}}return results;}peopleAndCats();- Goal:
[{person: 'Amir', cat: 'Ms. Fluff'}, {person: 'Betty', cat: 'Keanu'}]- Yours:
[{person: 'Amir', cat: 'Ms. Fluff'}, {person: 'Betty', cat: 'Keanu'}]
This is correct, in the sense that it will return the right results. But it has a big problem: its performance is terrible!
What happens if we have 10,000 people? Our outer loop
for (const person of people)will run 10,000 times! We'll do a total of 10,001 queries: one query to find all of the people, then a separate cat query for each of the 10,000 people.This is a very common problem when querying any kind of database, SQL or otherwise. It's called an "N+1 problem": we do 1 person query, then N (10,000) cat queries. That's too many queries; we can't afford to put that much load on the database for a single pageview on our site.
We'll fix this in stages. First, let's reduce the number of queries. We'll get all of the people, then get all of the cats. Then we'll write a nested loop:
- For each person:
- For each cat:
- If this person is this cat's owner:
- Add the cat to the results.
- If this person is this cat's owner:
- For each cat:
- For each person:
Here's a code problem:
Add an entry to the results array for each person and cat where
cat.owner_idequalsperson.id. (You can useresults.push({person: person.name, cat: cat.name})to add an entry to the results.) You won't need to add any more database queries.exec(`CREATE TABLE people (id INTEGER PRIMARY KEY NOT NULL,name TEXT NOT NULL);CREATE TABLE cats (owner_id INTEGER REFERENCES people(id) NOT NULL,name TEXT NOT NULL);INSERT INTO people (id, name) VALUES (100, 'Amir');INSERT INTO cats (owner_id, name) VALUES (100, 'Ms. Fluff');INSERT INTO people (id, name) VALUES (200, 'Betty');INSERT INTO cats (owner_id, name) VALUES (200, 'Keanu');`);function peopleAndCats() {const results = [];const people = exec(`SELECT * FROM people`);const cats = exec(`SELECT * FROM cats`);for (const person of people) {for (const cat of cats) {if (cat.owner_id === person.id) {results.push({person: person.name, cat: cat.name});}}}return results;}peopleAndCats();- Goal:
[{person: 'Amir', cat: 'Ms. Fluff'}, {person: 'Betty', cat: 'Keanu'}]- Yours:
[{person: 'Amir', cat: 'Ms. Fluff'}, {person: 'Betty', cat: 'Keanu'}]
We reduced the number of queries to 2!
But this still has a performance problem. What if we have 10,000 people and 10,000 cats, but we only want to retrieve Amir and his cat Ms. Fluff? We could do it by calling our
peopleAndCatsfunction, then filtering its results. But we'll end up retrieving a total of 20,000 rows from the database even though we only need 2.Also, even though we've reduced the number of queries, we've made our JavaScript very slow. The inner loop that looks at each cat runs 10,000 times for every iteration of the outer loop. With 10,000 people and 10,000 cats, the comparison of
owner_idtoperson.idwill run a total of 100,000,000 times (10,000 * 10,000). All of that just to find out that Amir owns Ms. Fluff. This method does not work in real systems.Here's where relational databases show their usefulness. We can use a
JOINto get the same result.Here's a code problem:
Write a
SELECT ... FROM ... JOIN ... ON ...query to get a list of all cats and people. Remember to selectpeople.name AS person, cats.name AS catto get the right column names.exec(`CREATE TABLE people (id INTEGER PRIMARY KEY NOT NULL,name TEXT NOT NULL);CREATE TABLE cats (owner_id INTEGER REFERENCES people(id) NOT NULL,name TEXT NOT NULL);INSERT INTO people (id, name) VALUES (100, 'Amir');INSERT INTO cats (owner_id, name) VALUES (100, 'Ms. Fluff');INSERT INTO people (id, name) VALUES (200, 'Betty');INSERT INTO cats (owner_id, name) VALUES (200, 'Keanu');`);exec(`SELECTpeople.name AS person,cats.name AS catFROM peopleJOIN catsON people.id = cats.owner_id`);- Goal:
[{person: 'Amir', cat: 'Ms. Fluff'}, {person: 'Betty', cat: 'Keanu'}]- Yours:
[{person: 'Amir', cat: 'Ms. Fluff'}, {person: 'Betty', cat: 'Keanu'}]
We got the same result in a single database query! This fixes the two performance problems we've talked about so far: it requires only one query, and it doesn't have any nested loops.
What's really going on in the
JOIN, though? So far, we've been thinking aboutJOINs as nested loops, similar to the ones that we wrote in JavaScript above. Aren't we just asking the database to do 100,000,000 (10,000 * 10,000) iterations, which will be slow?Fortunately, no! Nested loops are a perfect mental model for how
JOINworks, but they're only a mental model. In reality, the database will optimize the query, rebuilding it to be more efficient while still giving the same results.Here's a concrete example. (Your answer here should be identical to the one above, with an extra
WHEREadded.)Here's a code problem:
Write a
JOINquery to find all cats that are owned by a person with anameof "Amir".exec(`CREATE TABLE people (id INTEGER PRIMARY KEY NOT NULL,name TEXT NOT NULL);CREATE TABLE cats (owner_id INTEGER REFERENCES people(id) NOT NULL,name TEXT NOT NULL);INSERT INTO people (id, name) VALUES (100, 'Amir');INSERT INTO cats (owner_id, name) VALUES (100, 'Ms. Fluff');INSERT INTO people (id, name) VALUES (200, 'Betty');INSERT INTO cats (owner_id, name) VALUES (200, 'Keanu');`);exec(`SELECTpeople.name AS person,cats.name AS catFROM peopleJOIN catsON people.id = cats.owner_idWHERE people.name = 'Amir'`);- Goal:
[{person: 'Amir', cat: 'Ms. Fluff'}]- Yours:
[{person: 'Amir', cat: 'Ms. Fluff'}]
In the example above,
WHERE people.name = ...tells the database that only one person matters. ThenON people.id = cats.owner_idtells it that it only needs to consider cats owned by that person. By understanding both of those limitations, the database can execute the query more intelligently. It will do something similar toexec(`SELECT * FROM cats WHERE owner_id = ?`, [amir.id]), selecting only the cats that are relevant.This is impressive, but it's still a simple example. The bigger the query gets, the harder it is for a human to optimize manually. But the database has no such limitation; it will happily optimize any query that we come up with.
Imagine that we're joining across 7 different tables instead of just 2. (That's not extremely common, but it does happen.)
Each table has 10,000 records. With 7 nested loops, we'd require 10,000,000,000,000,000,000,000,000,000 iterations (10,000 to the 7th power). That would take something like 3,000,000,000,000,000 years.
If our database is set up properly, an 7-table join with a
WHEREthat matches only one row will execute in less than a millisecond. That lets us have our cake while eating it! We get to think about joins using a simple conceptual model: nested loops with anifinside. But the query actually executes in a much more intelligent way.(Full disclosure: database optimizers aren't perfect. In real databases, we give them certain kinds of manual hints that we'll learn about in later lessons. In rare cases, the optimizer does a bad job and we have to change our queries significantly to work around it. But in 99.9% of cases, the optimizer will do what you want!)