Execute Program

SQL: Join Performance

Welcome to the Join Performance lesson!

This lesson is shown as static text below. However, it's designed to be used interactively. Click the button below to start!

  • In a previous lesson, we used JOIN to list all pairs of people and their cats. We could have done that by looping in JavaScript, querying to find each person's cats. In this lesson, we'll see why we chose to write a JOIN instead of using JavaScript.

  • (In this lesson, our database will always be set up with a people table and a cats table. The cats table has an owner_id, which is a foreign key to a person. Amir has a cat named Ms. Fluff, and Betty has a cat named Keanu. You'll see this in the setup for each example, but it's always the same, so you can skip reading it.)

  • First, let's get every person-and-cat pair using JavaScript.

  • Here's a code problem:

    Finish the function peopleAndCats() so that it creates a list of each pair of cat and person. For every cat, call results.push({person: person.name, cat: cat.name}) to build up a list of results.

    exec(`
    CREATE TABLE people (
    id INTEGER PRIMARY KEY NOT NULL,
    name TEXT NOT NULL
    );
    CREATE TABLE cats (
    owner_id INTEGER REFERENCES people(id) NOT NULL,
    name TEXT NOT NULL
    );

    INSERT INTO people (id, name) VALUES (100, 'Amir');
    INSERT INTO cats (owner_id, name) VALUES (100, 'Ms. Fluff');
    INSERT INTO people (id, name) VALUES (200, 'Betty');
    INSERT INTO cats (owner_id, name) VALUES (200, 'Keanu');
    `);

    function peopleAndCats() {
    const results = [];
    const people = exec(`SELECT * FROM people`);
    for (const person of people) {
    const catsOwnedByPerson = exec(`
    SELECT * FROM cats WHERE cats.owner_id = ?
    `, [person.id]);
    for (const cat of catsOwnedByPerson) {
    results.push({person: person.name, cat: cat.name});
    }
    }
    return results;
    }
    peopleAndCats();
    Goal:
    [{person: 'Amir', cat: 'Ms. Fluff'}, {person: 'Betty', cat: 'Keanu'}]
    Yours:
    [{person: 'Amir', cat: 'Ms. Fluff'}, {person: 'Betty', cat: 'Keanu'}]Pass Icon
  • This is correct, in the sense that it will return the right results. But it has a big problem: its performance is terrible!

  • What happens if we have 10,000 people? Our outer loop for (const person of people) will run 10,000 times! We'll do a total of 10,001 queries: one query to find all of the people, then a separate cat query for each of the 10,000 people.

  • This is a very common problem when querying any kind of database, SQL or otherwise. It's called an "N+1 problem": we do 1 person query, then N (10,000) cat queries. That's too many queries; we can't afford to put that much load on the database for a single pageview on our site.

  • We'll fix this in stages. First, let's reduce the number of queries. We'll get all of the people, then get all of the cats. Then we'll write a nested loop:

    • For each person:
      • For each cat:
        • If this person is this cat's owner:
          • Add the cat to the results.
  • Here's a code problem:

    Add an entry to the results array for each person and cat where cat.owner_id equals person.id. (You can use results.push({person: person.name, cat: cat.name}) to add an entry to the results.) You won't need to add any more database queries.

    exec(`
    CREATE TABLE people (
    id INTEGER PRIMARY KEY NOT NULL,
    name TEXT NOT NULL
    );
    CREATE TABLE cats (
    owner_id INTEGER REFERENCES people(id) NOT NULL,
    name TEXT NOT NULL
    );
    INSERT INTO people (id, name) VALUES (100, 'Amir');
    INSERT INTO cats (owner_id, name) VALUES (100, 'Ms. Fluff');
    INSERT INTO people (id, name) VALUES (200, 'Betty');
    INSERT INTO cats (owner_id, name) VALUES (200, 'Keanu');
    `);

    function peopleAndCats() {
    const results = [];
    const people = exec(`SELECT * FROM people`);
    const cats = exec(`SELECT * FROM cats`);
    for (const person of people) {
    for (const cat of cats) {
    if (cat.owner_id === person.id) {
    results.push({person: person.name, cat: cat.name});
    }
    }
    }
    return results;
    }
    peopleAndCats();
    Goal:
    [{person: 'Amir', cat: 'Ms. Fluff'}, {person: 'Betty', cat: 'Keanu'}]
    Yours:
    [{person: 'Amir', cat: 'Ms. Fluff'}, {person: 'Betty', cat: 'Keanu'}]Pass Icon
  • We reduced the number of queries to 2!

  • But this still has a performance problem. What if we have 10,000 people and 10,000 cats, but we only want to retrieve Amir and his cat Ms. Fluff? We could do it by calling our peopleAndCats function, then filtering its results. But we'll end up retrieving a total of 20,000 rows from the database even though we only need 2.

  • Also, even though we've reduced the number of queries, we've made our JavaScript very slow. The inner loop that looks at each cat runs 10,000 times for every iteration of the outer loop. With 10,000 people and 10,000 cats, the comparison of owner_id to person.id will run a total of 100,000,000 times (10,000 * 10,000). All of that just to find out that Amir owns Ms. Fluff. This method does not work in real systems.

  • Here's where relational databases show their usefulness. We can use a JOIN to get the same result.

  • Here's a code problem:

    Write a SELECT ... FROM ... JOIN ... ON ... query to get a list of all cats and people. Remember to select people.name AS person, cats.name AS cat to get the right column names.

    exec(`
    CREATE TABLE people (
    id INTEGER PRIMARY KEY NOT NULL,
    name TEXT NOT NULL
    );
    CREATE TABLE cats (
    owner_id INTEGER REFERENCES people(id) NOT NULL,
    name TEXT NOT NULL
    );

    INSERT INTO people (id, name) VALUES (100, 'Amir');
    INSERT INTO cats (owner_id, name) VALUES (100, 'Ms. Fluff');
    INSERT INTO people (id, name) VALUES (200, 'Betty');
    INSERT INTO cats (owner_id, name) VALUES (200, 'Keanu');
    `);
    exec(`
    SELECT
    people.name AS person,
    cats.name AS cat
    FROM people
    JOIN cats
    ON people.id = cats.owner_id
    `);
    Goal:
    [{person: 'Amir', cat: 'Ms. Fluff'}, {person: 'Betty', cat: 'Keanu'}]
    Yours:
    [{person: 'Amir', cat: 'Ms. Fluff'}, {person: 'Betty', cat: 'Keanu'}]Pass Icon
  • We got the same result in a single database query! This fixes the two performance problems we've talked about so far: it requires only one query, and it doesn't have any nested loops.

  • What's really going on in the JOIN, though? So far, we've been thinking about JOINs as nested loops, similar to the ones that we wrote in JavaScript above. Aren't we just asking the database to do 100,000,000 (10,000 * 10,000) iterations, which will be slow?

  • Fortunately, no! Nested loops are a perfect mental model for how JOIN works, but they're only a mental model. In reality, the database will optimize the query, rebuilding it to be more efficient while still giving the same results.

  • Here's a concrete example. (Your answer here should be identical to the one above, with an extra WHERE added.)

  • Here's a code problem:

    Write a JOIN query to find all cats that are owned by a person with a name of "Amir".

    exec(`
    CREATE TABLE people (
    id INTEGER PRIMARY KEY NOT NULL,
    name TEXT NOT NULL
    );
    CREATE TABLE cats (
    owner_id INTEGER REFERENCES people(id) NOT NULL,
    name TEXT NOT NULL
    );

    INSERT INTO people (id, name) VALUES (100, 'Amir');
    INSERT INTO cats (owner_id, name) VALUES (100, 'Ms. Fluff');
    INSERT INTO people (id, name) VALUES (200, 'Betty');
    INSERT INTO cats (owner_id, name) VALUES (200, 'Keanu');
    `);
    exec(`
    SELECT
    people.name AS person,
    cats.name AS cat
    FROM people
    JOIN cats
    ON people.id = cats.owner_id
    WHERE people.name = 'Amir'
    `);
    Goal:
    [{person: 'Amir', cat: 'Ms. Fluff'}]
    Yours:
    [{person: 'Amir', cat: 'Ms. Fluff'}]Pass Icon
  • In the example above, WHERE people.name = ... tells the database that only one person matters. Then ON people.id = cats.owner_id tells it that it only needs to consider cats owned by that person. By understanding both of those limitations, the database can execute the query more intelligently. It will do something similar to exec(`SELECT * FROM cats WHERE owner_id = ?`, [amir.id]), selecting only the cats that are relevant.

  • This is impressive, but it's still a simple example. The bigger the query gets, the harder it is for a human to optimize manually. But the database has no such limitation; it will happily optimize any query that we come up with.

  • Imagine that we're joining across 7 different tables instead of just 2. (That's not extremely common, but it does happen.)

  • Each table has 10,000 records. With 7 nested loops, we'd require 10,000,000,000,000,000,000,000,000,000 iterations (10,000 to the 7th power). That would take something like 3,000,000,000,000,000 years.

  • If our database is set up properly, an 7-table join with a WHERE that matches only one row will execute in less than a millisecond. That lets us have our cake while eating it! We get to think about joins using a simple conceptual model: nested loops with an if inside. But the query actually executes in a much more intelligent way.

  • (Full disclosure: database optimizers aren't perfect. In real databases, we give them certain kinds of manual hints that we'll learn about in later lessons. In rare cases, the optimizer does a bad job and we have to change our queries significantly to work around it. But in 99.9% of cases, the optimizer will do what you want!)