Execute Program

SQL: Join Mistakes

Welcome to the Join Mistakes lesson!

This lesson is shown as static text below. However, it's designed to be used interactively. Click the button below to start!

  • Let's examine some subtle mistakes that we might make with SQL joins. First, what happens if we forget to add the ON condition to an inner join?

  • With no ON, an inner join returns every combination of the left table's rows with the right table's rows. If the left table has N rows and the right table has M rows, then the join will have N*M total rows.

  • However, we won't notice that mistake if we only test with simple cases! For example, imagine that we're joining users with their comments, but we forget the ON. If we only test with one user and one comment, then the results will look correct.

  • >
    exec(`
    CREATE TABLE users (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL
    );
    CREATE TABLE comments (
    user_id INTEGER REFERENCES users(id) NOT NULL,
    comment_text TEXT NOT NULL
    );
    -- Amir has written a comment.
    INSERT INTO users (name) VALUES ('Amir');
    INSERT INTO comments (
    user_id,
    comment_text
    ) VALUES (1, 'Ms. Fluff needs a bath!');
    -- "Test" our JOIN by querying all of the users and comments.
    SELECT
    users.name AS name,
    comments.comment_text AS comment_text
    FROM users JOIN comments;
    `);
    Result:
    [{name: 'Amir', comment_text: 'Ms. Fluff needs a bath!'}]Pass Icon
  • The ON is missing, but our test doesn't notice! There's a great rule of thumb in testing that goes something like: "If you're testing code that deals with arrays, write separate tests for 0 elements, 1 element, and many elements." We can imagine a similar process for testing joins. When joining tables, it's a good idea to test multiple cases, like:

    • There are no rows at all.
    • Only the left table has rows.
    • Only the right table has rows.
    • Both tables have multiple rows.
  • (Sometimes you'll want to add even more cases: "what happens if the left table has multiple rows that match one row on the right?" Etc. As usual, the amount of testing should match your confidence in the code.)

  • Testing is deep and subtle, but there's a baseline rule that you should always use when writing a join: never test a join by inserting only one row into each table. That "test" is too loose, like the one above. Almost any join will pass it, even if it's the wrong join!

  • That rule will catch other mistakes as well. For example, what happens if we accidentally flip the order of the tables in a LEFT JOIN?

  • In the next example, we meant to select from users LEFT JOIN comments. However, we mixed up left and right joins, so we accidentally selected from comments LEFT JOIN users. (It's common to mix up the different join types, so this is a realistic mistake.)

  • We only test the join with one user and one comment. It could be a LEFT JOIN, a RIGHT JOIN, an INNER JOIN, or an OUTER JOIN. They're all the same for that test case, but only one is correct!

  • >
    exec(`
    CREATE TABLE users (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL
    );
    CREATE TABLE comments (
    user_id INTEGER REFERENCES users(id) NOT NULL,
    comment_text TEXT NOT NULL
    );
    -- Amir has written a comment.
    -- This is a bad test case: it won't notice if we switch the tables!
    INSERT INTO users (name) VALUES ('Amir');
    INSERT INTO comments (
    user_id,
    comment_text
    ) VALUES (1, 'Ms. Fluff needs a bath!');
    -- This join contains an error: the table order is flipped.
    -- But our test can't tell!
    SELECT
    users.name AS name,
    comments.comment_text AS comment_text
    FROM comments LEFT JOIN users
    ON users.id = comments.user_id
    `);
    Result:
    [{name: 'Amir', comment_text: 'Ms. Fluff needs a bath!'}]Pass Icon
  • One final type of join mistake. So far, all of our examples have used foreign keys in the ON. Usually, that's a good idea. Joining on foreign keys means that we don't have to worry about cases where the join condition column has a value, but it doesn't match any value in the other table.

  • However, the database will let us write any ON that we want. It doesn't care whether there's a foreign key or not. For example, it will let us join against the wrong table's ID.

  • (Be careful with this example. It will successfully produce one joined row of "user" with comment, but the "user"'s name will be Ms. Fluff, not Amir!)

  • >
    exec(`
    CREATE TABLE users (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL
    );
    CREATE TABLE cats (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL
    );
    CREATE TABLE comments (
    user_id INTEGER REFERENCES users(id) NOT NULL,
    comment_text TEXT NOT NULL
    );
    -- Ms. Fluff is a cat (with ID 1).
    INSERT INTO cats (name) VALUES ('Ms. Fluff');
    -- Amir (user ID 1) has written a comment.
    INSERT INTO users (name) VALUES ('Amir');
    INSERT INTO comments (
    user_id,
    comment_text
    ) VALUES (1, 'Ms. Fluff needs a bath!');
    -- Select all of the comments joined with "users".
    -- But we accidentally join against cats' IDs instead of users' IDs.
    -- The database won't stop us!
    -- It will happily match up comments' "user_id"s with cats' "id"s.
    SELECT
    name,
    comment_text
    FROM cats JOIN comments
    ON id = comments.user_id;
    `);
    Result:
    [{name: 'Ms. Fluff', comment_text: 'Ms. Fluff needs a bath!'}]Pass Icon
  • We can tell that this is wrong because no cat who has ever existed would say that they need a bath!

  • Like our previous examples, more thorough testing will highlight this kind of mistake.

  • There's another method to spot it as well. In past lessons, we've recommended listing the columns in a SELECT on separate lines, like we did in this example. For joins, we've also recommended explicitly naming those columns' tables, like SELECT users.name. However, we intentionally didn't follow that advice above: we selected only the names of the columns without specifying which tables they came from.

  • Qualifying the column names with their tables makes it easier to spot this kind of mistake. In order for this mistake to happen, we'd have to write SELECT cats.name. Just seeing the table name a second time gives us a chance to notice the mistake.

  • Hopefully, we type the correct table name the second time: users.name. If we do that while selecting from a join that doesn't include users, we'll get an error. (You can type "error" to indicate an error.")

  • >
    exec(`
    CREATE TABLE users (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL
    );
    CREATE TABLE cats (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL
    );
    CREATE TABLE comments (
    user_id INTEGER REFERENCES users(id) NOT NULL,
    comment_text TEXT NOT NULL
    );
    -- Ms. Fluff is a cat (with ID 1).
    INSERT INTO cats (name) VALUES ('Ms. Fluff');
    -- Amir (user ID 1) has written a comment.
    INSERT INTO users (name) VALUES ('Amir');
    INSERT INTO comments (
    user_id,
    comment_text
    ) VALUES (1, 'Ms. Fluff needs a bath!');
    -- Select all of the comments joined with "users".
    -- But we accidentally join against cats' IDs instead of users' IDs.
    -- The database won't stop us!
    -- It will happily match up comments' "user_id"s with cats' "id"s.
    --
    -- Naming the tables explicitly gives us a second chance to notice.
    SELECT
    users.name,
    comments.comment_text
    FROM cats JOIN comments
    ON id = comments.user_id;
    `);
    Result:
    Error: no such column: users.namePass Icon
  • SQL is quite good at preventing mistakes. That's what all of its constraint types are for. However, it can't save us from every mistake. Hopefully this lesson helps you to catch some mistakes earlier. We can reduce the advice here into two minimal rules:

    • When testing a join, use more than one row in each table.
    • When selecting from a join, qualify your selects with table names (like SELECT users.name) to make sure that you're selecting the right things.