Transaction Processing
Now let's move on to an important feature in any database system: transaction processing.
A transaction is a group of one or more SQL commands treated as a unit. PostgreSQL promises that all commands within a transaction will complete or that none of them will complete. If any command within a transaction does not complete, PostgreSQL will roll back all changes made within the transaction.
PostgreSQL makes use of transactions to ensure database consistency. Transactions are needed to coordinate updates made by two or more concurrent users. Changes made by a transaction are not visible to other users until the transaction is committed. When you commit a transaction, you are telling PostgreSQL that all the changes made within the transaction are logically complete, the changes should be made permanent, and the changes should be exposed to other users. When you roll back a transaction, you are telling PostgreSQL that the changes made within the transaction should be discarded and not made visible to other users.
To start a new transaction, execute a BEGIN13 command. To complete the transaction and have PostgreSQL make your changes permanent, execute the COMMIT command. If you want PostgreSQL to revert all changes made within the current transaction, execute the ROLLBACK command.
It's important to realize that all SQL commands execute within a transaction. If you don't explicitly BEGIN a transaction, PostgreSQL will automatically execute each command within its own transaction.
Persistence
I used to think that single-command transactions were pretty useless: I was wrong. Single-command transactions are important because a single command can access multiple rows. Consider the following: Let's add a new constraint to the customers table.
movies=# ALTER TABLE customers ADD CONSTRAINT movies-# balance_exceeded CHECK( balance <= 50 );
This constraint ensures that no customer is allowed to have a balance exceeding $50.00. Just to prove that it works, let's try setting a customer's balance to some value greater than $50.00:
movies=# UPDATE CUSTOMERS SET balance = 100 where customer_id = 1; ERROR: ExecReplace: rejected due to CHECK constraint balance_exceeded
You can see that the UPDATE is rejected. What happens if you try to update more than one row? First, let's look at the data already in the customers table:
movies=# SELECT * FROM customers; customer_id | customer_name | phone | birth_date | balance -------------+----------------------+----------+------------+--------- 1 | Jones, Henry | 555-1212 | 1970-10-10 | 0.00 2 | Rubin, William | 555-2211 | 1972-07-10 | 15.00 3 | Panky, Henry | 555-1221 | 1968-01-21 | 0.00 4 | Wonderland, Alice N. | 555-1122 | 1969-03-05 | 3.00 8 | Wink Wankel | 555-1000 | 1988-12-25 | 0.00 (5 rows)
Now, try to UPDATE every row in this table:
movies=# UPDATE customers SET balance = balance + 40; ERROR: ExecReplace: rejected due to CHECK constraint balance_exceeded
This UPDATE command is rejected because adding $40.00 to the balance for Rubin, William violates the balance_exceeded constraint. The question is, were any of the customers updated before the error occurred? The answer is: probably. You don't really know for sure because any changes made before the error occurred are rolled back. The net effect is that no changes were made to the database:
movies=# SELECT * FROM customers; customer_id | customer_name | phone | birth_date | balance -------------+----------------------+----------+------------+--------- 1 | Jones, Henry | 555-1212 | 1970-10-10 | 0.00 2 | Rubin, William | 555-2211 | 1972-07-10 | 15.00 3 | Panky, Henry | 555-1221 | 1968-01-21 | 0.00 4 | Wonderland, Alice N. | 555-1122 | 1969-03-05 | 3.00 8 | Wink Wankel | 555-1000 | 1988-12-25 | 0.00 (5 rows)
If some of the changes persisted while others did not, you would have to somehow find the persistent changes yourself and revert them. You can see that single-command transactions are far from useless. It took me awhile to learn that lesson.
What about multicommand transactions? PostgreSQL treats a multicommand transaction in much the same way that it treats a single-command transaction. A transaction is atomic, meaning that all the commands within the transaction are treated as a single unit. If any of the commands fail to complete, PostgreSQL reverts the changes made by other commands within the transaction.
Transaction Isolation
I mentioned earlier in this section that the changes made within a transaction are not visible to other users until the transaction is committed. To be a bit more precise, uncommitted changes made in one transaction are not visible to other transactions14.
Transaction isolation helps to ensure consistent data within a database. Let's look at a few of the problems solved by transaction isolation.
Consider the following transactions:
User: bruce |
Time |
User: sheila |
BEGIN TRANSACTION |
T1 |
BEGIN TRANSACTION |
UPDATE customers |
T2 |
|
SET balance = balance - 3 |
|
|
WHERE customer_id = 2; |
|
|
|
T3 |
SELECT SUM( balance ) |
|
|
FROM customers; |
|
T4 |
COMMIT TRANSACTION; |
ROLLBACK TRANSACTION; |
T5 |
|
At time T1, bruce and sheila each begin a new transaction. bruce updates the balance for customer 3 at time T1. At time T3, sheila computes the SUM() of the balances for all customers, completing her transaction at time T4. At time T5, bruce rolls back his transaction, discarding all changes within his transaction. If these transactions were not isolated from each other, sheila would have an incorrect answer: Her answer was calculated using data that was rolled back.
This problem is known as the dirty read problem: without transaction isolation, sheila would read uncommitted data. The solution to this problem is known as READ COMMITTED. READ COMMITTED is one of the two transaction isolation levels supported by PostgreSQL. A transaction running at the READ COMMITTED isolation level is not allowed to read uncommitted data. I'll show you how to change transaction levels in a moment.
There are other data consistency problems that are avoided by isolating transactions from each other. In the following scenario, sheila will receive two different answers within the same transaction:
User: bruce |
Time |
User: sheila |
BEGIN TRANSACTION; |
T1 |
BEGIN TRANSACTION; |
|
T2 |
SELECT balance |
|
|
FROM customers |
|
|
WHERE customer_id = 2; |
UPDATE customers |
|
|
SET balance = 20 |
|
|
WHERE customer_id = 2; |
T3 |
|
COMMIT TRANSACTION; |
T4 |
|
|
T5 |
SELECT balance |
|
|
FROM customers |
|
|
WHERE customer_id = 2; |
|
T6 |
COMMIT TRANSACTION; |
Again, bruce and sheila each start a transaction at time T1. At T2, sheila finds that customer 2 has a balance of $15.00. bruce changes the balance for customer 2 from $15.00 to $20.00 at time T3 and commits his change at time T4. At time T5, sheila executes the same query that she executed earlier in the transaction, but this time she finds that the balance is $20.00. In some applications, this isn't a problem; in others, this interference between the two transactions is unacceptable. This problem is known as the non-repeatable read.
Here is another type of problem:
User: bruce |
Time |
User: sheila |
BEGIN TRANSACTION; |
T1 |
BEGIN TRANSACTION; |
|
T2 |
SELECT * FROM customers; |
INSERT INTO customers VALUES |
T3 |
|
( |
|
|
6, |
|
|
'Neville, Robert', |
|
|
'555-9999', |
|
|
'1971-03-20', |
|
|
0.00 |
|
|
); |
|
|
COMMIT TRANSACTION; |
T4 |
|
|
T5 |
SELECT * FROM customers; |
|
T6 |
COMMIT TRANSACTION; |
In this example, sheila again executes the same query twice within a single transaction. This time, bruce has inserted a new row in between the sheila's queries. Notice that this is not a case of a dirty readbruce has committed his change before sheila executes her second query. At time T5, sheila finds a new row. This is similar to the non-repeatable read, but this problem is known as the phantom read problem.
The answer to both the non-repeatable read and the phantom read is the SERIALIZABLE transaction isolation level. A transaction running at the SERIALIZABLE isolation level is only allowed to see data committed before the transaction began.
In PostgreSQL, transactions usually run at the READ COMMITTED isolation level. If you need to avoid the problems present in READ COMMITTED, you can change isolation levels using the SET TRANSACTION command. The syntax for the SET TRANSACTION command is
SET TRANSACTION ISOLATION LEVEL { READ COMMITTED | SERIALIZABLE };
The SET TRANSACTION command affects only the current transaction (and it must be executed before the first DML15 command within the transaction). If you want to change the isolation level for your session (that is, change the isolation level for future transactions), you can use the SET SESSION command:
SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL { READ COMMITTED | SERIALIZABLE }
Multi-Versioning and Locking
Most commercial (and open-source) databases use locking to coordinate multiuser updates. If you are modifying a table, that table is locked against updates and queries made by other users. Some databases perform page-level or row-level locking to reduce contention, but the principle is the sameother users must wait to read the data you have modified until you have committed your changes.
PostgreSQL uses a different model called multi-versioning, or MVCC for short (locks are still used, but much less frequently than you might expect). In a multi-versioning system, the database creates a new copy of the rows you have modified. Other users see the original values until you commit your changesthey don't have to wait until you finish. If you roll back a transaction, other users are not affectedthey did not have access to your changes in the first place. If you commit your changes, the original rows are marked as obsolete and other transactions running at the READ COMMITTED isolation level will see your changes. Transactions running at the SERIALIZABLE isolation level will continue to see the original rows. Obsolete data is not automatically removed from a PostgreSQL database. It is hidden, but not removed. You can remove obsolete rows using the VACUUM command. The syntax of the VACUUM command is
VACUUM [ VERBOSE ] [ ANALYZE ] [ table ]
I'll talk about the VACUUM command in more detail in the next chapter.
The MVCC transaction model provides for much higher concurrency than most other models. Even though PostgreSQL uses multiple versions to isolate transactions, it is still necessary to lock data in some circumstances.
Try this experiment. Open two psql sessions, each connected to the movies database. In one session, enter the following commands:
movies=# BEGIN WORK; BEGIN movies=# INSERT INTO customers VALUES movies-# ( 5, 'Manyjars, John', '555-8000', '1960-04-02', 0 ); INSERT
In the other session, enter these commands:
movies=# BEGIN WORK; BEGIN movies=# INSERT INTO customers VALUES movies-# ( 6, 'Smallberries, John', '555-8001', '1960-04-02', 0 ); INSERT
When you press the Enter (or Return) key, this INSERT statement completes immediately. Now, enter this command into the second session:
movies=# INSERT INTO customers VALUES movies-# ( 5, 'Gomez, John', '555-8000', '1960-04-02', 0 );
This time, when you press Enter, psql hangs. What is it waiting for? Notice that in the first session, you already added a customer whose customer_id is 5, but you have not yet committed this change. In the second session, you are also trying to insert a customer whose customer_id is 5. You can't have two customers with the same customer_id (because you have defined the customer_id column to be the unique PRIMARY KEY). If you commit the first transaction, the second session would receive a duplicate value error. If you roll back the first transaction, the second insertion will continue (because there is no longer a constraint violation). PostgreSQL won't know which result to give you until the transaction completes in the first session.