Talk:Three-phase commit protocol

I have removed large parts of the article and corrected the citation of the seminal report by Skeen, which was incorrectly a reference to a follow-up theoretical analysis by Skeen and Stonebraker. The claims made in the previous version of the Wikipedia article were theoretically unsound, and I wasted quite some time trying to convince myself otherwise, until I finally gave up and read the original papers. It is not possible to place an upper bound on the time it takes to resolve a distributed transaction without violating the basic soundness criterion, as the Wikipedia article previously claimed. This would imply that one could solve the two generals problem in finite time. Indeed, it wasn't hard to find an example of a network partitioning where the timeout-based protocol would cause two cohorts to respectively commit and abort the same transaction. This eliminates the whole point of the protocol to begin with, as it is no better than just sending the transaction to all cohorts and hoping for the best.

I haven't replaced the description with a better one. The seminal technical report by Skeen is publicly available and very readable, and I don't think I can describe it any better than him. Note in particular that his description does not involve the use of timeouts at all: it is a quorum based algorithm, and timeouts would be an implementation detail used to detect failures. Ulrik Rasmussen (talk) 08:52, 10 October 2019 (UTC)

The protocol presented on the page at present conforms to the Skeen article which actually differs slightly from the description given at. Specifically, the state transition on the cohort from prepared to committed only happens when receiving a commit message from the coordinator in the original article. Was there a change to the protocol in the meantime?

Agreed, and it's definitely not a slight discrepancy; the description about the cohort states matches neither the diagram shown or the state diagram in the source material. I'd rather someone more knowledgeable about the subject matter commits a change though. --130.15.80.105 (talk) 16:07, 31 March 2009 (UTC)

i reformatted the protocol description at the bottom of the page to look similiar to two-phase_commit. hope nobody minds. gba 18:56, 4 March 2006 (UTC)

Ah, got 3PC only after Tanenbaum's book description. Both picture and description have principal mistakes.

1. Picture: on coordinator's side "Finalizing commit. Timeout causes abort". Coordinator MUST commit, because cohorts are commited. 2. Coordinator's action, item #3: "However if the coordinator times out while waiting for an acknowledgement from a cohort, it will abort the transaction." Also invalid, because after prepared state, whole system have no way back. It could be only commited earlier or later.

This base algorithm's idea:

1. both coordinator and cohorts are change their phases together and only after all parts are entered previous phase.

2. there is "point of no return", after which we cannot roll back transaction, only commit. If someone failed after that point, it will commit transaction later at the restore state.

Shmuma (talk) 10:48, 23 April 2010 (UTC)

Atomicity reliability
Since this is the first time that I post a message to a wikipedia discussion, I won't edit the article myself. I suggest this to the original author. Please correct me if I'm wrong!

You could make the two-phase commit protocol non-blocking in the same way as with the three-phase: by introducing timeouts. The problem with both the blocking and the non-blocking variant is the same: you can never be sure of the atomicity.

Consider the. If Cohort(i) sends an ACK message that gets lost because the link to Cohort breaks, the Cohort will timeout and commit the local transaction, while the Coordinator, not having received the ACK, will timeout and abort. Even if the link gets restored, you can't abort (rollback) later on the commited part on the Cohort.

The basic problem with most kinds of commit protocols is called the Two_Generals%27_Problem. If you add more and more layers of acknowledgements (acknowledgements to acknowledgements), the system gets more reliable but never perfect. On the down side, the execution slows more and more.

Regards, Igorecan 13:14, 11 April 2006 (UTC)

Regarding your comments, Igorecan, I have to disagree. In the situation you mention, in 2PC commit with timeouts, this is how I believe it would go (according to my reading of Lampson93):
 * Cohort(i) sends an ACK message that gets lost. By the time any cohort can send an ACK, it has already been decided by the coordinator whether the transation is commiting or aborting. So this Cohort is ACKing a commit message in your scenario.
 * The Coordinator, knowing that this is a commited transaction, will timeout on Cohort(i)'s response, and will again send a COMMIT message to Cohort(i).
 * Cohort(i) upon receiving a COMMIT message for a transaction that has already been commited, will know its ACK message was lost, and will resend the ACK.
 * This process might repeat many times until both the COMMIT message and the ACK message were transmited.

Thanks, Nels Beckman 14:28, 7 September 2006 (UTC)

Hello again after a long time! First of all, I was speaking about the 3PC, as described by the state automata on the . Notice the P1 (coordinator), and the Pi (cohort) states. For one the timeout leads to abort and for the other it leads to commit. Isn't that wrong? I didn't study your link, but this confuses me: "By the time any cohort can send an ACK, it has already been decided by the coordinator whether the transaction is committing or aborting" - if the coordinator has decided, then what is the need for further ACKs or NACKs? I believe the timeouts necessarily introduce a degree of uncertainty whether both will actually abort or commit.

Regards, Igorecan (talk) 00:20, 15 November 2008 (UTC)

Figure
The figure is nice, but is inconsistent with the text, in that it shows "participants" rather than "cohorts". —Preceding unsigned comment added by Yagibear (talk • contribs) 22:51, 31 October 2007 (UTC)
 * Ah, good point. I'll update it sometime within the next few days. If I forget, feel free to send me an email to remind me. --Tjohns &#9998; 18:46, 11 March 2008 (UTC)
 * ✅ Feel free to let me know if any other changes need to be made. Tjohns &#9998; 06:49, 21 March 2008 (UTC)

- It appears that the figure is still inconsistent with the text. In particular, the figure seems to indicate that, for a cohort that has ACK'd a pre-commit but not received a do-commit, a timeout will cause a commit to take place. However the text says "In the prepared state, if the cohort receives an abort message from the coordinator, fails, or times out waiting for a commit, it aborts." 98.212.216.20 (talk) 18:38, 22 April 2008 (UTC)

modes of failure
I wanted to use this article as a brief introduction to the kinds of problems that must be considered in distributed consensus, but was disappointed by the brevity of the explanation of how this is an improvement over the two-phase commit. I think the discussion is fine as a definition for those already familiar with the domain, but needs a little more justification for pedagogical use. I will take a shot at this, and would welcome improvement from anyone.

MarkKampe (talk) 18:55, 13 March 2010 (UTC)

Unreferenced quote
This passage:

"Three-phase commit assumes a network with bounded delay and nodes with bounded response times; In most practical systems with unbounded network delay and process pauses, it cannot guarantee atomicity."

is a direct quote from Martin Kleppmann's Designing Data-Intensive Applications, p. 359.

I am not a regular contributor, not sure what's the approach to fix it.