User:FrankYang43338/sandbox

Overview[edit]

A Coherence-Free Processor (CFP)^[1]^[2]contains a Cache Collaboration Protocol that permits it to delegate tasks to another CFP without requiring cache coherence. As a result, CFPs are scalable. Scalable enables you to add core processors to your device just as you can connect drives and memory. CFPs would permit having a core processor dedicated to each AI GPU. Current computers are not scalable because adding core processors creates coherence. A CFP allows you to simply add more core processors.

Coherence is Binomial[edit]

The process of cache synchronization is known as coherence. There are a few communication protocols that may be used to synchronize. The illustration on the right shows how coherency synchronizes the caches. Hardware developers have been trying to reduce this coherence overhead. Their efforts have proven fruitless because the overhead is binomial. Binomial overhead causes coherence overhead to explode almost geometrically when cores are added. (See Birthday problem )

The issue has to be resolved before the problem becomes binomial. Software tasks are able to share data without having a binomial problem. Prior to 1973, this was done with a software lock such as the Test and Set instruction. In 1973 this function was replaced with an uninterruptible Compare and Swap instruction which also eliminated the lock in many cases. The swap is conditional and only occurs when the comparison is true. The instruction is uninterruptible because both steps must complete prior to allowing access by another task. However the hardware cannot resolve the binomial problem because the solution requires information that the software does not pass to the hardware. Passing that information requires a new instruction.

System Architecture[edit]

(Note: For clarity, this section assumes processor affinity. Swapping to another processor can be implemented by immediate cast-out to eliminate residual cache data.)

Software[edit]

The CFP runs existing software. However a new instruction is required to maintain current price/performance. The new instruction designates whether the allocation is for shared memory or exclusive memory. The typical user program contains no shared data, so the default memory allocation would be exclusive. Since system programs work with shared data, utilizing the exclusive cache would require allocating both.

The new allocation provides one immediate benefit. Cache coherence is currently performed on all data. It only needs to be performed on shared data.

Hardware[edit]

The CFP requires three methods of handling data, all without coherence.

1) Exclusive Cache[edit]

1) An exclusive cache is used for memory that is not shared. It might also contain static common data items, such as program code, that are shared but do not change (shared R/O). The contents of an exclusive cache do not require coherence.

2) Shared Memory[edit]

Shared memory is not stored in a cache and updates are directly performed in shared memory. Bypassing the cache results in no coherence.

3) Atomic Hardware Synchronization Mechanism[edit]

The atomic hardware synchronization mechanism is currently implemented with coherence because it is performed in the cache. The CFP replaces this atomic hardware synchronization mechanism with one that performs the atomic instruction directly on main memory. Serialization replaces coherence.

Migration[edit]

Current software logic is maintained, the change is to a transparent cache. Shared memory is always consistent because it is never stored in a cache. Converting to the new allocation instruction is solely for performance.

Performance[edit]

CFP performance compares favorably to a multiprocessor because data in the exclusive cache have no coherence. The multiprocessor has the advantage of performing shared reads from a cache, however the software avoids repeat reads of shared data because it is subject to change. Software should use snapshot copies. Shared writes are similar but a multiprocessor write-thru causes coherence in the form of invalidation.

Incremental Growth[edit]

A CFP permits incremental processor growth with a constant price/performance. No new technology is required for performance.

Multitasking[edit]

CFPs provide faster performance by permitting the delegation of tasks residing in the multitasking queue to other CFPs. This reduces elapsed time though not execution time. CFPs multitask when there are insufficient processors. CFPs enable the addition of devices that contain their own core processor, thereby not impacting the performance of the existing system.

CFP Advantages (enabled by sufficient processors)[edit]

Processor affinity, no load balancing. Load balancing creates coherence.
Database managers can run on dedicated processors which enables execution without locks.
CFPs that run tasks do not require the entire operating system.
Time tasks spend in a multitasking queue can be reduced to zero. Elapsed time might equal execution time.
CFPs can reside on different chips, because they do not have coherence.
Any device that connects to a bus could contain a CFP.

References[edit]

^ Yang, Frank. "CFP: Coherence-Free Processor Design". Journal of Computer Science and Technology. doi:10.1007/s11390-023-3964-5. {{cite journal}}: External link in |website= (help)
^ WIPO Patent

[1] Yang, Frank. "CFP: Coherence-Free Processor Design". Journal of Computer Science and Technology. doi:10.1007/s11390-023-3964-5. {{cite journal}}: External link in |website= (help)

[2] WIPO Patent

[1]

[2]