User:FrankYang43338/sandbox

Disclaimer
This is not yet an official wiki page due to COI. It has been peer reviewed and published but not modeled It is currently remains an algorithm.

Overview
A Coherence-Free Processor (CFP) contains a cache collaboration protocol that permits it to delegate tasks to another CFP without requiring cache coherence. CFPs are thread-safe core processors, utilizing the same algorithm as the software. As a result, CFPs are scalable. Scalable enables you to add core processors to your device just as you can connect drives and memory. CFPs permit having a core processor attached to each AI GPU. Current computers are not scalable because adding core processors creates coherence. A CFP allows you to simply add more core processors.

Coherence is Binomial


Each core processor has a cache and cache coherence synchronizes multiple caches. The illustration on the right shows how coherency synchronizes the caches. There are a few communication protocols that may be used to synchronize. Hardware developers have focused on reducing this coherence overhead. Their efforts have proven fruitless because the overhead is binomial. Binomial overhead causes coherence overhead to explode almost geometrically when cores are added. (See Birthday problem )

The issue must be resolved before the problem becomes binomial. Software tasks are able to share data without having a binomial problem. Prior to 1973, this was done with a software lock such as the Test and Set instruction. In 1973 this function was replaced with an uninterruptible Compare and Swap instruction which also eliminated the lock in many cases. The swap is conditional and only occurs when the comparison is true. The instruction is uninterruptible because both steps must complete prior to allowing access by another task. However the hardware cannot resolve the binomial problem because the solution requires information that the software does not pass to the hardware. Passing that information requires a new memory allocation instruction.

System Architecture
(Note: For clarity, this section assumes processor affinity. Swapping to another processor can be implemented by immediate cast-out to eliminate residual cache data. More to the point is that having sufficient processors eliminates the need for swapping)

Software
The CFP runs existing software. However a new instruction is required to maintain current price/performance. The new instruction designates whether the allocation is for shared memory or exclusive memory. The typical user program contains no shared data, so the default memory allocation would be exclusive. Since system programs work with shared data, utilizing the exclusive cache would require allocating both.

The new allocation provides one immediate benefit. Cache coherence is currently performed on all data. However exclusive data does not require coherence. The prior mentioned change to compare and swap from test and set was more complicated because it involved a change to the software algorithm.

Hardware


The CFP requires three methods of handling data, all without coherence.

1) Exclusive Cache
1) An exclusive cache is used for memory that is not shared. It might also contain static common data items, such as program code, that are shared but do not change (shared R/O). The contents of an exclusive cache do not require coherence. It is a coherent cache.

2) Shared Memory
Shared memory is not stored in a cache and updates are directly performed in shared memory. Bypassing the cache results in no coherence. It is coherent memory. (Snoopy is also write-through.)

3) Atomic Hardware Synchronization Mechanism
The atomic hardware synchronization mechanism is currently implemented with coherence because it is performed in the cache. The CFP replaces this atomic hardware synchronization mechanism with one that performs the atomic instruction directly on main memory. Serialization replaces coherence. It is currently thread-safe but implemented with coherence. Implementing this mechanism without coherence permits a thread-safe core processor.

Migration
Current software logic is maintained, the change is to a transparent cache. Shared memory is coherent because it is never stored in a cache. Converting to the new allocation instruction is solely for performance.

Performance
CFP performance compares favorably to a multiprocessor because data in the exclusive cache have no coherence. The multiprocessor has the advantage of performing shared reads from a cache, however the software avoids repeat reads of shared data because it is subject to change. When software must reuse shared memory, it either uses a static snapshot copy or a mutex. Shared writes are similar but a current multiprocessor write-thru causes coherence in the form of cache invalidation.

Incremental Growth
A CFP permits incremental processor growth with a constant price/performance. No new technology is required for performance.

Multitasking
CFPs provide faster performance by permitting the delegation of tasks residing in the multitasking queue to other CFPs. This reduces elapsed time though not execution time. CFPs can multitask when there are insufficient processors. CFPs enable the addition of devices that contain their own core processor, thereby not impacting and potentially improving the performance of the existing system.

CFP Advantages (enabled by sufficient processors)
Processor affinity, no load balancing. Load balancing creates coherence. Database managers can run on dedicated processors which enables execution without locks. CFPs that run tasks do not require the entire operating system. Time tasks spend in a multitasking queue can be reduced to zero. Elapsed time might equal execution time. CFPs can reside on different chips, because they do not have coherence. Any device that connects to a bus could contain a CFP.