Quantcast
Viewing all articles
Browse latest Browse all 7

atomic load/store

Your program's execution time is limited by the speed of system memory.  Or as I would say, it is memory bound.  (Your system almost certainly has a slower memory speed than processor speed.)

I'm curious to know... Why do you think that an atomic write to memory does not synchronize with another processor? Is it because you specified std::memory_order_relaxed? It must still write to memory, don't you think? 

In a multi-processor environment each thread must constantly send information back and forth to each other (via system memory), resulting in obscene numbers of L1 cache misses.

The same hardware architecture using a single thread will experience no delay because there is no arithmetic going on (in other words, the test is still memory bound) and all the memory operations can remain in L1 cache and does not require a write-through (because there are no accesses of that memory from the other processors in the system.)

You are likely observing the slow speed of your system memory due to cache misses and the wonders of multi-processor system architecture.

I suspect your penalties will be even more severe if non-uniform memory is involved, because the second thread may execute on CPU that hassdata in non-local memory.

(If you think I don't understand your scenario correctly, please feel free to set me straight.)



Viewing all articles
Browse latest Browse all 7

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>