Try to partition off the work so that each thread can load a "piece" of the problem into its cache, work on it, and then arrive at a result. This may mean that there is a copy of sdata per-thread. And you only have to resolve conflicts in accordance with your partitioning of the work across the threads.
Naive partitioning as you have used, where each thread greedily acquires the next of the billion computations is a poor partitioning of the problem. You'd be better off computing the steps for i=0-499999999 on one thread and i=500000000-999999999 on the other. Dividing up the work to make efficient use of the threads is calledpartitioning, and it is the important part of designing a parallel solution.
Now, let's say the small values of i finish faster than the large values of i for whatever reason. Then thread 1 will finish early and thread 2 will still be busy. That's inefficient. We need to chop it up into chunks of work where each thread can do a big chunk. So instead of two tasks where each task does half a billion operations (and there is only one point of contention to resolve their results at the end), and instead of a billion tasks where each task does one operation (and there are a billion points of contention, which you've demonstrated is bad), we split the difference and maybe do a thousand tasks where each task does a million operations (and there are only a thousand points of contention). Hopefully then we minimize thread contention compared to the arithmetic we need to do, and when one thread finishes before the other the leftover is only at most a million operations and things are mostly efficient.
You can also do dynamic and adaptive partitioning. The ppl does a not-too-shabby job of automatically partitioning your work when you do aparallel for. It also makes sure that there are exactly the right number of threads running and can detect whether they are blocked, or idle or running (this taking into consideration your system architecture and current load) and so minimizes the number of threads created and destroyed as well as minimizes context switches.
Your scenario demonstrates why partitioning is important, because you have poorly partitioned your work such that you contend for system memory a billion times, which is inefficient compared to the insignificant amount of arithmetic you did. And so it performs more poorly with more threads.
(I realize this is just an experiment you are doing to discuss an issue and doesn't represent a real scenario. Your loop is a do-nothing time waster as is, and I'm sure you intended it to be.Arithmetic sums of numbers can be computed in constant time anyway.)