Four dumb questions I'd like to ask hardware designers

Given that an x86 doesn't touch memory in anything less than 64-byte increments (since that's the cache line size), and that the slow part of a lock cmpxchg8/16b instruction is acquiring exclusive use of the relevant cache line rather than comparing the contents of the source and targets, how come there isn't a lock cmpxchg64b instruction, just to make lock-free algorithms easier?

Regarding the frequency vs cost efficiency trade-off, are current GPUs (which seem to run at somewhere around 1.0 to 1.5GHz rather than up in the 3GHz+ range) sitting pretty much precisely on the optimal point for cost-efficiency?

How could REP MOVSB ever be slower than memcpy() on an x86 CPU that has a microcode unit? If there did exist some memcpy() implementation faster than REP MOVSB, couldn't the REP MOVSB implementation be sped up to match it by simply having it issue the same microcode ops that the faster memcpy() implementation would have issued too?

Do multicore ARM systems see a practical improvement compared to x86es in how much power or performance they spend on inter-core coordination from the fact that they implement a weaker memory model?