Software distributed-shared-memory (DSM) systems provide an appealing target for parallelizing compilers due to their flexibility. Previous studies demonstrate such systems can provide performance comparable to messagepassing compilers for dense-matrix kernels. However, synchronization and load imbalance are significant sources of overhead. In this paper, we investigate the impact of compilation techniques for eliminating barrier synchronization overhead in software DSMs. Our compile-time barrier elimination algorithm extends previous techniques in three ways: (1) we perform inexpensive communication analysis through local subscript analysis when using chunk iteration partitioning for parallel loops; (2) we exploit delayed updates in lazy-release-consistency DSMs to eliminate barriers guarding only anti-dependences; (3) when possible we replace barriers with customized nearest-neighbor synchronization. Experiments on an IBM SP-2 indicate these techniques can improve parallel performance by 20% on average and by up to 60% for some applications.