Coperformers in musical ensembles continuously adapt the timing of their actions to maintain interpersonal coordination. The current study used a dyadic finger-tapping task to investigate whether such mutual adaptive timing is predominated by assimilation (i.e., copying relative timing, akin to mimicry) or compensation (local error correction). Our task was intended to approximate the demands that arise when coperformers coordinate complementary parts with a rhythm section in an ensemble. In two experiments, paired musicians (the coperformers) were required to tap in alternation, in synchrony with an auditory pacing signal (the rhythm section). Serial dependencies between successive asynchronies produced by alternating individuals' taps relative to the pacing tones revealed greater evidence for temporal assimilation than compensation. By manipulating the availability of visual and auditory feedback across experiments, it was shown that this assimilation was strongest when coactors' taps triggered sounds, while the effects of visual information were negligible. These results suggest that interpersonal temporal assimilation was mediated by perception–action coupling in the auditory modality. Mutual temporal assimilation may facilitate coordination in musical ensembles by automatically increasing stylistic compatibility between coperformers, thereby assisting them to sound cohesive.