Upright and inverted audiovisual video clips of a monkey producing a ‘coo’ and a human imitating this vocalization were presented at a range of stimulus onset asynchronies. Participants made temporal order judgments regarding which modality stream appeared to have been presented first. The results showed that inverting the dynamic human visual display led to a significant differences in the point of subjective simultaneity, with the inverted human faces requiring more time to be processed compared with the upright displays. No such inversion effect was found for the monkey visual displays. These results demonstrate that the effect of inversion on the temporal perception of audiovisual speech stimuli are driven by the viewing of a human face rather than by the integration of audiovisual speech.