The development of attention to dynamic faces versus objects providing synchronous audiovisual versus silent visual stimulation was assessed in a large sample of infants. Maintaining attention to the faces and voices of people speaking is critical for perceptual, cognitive, social, and language development. However, no studies have systematically assessed when, if, or how attention to speaking faces emerges and changes across infancy. Two measures of attention maintenance, habituation time (HT) and look-away rate (LAR), were derived from cross-sectional data of 2- to 8-month-old infants (N = 801). Results indicated that attention to audiovisual faces and voices was maintained across age, whereas attention to each of the other event types (audiovisual objects, silent dynamic faces, silent dynamic objects) declined across age. This reveals a gradually emerging advantage in attention maintenance (longer HTs, lower LARs) for audiovisual speaking faces compared with the other 3 event types. At 2 months, infants showed no attentional advantage for faces (with greater attention to audiovisual than to visual events); at 3 months, they attended more to dynamic faces than objects (in the presence or absence of voices), and by 4 to 5 and 6 to 8 months, significantly greater attention emerged to temporally coordinated faces and voices of people speaking compared with all other event types. Our results indicate that selective attention to coordinated faces and voices over other event types emerges gradually across infancy, likely as a function of experience with multimodal, redundant stimulation from person and object events.