This paper presents a study of models for audiovisual (AV) fusion in a noisy-vowel recognition task. We progressively elaborate audiovisual models in order to respect the major principle demonstrated by human subjects in speech perception experiments (the “synergy” principle): audiovisual identification should always be more efficient than auditory-alone or visual-alone identification. We first recall that the efficiency of audiovisual speech recognition systems depends on the level at which they fuse sound and image: four AV architectures are presented, and two are selected for the following of the study. Secondly, we show the importance of providing a contextual input linked to the Signal-to-Noise Ratio (SNR) in the fusion process. Then we propose an original approach using an efficient nonlinear dimension reduction algorithm (“curvilinear components analysis”) in order to increase the performances of the two AV architectures. Furthermore, we show that this approach allows an easy and efficient estimation of the reliability of the audio sensor in relation to SNR, that this estimation can be used to control the AV fusion process, and that it significantly improves the AV performances. Hence, altogether, nonlinear dimension reduction, context estimation and control of the fusion process enable us to respect the “synergy” criterion for the two most used architectures.