Cross-modal correspondences influence perceptual performance in adults, infants, and even nonhuman primates across a variety of different sensory modalities, including tasks involving speeded detection and categorization. However, to date, it is still unclear whether and how correspondences could modulate post-perceptual processes, such as working memory (WM). We investigated this issue using an audiovisual two-back task. In Experiment 1, 3 kinds of correspondences were used: audio/visual numerosity, pitch/shape, and pitch/elevation, each presented congruently (e.g., for numerosity: 3 tones along with 3 shapes) or incongruently (3 tones/2 shapes). Participants attended to the visual or auditory modalities, or both, simultaneously. The results revealed faster target-detection latencies following congruent as compared to incongruent stimulation, especially for numerosity congruence. In Experiment 2, we focused on numerosity, varying the correspondence of the unattended modality, thus having correspondences at both sample (e.g., 3 tones/3 shapes) and target (e.g., 3 tones/3 shapes), only at sample (sample: 3 tones/3 shapes; target: 3 tones/2 shapes), only at target (sample: 3 tones/2 shapes; target: 3 tones/3 shapes), or never. To investigate the information format we included “symbolic” quantities (i.e., visually/auditorily presented digits). The results confirmed the congruence effects, specifically when the correspondence operates at the target display, thus affecting response selection. The experiment revealed modal effects, showing how task-irrelevant digits affect performance only in the auditory modality, while task-irrelevant quantities affect it only when presented visually. Overall, these findings highlight the impact of cross-modal correspondences on WM, adding new light on the link between perceptual and post-perceptual stages of human information processing.