Conjoint processing of text and pictures is assumed to possess an inherent asymmetry, because text and pictures serve fundamentally different but complementary functions. Conjoint processing is assumed to start with general, coherence-oriented mental model construction. When certain tasks have to be solved, the mental model is adjusted to the task requirements by adaptive mental model elaboration. We hypothesized that, due to different constraints on cognitive processing, initial mental model construction is more text-driven than picture-driven, whereas adaptive mental model elaboration is more picture-driven than text-driven. We also hypothesized that there are more transitions between text and picture during initial model construction than during adaptive model elaboration and more task–picture transitions than task–text transitions during adaptive mental model elaboration. To test these hypotheses, we selected 6 text–picture units from textbooks on biology and geography, each combined with 3 comprehension items of different complexity. The units and corresponding items were presented to 204 students from Grades 5 to 8 from the higher tier and the lower tier of the German school system. The participants were required to answer the presented items 1 by 1. Their eye movements were analyzed in terms of fixations and transitions between texts, pictures, and items as dependent variables. The independent variables were school tier, grade, and order of presentation. The results confirmed our hypotheses. We presume that the benefits of learning from text and pictures are due to the inherent asymmetry, which allows the learner to combine the specific advantages of both forms of representations.