Real-time motion management for image-guided radiation therapy interventions plays an important role for accurate dose delivery. Forecasting future 4D deformations from in-plane image acquisitions is fundamental for accurate dose delivery and tumor targeting. However, anticipating visual representations is challenging and is not exempt from hurdles such as the prediction from limited dynamics, and the high-dimensionality inherent to complex deformations. Also, existing 3D tracking approaches typically need both template and search volumes as inputs, which are not available during real-time treatments. In this work, we propose an attention-based temporal prediction network where features extracted from input images are treated as tokens for the predictive task. Moreover, we employ a set of learnable queries, conditioned on prior knowledge, to predict future latent representation of deformations. Specifically, the conditioning scheme is based on estimated time-wise prior distributions computed from future images available during the training stage. Finally, we propose a new framework to address the problem of temporal 3D local tracking using cine 2D images as inputs, by employing latent vectors as gating variables to refine the motion fields over the tracked region. The tracker module is anchored on a 4D motion model, which provides both the latent vectors and the volumetric motion estimates to be refined. Our approach avoids auto-regression and leverages spatial transformations to generate the forecasted images. The tracking module reduces the error by 63% compared to a conditional-based transformer 4D motion model, yielding a mean error of 1.5± 1.1 mm. Furthermore, for the studied cohort of abdominal 4D MRI images, the proposed method is able to predict future deformations with a mean geometrical error of 1.2± 0.7 mm.