Automatic craniomaxillofacial (CMF) landmark localization from cone-beam computed tomography (CBCT) images is challenging, considering that 1) the number of landmarks in the images may change due to varying deformities and traumatic defects, and 2) the CBCT images used in clinical practice are typically large. In this paper, we propose a two-stage, coarse-to-fine deep learning method to tackle these challenges with both speed and accuracy in mind. Specifically, we first use a 3D faster R-CNN to roughly locate landmarks in down-sampled CBCT images that have varying numbers of landmarks. By converting the landmark point detection problem to a generic object detection problem, our 3D faster R-CNN is formulated to detect virtual, fixed-size objects in small boxes with centers indicating the approximate locations of the landmarks. Based on the rough landmark locations, we then crop 3D patches from the high-resolution images and send them to a multi-scale UNet for the regression of heatmaps, from which the refined landmark locations are finally derived. We evaluated the proposed approach by detecting up to 18 landmarks on a real clinical dataset of CMF CBCT images with various conditions. Experiments show that our approach achieves state-of-the-art accuracy of 0.89 ± 0.64mm in an average time of 26.2 seconds per volume.