Hand pose estimation, formulated as an inverse problem, is typically optimized by an energy function over pose parameters using a 'black box' image generation procedure, knowing little about either the relationships between the parameters or the form of the energy function. In this paper, we show significant improvement upon such black box optimization by exploiting high-level knowledge of the parameter structure and using a local surrogate energy function. Our new framework, hierarchical sampling optimization (HSO), consists of a sequence of discriminative predictors organized into a kinematic hierarchy. Each predictor is conditioned on its ancestors, and generates a set of samples over a subset of the pose parameters, with only one selected by the highly-efficient surrogate energy. The selected partial poses are concatenated to generate a full-pose hypothesis. Repeating the same process, several hypotheses are generated and the full energy function selects the best result. Under the same kinematic hierarchy, two methods based on decision forest and convolutional neural network are proposed to generate the samples and two optimization methods are studied when optimizing these samples. Experimental evaluations on three publicly available datasets show that our method is particularly impressive in low-compute scenarios where it significantly outperforms all other state-of-the-art methods.