The prediction of the binding free energy between a ligand and a protein is an important component in the virtual screening and lead optimization of ligands for drug discovery. To determine the quality of current binding free energy estimation programs, we examined FlexX, X-Score, AutoDock, and BLEEP for their performance in binding free energy prediction in various situations including cocrystallized complex structures, cross docking of ligands to their non-cocrystallized receptors, docking of thermally unfolded receptor decoys to their ligands, and complex structures with "randomized" ligand decoys. In no case was there a satisfactory correlation between the experimental and estimated binding free energies over all the datasets tested. Meanwhile, a strong correlation between ligand molecular weight-binding affinity correlation and experimental predicted binding affinity correlation was found. Sometimes the programs also correctly ranked ligands' binding affinities even though native interactions between the ligands and their receptors were essentially lost because of receptor deformation or ligand randomization, and the programs could not decisively discriminate randomized ligand decoys from their native ligands; this suggested that the tested programs miss important components for the accurate capture of specific ligand binding interactions.