Accurately determining the binding affinity of a ligand with a protein is important for drug design, development, and screening. With the advent of accessible protein structure prediction methods such as AlphaFold, several approaches have been developed that make use of information determined from the 3D structure for a variety of downstream tasks. However, methods for predicting binding affinity that do consider protein structure generally do not take full advantage of such 3D structural protein information, often using such information only to define nearest-neighbor graphs based on inter-residue or inter-atomic distances. Here, we present a joint architecture that we call CASTER-DTA (Cross-Attention with Structural Target Equivariant Representations for Drug-Target Affinity) that makes use of an SE(3)-equivariant graph neural network to learn more robust protein representations alongside a standard graph neural network to learn molecular representations, and we further augment these representations by incorporating an attention-based mechanism by which individual residues in a protein can attend to atoms in a ligand and vice-versa to improve interpretability. In this manner, we show that using equivariant graph neural networks in our architecture enables CASTER-DTA to approach and exceed state-of-the-art performance in predicting drug-target affinity without the inclusion of external information, such as protein language model embeddings. We do so on the Davis and KIBA datasets, common benchmarks for predicting drug-target affinity. We also discuss future steps to further improve performance.