Many regions of the Epstein-Barr virus (EBV) genome, repeated and unique sequences, contribute to the geographical variation observed between strains. Here we use a large alignment of curated EBV genome sequences to identify major sites of variation in the genome of type 1 EBV strains; the CAO deletion in latent membrane protein 1 (LMP1) is the most frequent major indel present in the unique regions of EBV strains from various parts of the world. Principal component analysis was used to identify patterns of sequence variation and nucleotide positions in the sequences that can distinguish EBV from some different geographical regions. Viral genome sequence variation also affects interpretation of genetic content; known genes, origins of replication and gene expression control regions explain most of the viral genome but there are still a few sections of unknown function. One of these EBV genome regions contains a large inverted repeat sequence (invR) within the IR-1 major internal repeat array. We deleted this invR sequence and showed that this abolished the ability of the virus to transform human B cells into lymphoblastoid cell lines. This article is part of the theme issue 'Silent cancer agents: multi-disciplinary modelling of human DNA oncoviruses'.
Keywords: Epstein–Barr virus; cell transformation; repeats; strain variation.