Background: Short tandem repeats (STRs) are highly variable regions within genomes that have consequences for evolution and disease. In humans, variable tandem repeats (TRs) are known to be the cause of many neurological disorders as well as being associated with cancer, such as altered TR length affecting age of disease onset. TRs in promoters and regulatory regions of genes can affect gene expression and consequently protein function. There are currently no resources available for identifying STRs in regulatory regions of human genes.
Aims: This study aimed to identify and characterise STRs in the upstream regulatory region of human genes on a genome-wide scale and establish a resource to allow the interrogation of these STRs.
Methods: We took the table of genome-wide STRs identified by the Tandem Repeats Finder program and filtered it for STR length, repeat purity and on location relative to transcription start sites. In a series of nested UCSC Genome Browser table database joins, we produced a table of all the STRs present in the 3 kilobase regulatory region at the 5’-end of all human genes.
Results and Conclusions: By creating this table of STRs, we have identified 5436 STRs within the 5’-regulatory region of 4457 human genes. The details of these STRs will be made accessible in the Short Tandem Repeats in Regulatory Regions Table, or STaRRRT. This table has revealed that 10.3% of genes in the human genome have at least one STR in their upstream regulatory region. Analysis has shown that mismatch repair and neural genes are enriched for STRs in this regulatory region. Also, cellular and neurological processes contain a significant over-representation of these genes. This is consistent with the role of tandem repeats in neurological disorders and could potentially lead to the identification of targets for diagnosing and treating certain neurological diseases.