The original Human Genome Project and the subsequent study of now many thousands of individuals worldwide have provided a vast amount of DNA sequence information. With this information in hand, one can begin to characterize the types and frequencies of polymorphic variation found in the human genome and to generate catalogues of human DNA sequence diversity around the globe. DNA polymorphisms can be classified according to how the DNA sequence varies between the different alleles.
Single Nucleotide Polymorphisms
The simplest and most common of all polymorphisms are single nucleotide polymorphisms (SNPs). A locus characterized by a SNP usually has only two alleles, corresponding to the two different bases occupying that particular location in the genome. As mentioned previously, SNPs are common and are observed on average once every 1000 bp in the genome. However, the distribution of SNPs is uneven around the genome; many more SNPs are found in noncoding parts of the genome, in introns and in sequences that are some distance from known genes. Nonetheless, there is still a significant number of SNPs that do occur in genes and other known functional elements in the genome. For the set of protein-coding genes, over 100,000 exonic SNPs have been documented to date. Approximately half of these do not alter the predicted amino acid sequence of the encoded protein and are thus termed synonymous, whereas the other half do alter the amino acid sequence and are said to be nonsynonymous. Other SNPs introduce or change a stop codon, and yet others alter a known splice site; such SNPs are candidates to have significant functional consequences.
The significance for health of the vast majority of SNPs is unknown and is the subject of ongoing research. The fact that SNPs are common does not mean that they are without effect on health or longevity. What it does mean is that any effect of common SNPs is likely to involve a relatively subtle altering of disease susceptibility rather than a direct cause of serious illness.
A second class of polymorphism is the result of variations caused by insertion or deletion (in/dels or simply indels) of anywhere from a single base pair up to approximately 1000 bp, although larger indels have been documented as well. Over a million indels have been described, numbering in the hundreds of thousands in any one individual’s genome. Approximately half of all indels are referred to as “simple” because they have only two alleles – that is, the presence or absence of the inserted or deleted segment.
Other indels, however, are multiallelic due to variable numbers of the segment of DNA that is inserted in tandem at a particular location, thereby constituting what is referred to as a microsatellite. They consist of stretches of DNA composed of units of two, three, or four nucleotides, such as TGTGTG, CAACAACAA, or AAATAAATAAAT, repeated between one and a few dozen times at a particular site in the genome. The different alleles in a microsatellite polymorphism are the result of differing numbers of repeated nucleotide units contained within any one microsatellite and are therefore sometimes also referred to as short tandem repeat (STR) polymorphisms. A microsatellite locus often has many alleles (repeat lengths) that can be rapidly evaluated by standard laboratory procedures to distinguish different individuals and to infer familial relationships. Many tens of thousands of microsatellite polymorphic loci are known throughout the human genome. Finally, microsatellites are a particularly useful group of indels. Determining the alleles at multiple microsatellite loci is currently the method of choice for DNA fingerprinting used for identity testing.
Mobile Element Insertion Polymorphisms
Nearly half of the human genome consists of families of repetitive elements that are dispersed around the genome. Although most of the copies of these repeats are stationary, some of them are mobile and contribute to human genetic diversity through the process of retrotransposition, a process that involves transcription into an RNA, reverse transcription into a DNA sequence, and insertion into another site in the genome. Mobile element polymorphisms are found in nongenic regions of the genome, a small proportion of them are found within genes. At least 5000 of these polymorphic loci have an insertion frequency of greater than 10% in various populations.
Coyp Number Variants
Another important type of human polymorphism includes copy number variants (CNVs). CNVs are conceptually related to indels and microsatellites but consist of variation in the number of copies of larger segments of the genome, ranging in size from 1000 bp to many hundreds of kilobase pairs. Variants larger than 500 kb are found in 5% to 10% of individuals in the general population, whereas variants encompassing more than 1 Mb are found in 1% to 2%. The largest CNVs are sometimes found in regions of the genome characterized by repeated blocks of homologous sequences called segmental duplications (or segdups).
Smaller CNVs in particular may have only two alleles (i.e., the presence or absence of a segment), similar to indels in that regard. Larger CNVs tend to have multiple alleles due to the presence of different numbers of copies of a segment of DNA in tandem. In terms of genome diversity between individuals, the amount of DNA involved in CNVs vastly exceeds the amount that differs because of SNPs. The content of any two human genomes can differ by as much as 50 to 100 Mb because of copy number differences at CNV loci.
Notably, the variable segment at many CNV loci can include one to as several dozen genes, that thusCNVs are frequently implicated in traits that involve altered gene dosage. When a CNV is frequent enough to be polymorphic, it represents a background of common variation that must be understood if alterations in copy number observed in patients are to be interpreted properly. As with all DNA polymorphism, the significance of different CNV alleles in health and disease susceptibility is the subject of intensive investigation.
A final group of polymorphisms to be discussed is inversions, which differ in size from a few base pairs to large regions of the genome (up to several megabase pairs) that can be present in either of two orientations in the genomes of different individuals. Most inversions are characterized by regions of sequence homology at the edges of the inverted segment, implicating a process of homologous recombination in the origin of the inversions. In their balanced form, inversions, regardless of orientation, do not involve a gain or loss of DNA, and the inversion polymorphisms (with two alleles corresponding to the two orientations) can achieve substantial frequencies in the general population.