Coding Sequences (protein-coding Genes)
Protein-coding sequences represent the most widely studied and best understood component of the human genome. These sequences ultimately lead to the production of all human proteins, although several biological processes (e.g. DNA rearrangements and alternative pre-mRNA splicing) can lead to the production of many more unique proteins than the number of protein-coding genes. The complete modular protein-coding capacity of the genome is contained within the exome, and consists of DNA sequences encoded by exons that can be translated into proteins. Because of its biological importance, and the fact that it constitutes less than 2% of the genome, sequencing of the exome was the first major milepost of the Human Genome Project.
The number of protein-coding genes within the human genome remains a subject of active investigation. A 2012 analysis of the human genome based on in vitro gene expression in multiple cell lines identified 20,687 protein-coding genes. Historically, the estimate of the number of protein genes has varied widely, from as many as 2,000,000 in the late 1960s to approximately 40,000. Remarkably, the number of human protein-coding genes is significantly smaller than that of many less complex organisms, such as the roundworm and the fruit fly. This difference may result from the extensive use of alternative pre-mRNA splicing in humans, which provides the ability to build a very large number of modular proteins through the selective incorporation of exons
Protein-coding genes are distributed unevenly across the chromosomes, with an especially high gene density within chromosomes 19, 11, and 1 (Table 1). Each chromosome contains various gene-rich and gene-poor regions, which may be correlated with chromosome bands and GC-content. The significance of these nonrandom patterns of gene density is not well understood.
The size of protein-coding genes within the human genome shows enormous variability (Table 2). For example, the gene for histone H1a (HIST1HIA) is relatively small and simple, lacking introns and encoding mRNA sequences of 781 nt and a 215 amino acid protein (648 nt open reading frame). Dystrophin (DMD) is the largest protein-coding gene in the human reference genome, spanning a total of 2.2 MB, while Titin (TTN) has the longest coding sequence (80,780 bp), the largest number of exons (364), and the longest single exon (17,106 bp). Over the whole genome, the median size of an exon is 122 bp (mean = 145 bp), the median number of exons is 7 (mean = 8.8), and the median coding sequence encodes 367 amino acids (mean = 447 amino acids; Table 21 in ).
Protein | Chrom | Gene | Length | Exons | Exon length | Intron length | Alt splicing |
---|---|---|---|---|---|---|---|
Breast cancer type 2 susceptibility protein | 13 | BRCA2 | 83,736 | 27 | 11,386 | 72,350 | yes |
Cystic fibrosis transmembrane conductance regulator | 7 | CFTR | 202,881 | 27 | 4,440 | 198,441 | yes |
Cytochrome b | MT | MTCYB | 1,140 | 1 | 1,140 | 0 | no |
Dystrophin | X | DMD | 2,220,381 | 79 | 10,500 | 2,209,881 | yes |
Glyceraldehyde-3-phosphate dehydrogenase | 12 | GAPDH | 4,444 | 9 | 1,425 | 3,019 | yes |
Hemoglobin beta subunit | 11 | HBB | 1,605 | 3 | 626 | 979 | no |
Histone H1A | 6 | HIST1H1A | 781 | 1 | 781 | 0 | no |
Titin | 2 | TTN | 281,434 | 364 | 104,301 | 177,133 | yes |
Table 2. Examples of human protein-coding genes. Chrom, chromosome. Alt splicing, alternative pre-mRNA splicing. (Data source: Ensembl genome browser release 68, July 2012)
Read more about this topic: Human Genome