Human Genome - Coding Sequences (protein-coding Genes)

Coding Sequences (protein-coding Genes)

Protein-coding sequences represent the most widely studied and best understood component of the human genome. These sequences ultimately lead to the production of all human proteins, although several biological processes (e.g. DNA rearrangements and alternative pre-mRNA splicing) can lead to the production of many more unique proteins than the number of protein-coding genes. The complete modular protein-coding capacity of the genome is contained within the exome, and consists of DNA sequences encoded by exons that can be translated into proteins. Because of its biological importance, and the fact that it constitutes less than 2% of the genome, sequencing of the exome was the first major milepost of the Human Genome Project.

The number of protein-coding genes within the human genome remains a subject of active investigation. A 2012 analysis of the human genome based on in vitro gene expression in multiple cell lines identified 20,687 protein-coding genes. Historically, the estimate of the number of protein genes has varied widely, from as many as 2,000,000 in the late 1960s to approximately 40,000. Remarkably, the number of human protein-coding genes is significantly smaller than that of many less complex organisms, such as the roundworm and the fruit fly. This difference may result from the extensive use of alternative pre-mRNA splicing in humans, which provides the ability to build a very large number of modular proteins through the selective incorporation of exons

Protein-coding genes are distributed unevenly across the chromosomes, with an especially high gene density within chromosomes 19, 11, and 1 (Table 1). Each chromosome contains various gene-rich and gene-poor regions, which may be correlated with chromosome bands and GC-content. The significance of these nonrandom patterns of gene density is not well understood.

The size of protein-coding genes within the human genome shows enormous variability (Table 2). For example, the gene for histone H1a (HIST1HIA) is relatively small and simple, lacking introns and encoding mRNA sequences of 781 nt and a 215 amino acid protein (648 nt open reading frame). Dystrophin (DMD) is the largest protein-coding gene in the human reference genome, spanning a total of 2.2 MB, while Titin (TTN) has the longest coding sequence (80,780 bp), the largest number of exons (364), and the longest single exon (17,106 bp). Over the whole genome, the median size of an exon is 122 bp (mean = 145 bp), the median number of exons is 7 (mean = 8.8), and the median coding sequence encodes 367 amino acids (mean = 447 amino acids; Table 21 in ).

Protein	Chrom	Gene	Length	Exons	Exon length	Intron length	Alt splicing
Breast cancer type 2 susceptibility protein	13	BRCA2	83,736	27	11,386	72,350	yes
Cystic fibrosis transmembrane conductance regulator	7	CFTR	202,881	27	4,440	198,441	yes
Cytochrome b	MT	MTCYB	1,140	1	1,140	0	no
Dystrophin	X	DMD	2,220,381	79	10,500	2,209,881	yes
Glyceraldehyde-3-phosphate dehydrogenase	12	GAPDH	4,444	9	1,425	3,019	yes
Hemoglobin beta subunit	11	HBB	1,605	3	626	979	no
Histone H1A	6	HIST1H1A	781	1	781	0	no
Titin	2	TTN	281,434	364	104,301	177,133	yes

Table 2. Examples of human protein-coding genes. Chrom, chromosome. Alt splicing, alternative pre-mRNA splicing. (Data source: Ensembl genome browser release 68, July 2012)

Read more about this topic: Human Genome

Main Site Subjects