too young, too simple

0%

# Genomics Homework1

chr2        37428774        37458740        uc002rpz.3        0        -        37428906        37458710        0        16        272,82,59,84,197,58,98,67,69,103,54,89,184,232,1493,186,        0,1152,1318,9367,10211,10703,12232,13254,14508,14684,15352,18753,20748,21538,25912,29780,

1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
3. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature, however, the number in position format will be represented. For example, the first 100 bases of chromosome 1 are defined as chrom=1, chromStart=0, chromEnd=100, and span the bases numbered 0-99 in our software (not 0-100), but will represent the position notation chr1:1-100. Read more here.

The 9 additional optional BED fields are:

1. name - Defines the name of the BED line. This label is displayed to the left of the BED line in the Genome Browser window when the track is open to full display mode or directly to the left of the item in pack mode.
2. score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). This table shows the Genome Browser’s translation of BED score values into shades of gray:
3. strand - Defines the strand. Either “.” (=no strand) or “+” or “-”.
4. thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). When there is no thick part, thickStart and thickEnd are usually set to the chromStart position.
5. thickEnd - The ending position at which the feature is drawn thickly (for example the stop codon in gene displays).
6. itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to “On”, this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser.
7. blockCount - The number of blocks (exons) in the BED line.
8. blockSizes - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.
9. blockStarts - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.

From <https://genome.ucsc.edu/FAQ/FAQformat.html> From the bed format we can answer:

• Extract this transcript as BED12 format (5’)

chr2 37428774 37458740 uc002rpz.3 0 - 37428906 37458710 0 16 272,82,59,84,197,58,98,67,69,103,54,89,184,232,1493,186, 0,1152,1318,9367,10211,10703,12232,13254,14508,14684,15352,18753,20748,21538,25912,29780,

• Chromosome (5’)

chr2

• Strand (5’)

• Exon number (5’)

This equals to blockCount,16 in this case

Then we set output format to sequence, click get output. Choose “protein” and submit We can get the fasta format protein sequence

uc002rpz.3 MAAVKEPLEFHAKRPWRPEEAVEDPDEEDEDNTSEAENGFSLEEVLRLGGTKQDYLMLAT LDENEEVIDGGKKGAIDDLQQGELEAFIQNLNLAKYTKASLVEEDEPAEKENSSKKEVKI PKINNKNTAESQRTSVNKVKNKNRPEPHSDENGSTTPKVKKDKQNIFEFFERQTLLLRPG GKWYDLEYSNEYSLKPQPQDVVSKYKTLAQKLYQHEINLFKSKTNSQKGASSTWMKAIVS SGTLGDRMAAMILLIQDDAVHTLQFVETLVNLVKKKGSKQQCLMALDTFKELLITDLLPD NRKLRIFSQRPFDKLEQLSSGNKDSRDRRLILWYFEHQLKHLVAEFVQVLETLSHDTLVT TKTRALTVAHELLCNKPEEEKALLVQVVNKLGDPQNRIATKASHLLETLLCKHPNMKGVV SGEVERLLFRSNISSKAQYYAICFLNQMALSHEESELANKLITVYFCFFRTCVKKKDVES KMLSALLTGVNRAYPYSQTGDDKVREQIDTLFKVLHIVNFNTSVQALMLLFQVMNSQQTI SDRYYTALYRKMLDPGLMTCSKQAMFLNLVYKSLKADIVLRRVKAFVKRLLQVTCQQMPP FICGALYLVSEILKAKPGLRSQLDDHPESDDEENFIDANDDEDMEKFTDADKETEIVKKL ETEETVPETDVETKKPEVASWVHFDNLKGGKQLNKYDPFSRNPLFCGAENTSLWELKKLS VHFHPSVALFAKTILQGNYIQYSGDPLQDFTLMRFLDRFVYRNPKPHKGKENTDSVVMQP KRKHFIKDIRHLPVNSKEFLAKEESQIPVDEVFFHRYYKKVAVKEKQKRDADEESIEDVD DEEFEELIDTFEDDNCFSSGKDDMDFAGNVKKRTKGAKDNTLDEDSEGSDDELGNLDDDE VSLGSMDDEEFAEVDEDGGTFMDVLDDESESVPELEVHSKVSTKKSKRKGTDDFDFAGSF QGPRKKKRNLNDSSLFVSAEEFGHLLDENMGSKFDNIGMNAMANKDNASLKQLRWEAERD DWLHNRDAKSIIKKKKHFKKKRIKTTQKTKKQRK

From <https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=763976383_GbLRUAHtXz4mJsZuaxa6O4lzsEDg&hgta_geneSeqType=protein&hgta_doGenePredSequence=submit> Warning:Be careful, this output sequence contains EOL or ‘\n’. You can put it in a seqCleaner E.g.http://www.detaibio.com/sms2/filter_protein.html This is the answer of CDS protein sequence • Extract the sequence of the exon 12 (5’), highlight this block (5’) and take a screenshot as pdf format (5’) Click get sequence, then you can get some fasta format data Find the No.12 exon(or block,depends on how you name it) Warning: No.12 exon is xxx_11, because the subscript start from zero. You can hightlight block by shift +click and drag Click the gene graphic and you can enter a page full of all kins of data https://genome.ucsc.edu/cgi-bin/hgGene?hgg_gene=uc002rpz.3&hgg_prot=uc002rpz.3&hgg_chrom=chr2&hgg_start=37428774&hgg_end=37458740&hgg_type=knownGene&db=hg19&hgsid=763976383_GbLRUAHtXz4mJsZuaxa6O4lzsEDg Do it have isoforms? uc003qmu.2 From <https://genome.ucsc.edu/cgi-bin/hgGene?hgg_gene=uc003qmu.2&hgg_prot=uc003qmu.2&hgg_chrom=chr6&hgg_start=149979288&hgg_end=150039392&hgg_type=knownGene&db=hg19&hgsid=764000265_VoPfi1D1ZReEprAdTtMipArUJMoR> _ Take this one for example. It has. But for uc002rpz.3 **It doesn’t ** Or you can click it If it have mutiple isoforms, there is transcript variant. To be sure, Search LATS1 for how many isoforms • mRNA length (5’) This size include polyA tail，for mrna length without polyA tail, add all blockSizes • TSS position (5’) What is TSS positon? Transcription start site, the starting point of the process of creating a complementary RNA copy of a sequence of DNA From <https://en.wikipedia.org/wiki/TSS> _ Because this is a negative strand gene? I don’t know if I can say it like this. • CDS end site position (5’) • CDS length (hint: not including introns) (5’) This is equal to ORF size. 3165 •Get all transcript IDs within 100,000 bp upstream of this transcript TSS (UCSC Genes annotation) (5’) Gene Upstream is Chromosome Downstream https://genome.ucsc.edu/cgi-bin/hgTables Warning: Do not include itself Or you can do it this way