Samenvatting van slides+notities van bio-informatica. Codes met uitleg en werkzittingen uitwegewerk+uitleg. Uitleg van verschillende databases en veel gebruikte codes.
Bioinformatica
Inhoud
Les 1: intro+ databases ............................................................................................................2
Gendatabase ............................................................................................................................2
Protein database ...................................................................................................................3
Les 2: databases ......................................................................................................................4
Ontologies ............................................................................................................................4
Gene expression ....................................................................................................................5
Phenotypes/Diseases..............................................................................................................6
Model Organism databases ....................................................................................................6
Les 3: genome browsers+ SQL .................................................................................................6
Genome browsers .....................................................................................................................6
Homology.............................................................................................................................7
Database architectures...........................................................................................................8
Les 4: Linux + Jupyter .......................................................................................................... 10
Navigating the file system ..................................................................................................... 10
Additional Jupyternotebooks notes ........................................................................................ 13
Les 5: EMBOSS + BedTools exercises .................................................................................... 14
Les 6: Gene prediction ........................................................................................................... 15
Les 7+8: Python .................................................................................................................... 17
Les 9: Alignment, pattern matching, gene set analysis ............................................................. 19
Werkzitting1 ......................................................................................................................... 25
Information retrieval............................................................................................................ 25
CpG islands ........................................................................................................................ 26
Unknown sequence study...................................................................................................... 28
Werkzitting 2 ........................................................................................................................ 30
Python CpG island .............................................................................................................. 30
miRNA ............................................................................................................................... 32
,Les 1: intro+ databases
Gendatabase
- Entrez gene
o Onderdeel van NCBI: https://www.ncbi.nlm.nih.gov/gene/
o Each line is a transcript isoform (due to alternative promoters, and alternative
splicing); look at the exons, introns, non-coding exons (light greens: 5’UTR,
3’UTR), coding exons (dark green)
o Each transcript has a unique NM_ identifier = RefSeq identifier
o Each NM transcript corresponds to a unique NP_ protein entry
o More details about each NM/NP and links to the sequence in Entrez
Nucleotide are at the bottom of the Gene page
▪ Entrez Nucleotide contains all nucleotide sequences
▪ Search Nucleotide db with NM_000564
▪ (After the dot “.” is the version number)
- Refseq
o https://www.ncbi.nlm.nih.gov/refseq/
o Many sequences were/are represented more than once in GenBank
o RefSeq = curated “secondary” database that aims to provide a
comprehensive, integrated, nonredundant set of sequences
o Goal is to provide a reference sequence for each molecule in the central dogma
(DNA, mRNA, and protein)
o Each RefSeq represents a single, naturally occurring molecule from one
organism
o Nucleotide and protein sequences in RefSeq are explicitly linked to one
another
o Distinct accession number: 2+6 format (2 letters, underscore, six-digit number)
▪ NT_123456 (Genomic contigs), NM_123456 (mRNAs), NP_123456
(Proteins)
▪ XM_123456 (Model mRNAs), XP_123456 (model proteins):
computational predictions
,To visualize the data, download GenBank format (.gb) as textfile and open it in text editor,
such as Visual Studio Code or Jupyternotebooks.
- How to download:
o Click on “Send to” (right upper screen)
o Select “Complete Record” and “File”
o Choose GenBank format or FASTA (no header and features)
- In feature
o Sequence has a coding sequence (CDS) made up of five exons
▪ First exon begins at base 201 and ends at base 224
▪ Then is joined at basepair 1550 until bp 1920, and so forth.
o Each comma in this line represents a splicing event, and each “..” represents
the string of letters between the two coordinates.
o The gene product is eukaryotic initiation factor 4E-II, and the gene name is
eIF4E
EMBL/EBI
o https://www.ebi.ac.uk/
o European database
o DBFETCH provides an easy way to retrieve entries from various databases at
the EMBL-EBI
o Format:https://www.ebi.ac.uk/Tools/dbfetch/db=refseqn;id=NM_000231;form
at=fasta&style=raw
Protein database
- Uniprot: https://www.uniprot.org/
o Gives general feature format (GFF) (text file)
▪ Click download
▪ Choose GFF format
- Protein sequences in databases can be derived from translation of nucleotide
sequences (secondary databases)
o e.g., RefSeq NM_ to RefSeq NP_
o e.g.,TrEMBL
o Go to the protein database, following one of the NP_isoforms
- There are also curated databases: experts enhance the original data by adding new
information
, o e.g., SwissProt (in the UniProt knowledgebase)
▪ Information from literature
▪ Curator-evaluated computational analysis/predictions
- 3D structures
o https://www.ncbi.nlm.nih.gov/structure/ or Uniprot→ structure
Les 2: databases
Ontologies
- Gene ontology (GO)
o https://geneontology.org/ or https://www.ebi.ac.uk/QuickGO/ (human usually
capitalized)
▪ Data downloaden QuickGo
• Click on export
• Choose format: gen association file (then add .txt in the name)
• Adjust the amount of annotations
o Specific purpose: “Annotation of genes and proteins in genomic and protein
databases”
o Facilitate complex queries
o Applicable to all species
o Databases involved:
▪ FlyBase (Drosophila)
▪ MGI (Mouse)
▪ SGD (S. cerevisae)
▪ TAIR (Arabadopsis)
▪ TIGR (microbes including prokaryotes)
▪ SWISS-PROT (several thousand species inc. human)
▪ PSU (P. falciparum)
▪ ZFIN (zebrafish)
▪ PAMGO (plant pathogens)
o GO structure
Les avantages d'acheter des résumés chez Stuvia:
Qualité garantie par les avis des clients
Les clients de Stuvia ont évalués plus de 700 000 résumés. C'est comme ça que vous savez que vous achetez les meilleurs documents.
L’achat facile et rapide
Vous pouvez payer rapidement avec iDeal, carte de crédit ou Stuvia-crédit pour les résumés. Il n'y a pas d'adhésion nécessaire.
Focus sur l’essentiel
Vos camarades écrivent eux-mêmes les notes d’étude, c’est pourquoi les documents sont toujours fiables et à jour. Cela garantit que vous arrivez rapidement au coeur du matériel.
Foire aux questions
Qu'est-ce que j'obtiens en achetant ce document ?
Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.
Garantie de remboursement : comment ça marche ?
Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.
Auprès de qui est-ce que j'achète ce résumé ?
Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur sisivorst. Stuvia facilite les paiements au vendeur.
Est-ce que j'aurai un abonnement?
Non, vous n'achetez ce résumé que pour €16,66. Vous n'êtes lié à rien après votre achat.