This is one of the chapter in Bioinformatics basic course in Tsinghua University. You may also find it here
The related jupyter file
Python Tutorial
Install Anaconda Python
URL: (https://www.anaconda.com/download/)
- Easy install of data science packages (binary distribution)
- Package management with
conda
Install Python packages using conda:
1 | conda install h5py |
Update a package to the latest version:
1 | conda update h5py |
Install Python packages using pip:
1 | pip install h5py |
Update a package using pip:
1 | pip install --upgrade h5py |
Python language tips
Compatibility between Python 3.x and Python 2.x
Biggest difference: print is a function rather than statement in Python 3
This does not work in Python 3
1 | print 1, 2, 3 |
Solution: use the __future__
module
1 | from __future__ import print_function |
Second biggest difference: some package/function names in the standard library are changed
Python 2 => Python 3
1 | cStringIO => io.StringIO |
Solution: use the six
module
- Refer to (https://docs.python.org/3/library/__future__.html) for usage of the
__future__
module. - Refer to (https://pythonhosted.org/six/) for usage of the
six
module.
Get away from IndentationError
Python forces usage of tabs/spaces to indent code
1 | # use a tab |
Best practice: always use 4 spaces. You can set whether to use spaces(soft tabs) or tabs for indentation.
In vim editor, use :set list
to inspect incorrect number of spaces/tabs.
Add Shebang and encoding at the beginning of executable scripts
Create a file named welcome.py
1 | #! /usr/bin/env python |
Then set the python script as executable:
1 | chmod +x welcome.py |
Now you can run the script without specifying the Python interpreter:1
./welcome.py
All variables, functions, classes are dynamic objects
1 | class MyClass(): |
All python variables are pointers/references
1 | a = [1, 2, 3] |
Use deepcopy
if you really want to COPY a variable
1 | from copy import deepcopy |
What if I accidentally overwrite my builtin functions?
You can refer to (https://docs.python.org/2/library/functions.html) for builtin functions in the standard library.
1 | A = [1, 2, 3, 4] |
Note: in Python 3, you should import from builtins
rather than __builtin__
1 | from builtins import sum |
int
is of arbitrary precision in Python!
In Pyhton:
1 | print(2**10000) |
In R:
1 | print(2^10000) |
Easiest way to swap values of two variables
In C/C++:
1 | int a = 1, b = 2, t; |
In Python:
1 | a = 1 |
List comprehension
Use for-loops:
1 | a = [] |
Use list comprehension
1 | a = [i + 10 for i in range(10)] |
Dict comprehension
Use for-loops:
1 | a = {} |
Use dict comprehension:1
2a = {i:chr(ord('A') + i) for i in range(10)}
print(a)
For the one-liners
Use ‘;’ instead of ‘\n’:
1 | # print the first column of each line |
For more examples of one-liners, please refer to (https://wiki.python.org/moin/Powerful%20Python%20One-Liners).
Read from standard input
1 | import sys |
Order of dict keys are NOT as you expected
1 | a = {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5, 'F': 6} |
Use enumerate() to add a number during iteration
1 | A = ['a', 'b', 'c', 'd'] |
Reverse a list
1 | # a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] |
Strings are immutable in Python
1 | a = 'ABCDF' |
tuples are hashable while lists are not hashable
1 | # create dict using tuples as keys |
Use itertools
Nested loops in a more concise way:
1 | A = [1, 2, 3] |
Get all combinations of a list:
1 | A = ['A', 'B', 'C', 'D'] |
Convert iterables to lists
1 | import itertools |
Use the zip() function to transpose nested lists/tuples/iterables
1 | records = [ |
Global and local variables
1 | # a is global |
Use defaultdict
Use dict
:
1 | d = {} |
Use defaultdict
:
1 | from collections import defaultdict |
Use generators
Example: read a large FASTA file
1 | def append_extra_line(f): |
Turn off annoying KeyboardInterrupt and BrokenPipe Error
Without exception handling (press Ctrl+C):
1 | import time |
With exception handling (press Ctrl+C):
1 | import time |
Class and instance variables
1 | class MyClass(): |
Useful Python packages for data analysis
Browser-based interactive programming in Python: jupyter
URL: (http://jupyter.org/)
Start jupyter notebook
1 | jupyter notebook --no-browser |
Jupyter notebooks manager
Jupyter process manager
Jupyter notebook
Integrate with matplotlib
Browser-based text editor
Browser-based terminal
Display image
Display dataframe
Display audio
Embedded markdown
Python packages for scientific computing
Vector arithmetics: numpy
URL: (http://www.numpy.org/)
Example:
1 | import numpy as np |
Numerical analysis (probability distribution, signal processing, etc.): scipy
URL: (https://www.scipy.org/)
scipy.stats contains a large number probability distributions:
Unified interface for all probability distributions:
Just-in-time (JIT) compiler for vector arithmetics
URL: (https://numba.pydata.org/)
Compile python for-loops to native code to achive similar performance to C/C++ code.
Example:
1 | from numba import jit |
Library for symbolic computation: sympy
URL: (http://www.sympy.org/en/index.html)
Operation on data frames: pandas
URL: (http://pandas.pydata.org/pandas-docs/stable/)
Example:
1 | import pandas as pd |
Basic graphics and plotting: matplotlib
URL: (https://matplotlib.org/contents.html)
Statistical data visualization: seaborn
URL: (https://seaborn.pydata.org/)
Interactive programming in Python: ipython
URL: (http://ipython.org/ipython-doc/stable/index.html)
Statistical tests: statsmodels
URL: (https://www.statsmodels.org/stable/index.html)
Machine learning algorithms: scikit-learn
URL: (http://scikit-learn.org/)
Example:
1 | from sklearn.datasets import make_classification |
Natural language analysis: gensim
URL: (https://radimrehurek.com/gensim/)
HTTP library: requests
URL: (http://docs.python-requests.org/en/master/)
Lightweight Web framework: flask
URL: (http://flask.pocoo.org/)
Deep learning framework: tensorflow
URL: (http://tensorflow.org/)
High-level deep learning framework: keras
URL: (https://keras.io/)
Operation on sequence and alignment formats: biopython
URL: (http://biopython.org/)
1 | from Bio import SeqIO |
1 | from Bio import SeqIO |
Operation on genomic formats (BigWig,etc.): bx-python
Operation on HDF5 files: h5py
URL: (https://www.h5py.org/)
Save data to an HDF5 file
1 | import h5py |
1 | h5ls -r dataset.h5 |
1 | / Group |
Read data from an HDF file:
1 | import h5py |
Mixed C/C++ and python programming: cython
URL: (http://cython.org/)
1 | import numpy as np |
Progress bar: tqdm
URL: (https://pypi.python.org/pypi/tqdm)
Example Python scripts
View a table in a pretty way
The original table is ugly:
1 | head -n 15 metadata.tsv |
Output:
1 | File accession File format Output type Experiment accession Assay Biosample term id |
Now display the table more clearly:
1 | head -n 15 metadata.tsv | tvi -d $'\t' -j center |
Output:
1 | File accession File format Output type Experiment accession Assay Biosample term id |
You can also get some help by typing tvi -h
:
1 | usage: tvi [-h] [-d DELIMITER] [-j {left,right,center}] [-s SEPARATOR] |
tvi.py
1 | #! /usr/bin/env python |
Generate a random FASTA file
seqgen.py
1 | #! /usr/bin/env python |
Weekly tasks
All files you need for completing the tasks can be found at: weekly_tasks.zip
Task 1: run examples (Python tips, numpy, pandas) in this tutorial
Install Anaconda on your PC. Try to understand example code and run in Jupyter or IPython.
Task 2: write a Python program to convert a GTF file to BED12 format
- Please refer to (https://genome.ucsc.edu/FAQ/FAQformat.html#format1) for BED12 format and refer to (https://www.ensembl.org/info/website/upload/gff.html) for GTF format.
GTF example:
1 | chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.5"; gene_type "lincRNA"; gene_name "MIR1302-2HG"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2"; |
BED12 example:
1 | chr1 67522353 67532326 ENST00000230113 0 + 0 0 0 5 45,60,97,64,221, 0,5024,7299,7961,9752, |
- The GTF file is
weekly_tasks/gencode.v27.long_noncoding_RNAs.gtf
. - Each line in the output file is a transcript with the 4th columns as transcript ID
- The version number of the transcript ID should be stripped (e.g. ENST00000473358.1 => ENST00000473358).
- The output file is sorted first by transcript IDs and then by chromosome in lexicographical order.
- Column 5, 7, 8, 9 in the BED12 file should be set to 0.
- Please do NOT use any external tools (e.g.
sort
,awk
, etc.) in your program other than Python. - An example output can be found in
weekly_tasks/transcripts.bed
.
Hint: use dict
, list
, tuple
, str.split
, re.match
, sorted
.
Task 3: write a Python program to add a prefix to all directories
- Each prefix is a two-digit number starting from 00 and ‘-‘. If the number is less than 10, a single ‘0’ letter should be filled.
- The files/directories should be numbered according to the lexicographical order.
For example, if the original directory structure is:
1 | . |
then you should get the following directory structure after renaming:
1 | . |
- The original directories can be found in
weekly_tasks/original_dirs
. - The root directory (i.e.
original_dirs
) should not be renamed. - You can use
tree
command to display the directory structure as shown above. - An example result can be found in
weekly_tasks/renamed_dirs
.
Hint: useos.listdir
,os.rename
,str.format
,sorted
,yield
.