|
SeqAn3 3.3.0
The Modern C++ library for sequence analysis.
|
Learning Objective:
In this tutorial, you will learn how to combine the components of previous tutorials to create your very first SeqAn application: a read mapper!
| Difficulty | High |
|---|---|
| Duration | 90 Minutes |
| Prerequisite tutorials | All |
| Recommended reading |
Read mapping is a common task in bioinformatics and is often the first step of an in-depth analysis of Next Generation Sequencing data. Its aim is to identify positions where a query sequence (read) matches with up to e errors to a reference sequence.
In this example, we will implement a read mapper step by step and make use of what we have learned in the previous tutorials. As it is common practice with read mappers, we will first create an indexer that creates an index from the reference and stores it to disk. After this, we will implement the actual read mapper that will use the stored index and map the reads.
We provide an example reference and an example query file.
As a first step, we want to parse command line arguments for our indexer. If you get into trouble, you can take a peek at the Argument Parser Tutorial or the API documentation of the seqan3::argument_parser for help.
Follow the best practice and create:
run_program that prints the parsed argumentscmd_arguments that stores the argumentsinitialise_argument_parser to add meta data and options to the parsermain function that parses the arguments and calls run_programUse validators where applicable!
Your main may look like this:
As a next step, we want to use the parsed file name to read in our reference data. This will be done using seqan3::sequence_file_input class. As a guide, you can take a look at the Sequence I/O Tutorial.
To do this, you should create:
reference_storage_t that stores the sequence information for both reference and query information within member variablesread_reference that fills a reference_storage_t object with information from the files and prints the reference IDsYou should also perform the following changes in run_program:
storage of type reference_storage_tread_referenceThis is the signature of read_reference:
This is the reference_storage_t:
Here is the complete program:
Now that we have the necessary sequence information, we can create an index and store it. Read up on the Index Tutorial if you have any questions.
create_index:index_path and storage as parametersWe also need to change:
run_program to now also call create_indexrun_program and read_reference to not print the debug output anymoreThis is the signature of create_index:
Here is the complete program:
Again, we want to parse command line arguments for our read mapper as a first step. If you get into trouble, you can take a peek at the Argumet Parser Tutorial or the API documentation of the seqan3::argument_parser for help.
Follow the best practice and create:
run_program that prints the parsed argumentscmd_arguments that stores the argumentsinitialise_argument_parser to add meta data and options to the parsermain function that parses the arguments and calls run_programUse validators where applicable!
Your main may look like this:
We also want to read the reference in the read mapper. This is done the same way as for the indexer. We can now load the index and conduct a search. Read up on the Search Tutorial if you have any questions.
To do this, you should:
read_reference function and the reference_storage_t struct from the indexermap_reads that loads the index and prints the results of the search (allowing all error types) for the first 20 queriesYou should also perform the following changes in run_program:
storage of type reference_storage_tread_reference and map_readsThis is the signature of map_reads:
Here is the complete program:
We can now use the obtained positions to align each query against the reference. Refer to the Alignment Tutorial if you have any questions.
map_reads to:This is the alignment config:
Here is the complete program:
Finally, we can write our results into a SAM file.
map_reads to write the alignment results into a SAM file. Additionally, there should be no more debug output.
Try to write all available information into the SAM file. We can introduce a naive mapping quality by using mapping quality = 60 + alignment score.
This is the sam_file_output construction:
Here is the complete program: