This page contains all the scripts developed to test the method and to obtain the results.
Here we describe how to use all the scripts, with some examples on the test data provided.
Computation of fuzzy read counts
Computation of fuzzy fold change
Computation of DE under/same/over possibilities
Appendix: how to convert mapping output to formatted input table
The provided scripts were developed with different languages, in order to to test the method described in the paper. We use Linux Shell scripts, Python scripts and mainly R scripts.
The input of the workflow is a set of files, one for each sequenced sample, formatted as tables containing the following columns:
- ReadID (string without whitespaces)
- GeneID (string without whitespaces)
- MappingScore (positive number)
This file is obtained by properliy extracting these information from the output of the mapping performed on the files containing the reads. We start from this file format in order to leave the user free to use the preferred sequencing platform or mapping algorithm. Some examples of how to convert the mapping output to our input format are showed in Appendix A.
Example: technical_rep1.tsv (Tab Separed Values)
H7HV99101A000M OTTHUMG00000183546|TRIM28 9888.0
H7HV99101A001A OTTHUMG00000150691|PSMD8 9813.93026087
H7HV99101A002W OTTHUMG00000018964|TRIM8 9182.52
H7HV99101A003Z OTTHUMG00000151985|DCBLD2 9448.43103448
H7HV99101A004T OTTHUMG00000035011|ARNT 486.5
H7HV99101A004T OTTHUMG00000035635|NR5A2 486.85
H7HV99101A004T OTTHUMG00000168750|SRGAP1 600.0
...
Some datasets are provided to test the application. The compressed folder contains both the data and the output obtained with the computations. The datasets are the same used for the "Results and discussion" section in the paper.
The first dataset is composed by two files, which are technical replicates obtained by sequencing the same sample two times with the Roche 454 sequencer. The FASTA files were mapped against Vega genes with BLAST.
- technical_rep1.tsv
- technical_rep2.tsv
The second dataset is composed by two files, which are used as case and control for a test without replicates. The FASTQ files were mapped against Vega genes with bowtie.
The third dataset is composed by 8 files, 4 biological replicates for case and 4 biological replicates for control. The FASTQ files were mapped against Vega genes with bowtie.
- case_rep1.tsv
- case_rep2.tsv
- case_rep3.tsv
- case_rep4.tsv
- control_rep1.tsv
- control_rep2.tsv
- control_rep3.tsv
- control_rep4.tsv
The test data were downloaded from the SRA study SRP055874
Computation of fuzzy read counts
This script computes the four characteristics values for the fuzzy read counts (A,B,C,D), counting the number of uniquely mapping reads, best matches and further matches from the mapping output.
Script: mapping_output_2_trapezoids.sh (calls python/compute_scores.py)
Language: Shell + Python
Input: tsv file with columns ReadID, GeneID, MappingScore (formatted as described here)
Output: tsv file with columns GeneID, Centroid, AValue, BValue, CValue, DValue
Example command:
mapping_output_2_trapezoids.sh case.tsv case.fuzzy
Example output: case.fuzzy (included in multidea_test_data.zip)
This script is useful when the experiment includes technical replicates that must be merged. The merging can be done before the mapping, but if the files have been already mapped and the fuzzy read counts have been already computed, this script can merge the fuzzy read couns by summing their values for each gene. The output is still a file containing fuzzy read counts.
Script: merge_technical_replicates.R
Language: R
Input: tsv files with fuzzy read counts (see Computation of fuzzy fold change)
Output: tsv file with merged technical replicates
Example command:
Rscript merge_technical_replicates.R merged_result.fuzzy technical_rep1.fuzzy technical_rep2.fuzzy technical_rep3.fuzzy ...
This script normalize the samples adopting the strategy proposed by DESeq2, in which each sample is scaled according to a model based on the geometric mean of values obtained for each gene across all the samples.
Script: normalize_samples.R
Language: R
Input: tsv files with fuzzy read counts (see Computation of fuzzy fold change), and a suffix for normalized files
Output: tsv files (one for each input) with normalized fuzzy read counts
Example command:
Rscript normalize_samples.R ".norm" control.fuzzy case.fuzzy
This script is useful when the experiment includes biological replicates that must be merged. The merging transforms the fuzzy read counts of each gene in order to cover all the possible read count values. This step must be run on normalized fuzzy read counts, and the output is still a file containing fuzzy read counts.
Script: merge_biological_replicates.R
Language: R
Input: tsv files with fuzzy read counts (see Computation of fuzzy fold change)
Output: tsv file with merged biological replicates
Example command:
Rscript merge_biological_replicates.R merged_result.fuzzy biological_rep1.fuzzy biological_rep2.fuzzy biological_rep3.fuzzy ...
Computation of fuzzy fold change
This script computes fuzzy fold change on two normalized control and case fuzzy read counts.
Script: fuzzy_fold_change.R
Input: tsv files with fuzzy read counts (control and case, normalized)
Output: tsv file with columns GeneID, Centroid, AValue, BValue, CValue, DValue of fuzzy fold change
Example command:
Rscript fuzzy_fold_change.R fuzzy_fc.fuzzy control.fuzzy.norm case.fuzzy.norm
Computation of DE under/same/over possibilities
This final computation is composed by 3 scripts. The first fits the hyperbolas that act as smooth boundaries of same-expression data. The second prepares the uncertain read counts for the computation of the DE possibilities with a user-defined accuracy. The third computes the possibilities with the preferred accuracy. The value for the accuracy increments the points in which the possibilities are evaluated for a fuzzy read count, and it can have an integer positive value. If set to 1, the possibilities are computed only on centroids, if 2 they are computed on A,B,C,D and centroid (5^2 points), if set to n>=3, the possibilities are computed in ((n-1)*3+1)^2 points.
Scripts: fit_hyperbolas.R, prepare_points.R, compute_possibilities.R
Input: tsv files with fuzzy read counts (control and case, normalized), a value for the accuracy
Output: tsv file with GeneID, UnderPoss, SamePoss, OverPoss
Example commands:
Rscript fit_hyperbolas.R control.fuzzy.norm case.fuzzy.norm hyperbolas_info
Rscript prepare_points.R control.fuzzy.norm case.fuzzy.norm 3 prepared3.tsv #_3_is_the_accuracy
Rscript compute_possibilities.R prepared3.tsv hyperbolas_info possibilities3.tsv
Appendix: how to convert mapping output to formatted input table
In this appendix we show some examples of how to convert the output of a mapping tool to a table with the columns ReadID, GeneID, MappingScore, using some linux commands.
Bowtie output
In order to obtain an easy to convert output, we suggest to use Bowtie with its default output format, as in the example:
bowtie database.index samle.fastq mapping_output.tsv
The output contains only the mapping reads, one mapping for each line. The ReadID is the first column, the GeneID is the third column. The eigth column contains the number of mismatches. We can extract a Score for the mapping by counting the number of mapping nucleotides. This can be done by subtracting the number of mismatches from the sequence lenght (the sequence is in the fifth column). This is an example in Linux Shell code to reformat bowtie output:
awk '-F\t' '{ if (NF == 8) { split($8, Mismatches, ","); Mis = length(Mismatches); } else { Mis = 0 } printf("%s\t%s\t%u\n", $1, $3, length($5) - Mis); }' mapping_output.tsv > read_gene_score.tsv
Arianna Consiglio(1§), Corrado Mencar(2), Giorgio Grillo(1), Flaviana Marzano(1), Mariano Francesco Caratozzolo(1), Sabino Liuni(1)
(1) Institute for Biomedical Technologies of Bari - ITB, National Research Council, Bari, 70126, IT
(2) Department of Informatics, University of Bari Aldo Moro, Bari, 70121, IT.
(§) Corresponding author, arianna.consiglio@ba.itb.cnr.it