BIOINFORMATICS FACILITIES @ ITB BARI

The provided scripts were developed with different languages, in order to to test the method described in the paper. We use Linux Shell scripts, Python scripts and mainly R scripts.

Formatting the input

The input of the workflow is a set of files, one for each sequenced sample, formatted as tables containing the following columns:

- ReadID (string without whitespaces)
- GeneID (string without whitespaces)
- MappingScore (positive number)

This file is obtained by properliy extracting these information from the output of the mapping performed on the files containing the reads. We start from this file format in order to leave the user free to use the preferred sequencing platform or mapping algorithm. Some examples of how to convert the mapping output to our input format are showed in Appendix A.

Example: technical_rep1.tsv (Tab Separed Values)

H7HV99101A000M OTTHUMG00000183546|TRIM28 9888.0
H7HV99101A001A OTTHUMG00000150691|PSMD8 9813.93026087
H7HV99101A002W OTTHUMG00000018964|TRIM8 9182.52
H7HV99101A003Z OTTHUMG00000151985|DCBLD2 9448.43103448
H7HV99101A004T OTTHUMG00000035011|ARNT 486.5
H7HV99101A004T OTTHUMG00000035635|NR5A2 486.85
H7HV99101A004T OTTHUMG00000168750|SRGAP1 600.0
...

Test data

Some datasets are provided to test the application. The compressed folder contains both the data and the output obtained with the computations. The datasets are the same used for the "Results and discussion" section in the paper.

The first dataset is composed by two files, which are technical replicates obtained by sequencing the same sample two times with the Roche 454 sequencer. The FASTA files were mapped against Vega genes with BLAST.

- technical_rep1.tsv
- technical_rep2.tsv

The second dataset is composed by two files, which are used as case and control for a test without replicates. The FASTQ files were mapped against Vega genes with bowtie.

- case.tsv
- control.tsv

The third dataset is composed by 8 files, 4 biological replicates for case and 4 biological replicates for control. The FASTQ files were mapped against Vega genes with bowtie.

- case_rep1.tsv
- case_rep2.tsv
- case_rep3.tsv
- case_rep4.tsv
- control_rep1.tsv
- control_rep2.tsv
- control_rep3.tsv
- control_rep4.tsv

The test data were downloaded from the SRA study SRP055874

Computation of fuzzy read counts

This script computes the four characteristics values for the fuzzy read counts (A,B,C,D), counting the number of uniquely mapping reads, best matches and further matches from the mapping output.

Script: mapping_output_2_trapezoids.sh (calls python/compute_scores.py)

Language: Shell + Python

Input: tsv file with columns ReadID, GeneID, MappingScore (formatted as described here)

Output: tsv file with columns GeneID, Centroid, AValue, BValue, CValue, DValue

Example command:

mapping_output_2_trapezoids.sh case.tsv case.fuzzy

Example output: case.fuzzy (included in multidea_test_data.zip)

Merging technical replicates

This script is useful when the experiment includes technical replicates that must be merged. The merging can be done before the mapping, but if the files have been already mapped and the fuzzy read counts have been already computed, this script can merge the fuzzy read couns by summing their values for each gene. The output is still a file containing fuzzy read counts.

Script: merge_technical_replicates.R

Language: R

Input: tsv files with fuzzy read counts (see Computation of fuzzy fold change)

Output: tsv file with merged technical replicates

Example command:

Rscript merge_technical_replicates.R merged_result.fuzzy technical_rep1.fuzzy technical_rep2.fuzzy technical_rep3.fuzzy ...

Normalization of samples

This script normalize the samples adopting the strategy proposed by DESeq2, in which each sample is scaled according to a model based on the geometric mean of values obtained for each gene across all the samples.

Script: normalize_samples.R

Language: R

Input: tsv files with fuzzy read counts (see Computation of fuzzy fold change), and a suffix for normalized files

Output: tsv files (one for each input) with normalized fuzzy read counts

Example command:

Rscript normalize_samples.R ".norm" control.fuzzy case.fuzzy

Merging biological replicates

This script is useful when the experiment includes biological replicates that must be merged. The merging transforms the fuzzy read counts of each gene in order to cover all the possible read count values. This step must be run on normalized fuzzy read counts, and the output is still a file containing fuzzy read counts.

Script: merge_biological_replicates.R

Language: R

Input: tsv files with fuzzy read counts (see Computation of fuzzy fold change)

Output: tsv file with merged biological replicates

Example command:

Rscript merge_biological_replicates.R merged_result.fuzzy biological_rep1.fuzzy biological_rep2.fuzzy biological_rep3.fuzzy ...

Computation of fuzzy fold change

This script computes fuzzy fold change on two normalized control and case fuzzy read counts.

Script: fuzzy_fold_change.R

Input: tsv files with fuzzy read counts (control and case, normalized)

Output: tsv file with columns GeneID, Centroid, AValue, BValue, CValue, DValue of fuzzy fold change

Example command:

Rscript fuzzy_fold_change.R fuzzy_fc.fuzzy control.fuzzy.norm case.fuzzy.norm

Computation of DE under/same/over possibilities

This final computation is composed by 3 scripts. The first fits the hyperbolas that act as smooth boundaries of same-expression data. The second prepares the uncertain read counts for the computation of the DE possibilities with a user-defined accuracy. The third computes the possibilities with the preferred accuracy. The value for the accuracy increments the points in which the possibilities are evaluated for a fuzzy read count, and it can have an integer positive value. If set to 1, the possibilities are computed only on centroids, if 2 they are computed on A,B,C,D and centroid (5^2 points), if set to n>=3, the possibilities are computed in ((n-1)*3+1)^2 points.

Scripts: fit_hyperbolas.R, prepare_points.R, compute_possibilities.R

Input: tsv files with fuzzy read counts (control and case, normalized), a value for the accuracy

Output: tsv file with GeneID, UnderPoss, SamePoss, OverPoss

Example commands:

Rscript fit_hyperbolas.R control.fuzzy.norm case.fuzzy.norm hyperbolas_info

Rscript prepare_points.R control.fuzzy.norm case.fuzzy.norm 3 prepared3.tsv #_3_is_the_accuracy

Rscript compute_possibilities.R prepared3.tsv hyperbolas_info possibilities3.tsv

Appendix: how to convert mapping output to formatted input table

In this appendix we show some examples of how to convert the output of a mapping tool to a table with the columns ReadID, GeneID, MappingScore, using some linux commands.

Bowtie output

In order to obtain an easy to convert output, we suggest to use Bowtie with its default output format, as in the example:

bowtie database.index samle.fastq mapping_output.tsv

The output contains only the mapping reads, one mapping for each line. The ReadID is the first column, the GeneID is the third column. The eigth column contains the number of mismatches. We can extract a Score for the mapping by counting the number of mapping nucleotides. This can be done by subtracting the number of mismatches from the sequence lenght (the sequence is in the fifth column). This is an example in Linux Shell code to reformat bowtie output:

awk '-F\t' '{ if (NF == 8) { split($8, Mismatches, ","); Mis = length(Mismatches); } else { Mis = 0 } printf("%s\t%s\t%u\n", $1, $3, length($5) - Mis); }' mapping_output.tsv > read_gene_score.tsv

Arianna Consiglio(1§), Corrado Mencar(2), Giorgio Grillo(1), Flaviana Marzano(1), Mariano Francesco Caratozzolo(1), Sabino Liuni(1)

(1) Institute for Biomedical Technologies of Bari - ITB, National Research Council, Bari, 70126, IT
(2) Department of Informatics, University of Bari Aldo Moro, Bari, 70121, IT.
(§) Corresponding author, arianna.consiglio@ba.itb.cnr.it

BIOINFORMATICS FACILITIES @ ITB BARI

Nav view search

Navigation

Font size

Search

MultiDEA scripts

A fuzzy method for RNA-Seq differential expression analysis in presence of multireads

Manual