Update parameters.tsv

hariszaf · Nov 25, 2019 · 7634974 · 7634974
1 parent 4df25c5
commit 7634974
Showing 1 changed file with 106 additions and 28 deletions.
diff --git a/parameters.tsv b/parameters.tsv
@@ -18,7 +18,7 @@
 #
 ## give in your each uniq experiment a NAME, so a single output file will be created for each of them 
 #
-outputFolderName	testingSingularity
+outputFolderName	16S_final_test
 #
 #
 ## If the names of the samples under analysis are not in an ENA format (e.g "ERR1021912") but they are more like as
@@ -41,20 +41,20 @@ EnaData	Yes
 #############################################################
 #
 ## Performs an adaptive quality trim, balancing the benefits of retaining longer reads against the costs of retaining bases with errors.
-## It needs to be ser either as 'Yes' or 'No'.
+## It needs to be set either as 'Yes' or 'No'.
 maxInfo	Yes
 #
 #
 ############ for MAXINFO ####################
 ## Specifies the read length which is likely to allow the location of the read within the target sequence to be determined.
 ## It needs an integer to be set as a value.
-targetLength	200
+targetLength	100
 #
 #
 ## This value, specifies the balance between preserving as much read length as possible vs. removal of incorrect bases.
 ## It can take values between 0 and 1.
 ## A low value of this parameter (<0.2) favours longer reads, while a high value (>0.8) favours read correctness.
-strictness	0.3
+strictness	0.8
 #
 #
 ############ for ILLUMINACLIP ##################
@@ -64,32 +64,32 @@ adapters	TruSeq2-PE.fa
 #
 ## This parameter specifies the maximum mismatch count which will still allow a full match to be performed.
 ## It needs an integer to be set as a value.
-seedMismatches	1
+seedMismatches	0
 #
 #
 ## This parameter specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment.
 ## It needs an integer to be set as a value.
-palindromeClipThreshold	30
+palindromeClipThreshold	20
 #
 #
 ## It specifies how accurate the match between any adapter etc. sequence must be against a read.
 ## It needs an integer to be set as a value.
-simpleClipThreshold	10
+simpleClipThreshold	30
 #
 #
 ############ for LEADING  ##########################
 ## The LEADING modules, removes low quality bases from the beginning.
 ## As long as a base has a value below this threshold (value of the 'leading' parameter) the base is removed and the next base will be investigated.
 ## It needs an integer to be set as a value.
-leading	3
+leading	20
 #
 #
 ############ for TRAILING  #############################
 ## This module of Trimmomatic removes low quality bases from the end.
 ## As long as a base has a value below this threshold (value of the 'trailing' parameter), the base is removed
 ## and the next base (which as trimmomatic is starting from the 3' prime end, would be base preceding the just removed base) will be investigated.
 ## It needs an integer to be set as a value.
-trailing	3
+trailing	2
 #
 #
 ############  for MINLEN  ################################
@@ -100,7 +100,7 @@ minlen	100
 #
 #
 ## Finally, you need to set how many threads you want Trimmomatic to run into. 
-threadsTrimmomatic	20
+threadsTrimmomatic	8
 #
 ###############################################################
 #########   BayesHammer (from SPAdes: v3.13.0)  ###############	
@@ -127,7 +127,34 @@ pandaseqAlgorithm	simple_bayesian
 #
 ## PANDAseq is a I/O bound algorithm. That means that it needs severous time in order to handle the ipnut and output files
 ## while the process is quite fast. However, it does support multithreading and here you can set the number of threads it is going to use.
-pandaseqThreads	20
+pandaseqThreads	8
+#
+#
+## The 'minlen' parameter sets the minimum length for a sequence, after primers are removed. 
+## By default, all sequences are kept. With this option, sequences shorter than desired can be discarded.
+## In case you need to use this parameter, be sure you leave a tab after 'minlen' and set it like this: '-l 80'
+## If you do not want to use this parameter, please remove everything after the 'minlen' 
+pandaseqMinlen	
+#
+#
+## The 'minoverlap' parameter sets the minimum overlap between forward and reverse reads. 
+## By default, this is at least one nucleotide of overlap. 
+## Raising this number does not generally increase the quality of the output as alignments with small overlaps tend to score poorly and are discarded anyway.
+minoverlap	1
+#
+#
+## The 'threshold' parameter sets the score, between zero and one, that a sequence must meet to be kept in the output. 
+## Any alignments lower than this will be discarded as low quality. 
+## Increasing this number will not necessarily prevent uncalled bases (Ns) from appearing in the final sequence. 
+## It is also used as the threshold to match primers, if primers are supplied. The default value is 0.6.
+threshold	0.6
+#
+#
+## The '-N' parameter eliminates all sequences with uncalled nucleotides in the output. 
+## Otherwise, during assembly, uncalled bases (Ns) from unpaired regions may be emitted.
+## If you need -N to be on your analysis, please add '-N' after 'elimination'. Please make sure you leave a tab. 
+## If you do not want the parameter to be on, please make sure there is nothing after the 'elimination' parameter.
+elimination	
 #
 #
 ## PEMA performs the PANDAseq algorithm, with the -a and the -B parameters also on.
@@ -150,14 +177,14 @@ pandaseqThreads	20
 ## VSEARCH is the main algorithm used for a lot of steps in the case of the 16S marker gene.
 ## Set how many threads do you want PANDAseq to use.
 ## It needs an integer to be set as a value.
-vsearchThreads	20
+vsearchThreads	8
 #
 #
 ## Here you need to set a score about the clustering step of the VSEARCH algorithm. 
 ## Do not add a read into a certain cluster if the pairwise identity with its centroid, is lower than the value of the 'vsearchId' parameter.
 ## The pairwise identity is defined as the number of (matching columns) / (alignment length - terminal gaps).
 ## It needs a real number to be set as a value, ranging from 0.0 to 1.0 .
-vsearchId	0.98
+vsearchId	0.95
 #
 #
 ################################################################################################
@@ -176,60 +203,110 @@ gene	gene_16S
 ###  Here are some parameters needed when the the metabarcoding analysis is about the 16S marker gene ####
 ##########################################################################################################
 #
+#
 ## If your marker gene is 16S, you can choose between 2 different approaches of taxonomy assignment (alignment & phylogenetic based)
 ## An alignment based taxonomy assignment - set as 'alignment' -  which is based on SILVA and CREST (version 3.0).
 ## However, you can also get a phylogenetic based assignment, by putting 'phylogeny' in this parameter. In that case, a reference tree we created is being used as well as the RAxML 
 taxonomyAssignmentMethod	alignment
 #
 #
+## I you choose phylogeny based taxonomy assignment, then you ll need to rum PaPaRa.
+## Pleas fill in how many cores PaPaRa is able to use.
+numberOfCoresForPapara	7
+#
+#
 ## When you use the alignment-based taxonomy assignment, then the LCAClassifier from the CREST algorithm, uses a Silva version for the assignment. 
-## PEMA allows you to choose between the two last version of Silva. Hence, set the "silvaVersion" parameter either as 'silva_128' or as 'silva_128'
+## PEMA allows you to choose between the two last version of Silva. Hence, set the "silvaVersion" parameter either as 'silva_128' or as 'silva_132'
 ## depending on the version of your choice.
 silvaVersion	silva_132
 #
-## As you may need a series of taxonomy assignment when you use the alignment-based method, please give another name in your taxonomy output folder of the CREST algorithm, each time you are about to use it. CREST creates an output folder every time and if a folder with the same already exists, it is going to abort the task! 
-taxonomyFolderName	16S_taxon_assign
-#
 #
-#					the following parameters is only for the case that the Phyloseq R package is about to run 		
+## As you may need a series of taxonomy assignment when you use the alignment-based method, please give another name in your 
+## taxonomy output folder of the CREST algorithm, each time you are about to use it. 
+## CREST creates an output folder every time and if a folder with the same already exists, it is going to abort the task!
+## You need to set the value of this parameter in case of 16S/18S rRNA and ITS marker genes. 
+taxonomyFolderName	its_taxon_assign
 #
+#########################################################################################################################################
+#																	#
+#		the following parameters is only for the case that the Phyloseq R package is about to run 				#
+#																	#
+#########################################################################################################################################
 #
 ## If wish to use Phyloseq in order to analyse your returned data then set the following  parameter 'phyloseq' with 'Yes'. In order to do that, PEMA needs an MSA that it is returned by the MAFFT (v7.427) aligner and a phylogeny tree of the OTUs found which is built by the RAxML-ng algorithm.
 ## Please remember that in order to use phyloseq a "metadata.tsv" file is necessary to be part of your anaylis folder. 
-phyloseq	Yes
+phyloseq	No
 #
 #
 ## The phyloseq object can handle phylogenetic trees as well. PEMA uses RAxML-ng in order to build such trees. Do you want to create such a tree with your OTUs? In case you build this once, you can use it as many times as you want.
 tree	No
 #
 ## In case you are about to use the phyloseq option, then a phylogeny tree has to be built. Hence, PEMA invokes the RAxML-ng algorithm
 ## which is able to run in more than one threads. Please set the number of threads RAxML is able to use.
-raxmlThreads	10
+raxmlThreads	5
 #
 #
 ## You can also set the number of the parsimony-based starting trees for the RAxML-ng 
-parsTrees	3
+parsTrees	1
 #
 #
 ## And finally, the number of the bootstrap trees
-bootstrapTrees	100
+bootstrapTrees	1
 #
 #
 ##########################################################################################################
-###  Here are some parameters needed when the the metabarcoding analysis is about the COI marker gene ####
+#######################     For the case of the ITS marker gene     ######################################
 ##########################################################################################################
 #
-### if your marker gene is  COI, you can choose between 2 different approaches of clustering. Depending on which of them you choose
+## For the case of ITS there is an extra problem with respect to the primes used.
+## Please complete the next two variables with the primers you used
+#
+forwardITSPrimer	GATGAAGAACGYAGYRAA
+#
+reverseITSPrimer	CTBTTVCCKCTTCACTCG
+#
+#
+##########################################################################################################
+########     Here are some parameters needed with respect to clustering algorithms     ###############
+##########################################################################################################
+#
+#
+## For the case of the 16S and 18S rRNA marker genes, you can either get an OTU-table using the VSEARCH algorithm
+## or you can get an ASV-table by taking advantage of the SWARM algorithm. Please fill in which of those
+## two, you prefere to run (write "swarm" or "vsearch" after algo_).
+clusteringAlgoFor16S_18SrRNA	algo_vsearch
+#
+#
+## If your marker gene is  COI or ITS, you can choose between 2 different approaches of clustering. Depending on which of them you choose
 ## you get either a robust output in a short time (SWARM)  or a non-robust output (CROP) that requires quite much more time.
 ## CROP is a bayesian algorithm and that is why its output in non-robust.
 ## By default, SWARM (v2.2.2) algorithm runs for the clustering.
 ## You have to change in to 'CROP' if you want the CROP algorithm to do the clustering step.
-clusteringAlgo	algo_SWARM
+clusteringAlgoForCOI_ITS	algo_SWARM
 #
 #
-# in case of SWARM, the user needs to speeecify the value of "d" parameter
-# d is the number of missmatches
-d	10
+## In case of SWARM, the user needs to speeecify the value of "d" parameter,
+## maximum number of differences allowed between two amplicons, meaning that two amplicons
+## will be grouped if they hav e integer (or less) differences. This is swarm's most important
+## parameter
+d	20
+#
+#
+## when using the option --fastidious (-f), define the minimum mass of a large ASV. 
+## By default, an ASV with a mass of 3 or more is considered large. 
+## Conversely, an ASV is small if it has a mass of less than 3, meaning that it is composed of either one amplicon of abundance 2, or two amplicons of abundance 1. 
+## Any positive value greater than 1 can be specified. Using higher boundary values will speed up the second pass, but also reduce the taxonomical resolution of swarm
+## results. Default mass of a large OTU is 3.
+boundary	3
+#
+## you also need to set the number of threads that Swarm is able to use
+swarmThreads	20
+#
+#
+## SWARM tends to create a great numebr of ASVs, especially when d takes a low value. 
+## Would you like to remove the singletons (ASVs that appear only once with abundance equal to 1) ?
+#
+removeSingletons	Yes
 #
 #
 ## PEMA invokes the UCHIME_DENOVO3 algorithm for the chimera removal in the case of the COI marker gene.
@@ -246,3 +323,4 @@ abskew	2
 emptyRawDataFile	Yes
 emptyCheckpoints	Yes
 #
+#