NucleoSeq


Authors: Roman Jaksik, Joanna Rzeszowska-Wolny

NucleoSeq is a user-friendly Windows application that allows downloading, storage, and analysis of the sequences of thousands of mRNA transcripts. It is mostly oriented towards users without a knowledge of bioinformatics, allowing various analyses to be performed easily.

Sequences can be provided by the user in the popular FASTA format or automatically downloaded from the RefSeq database based on either a RefSeq ID, gene symbol, Ensembl ID, or EntrezGene ID using the HUGO Gene Nomenclature Committee database. It is possible to analyze not only an entire transcript but also regions such as the coding sequence or the 5’- and 3’-untranslated regions. Downloaded sequences are stored in a local database from which they can be retrieved for further analyses, thus saving time for downloading data. Since the sequences are updated very frequently, there is also the option to remove sequences older than a specified date or to clear the entire database so that new sequences can be loaded in their place.

The sequence analyses provided include basic features like length or nucleotide composition, and analysis of simple motifs including regions containing a single nucleotide, AT/GC repeats, and class III AU-rich elements (with no specific pattern in the sequence). It is also possible to search for user-defined motifs composed of either a repeated sequence pattern or a non-specific sequence which contains different nucleotides in a single position; for example, [AT]GCC (or WGCC in IUPAC rules) will find both AGCC and TGCC motifs.

The application can also work with motifs provided as a position weight matrix (PWM) according to the Jaspar database format, which allows analysis of the occurrence of transcription factor binding sites with a user-defined specificity cutoff. Additionally, downloaded sequences can be exported to other applications if more sophisticated methods of analysis need to be used. Further features of the application allow a search for motifs in selected sequences after randomizing (conserving the length and overall nucleotide composition) or in completely random sequences when the user needs only to provide the length. This feature allows one to establish if the number of motifs found results from specific sequence features or from random nucleotide placement, which can or cannot be additionally related to differences in the distribution of specific nucleotides among the sequences.

The results can be presented as the overall number of motifs in an analyzed sequence, as the percentage of the sequence length that contains selected motifs, or as the location of each annotated motif, allowing a distribution map across a single sequence or a group to be built. Since sequence downloading and analysis can be time-consuming, the application allows partial results to be viewed or exported during calculations, and all results can be saved automatically to prevent data loss if the system is shut down abnormally.

