psf_logoPSF

•seqjoin software•

home
scientific concept
status
news
PSF in the press
publications
links
jobs
contact
internal pages

Project

overview
software, database
cloning, protein expression
biophysics, crystallisation
NMR
X-ray diffraction

PSF E. coli
expression

home
what we do
publications
people
ORFer software
Clone Manager LIMS
seqjoin software
vectors
strains
protocols
links

Introduction

seqjoin is used to predict the complete cDNA insert sequence of partially sequenced cDNA clones. The clones' partial experimental sequence are matched to a database of complete cDNA sequence. If a match is found, the clone's insert sequence is predicted from the vector sequence, the sequence of the database cDNA sequence entry that was matched and the experimental, partial clone sequence. seqjoin is based on the output of the sequence analysis programs phred, phrap, cross_match and also uses the Emboss package.

The prediction of the complete cDNA inserts by the seqjoin program uses a set of rules and assumptions. The experimental clone sequences (= tag sequences) are assumed to be derived from the 5'-end and to contain a small stretch of vector sequence. The tag sequence are aligned to the vector sequence and to a full-length cDNA sequence database using the program cross_match. The part of the tag sequence, that aligns to the vector sequence, will be removed and replaced by the vector sequence. Thus any sequencing errors introduced in this range are eliminated.

The remainder of the tag sequence will be replaced by the complete - or 5'-truncated - full-length database sequence, - provided that the alignment to the experimental and the database sequence suggest that they originate from the same transcript. If the alignments of the tag sequence to the vector and the database entry are not adjacent, the gap has to be closed by the experimental sequence, provided that the sequence quality in this range is sufficient.

Using the quality measures provided by the phred program, differences between the experimental and the database sequences are taken into account. A set of rules are applied to differentiate sequencing errors, substitutions representing single nucleotide polymorphisms (SNP), stretches of substitutions suggesting alternative splicing, insertions or deletions representing polymorphisms or suggesting that the aligned sequences represent alternative splice forms.

If cross_match identifies more than one alignment of the experimental to the database sequence, alternative splice forms are assumed. While single substitutions are taken into account, single deletions or insertions are ignored. We assume that single substitutions or insertions leading to frame shifts would represent sequencing errors rather than real polymorphims. Alternative splicing prevents insert prediction by the seqjoin program.

The seqjoin program produces a number of output files. A file with commands of the Emboss package is prepared that is used later to prepare the actual sequence manipulation and joining steps. For each alignment found by cross_match, a comment line is entered into the file seqjoin.stat.all. This file contains details on the sequence joining and indicates which alignments and predicted insert sequences might require additional manual inspection. For alignments that the program could not use to predict the inserts sequence, a comment indicating the reason is given.

Instructions

  • Download the seqjoin.pl Perl script.
  • Get the programs phred, phrap, cross_match and the phredPhrap script from The University of Washington.
  • Download and install the Emboss package.
  • Create a directory named seq, place the seqjoin.pl script in this directory. Alternatively, put seqjoin.pl somewhere into your path.
    Download seqjoin-example.zip to get example files in a correct directory tree for seqjoin.
  • Place all clone sequence trace files in a directory seq/chromat_dir.
  • Create a directory start in the same location as chromat_dir. Prepare a vector sequence file in Fasta format in this directory, and name it vector.fasta. The sequence of the vector should start with the 3' cloning site on the vector and end with the 5' site. E.g.: if the inserts were cloned into the SalI and NotI sites of the vector, the vector sequence should read like this:
    GCGGCCGC.....GTCGAC
    Prepare a second sequence file that stretches from the translation start on the vector to the 5' cloning site. This sequence should be called vector_start.fasta and should start with ATG.
  • Download a Fasta file of a database of complete cDNA sequences. Place the file into the directory seq/start/ and rename it to database.fasta.
    Example: human cDNA sequence files of the Ensembl database, available from ftp://ftp.ensembl.org/pub/current_human/data/fasta/cdna.
  • Edit the phredPhrap script and enter location of the cloning vector sequence file.
  • In the start directory, start phredPhrap.
    $ phredPhrap
    It will create a number of files, including seq.fasta, seq.fasta.screen and seq.fasta.screen.qual.
  • Compare the clone sequences to the vector sequence and the database of complete cDNA sequences with cross_match.
    $ nice cross_match -alignments -tags -discrep_lists seq.fasta vector.fasta > crossmatch_vector
    $ nice cross_match -alignments -tags -discrep_lists seq.fasta.screen database.fasta > crossmatch_database
  • Start seqjoin:
    $ ../seqjoin.pl
    This will create the files seqjoin.stat.all, seqjoin.stat.predict and emboss.script. Execute the commands in the file emboss.script:
    $ sh -v emboss.script
    Note: This will run very slowly because the Emboss seqret is slow when it extracts sequences out of the large Ensembl Fasta file. It is recommended to register the Ensembl cDNA Fasta file as an Emboss database and replace database.fasta: in emboss.script by <name_of_embossdb>:.
    The script will create the file seqjoin.result, which contains the predicted clone insert sequences in Fasta format.
  • Translate the cDNA sequences in seqjoin.result with the Emboss program transeq:
    $ transeq seqjoin.result sp
    Trim the protein sequences in sp to remove any sequences after the stop codon:
    $ perl -ne 'if(/^>/){$w=0;}if($_=~s/\*.*//){if($w==0 and length($_)>1){print;}$w=1;}if($w==0){print;}' sp > seqjoin.pep
    The file seqjoin.pep contains the protein sequences encoded by the clones.
  • The files seqjoin.stat.all and seqjoin.stat.predicted contain tabular information on the predicted sequences that seqjoin produced.

• [home] • [scientific concept] • [status] • [project structure] •
• [news] • [links] • [jobs ] • [internal pages]  •
• [home] • [what we do] • [publications] • [people] • [vectors] • [strains] • [ORFer software] • [links] •

© 2003 by V. Sievert, Konrad Büssow last changed 12 Sep 2006