*** OBJECT CREATION AND OUTPUT *** setting default output format to fasta aln in Default format: > R R desc abcdef > S S desc ghijkl > T T desc mnopqr > U U desc stuvwx > V V desc yzabcd aln in raw format: abcdef ghijkl mnopqr stuvwx yzabcd aln in fasta format: > R R desc abcdef > S S desc ghijkl > T T desc mnopqr > U U desc stuvwx > V V desc yzabcd *** SLICING *** seqs(1,3,1,2) = ab gh mn seqs([ 1,2,3 ], [ 1,4 ]) ad gj mp *** ADVANCED SLICING *** seqs([3,2,3,1], [1,3..5]) = mopq gijk mopq acde seqs([ 1,2,3 ], \&has_purine), only the columns for which has_purine returns 1 acf gil mor Same aln slice in fasta format: > R R desc acf > S S desc gil > T T desc mor seqs({ids=>'R T'}) = abcdef mnopqr *** MAPPING *** map_r(\&has_purine, [ 1,2,3 ]) 111 map_c(\&has_purine, [ 1,4 ]) 10 map_r(\&id, [ 1,2,3 ]) abcdef ghijkl mnopqr It's the same as seqs([ 1,2,3 ]) : abcdef ghijkl mnopqr 0/1 indicators for purine 101001 0/1 indicators for pyrimidine 111010 *** TRIGGERING ERRORS *** __TGGTGCTTCAAC TG__TTG--TCAAC DDTCCCGCGTDD __TGGTGCTTCAAC TG__TTG--TCAAC Offending characters:DD____ Offending characters:____ Equalilized lengths: TCCCGCGTDD---- TGGTGCTTCAAC__ __TTG--TCAACTG Padded lengths to 20: TCCCGCGTDD---------- TGGTGCTTCAAC__------ __TTG--TCAACTG------ *** TESTING ACCESSORS AND COPYING *** seqs() abcdef ghijkl mnopqr stuvwx yzabcd seqs(2) ghijkl mnopqr stuvwx yzabcd seqs(2,4) ghijkl mnopqr stuvwx seqs(2,4,3) ijkl opqr uvwx seqs(2,4,3,5) ijk opq uvw _strs() abcdef ghijkl mnopqr stuvwx yzabcd seqs([2,4]) ghijkl stuvwx seqs([],[2,4]) bd hj np tv zb _arys([2,3,4]) ghijkl mnopqr stuvwx seqs([2,4],[3]) i u layout > R R desc abcdef > S S desc ghijkl > T T desc mnopqr > U U desc stuvwx > V V desc yzabcd layout of duplicate: > R R desc abcdef > S S desc ghijkl > T T desc mnopqr > U U desc stuvwx > V V desc yzabcd Checking for equality, 1 expected: 1 Checking for equality, 0 expected: 0 Checking for equality, 1 expected: 1 set bad Id= bad Id set correct new Id= correct_id set old desc= artif.fasta to new desc= new desc for aln type= unknown set new numbering= 4 Row ids= RSTUV Column ids= 123456 Row descriptions= R descS descT descU descV desc Column descriptions= New row ids= rstuv New column ids= abcdef New row descriptions= r descs desct descu descv desc New column descriptions= a descb descc descdesc desce descf desc seqs(2,4,7,9) jkl pqr vwx seqs(2,4,2,4), columns -2,-1,0: klg qrm wxs Current format of $aln:raw abcdef ghijkl mnopqr stuvwx yzabcd > r r desc abcdef > s s desc ghijkl > t t desc mnopqr > u u desc stuvwx > v v desc yzabcd Modified alignment: > r r desc abcdef > s s desc ghijkl > t t desc mnopqr > u u desc stuvwx > v v desc yzabcd Original copy: > R R desc abcdef > S S desc ghijkl > T T desc mnopqr > U U desc stuvwx > V V desc yzabcd The current alignment has 6 columns and 5 rows. *** CREATING A NEW ALIGNMENT BY SLICING AN OLD ONE *** Set up new alignment from the modified alignment, aln(2,4,7,9) is an alignment with rows 2-4 and columns 4-6 (designated 7-9 since the numbering starts with column no.4 Modified alignment: > s s desc jkl > t t desc pqr > u u desc vwx Still got the old sequence names, and column _default_names_ from the ones that start with 1 in the old alignment; accessing seqs({ids=>'s Junk u'},{ids=>'c Junk d e'}) where column c and ``Junk'' won't be found and col. d and f are printed jl vx Creating empty alignment *** CONSENSUS, (IN)VARIABLE SITES, REVERSE, COMPLEMENT, GAP-FREE SITES *** Set new numbering= 0 Current alignment: > 1 AAATT--TCAACAG > 2 AATAA-TTCCACCC > 3 ATAAC--TCACGC- Consensus of the columns, >= 75% invariability (default) A!!!!-!TC!!!!! Consensus of the columns, >= 60% invariability AAAA!--TCAACC! Consensus of the first 10 columns, consensus residue must be invariable A!!!!-!TC! Using longer alignment: CGCGGTTGGTTCACCCACCGGACC CGCGGTTGGTTCACCCACCGGACC CGCGGTTGGTTCACCCACCGGACC CGCGGTTGGTTCACTCATCGAATA TTATTTAATCATTTGGAGTACGAT TCGTTGCAGCAATTTGGGTACCAT TCGTTGCATCGATTTGGGTACTAT TTATTGCATCAATTTGGGTACTAT TCAATCCTTCGATTTGGGTACCAT TTATTCCAACAGTTTGGGTACTGT GCGTACCTTCAATTTGGGGCCCAT GCGTACCTTCAATTTGGGTACCAT GCGTACCTTCAATTTGGGTACCAT TCGTACTTTCAATTTGGGTACTAT GCAGACCTTCAATTTCGGTACCAT CCGTTCCTCCAAGTTGGGGCCTAT TTATACCATCCGATAGGTTACTCC Consensus of the columns, default !!!!!!!!!C!!!T!!!!!!C!!! Consensus of the columns, >= 60% invariability !!!T!!C!!C!!TTTGGGTAC!AT Consensus of the columns, >= 40% invariability TCGTTCCTTCAATTTGGGTAC!AT Using shorter alignment: AAATT--TCAACAG AATAA-TTCCACCC ATAAC--TCACGC- Variable columns 2 3 4 5 7 10 11 12 13 14 only, i.e. deleting columns that are invariable AATT-AACAG ATAATCACCC TAAC-ACGC- Invariable columns 1 6 8 9 only, i.e. deleting columns that are variable A-TC A-TC A-TC Variable columns only, i.e. deleting columns with >= 60% invariability; print rows 1+3 only TG C- Invariable columns only, i.e. only columns with >= 60% invariability; print rows 1+3 only AAAT--TCAACA ATAA--TCACGC Variable columns only, i.e. deleting columns with >= 20% invariability For the current example, an empty array should be returned Only columns 1 2 3 4 5 8 9 10 11 12 13 that are gap_free, print rows 1+3 only AAATTTCAACA ATAACTCACGC Only columns 1 2 3 4 5 7 8 9 10 11 12 13 14 that do not have gaps exclusivly, print rows 1+3 only (row 2 has a non-gap character in position 6 !) AAATT-TCAACAG ATAAC-TCACGC- Original sequences 1 and 3, using the 'remove_gaps' function AAATTTCAACAG ATAACTCACGC Rows in reverse GACAACT--TTAAA CCCACCTT-AATAA -CGCACT--CAATA Rows in complement TTTAA--AGTTGTC TTATT-AAGGTGGG TATTG--AGTGCG- Rows 1+3 in reverse complement CTGTTGA--AATTT -GCGTGA--GTTAT Copied alignment: > 1 AAATT--TCAACAG > 2 AATAA-TTCCACCC > 3 ATAAC--TCACGC- Alignment sliced inplace, {ids=>'2 3'},[1..6] : > 2 AATAA- > 3 ATAAC- Alignment after gap_free_cols inplace: > 2 AATAA > 3 ATAAC Alignment after var_sites inplace: > 2 ATA > 3 TAC Alignment after invar_sites inplace, nothing left : > 2 > 3 Copied alignment: > 1 AAATT--TCAACAG > 2 AATAA-TTCCACCC > 3 ATAAC--TCACGC- Alignment after revcom inplace; select first and last row by id: > 1 CTGTTGA--AATTT > 3 -GCGTGA--GTTAT Alignment after reverse inplace; select first and last row by function: > 1 TTTAA--AGTTGTC > 3 TATTG--AGTGCG- Alignment after complement inplace; select first and last row by index list: > 1 AAATT--TCAACAG > 3 ATAAC--TCACGC- *** NOW THE SAME, USING USER-SUPPLIED FUNCTIONS: CONSENSUS, (IN)VARIABLE SITES, REVERSE, COMPLEMENT, ETC *** Current alignment: > 1 AAATT--TCAACAG > 2 AATAA-TTCCACCC > 3 ATAAC--TCACGC- Dominated sites: 01111010011111 Gap-free sites: 11111001111110 User-supplied consensus per column, threshold 0.34 (see above): AAAAN--TCAACCN User-supplied consensus per column, threshold 0.75: ANNNN-NTCNNNNN User-supplied consensus of rows (;-) ACN Note that in case of a tie, the winning residue is selected arbitrarily User-supplied consensus of the columns that have purine AAAANAACCN Rows in reverse GACAACT--TTAAA CCCACCTT-AATAA -CGCACT--CAATA Rows in complement TTTAA--AGTTGTC TTATT-AAGGTGGG TATTG--AGTGCG- Testing parsing and writing of long files: Differences after several parse/write iterations: 1c1 < > BALANUS --- > > BALANUS , 1969 bases, 69EFFB75 checksum. 3c3 < > CHTHAMALUS --- > > CHTHAMALU , 1969 bases, 2ED12D6 checksum. 5c5 < > TETRACLITA --- > > TETRACLIT , 1969 bases, 690831BC checksum. 7c7 < > CHELONIBIA --- > > CHELONIBI , 1969 bases, FFFF786 checksum. 9c9 < > CALANTICA --- > > CALANTICA , 1969 bases, B56E4354 checksum. 11c11 < > LEPAS --- > > LEPAS , 1969 bases, DF221691 checksum. 13c13 < > OCTOLASMIS --- > > OCTOLASMI , 1969 bases, A8722E42 checksum. 15c15 < > LOXOTHYLACUS --- > > LOXOTHYLA , 1969 bases, E0D07B7B checksum. 17c17 < > TRYPETESA --- > > TRYPETESA , 1969 bases, CF1CC84F checksum. 19c19 < > BERNDTIA --- > > BERNDTIA , 1969 bases, B72AB092 checksum. 21c21 < > ULOPHYSEMA --- > > ULOPHYSEM , 1969 bases, F6D6797F checksum. 23c23 < > BRANCHINECTA --- > > BRANCHINE , 1969 bases, 8507864F checksum. Parsing clustal format via clustal. aln in MSF format, via readseq: /var/tmp/maaa002Xx MSF: 25 Type: N January 01, 1776 12:00 Check: 2752 .. Name: BS1-fragment Len: 25 Check: 14E8 Weight: 1.00 Name: BS2-fragment Len: 25 Check: 16F1 Weight: 1.00 Name: BS3-fragment Len: 25 Check: 154E Weight: 1.00 Name: BS4-fragment Len: 25 Check: 17B9 Weight: 1.00 // BS1-fragment tisctgsssn igagnhvkwy qqlpg BS2-fragment vtisctgtss nigsitvnwy qqlpg BS3-fragment lrlscsssgf ifssyamywv rqapg BS4-fragment lsltctvsgt sfddyystwv rqppg Parsing msf via readseq. aln in PAUP format, via readseq: #NEXUS [/var/tmp/oaaa002Xx -- data title] [Name: BS1-fragment Len: 25 Check: 38ED3DD2] [Name: BS2-fragment Len: 25 Check: B9F1855C] [Name: BS3-fragment Len: 25 Check: 862150F] [Name: BS4-fragment Len: 25 Check: 3758D80E] begin data; dimensions ntax=4 nchar=25; format datatype=protein interleave missing=-; matrix BS1-fragm TISCTGSSSNIGAGNHVKWY QQLPG BS2-fragm VTISCTGTSSNIGSITVNWY QQLPG BS3-fragm LRLSCSSSGFIFSSYAMYWV RQAPG BS4-fragm LSLTCTVSGTSFDDYYSTWV RQPPG ; end; Parsing paup via readseq. aln in PIR format, via readseq; note the problem with the first 6 bases: \\\ ENTRY BS1 TITLE BS1 31 bases, 353AFCC checksum., 31 bases, 353AFCC checksum. SEQUENCE 5 10 15 20 25 30 1 - f r a g m T I S C T G S S S N I G A G N H V K W Y Q Q L P 31 G /// ENTRY BS2 TITLE BS2 31 bases, B9502F87 checksum., 31 bases, B9502F87 checksum. SEQUENCE 5 10 15 20 25 30 1 - f r a g m V T I S C T G T S S N I G S I T V N W Y Q Q L P 31 G /// ENTRY BS3 TITLE BS3 31 bases, 7BF16DF8 checksum., 31 bases, 7BF16DF8 checksum. SEQUENCE 5 10 15 20 25 30 1 - f r a g m L R L S C S S S G F I F S S Y A M Y W V R Q A P 31 G /// ENTRY BS4 TITLE BS4 31 bases, C9F5FCCA checksum., 31 bases, C9F5FCCA checksum. SEQUENCE 5 10 15 20 25 30 1 - f r a g m L S L T C T V S G T S F D D Y Y S T W V R Q P P 31 G /// Parsing pir via readseq. aln in ASN.1 format, via readseq: Bioseq-set ::= { seq-set { seq { id { local id 1 }, descr { title "BS1-fragment , 25 bases, 38ED3DD2 checksum." }, inst { repr raw, mol aa, length 25, topology linear, seq-data iupacaa "TISCTGSSSNIGAGNHVKWYQQLPG" } } , seq { id { local id 2 }, descr { title "BS2-fragment , 25 bases, B9F1855C checksum." }, inst { repr raw, mol aa, length 25, topology linear, seq-data iupacaa "VTISCTGTSSNIGSITVNWYQQLPG" } } , seq { id { local id 3 }, descr { title "BS3-fragment , 25 bases, 862150F checksum." }, inst { repr raw, mol aa, length 25, topology linear, seq-data iupacaa "LRLSCSSSGFIFSSYAMYWVRQAPG" } } , seq { id { local id 4 }, descr { title "BS4-fragment , 25 bases, 3758D80E checksum." }, inst { repr raw, mol aa, length 25, topology linear, seq-data iupacaa "LSLTCTVSGTSFDDYYSTWVRQPPG" } } , } } aln in pretty format, via readseq: TISCTGSSSN IGAGNHVKWY QQLPG VTISCTGTSS NIGSITVNWY QQLPG LRLSCSSSGF IFSSYAMYWV RQAPG LSLTCTVSGT SFDDYYSTWV RQPPG