Dgrep

= 集計結果 = http://133.39.115.67/~kazuo-h/DDBJ_parsed/table.html

新しいバージョン　http://133.39.115.67/~kazuo-h/DDBJ_parsed/FeatureQualMax/table.html

=/home/o0gasawa/data/DDBJ_parsed/parsed/ddbj.88をスパコンでgrepする= KAZUO HARA 13:56, 26 April 2012 (JST) FOR ANNOTATORS

検索対象テキストが大きいため，通常のgrep（= cat $OGAD/* | grep cancer）では時間がかかる（約18分）．

検索対象テキストは369ファイルに分割してあるので，スパコンを用いて369ファイルを同時に grep にかけ，結果を集計することで，時間を約1分に短縮できる．

実行方法（キーワード「cancer」をgrepする場合）: /home/kahara-d/dgrep/dgrep cancer

スクリプト dgrep:

スクリプト grep.sh:

= データファイルの書式説明 by OGA=

書式の説明
各行は３カラム. タブ区切り(\t).
 * 1) accession number
 * 2) predicate
 * 3) value

ただし以下は階層構造になっているので例外的に４カラム.
 * predicateがfeature:で始まる行とqualifier:で始まる行
 * predicateがline:REFERENCEの行とreference:で始まる行

feature:とqualifier:の対応は、下記のように表されています.

AB000100	feature:source	1	1..2992 AB000100	qualifier:/clone_lib	1	"constructed in pBluescript II KS-" AB000100	qualifier:/db_xref	1	"taxon:1140" AB000100	qualifier:/mol_type	1	"genomic DNA" AB000100	qualifier:/organism	1	"Synechococcus elongatus PCC 7942" AB000100	qualifier:/strain	1	"PCC 7942"


 * ２カラム目がfeature:で始まる行は、その３カラム目に、featureのエントリ内における通し番号が書いてあります.
 * ２カラム目がqualifier:で始まる行も、その３カラム目に、featureのエントリ内における通し番号が書いてあります.
 * したがって、３カラム目の番号を照合すると、featureと、そのfeatureに属するqualifierの対応が付きます.

line:REFERENCE行とreference:で始まる行との間の関係についても同様.

パース済みデータ
AB000100	line:ACCESSION__primary	AB000100 AB000100	line:BASE_COUNT	654 a         759 c          785 g          794 t AB000100	line:COMMENT	\n AB000100	line:DEFINITION	Synechococcus elongatus PCC 7942 genes for intrinsic membrane protein, malK-like protein, cyanase, complete cds. AB000100	line:FEATURES	Location/Qualifiers AB000100	line:KEYWORDS	. AB000100	line:LINEAGE	Bacteria; Cyanobacteria; Chroococcales; Synechococcus. AB000100	line:LOCUS__accession	AB000100 AB000100	line:LOCUS__date	15-MAY-2009 AB000100	line:LOCUS__division	BCT AB000100	line:LOCUS__length	2992 AB000100	line:LOCUS__moltype	DNA AB000100	line:LOCUS__topology	linear AB000100	line:ORGANISM	Synechococcus elongatus PCC 7942 AB000100	line:ORIGIN AB000100	line:SOURCE	Synechococcus elongatus PCC 7942 AB000100	line:VERSION	AB000100.1 AB000100	line:REFERENCE	1	1 (bases 1 to 2992) AB000100	reference:AUTHORS	1	Omata,T. AB000100	reference:JOURNAL	1	Submitted (26-DEC-1996) to the DDBJ/EMBL/GenBank databases. Contact:Tatsuo Omata School of Agricultural Sciences, Nagoya University, Department of Applied Biological Sciences; Chikusa, Nagoya, Aichi 464-01, Japan AB000100	reference:TITLE	1	Direct Submission AB000100	line:REFERENCE	2	2 AB000100	reference:AUTHORS	2	Harano,Y., Suzuki,I., Maeda,S., Kaneko,T., Tabata,S. and Omata,T. AB000100	reference:JOURNAL	2	J. Bacteriol. 179, 5744-5750 (1997) AB000100	reference:TITLE	2	Identification and nitrogen regulation of the cyanase gene from the cyanobacteria Synechocystis sp. strain PPC 6803 and Synechococcus sp. strain PPC 7942 AB000100	feature:source	1	1..2992 AB000100	qualifier:/clone_lib	1	"constructed in pBluescript II KS-" AB000100	qualifier:/db_xref	1	"taxon:1140" AB000100	qualifier:/mol_type	1	"genomic DNA" AB000100	qualifier:/organism	1	"Synechococcus elongatus PCC 7942" AB000100	qualifier:/strain	1	"PCC 7942" AB000100	feature:CDS	2	121..912 AB000100	qualifier:/codon_start	2	1 AB000100	qualifier:/gene	2	"cynB" AB000100	qualifier:/product	2	"intrinsic membrane protein" AB000100	qualifier:/protein_id	2	"BAA21794.1" AB000100	qualifier:/transl_table	2	11 AB000100	qualifier:/translation	2	"MVRTPVPLYLRWAVSILSVLAFLAIWQIAAASGFLGKTFPGSLR TLQDLFGWLSDPFFDNGPNDLGIGWNLLISLRRVAIGYLLATVVAIPLGIAIGMSALA SSIFSPFVQLLKPVSPLAWLPIGLFLFRDSELTGVFVILISSLWPTLINTAFGVANVN PDFLKVSQSLGASRWRTILKVILPAALPSIIAGMRISMGIAWLVIVAAEMLLGTGIGY FIWNEWNNLSLPNIFSAIIIIGIVGILLDQGFRFLENQFSYAGNR" AB000100	feature:CDS	3	916..1785 AB000100	qualifier:/codon_start	3	1 AB000100	qualifier:/gene	3	"cynD" AB000100	qualifier:/product	3	"malK-like protein" AB000100	qualifier:/protein_id	3	"BAA21795.1" AB000100	qualifier:/transl_table	3	11 AB000100	qualifier:/translation	3	"MISEAVPAKEETGQAQLLIEQVGKVFTVNSPSLLDRLRQRSPKR YVALEDVNLTIASNTFVSIIGPSGCGKSTLLNLIAGLDLPTSGQILLDGQRIRSPGPD RGIVFQNYALMPWMTALENVIFAVETARPNLSKSQAREVAREHLELVGLTKAADRYPG QISGGMKQRVAIARALSIRPKLLLMDEPFGALDALTRGYLQEEVLRIWEANKLSVVLI THSIDEALLLSDRIVVMSRGPRATIREVIDLPAVRPRQRSVIEEDERFVKIKLRLEEH LFNETRAVEEASV" AB000100	feature:CDS	4	1796..2236 AB000100	qualifier:/EC_number	4	"4.2.1.104" AB000100	qualifier:/codon_start	4	1 AB000100	qualifier:/gene	4	"cynS" AB000100	qualifier:/product	4	"cyanase" AB000100	qualifier:/protein_id	4	"BAA19515.1" AB000100	qualifier:/transl_table	4	11 AB000100	qualifier:/translation	4	"MTSAITEQLLKAKKAKGITFTELEQLLGRDEVWIASVFYRQSTA SPEEAEKLLTALGLDLALADELTTPPVKGCLEPVIPTDPLIYRFYEIMQVYGLPLKDV IQEKFGDGIMSAIDFTLDVDKVEDPKGDRVKVTMCGKFLAYKKW" AB000106	line:ACCESSION__primary	AB000106 AB000106	line:BASE_COUNT	328 a         313 c          423 g          279 t AB000106	line:COMMENT	\n AB000106	line:DEFINITION	Sphingomonas sp. 16S ribosomal RNA. AB000106	line:FEATURES	Location/Qualifiers AB000106	line:KEYWORDS	16S rRNA. AB000106	line:LINEAGE	Bacteria; Proteobacteria; alpha subdivision; Zymomonas group; Sphingomonas. AB000106	line:LOCUS__accession	AB000106 AB000106	line:LOCUS__date	05-FEB-1999 AB000106	line:LOCUS__division	BCT AB000106	line:LOCUS__length	1343 AB000106	line:LOCUS__moltype	rRNA AB000106	line:LOCUS__topology	linear AB000106	line:ORGANISM	Sphingomonas sp. AB000106	line:ORIGIN AB000106	line:SOURCE	Sphingomonas sp. AB000106	line:VERSION	AB000106.1

（以下省略）

パース前のデータ (DDBJフラットファイル)
LOCUS      AB000100                2992 bp    DNA     linear   BCT 15-MAY-2009 DEFINITION Synechococcus elongatus PCC 7942 genes for intrinsic membrane protein, malK-like protein, cyanase, complete cds. ACCESSION  AB000100 VERSION    AB000100.1 KEYWORDS. SOURCE     Synechococcus elongatus PCC 7942 ORGANISM Synechococcus elongatus PCC 7942 Bacteria; Cyanobacteria; Chroococcales; Synechococcus. REFERENCE  1  (bases 1 to 2992) AUTHORS  Omata,T. TITLE    Direct Submission JOURNAL  Submitted (26-DEC-1996) to the DDBJ/EMBL/GenBank databases. Contact:Tatsuo Omata School of Agricultural Sciences, Nagoya University, Department of           Applied Biological Sciences; Chikusa, Nagoya, Aichi 464-01, Japan REFERENCE  2 AUTHORS  Harano,Y., Suzuki,I., Maeda,S., Kaneko,T., Tabata,S. and Omata,T. TITLE    Identification and nitrogen regulation of the cyanase gene from the cyanobacteria Synechocystis sp. strain PPC 6803 and Synechococcus sp. strain PPC 7942 JOURNAL  J. Bacteriol. 179, 5744-5750 (1997) COMMENT FEATURES            Location/Qualifiers source         1..2992 /clone_lib="constructed in pBluescript II KS-" /db_xref="taxon:1140" /mol_type="genomic DNA" /organism="Synechococcus elongatus PCC 7942" /strain="PCC 7942" CDS            121..912 /codon_start=1 /gene="cynB" /product="intrinsic membrane protein" /protein_id="BAA21794.1" /transl_table=11 /translation="MVRTPVPLYLRWAVSILSVLAFLAIWQIAAASGFLGKTFPGSLR                    TLQDLFGWLSDPFFDNGPNDLGIGWNLLISLRRVAIGYLLATVVAIPLGIAIGMSALA                     SSIFSPFVQLLKPVSPLAWLPIGLFLFRDSELTGVFVILISSLWPTLINTAFGVANVN                     PDFLKVSQSLGASRWRTILKVILPAALPSIIAGMRISMGIAWLVIVAAEMLLGTGIGY                     FIWNEWNNLSLPNIFSAIIIIGIVGILLDQGFRFLENQFSYAGNR" CDS            916..1785 /codon_start=1 /gene="cynD" /product="malK-like protein" /protein_id="BAA21795.1" /transl_table=11 /translation="MISEAVPAKEETGQAQLLIEQVGKVFTVNSPSLLDRLRQRSPKR                    YVALEDVNLTIASNTFVSIIGPSGCGKSTLLNLIAGLDLPTSGQILLDGQRIRSPGPD                     RGIVFQNYALMPWMTALENVIFAVETARPNLSKSQAREVAREHLELVGLTKAADRYPG                     QISGGMKQRVAIARALSIRPKLLLMDEPFGALDALTRGYLQEEVLRIWEANKLSVVLI                     THSIDEALLLSDRIVVMSRGPRATIREVIDLPAVRPRQRSVIEEDERFVKIKLRLEEH                     LFNETRAVEEASV" CDS            1796..2236 /codon_start=1 /EC_number="4.2.1.104" /gene="cynS" /product="cyanase" /protein_id="BAA19515.1" /transl_table=11 /translation="MTSAITEQLLKAKKAKGITFTELEQLLGRDEVWIASVFYRQSTA                    SPEEAEKLLTALGLDLALADELTTPPVKGCLEPVIPTDPLIYRFYEIMQVYGLPLKDV                     IQEKFGDGIMSAIDFTLDVDKVEDPKGDRVKVTMCGKFLAYKKW" BASE COUNT         654 a          759 c          785 g          794 t ORIGIN 1 ctgcagccgc cgactgaaat ctatcgggaa gaaaagctcg cttacgacac ctttaacccg 61 caggatccag tcgcttacct cgcatctcaa aagcagaaat acgggagata aacacaactt 121 atggtgagaa ctcctgtacc gctttaccta cgttgggcgg tctccatcct cagcgtgctt 181 gcgttcctag ccatttggca aattgcggca gcttcaggat ttttaggcaa aacttttcct 241 ggctccctgc gcactttgca ggatttgttt ggatggcttt cagatccctt ctttgataac 301 ggccccaatg acttagggat tggctggaac ttactgatta gtttgcgtcg cgttgcgatc 361 ggctacctgc tggcaacagt tgttgcaatt cctttgggga ttgcaatcgg tatgtcggcg 421 ctagcttcca gtattttttc gccctttgtg caactcctga agccagtttc acctttggcc 481 tggttgccga ttggtctctt cttattccga gattcggaat tgacgggtgt ttttgtcatc 541 ctgatttcga gtctgtggcc aacgttgatc aacacagcgt ttggggtggc gaatgtcaat 601 cctgactttt tgaaggtttc gcaatctttg ggagctagtc gttggcgcac gattctgaag 661 gtgattctgc ccgcagcatt gcccagcatc atcgcgggaa tgcggatcag catgggcatt 721 gcttggctgg tcattgtggc agcagagatg ctgttgggaa caggaattgg ctatttcatt 781 tggaatgagt ggaataacct atcacttcct aatattttct cggccatcat catcattggg 841 attgttggca ttcttctcga ccaaggcttc cgttttcttg agaaccagtt ttcttacgca 901 ggcaaccgat aacccatgat ttctgaagct gtgccagcca aggaggagac agggcaggct 961 caattgctga ttgagcaagt tggcaaagtt tttactgtca attcaccttc tctcctcgat 1021 cgccttcgac agcgatcgcc caaacgctac gttgcattag aagatgtcaa cctcacgatc 1081 gcgtcgaaca catttgtctc gattattggc ccttcgggtt gtggtaaatc aacccttctc 1141 aacttgattg ctggccttga tttaccaacg tctggccaga ttctgctgga tggtcaacgc 1201 attcgatcgc cggggcccga tcgtggcatc gtcttccaga actatgccct gatgccctgg 1261 atgaccgcgc ttgagaatgt catctttgca gttgaaacgg cgcgcccaaa cctgagcaaa 1321 tcccaagctc gcgaagtggc acgagagcat ctagagctgg tgggtttaac caaagctgcc 1381 gatcgctatc cgggccaaat ttcagggggg atgaaacagc gcgtagcgat cgcccgtgcc 1441 ctctccatcc gtcctaagct cctgctgatg gatgaaccct ttggtgcctt ggatgccctc 1501 acccgtggct acctccaaga agaagtgctg cggatttggg aagccaacaa actgagtgtg 1561 gtgctcatca ctcacagtat tgatgaagca ctgctgcttt ccgatcgcat tgtggtgatg 1621 tctcgtgggc cacgagccac tattcgagaa gtgattgatt taccagccgt tcgccctcgg 1681 caacggtctg tgatcgaaga agatgagcgc ttcgtcaaaa tcaaattgcg ccttgaagaa 1741 catttgttca acgagacgcg tgcagttgaa gaagccagtg tttaggagaa ttccaatgac 1801 ctcagcgatt actgaacaac ttctgaaagc gaaaaaagca aagggaatta cctttactga 1861 gcttgagcaa ttacttggac gggatgaagt ctggattgcg agtgtgttct accgtcaatc 1921 tacggcttcg cctgaagagg cagaaaagct actgactgct ctgggcttag atctggcctt 1981 ggctgatgag ttgacgactc cgccggtcaa aggttgtttg gaaccggtga ttccaactga 2041 tccgttgatc tatcgcttct acgaaatcat gcaggtctat ggcttgcccc tcaaggatgt 2101 tatccaagaa aaatttggcg atggcatcat gagtgcgatt gatttcacct tagatgtcga 2161 taaggttgaa gatcccaaag gcgatcgcgt taaggtcacg atgtgtggca agttcttggc 2221 gtacaagaag tggtaaatac tgctagctaa tcaagcttca attcttgatc actggaggag 2281 agaggtttcc gcttctctcc ttttttgatt ggaattctct cattaactac gataccgctc 2341 tgcactgaat gacctcgagc tgagtggaag gtagctcgcc gccgatgata atggcgcctc 2401 tggaagagtt tggctaagct gtggacggcg atcgcggttg tctgtctgtg ctatgccctt 2461 gatttcggtg acccgactca agcttagaaa tgttctttat ttgccccgct tgcttccctt 2521 ctcgttgcga tcgacgtggc aggctaaacg agcgcctggc aatctgggcg ttaagctgtt 2581 gcaggatcgt aacttggctt tttggacctg caccgcttgg acggatgaag gagccatgcg 2641 tcggttcatg agagcggatg cccacgggca ggccatgacg aaattgatgg attggtgcag 2701 cgaagcctca gtcgtccatt ggcagcagga tcagccagac ttgcccgact ggcaggaagc 2761 tcaccgccgc atgatcgcgg aggggcgccc ctccaaagtg aaccatcctt cggctgccca 2821 ccaagcattt caggtcgatc cgccgcgccg cgcctagctc agtgactgcg gtcgcgctgt 2881 cttgcatcat tgcttcgctc taccagcccg gatcgctggc acagtccacg gtgatctcac 2941 ccgaggcggc atcgggaatc gcagtgatac agccgcagac tggctcgcca tc //