レトロフィットの実例集

リリースノートに記載されている実例
リリースノートの「9. Release history」から抽出

-- Since release 81 -- TPA category data have been excluded from DDBJ periodical release: Since September 2002 (DDBJ release 51), we provided DDBJ periodical releases including TPA category data. However, it is potentially confusing, because TPA category is not primary nucleotide sequence data. Therefore, DDBJ terminated to include TPA data. TPA data has been available from the other FTP site. See following site in detail. URL; http://www.ddbj.nig.ac.jp/whatsnew/whatsnew2009-e.html#090828

-- Since release 80 -- The format of the SOURCE line in DDBJ flat file has been changed: The SOURCE lines in some of DDBJ flat file included a common name like as GenBank flat file. The change is shown below

Old (-rel. 79)

Format: SOURCE      [] Example: SOURCE     Homo sapiens mitochondrion

New (rel. 80-)

Format: SOURCE     []  [()] Example: SOURCE     mitochondrion Homo sapiens (human)

See also '2. DDBJ flat file format'.

-- Since release 79 -- A new line, DBLINK, has replaced PROJECT line:

Following the agreement at the INSD collaborative meeting in 2008, the scope of the project ID has expanded to include projects that are not necessarily targeted to the sequencing of a complete genome. In addition, there are other resources such as the Trace Assembly Archive at the NCBI and the like.

Therefore, we have decided to replace the PROJECT line by a new line format, "DBLINK".

The replacement is illustrated in the following;

From the use of the PROJECT line (-release 78); --- LOCUS      AP000000             4700000 bp    DNA     circular BCT 27-FEB-2009 DEFINITION Escherichia coli DDBJ genomic DNA, complete genome. ACCESSION  AP000000 VERSION    AP000000.1 PROJECT    GenomeProject:99999 KEYWORDS. ---

To the DBLINK line format (release 79-); --- LOCUS      AP000000             4700000 bp    DNA     circular BCT 27-FEB-2009 DEFINITION Escherichia coli DDBJ genomic DNA, complete genome. ACCESSION  AP000000 VERSION    AP000000.1 DBLINK     Project:99999 KEYWORDS. ---

-- Since release 75 -- A new division for assembled mRNA sequences, Transcriptome Shotgun Assembly (TSA), has been included since the release 75.

With new sequencing technologies in use, INSDC have faced many requests to accept assembled EST sequences. These sequence data have become more useful than used to be, although they may not be correctly assembled or exist in nature. Therefore, INSDC decided to collect assembled EST sequences and classified them into the new division 'TSA'.

TSA sequences are shotgun assemblies of primary sequences deposited in the EST division of INSDC, race Archive (TA) or Short-Read Archive (SRA). Two specific keywords, "TSA" and "Transcriptome Shotgun Assembly", are present in all TSA entries. The new division code, "TSA", is also described in the the LOCUS line in all TSA entries.

No format changes in the flat file are anticipated for the TSA division, however, note that TSA entries make use of the same PRIMARY line that is described for the entries in TPA category. The PRIMARY block contains references to the underlying reads/transcripts that are assembled to construct a TSA record.

Note that it is required for a TSA submission to submit sequence data of primary transcripts to the EST division of INSDC, TA, or SRA. More information about how to submit a TSA entry is provided via the following URL; http://www.ddbj.nig.ac.jp/sub/tsa-e.html

-- Since release 73 -- Introduction of the sequence data from the Korean Intellectual Property Office:

The nucleotide sequence data transferred from Korean Intellectual Property Office (KIPO) have been included in DDBJ release. See also, '3. Division categories' and '3.1. Notice for patent related sequence data'.

-- Since release 72 -- Deletion of E-mail address, phone and fax numbers from DDBJ flat file:

To follow the Japanese law of protecting personal information, DDBJ deleted both phone and fax numbers, and E-mail address from the flat files of the entries submitted to DDBJ. It would be also helpful to protect DDBJ releases against SPAM mail senders. DDBJ retrofitted most of all entries submitted to DDBJ, not to GenBank or EMBL, by the DDBJ periodical release 72.

Previously, the submitter information was described in JOURNAL line at REFERENCE 1 as,

REFERENCE  1  (bases 1 to 1200) AUTHORS  Mishima,T. TITLE    Direct Submission JOURNAL  Submitted (01-Jan-1990) to the DDBJ/EMBL/GenBank databases. Taro Mishima, DNA Data Bank of Japan, National Institute of           Genetics; 1111, Yata, Mishima, Shizuoka 411-8540, Japan (E-mail:ddbj@ddbj.nig.ac.jp, URL:http://www.ddbj.nig.ac.jp/,           Tel:81-12-345-6789, Fax:81-12-345-9876)

After the deletion or the information in question, DDBJ flat file is either one of the following two types;

Type 1: Phone and fax numbers and E-mail address are deleted.

REFERENCE  1  (bases 1 to 1200) AUTHORS  Mishima,T. TITLE    Direct Submission JOURNAL  Submitted (01-Jan-1990) to the DDBJ/EMBL/GenBank databases. Contact:Taro Mishima DNA Data Bank of Japan, National Institute of Genetics; 1111, Yata, Mishima, Shizuoka 411-8540, Japan URL   :http://www.ddbj.nig.ac.jp/ ---

Type 2: When the submitters wish to keep their contact information disclosed, it is described as, --- REFERENCE  1  (bases 1 to 1200) AUTHORS  Mishima,T. TITLE    Direct Submission JOURNAL  Submitted (01-Jan-1990) to the DDBJ/EMBL/GenBank databases. Contact:Taro Mishima DNA Data Bank of Japan, National Institute of Genetics; 1111, Yata, Mishima, Shizuoka 411-8540, Japan URL   :http://www.ddbj.nig.ac.jp/ E-mail :ddbj@ddbj.nig.ac.jp           Phone  :81-12-345-6789 Fax   :81-12-345-9876 ---

-- Since release 69 -- Introduction of the project ID at PROJECT line in DDBJ flat file: Following the agreement at the INSD collaborative meeting in 2006, INSDC has started to assign the project ID for submissions from sequencing projects. The description of project ID is shown as below;

A unique identifier, assigned at the time of the submission by a sequencing project that informed INSDC of the submission beforehand. It is recommended that the submitter quotes the assigned project ID in all communication with INSDC databases to allow for easier and faster tracking of issues. The project ID field provides an umbrella identifier that points to all related sequence data for the project.

The PROJECT lines contain INSDC-assigned ID for the sequencing project. It will be appeared between VERSION and KEYWORDS lines in DDBJ flat files, from the DDBJ periodical release, 69 as shown below. See also '2. DDBJ flat file format'.

ACCESSION  AB012345 VERSION    AB012345.1 PROJECT    GenomeProject:123 KEYWORDS.

Termination of providing the index files for each category:

-- Since release 68 -- Split of files: We changed the maximum file size from 300 MB to 1.5 GB, because the network capacity has been remarkably increased. Each file named as ddbj***##.seq has at most 1.5 GB storage capacity. See also the sections, '7. File categories' and '10. File list'.

-- Since release 64 -- Split of index files: In the present release, some of index files (ddbjacc.idx, ddbjjou.idx, and ddbjkey.idx) have been greater than 2 GB in the file size. So, these have been recorded in multiple ddbj****.idx files, each of which at most has 1.5 GB storage capacity as follows, respectively. See also 7., 8.2., 8.3., 8.4. and 10.

-- Since release 62 -- Release version number is introduced: DDBJ has started to include the item, 'version', for its release note, which indicates a version for its periodical release. It is expressed like '62.0', in which the digit(s) after the period is a version number. The reason for adding the version number is that a released data is sometimes revised due to urgent and necessary corrections. The number is increased by one every time when a revised periodical release is made public until the next release.

Introduction of ENV division: Recently, the submissions of the sequences derived from environmental samples have rapidly increased. To accommodate such submissions, a new division, ENV, has been created (See also '3.1. Division categories'). This division contains the sequences obtained via direct molecular isolation such as PCR, DGGE, or any anonymous method. In the past, the sequences derived from environmental samples belonged to taxonomic divisions, mainly BCT. At DDBJ, the retrofit to transfer relevant entries from taxonomic divisions to the ENV division starts in the present release, and ends by the next periodical release. Please note that during this transitional period, some entries to be eventually placed in the ENV division will be found in other divisions.

Strand information is removed: The strand information of LOCUS line in the flat file has been removed as shown below. See also '2.1. LOCUS line'.

Old (-rel. 61): 44-44    space 45-47    spaces, ss- (single-stranded), ds- (double-stranded), or              ms- (mixed-stranded) New (rel. 62-): 44-47    spaces

-- Since release 61 -- The style of release note (this file) has been changed.

Some entries have the sequential format for the secondary accession numbers in the ACCESSION line, in order to make the expression of secondary accession numbers in the past short. For example; -- Before; ACCESSION  AB000802 D85885 D85886 D85887 After; ACCESSION  AB000802 D85885-D85887 -- See also '2.3. ACCESSION line'.

-- Since release 60 -- The cross-reference to the H-invitational has been included.

-- Since release 56 -- The three data banks have agreed that the maximum length limitation (350 kb) of a submitted sequence be relaxed.

The BASE COUNT line of the DDBJ flat file format has been changed, corresponding to the relaxation of the maximum sequence length restriction in the entry that had been practiced at DDBJ/EMBL/GenBank International Nucleotide Sequence Databases. In the BASE COUNT line of the DDBJ flat file, 6 digits had been allocated for each number of a, c, g, t and other bases in the sequence. Hereafter, in the new flat file format, 9 digits are allocated for each number of a, c, g and t, while the numbers of other bases are removed. In accordance with the relaxation of sequence length limitation, GenBank had already dropped the BASE COUNT line from their flat file format from GenBank Release 138 (Oct. 2003). We DDBJ have decided to maintain the BASE COUNT line in our flat file format from the view that GC contents are still important information to characterize the sequence. The changes in the BASE COUNT line are shown below.

Old (-rel. 55): 1   6   11   16   21   26   31   36   41   46   51   56   61   66   71    |||||||||||||||    BASE COUNT   123456 a 123456 c 123456 g 123456 t 123456 others

New (rel. 56-): 1   6   11   16   21   26   31   36   41   46   51   56   61   66   71    |||||||||||||||    BASE COUNT    123456789 a    123456789 c    123456789 g    123456789 t

-- Since release 54 -- '/sequenced_mol' qualifier has been changed to '/mol_type' qualifier. We accordingly completed retrofitting the pertinent entries. This change was made on the agreement at the INSD collaborative meeting in 2002.

-- Since release 51 -- The format of LOCUS line in the flat file has been changed as shown below to adjust to the GenBank format. -- Old (-rel. 50): LOCUS      AB000001      660 bp    DNA             PLN       01-FEB-2001 New (rel. 51-): LOCUS      AB000001                 660 bp    DNA     linear   PLN 01-FEB-2001 --

-- Since release 45 -- The HTC (High Throughput cDNA) division has been included. This is to include unfinished high throughput cDNA sequences, each of which has 5'UTR and 3'UTR at both ends and part of a coding region. The sequence may also include introns. When the sequence becomes finished later, it moves to the corresponding taxonomic division. The sequence is accompanied with a keyword, HTC (High Throughput cDNA), which is dropped when the sequence is finished and moved to a taxonomic division.

-- Since release 41 -- The CON division has been included. This division is to show the order of related sequences in a genome, and expressed by join and the accession numbers of the sequences. The contents of the CON division are compiled by the three data banks not by the data submitter.

-- Since release 40 -- The RNA division was terminated.

-- Since release 37 -- The three data banks include the item VERSION in the flat file, which indicates a version of a submitted nucleotide sequence. It is expressed like AB123456.1, in which the digit(s) after the period is a version number. The reason for adding VERSION is that since a released sequence sometimes revised by the submitter, the accession number alone cannot specify the sequence in question causing the user a trouble. The number is increased by one every time when a revised sequence is made public.

Accordingly, the translated protein sequence will be accompanied with a /protein_id which is expressed as BAA12345.1, in which the digit(s) after the period is again a version number. The number is increased by one when the

corresponding nucleotide sequence is revised and the protein sequence is changed as a result, and when the revised protein sequence is made public.

-- Since release 31 -- We have started adopting the unified taxonomy database to unify the biological source of the sequence. The database is made up with scientific names, ID of unidentified organisms, and synthetic constructs etc.

-- Since release 30 -- NID and PID were terminated. This change was made on the agreement at the INSD collaborative meeting in 1999.

-- Since release 28 -- The HTG (High Throughput Genomic sequence) has been included. We terminated the ORG (Organelle) division.

-- Since release 27 -- The GSS division has been included. GSS stands for Genome Survey Sequence, which is similar to EST, except that GSS is genomic DNA whereas EST is cDNA.

-- Since release 25 -- DDBJ release contains amino acid sequences that were translated from the corresponding nucleotide sequences of the database.

-- Since release 22 -- The HUM division has been included. We have the human (HUM) division solely for human sequences and the primate (PRI) division for non-human primate sequences.

-- Since release 12 -- The EST (Expressed Sequence Tag) division has been included.

-- Since release 10 -- The sequences submitted to GenBank or EMBL have been included in the release.