2007-2011summary

=1. Why Japanese Big Projects Do not Share their Data? - A Diagnosis and Prescription =

http://farm8.staticflickr.com/7064/6879064953_7c0b574fe9_m.jpg http://farm8.staticflickr.com/7200/6879064989_a0d663912f_m.jpg http://farm8.staticflickr.com/7203/6879065033_6e82587a01_m.jpg http://farm8.staticflickr.com/7069/6879065067_010d1d14b9_m.jpg http://farm8.staticflickr.com/7232/7155295879_f9d6d929e3.jpg http://farm8.staticflickr.com/7060/6890214551_5bc0372257.jpg http://farm8.staticflickr.com/7045/6890214595_e73c6e9d9a_b.jpg
 * After a incredulous commitment to the world-wide human genome project in 90's, each of four ministries of Japan's government independently started "post-genome" projects, based on "nation's 5-year science and technology plan".
 * The annual budget spent on post-genome projects amounted almost 10 times as much as the total amount spent on the human genome project in Japan (As a part of the Millennium project 2000-2005).
 * However, there have been an advocate that the data generated by the government's projects are so separately withheld by the contractors that re-use by the taxpayer is almost impossible.(reported in Daily Yomiuri, 2005-02-15)
 * In respond to this advocate,the scientific board for the cabinet decided to look into the status of the output data of government's science project(2005-2007).
 * By leading a 3-year investigation project for the board,we looked into the situation from all aspects and concluded that the focus of disorder is the present "contracts" for such projects. It never mention about the ownership nor fate of the output data. With such a contract, the contractors, universities and agencies, can withhold or even sell the data, if they wish. Moreover, Japan's Bayh-Dole act(1998), explicitly stated that "the contractor, may control the intellectual property".  This statement is in contrast to the original US version where small grant receivers are protected.  This statement would never encourage the early release of the data at their free will.
 * We diagnosed the pathogenesis of such dyscoordination in the law and "national" project as follows.
 * The laws to privatize Japan's science, namely science and technology basic law, Japan's Bayh-Dole, TLO and university corporation law, are introduced as a "package" by the Obuchi/Koizumi's cabinet around the year 2000.
 * Obviously it is a copy of the 1980's system in the US which successfully brought a big license revenue to universities and made bio-ventures flourished. The cabinet's intention was to reproduce the similar outcome, probably. Ironically, however, in the internet age when open/share of data is the key to innovation, the same set of policies worked negatively for innovations. Bad Timing.
 * We prescribed that Japan should introduce a strong data sharing policy, especially for the government's project as fast as possible to correct this dyscoordination which leads to the inefficient investment.　(reported in 2008-02-10 daily Nikkei Newspaper)
 * In response to this report, some ministries add note about the sharing of output data in the new contracts but has no legal power yet.
 * NIG's umbrella organization swiftly responded by starting a new institute functions as a public library of the shared data, Database Center for Life Sciences, having Dr. Toshihisa Takagi as a director in 2006.
 * In order to make efficient collaboration between this new task and DDBJ, NIG appointed Dr. Takagi as a professor in DDBJ concurrently and K.Okubo was invited as a professor of DBCLS. In 2007, this institute was selected as a main contractor of the ministry of education's "data base integration 5 year project".
 * As a part of this project, several data archiving services were started in DDBJ supervised by H.Sugawara and Y.Nakamura. They are raw data archiving for NGS (DRA) and trace data archive.

=Knowledge/Data representation and sharing Tools = http://farm8.staticflickr.com/7051/6888917331_4afb728d9e_m.jpg http://farm8.staticflickr.com/7060/6889007635_3e3bb95beb_m.jpg http://farm8.staticflickr.com/7197/6888952849_8980e02d36_m.jpg http://farm8.staticflickr.com/7067/6889349615_68909d697b_m.jpg
 * Researchers who OWN big data are reluctant to share. Whereas researchers with small data do not afford to share elaborately. To assist them, we designed various open source tools to enhance creativity, knowledge/data representation and sharing. For 1-3, the designs were embodied/encoded by the effort of DBCLS(Database Center for Life Sciences)with contract with the ministry of education, science and technology.
 * 1) Wired-Marker : It allows user to ISSUE, share, and publish a link to ANY SEGMENT of web articles by mouse action.  Resolution of bibliographic information in articles and web materials can be highly enhanced.   Free addon for FireFox. Downloaded more than 500,000 times and have 16,467 active users. K.Okubo, T. Tamura, T.Takagi  https://addons.mozilla.org/ja/firefox/addon/wired-marker/
 * 2) BodyParts3D: BodyParts3D is anatomically segmented canonical 3D human model. It has 1,532 parts as of today. Shared with CC-BY. There are hundreds of derivations in Wikipedia and Wikimedia commons.  Search youtube by "bodyparts3d". T.Mitsuhashi, K. Fujieda, S.Kawamoto, T.Takagi, K.Okubo http://lifesciencedb.jp/bp3d/
 * 3) Anatomography: A 3D image Rendering server for bodyparts which allows users to pick and combine body parts to generate his own avatar data. The rendered image is embeddable as a URL with query parameters which allows on-the-fly regeneration of image in web context. Developed and maintained in DBCLS project. http://lifesciencedb.jp/ag/
 * 4) Duckbill and R-graphical Manual: Duckbill is an unique script language library created to make executable document by O.Ogasawara.  As an initial application, example codes in all manual documents of R-statistical system was run by batch operation and results were organized in DB by himself to enhance searchability of R-library. This database has been highly acclaimed by users world-wide and the visit statistics for the database about 50,000 unique IPs/month (about 200,000 page views/month) which is 3 times larger than DDBJ (17,000 unique IPs/m)  http://rgm2.lab.nig.ac.jp/RGM2/

=Replacing supercomputer / DDBJ operation system (just memo yet)=

1) Key word search of whole DDBJ/INSDC was changed from commericial appliance that can not finantially scale apache solr. 2) DB for managing sequence'version' and annotation 'revision' was constructed on Berkley DB. 3) Time consuming processes for transform GenBank data format to DDBJ format have completely re-written from scratch.
 * NIG has been facilitated with supercomputer system by ministry of education as one of three large scale computer center dedicated to biology scince 1995.
 * This center provides computational resources for DDBJ operation and, at the same time, it provides various services for biological and medical researches in Universities.
 * The present system was introduced in 2007 to be replaced in 2012. In the present system, different tasks are carried out by different architectures.
 * For the log-in use of general academic users, dedicated is a big SMP server (Fujitsu Primepower) with Sun OS. For DDBJ operation mission critical servers with commercial RDB are used. Many cores of commercial appliance for large memory XML searching engines, Shunsaku for high-speed key word search.
 * Linux cluster servers are maintained for open use and molecular biological application softwres and one big storage of 0.75 Petabyte was shared among those via NFS.
 * By planning the architecture for the coming replacement in 2012, we started design and preparation for replacement.
 * In the last 10 years, we learned that computational resources and services required in life science are very hard to predict.
 * For example, no one expected the 1000 fold reduction of cost, which inversely corrlate with data production, will be really achieved in the last 5 years. No one predicted such a wide spread of microarrays in 10 years ago.
 * Accordingly, we concluded that next supercomputer system should have very flexible architecture; a huge cores of monotonous clusters with mainly volume class memory. Disks large enough to accomodate re-sequencing data from large scale cohort study should be connected with the fastest connectivity for processing millions of independent data pararelly.
 * To achieve this, we should abandon all expensive appliances, commercial middlewares and DBMS that require licensing cost and specific environments as well. Above all, appliance of multi core XML search engine must be replaced by flexible system because it can theoretically scale any data increase but we have already financially failed to scale.
 * In order to clear above conditions, we started to refucture the software systems for majority of DDBJ operations. As an academic service providers we completly shifted to open source softwares on Linux for the following two major services;.

=Complete renewal of software systems for DDBJ operation =
 * In order to achieve above mentioned renewal of our computer system, we have to abandon most of the software systems, because they do not operate on linux, they do not operate without their commercial DBMS. In short, DDBJ has been lending a huge software system with custom modification for the last 25 years operation. Since 2010, we started re-design and reconstruction software systems for majority of DDBJ operations. The key concept is open source software for Linux.
 * Key word search of whole DDBJ
 * DB for managing sequence 'version' and annotation 'revision' was constructed on Berkley DB.
 * DB that helps Annotators