GenBank/RefSeq Update Deployment
This page describes how the GenBank/RefSeq update process is deployed.
This is a proposed setup, not currently implemented. The following
system setup is required:
- Create a user
genbank on the cluster, and the
round-robin and GBDB server (hgnfs1).
Enable sudo to genbank for
markd.
- There is currently sufficient disk space on
/cluster/store5/ for GenBank files and alignments for human,
mouse and rat. However diskspace should be monitored and may need to be
increased.
- Setup an rsync server on
eieio accessable from
the GBDB server.
- Setup the
/somewhere/genbank/ directory on the
GBDB server, owned by genbank, preferably on the same
filesytems as /gbdb/ (but not under /gbdb/).
NFS export and mount on the round-robin servers as
/genbank/. I should also available as
/genbank/ on the GBDB server as well.
Download/Processing/Alignment (build)
- These three steps are collectively know as the build
phase.
- The GenBank root directory is currently at:
/cluster/store5/genbank/
-
Estimates of disk space requirements:
download/ -
50-75gb, depending on how many previous release are maintained. Once
a new release is downloaded and processed (quarterly), old downloaded
files can be archived.
processed/ - 25-50gb
- processed files must be maintained as long as some database is
using sequences from them.
aligned/ -
~3gb per release per genome assembly
- Cluster accessable, temporary work space - ~2gb,
Note that these replace data currently kept in other locations, however
the downloads it now include the HTG sequences, which add several
gigabytes of data.
-
The download, processing, and alignment steps run on the GenBank
build server, which should have the following attributes:
- Should have the GenBank root directories as local
filesystem.
- Should have at least two CPUs.
- Must be able to
rsh to kkr1u00 and
kk.
kkstore is probably the best candidate.
- A dedicated user,
genbank, allows multiple people to
manage the jobs.
- A cron job will start the process daily at 1am.
Round-Robin Database Update
-
In order to update the databases on the round-robin servers, each
server must have acecss to the
processed/ and
aligned/ directories. FASTA files under the
processed/ directory must be copied into the
/gbdb/genbank/ directory. Since these directories are
large, they are maintained on the GBDB server for access by
the round-robin servers.
- The GBDB server exports a
/genbank/
directory to the the round-robin servers, which contains the
processed/ and aligned/ directories.
- If possible, the
/gbdb/ and /genbank/
directories should be on the same physical file system on the GBDB
server. This way, the FASTA file under the /gbdb/
directory can be hard links to the ones under the
processed/ directory, saving significant disk space. If
this is not possible, the FASTA files will be copied.
- A process running on the GBDB server must be able to
rsync
files from the GenBank root. on the cluster
-
A cron job GBDB server polls (with rsync) the GenBank build server
to determine if new alignments are ready.
- Copy new
processed/ and aligned/ files
to /genbank/ hierarchy, in to passed, one to get the
data files, and a second to get the index files.
- Update the
/gbdb/genbank/ hierarchy with the new
FASTA files. If /genbank/ and /gbdb/
are the same file system, these will be hardlinks.
- Flag copy as complete.
-
The each round-robin server periodically examines the the
/genbank/ to see if a copy has completed.