ENTREZ DIRECT: COMMAND LINE ACCESS TO NCBI ENTREZ DATABASES

Searching, retrieving, and parsing data from NCBI databases through the Unix command line.

INTRODUCTION

Entrez Direct (EDirect) provides access to the NCBI's suite of interconnected databases (biomedical and life science literature, nucleotide and protein sequence, molecular structure, gene, genome assembly, gene expression, clinical variation, etc.) from a Unix terminal window. Search terms are given in command-line arguments. Individual operations are connected with Unix pipes to allow construction of multi-step queries. Selected records can then be retrieved in a variety of formats.

EDirect also includes an argument-driven function that simplifies the extraction of data from document summaries or other results that are in structured XML format. This can eliminate the need for writing custom software to answer ad hoc questions. Queries can move seamlessly between EDirect commands and Unix utilities or scripts to perform actions that cannot be accomplished entirely within Entrez.

PROGRAMMATIC ACCESS

Several underlying network services provide access to different facets of Entrez. These include searching by indexed terms, looking up precomputed neighbors or links, filtering results by date or category, and downloading record summaries or reports. The same functionalities are available on the web or when using programmatic methods.

EDirect navigation programs (esearch, elink, efilter, and efetch) communicate by means of a small structured message, which can be passed invisibly between operations with a Unix pipe. The message includes the current database, so it does not need to be given as an argument after the first step.

All EDirect commands are designed to work on large sets of data. There is no need to write a script to loop over records one at a time. Intermediate results are stored on the Entrez history server. For best performance, obtain an API Key from NCBI, and place the following line in your .bash_profile file:

  export NCBI_API_KEY=user_api_key_goes_here

Each program also has a -help command that prints detailed information about available arguments.

NAVIGATION FUNCTIONS

Esearch performs a new Entrez search using terms in indexed fields. It requires a -db argument for the database name and uses -query to obtain the search terms. For PubMed, without field qualifiers, the server uses automatic term mapping to compose a search strategy by translating the supplied query:

  esearch -db pubmed -query "selective serotonin reuptake inhibitor"

Search terms can be also qualified with bracketed field names:

  esearch -db nucleotide -query "insulin [PROT] AND rodents [ORGN]"

Elink looks up precomputed neighbors within a database, or finds associated records in other databases:

  elink -related

  elink -target gene

Efilter limits the results of a previous query, with shortcuts that can also be used in esearch:

  efilter -molecule genomic -location chloroplast -country sweden

Efetch downloads selected records or reports in a designated format:

  efetch -format abstract

ENTREZ EXPLORATION

Individual query commands are connected by a Unix vertical bar pipe symbol:

  esearch -db pubmed -query "tn3 transposition immunity" | efetch -format medline

PubMed related articles are calculated by a statistical algorithm using the title, abstract, and medical subject headings (MeSH terms). These connections between papers can be used for knowledge discovery.

Lycopene cyclase converts lycopene to beta-carotene, the immediate precursor of vitamin A. An initial search on the enzyme results in 232 articles. Looking up precomputed neighbors returns 14,387 PubMed papers, some of which might be expected to discuss adjacent steps in the biosynthetic pathway:

  esearch -db pubmed -query "lycopene cyclase" |
  elink -related |
  elink -target protein |
  efilter -organism mouse |
  efetch -format fasta

Linking to the protein database finds 251,887 sequence records, each of which has standardized organism information from the NCBI taxonomy. Limiting to proteins in mice returns 39 records. (Animals do not encode the genes involved in carotene biosynthesis, except in aphids and their ilk, apparently obtained by horizontal gene transfer from fungi.) Records are then retrieved in FASTA format:

  ...
  >NP_067461.2 beta,beta-carotene 15,15'-dioxygenase isoform 1 [Mus musculus]
  MEIIFGQNKKEQLEPVQAKVTGSIPAWLQGTLLRNGPGMHTVGESKYNHWFDGLALLHSFSIRDGEVFYR
  SKYLQSDTYIANIEANRIVVSEFGTMAYPDPCKNIFSKAFSYLSHTIPDFTDNCLINIMKCGEDFYATTE
  TNYIRKIDPQTLETLEKVDYRKYVAVNLATSHPHYDEAGNVLNMGTSVVDKGRTKYVIFKIPATVPDSKK
  KGKSPVKHAEVFCSISSRSLLSPSYYHSFGVTENYVVFLEQPFKLDILKMATAYMRGVSWASCMSFDRED
  KTYIHIIDQRTRKPVPTKFYTDPMVVFHHVNAYEEDGCVLFDVIAYEDSSLYQLFYLANLNKDFEEKSRL
  TSVPTLRRFAVPLHVDKDAEVGSNLVKVSSTTATALKEKDGHVYCQPEVLYEGLELPRINYAYNGKPYRY
  IFAAEVQWSPVPTKILKYDILTKSSLKWSEESCWPAEPLFVPTPGAKDEDDGVILSAIVSTDPQKLPFLL
  ILDAKSFTELARASVDADMHLDLHGLFIPDADWNAVKQTPAETQEVENSDHPTDPTAPELSHSENDFTAG
  HGGSSL
  ...

As anticipated, the results include the enzyme that splits beta-carotene into two molecules of retinal.

STRUCTURED DATA EXTRACTION

The xtract program uses command-line arguments to direct the conversion of XML data into a tab-delimited table. The -pattern argument divides the results into rows, while placement of data into columns is controlled by -element.

Formatting arguments allow extensive customization of the output. The line break between -pattern objects can be changed with -ret, and the tab character between -element fields can be replaced by -tab. The -sep argument is used to distinguish multiple elements of the same type, and controls their separation independently of the -tab argument. The -sep value also applies to unrelated -element arguments that are grouped with commas. The following query:

  efetch -db pubmed -id 6271474,1413997,16589597 -format docsum |
  xtract -pattern DocumentSummary -sep "|" -element Id PubDate Name

returns a table with individual author names separated by vertical bars:

  6271474     1981        Casadaban MJ|Chou J|Lemaux P|Tu CP|Cohen SN
  1413997     1992 Oct    Mortimer RK|Contopoulou CR|King JS
  16589597    1954 Dec    Garber ED

Selection arguments are specialized derivatives of -element. Among these are positional commands (-first and -last) and numeric processing operations (including -num, -len, -sum, -min, -max, and -avg). There are also functions that perform sequence coordinate conversion (-0-based, -1-based, and -ucsc-based).

NESTED EXPLORATION

Exploration arguments (-pattern, -group, -block, and -subset) limit data extraction to specified regions of the XML, visiting all relevant objects one at a time. This design allows nested exploration of complex, hierarchical data to be controlled by a linear chain of command-line argument statements.

PubmedArticle XML contains the MeSH terms applied to a publication. Each MeSH term can have its own unique set of qualifiers. A single level of nested exploration within the current pattern:

  esearch -db gene -query "beta-carotene oxygenase 1" -organism human |
  elink -target pubmed | efilter -released last_year | efetch -format xml |
  xtract -pattern PubmedArticle -element MedlineCitation/PMID \
    -block MeshHeading \
      -pfc "\n" -sep "/" -element DescriptorName,QualifierName

retains the proper association of subheadings for each MeSH term:

  30396924
  Age Factors
  Animals
  Cell Cycle Proteins/deficiency/genetics/metabolism
  Cellular Senescence/physiology
  ...

A second level (-subset) would be needed to print major topic attributes next to their parent subheadings.

CONDITIONAL EXECUTION

Conditional processing arguments (-if and -unless) restrict exploration by object name and value. These may be used in conjunction with string or numeric constraints:

  esearch -db pubmed -query "Casadaban MJ [AUTH]" |
  efetch -format xml |
  xtract -pattern PubmedArticle -if "#Author" -lt 6 \
    -block Author -if LastName -is-not Casadaban \
      -sep ", " -tab "\n" -element LastName,Initials |
  sort-uniq-count-rank

to select papers with fewer than 6 authors and print a table of the most frequent coauthors:

  11    Chou, J
  8     Cohen, SN
  7     Groisman, EA
  ...

SAVING DATA IN VARIABLES

A value can be recorded in a variable and used wherever needed. Variables are created by a hyphen followed by a name consisting of a string of capital letters or digits (e.g., -PMID). Values are retrieved by placing an ampersand before the variable name (e.g., "&PMID") in an -element statement:

  efetch -db pubmed -id 3201829,6301692,781293 -format xml |
  xtract -pattern PubmedArticle -PMID MedlineCitation/PMID \
    -block Author -element "&PMID" \
      -sep " " -tab "\n" -element Initials,LastName

producing a list of authors, with the PubMed Identifier (PMID) in the first column of each row:

  3201829    JR Johnston
  3201829    CR Contopoulou
  3201829    RK Mortimer
  6301692    MA Krasnow
  6301692    NR Cozzarelli
  781293     MJ Casadaban

The variable can be used even though the original object is no longer visible inside the -block section.

SEQUENCE QUALIFIERS

The NCBI represents sequence records in a data model based on the central dogma of molecular biology. A sequence can have multiple features, which carry information about the biology of a given region, including the transformations involved in gene expression. A feature can have multiple qualifiers, which store specific details about that feature (e.g., name of the gene, genetic code used for translation).

The data hierarchy is easily explored using a -pattern {sequence} -group {feature} -block {qualifier} construct. As a convenience, an -insd helper function is provided for generating the appropriate nested extraction commands from feature and qualifier names on the command line. For example, processing the results of a search on cone snail venom:

  esearch -db protein -query "conotoxin" -feature mat_peptide |
  efetch -format gpc |
  xtract -insd complete mat_peptide "%peptide" product peptide |
  grep -i conotoxin | sort -t $'\t' -u -k 2,2n

returns the accession, length, name, and sequence for a sample of neurotoxic peptides:

  ADB43131.1    15    conotoxin Cal 1b      LCCKRHHGCHPCGRT
  ADB43128.1    16    conotoxin Cal 5.1     DPAPCCQHPIETCCRR
  AIC77105.1    17    conotoxin Lt1.4       GCCSHPACDVNNPDICG
  ADB43129.1    18    conotoxin Cal 5.2     MIQRSQCCAVKKNCCHVG
  ADD97803.1    20    conotoxin Cal 1.2     AGCCPTIMYKTGACRTNRCR
  AIC77085.1    21    conotoxin Bt14.8      NECDNCMRSFCSMIYEKCRLK
  ADB43125.1    22    conotoxin Cal 14.2    GCPADCPNTCDSSNKCSPGFPG
  AIC77154.1    23    conotoxin Bt14.19     VREKDCPPHPVPGMHKCVCLKTC
  ...

INSTALLATION

EDirect consists of a set of scripts and programs that are downloaded to the user's computer.

EDirect will run on Unix and Macintosh computers that have the Perl language installed, and under the Cygwin Unix-emulation environment on Windows PCs.

To install the EDirect software, copy the following commands and paste them into a terminal window:

  cd ~
  /bin/bash
  perl -MNet::FTP -e \
    '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1);
     $ftp->login; $ftp->binary;
     $ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
  gunzip -c edirect.tar.gz | tar xf -
  rm edirect.tar.gz
  builtin exit
  export PATH=${PATH}:$HOME/edirect >& /dev/null || setenv PATH "${PATH}:$HOME/edirect"
  ./edirect/setup.sh

This downloads several scripts into an "edirect" folder in the user's home directory. The setup.sh script then downloads any missing Perl modules, and may print an additional command for updating the PATH environment variable in the user's configuration file. Copy that command, if present, and paste it into the terminal window to complete the installation process. The editing instructions will look something like:

  echo "export PATH=\$PATH:\$HOME/edirect" >> $HOME/.bash_profile

DOCUMENTATION

Documentation for EDirect is on the web at:

  http://www.ncbi.nlm.nih.gov/books/NBK179288

Information on how to obtain an API Key is described in this NCBI blogpost:

  https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities

Questions or comments on EDirect may be sent to info@ncbi.nlm.nih.gov.
