ncbi-gene

Koza ingest for NCBI Gene data, transforming gene information into Biolink model format.

Data Source

NCBI Gene integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide.

Data is downloaded from: https://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz

Output

This ingest produces: - Gene nodes - Gene entities with NCBI Gene IDs, symbols, names, and taxon information

Biolink Properties Captured

biolink:Gene
id
symbol
name
full_name
description
in_taxon
in_taxon_label
provided_by (["infores:ncbi-gene"])

Postprocessing

Output is split by taxon into separate files in output/by_taxon/.

Usage

# Install dependencies
just install

# Run full pipeline
just run

# Or run steps individually
just download      # Download gene_info data
just transform-all # Run Koza transform
just postprocess   # Split output by taxon
just test          # Run tests

Requirements

Python 3.10+
uv package manager
just command runner

Environment Variables

NCBI_API_KEY - NCBI API key for higher rate limits (optional)
NCBI_MAIL - Email for NCBI E-utilities identification (optional)

Citation

National Center for Biotechnology Information (NCBI)[Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [1988] - [cited 2024 Dec]. Available from: https://www.ncbi.nlm.nih.gov/

License

BSD-3-Clause