The laboratory mouse has served for decades as an informative animal model system for investigating the genetic and genomic basis of cancer in humans. Although thousands of mouse models have been generated, compiling and aggregating relevant data and knowledge about these models is hampered by a general lack of compliance, in the published literature, with nomenclature and annotation standards for genes, alleles, mouse strains and cancer types. The Mouse Models of Human Cancer database (MMHCdb) is an expertly curated, comprehensive knowledgebase of diverse types of mouse models of human cancer, including inbred mouse strains, genetically engineered mouse models, patient-derived xenografts, and mouse genetic diversity panels such as the Collaborative Cross. The MMHCdb is a FAIR-compliant knowledgebase that enforces nomenclature and annotation standards, and supports the completeness and accuracy of searches for mouse models of human cancer and associated data. The resource facilitates the analysis of the impact of genetic background on the incidence and presentation of different tumor types, and aids in the assessment of different mouse strains as models of human cancer biology and treatment response.
The laboratory mouse has long been an important model system for the study of the genetic and genomic basis of human disease and biology. Inbred mice have been used to study the pathobiology of human disease since the early 1900s (Naf et al., 2002). Investigations on the effects of genetic background on tumor incidence and predisposition were among the earliest lines of cancer research using inbred mice (Little and Tyzzer, 1916; Tyzzer, 1909). Although mouse models do not fully recapitulate all aspects of human biology, their genetic and physiological similarities to humans and their experimental tractability have yielded mechanistic insights into human diseases and novel therapeutic strategies (Sharpless and Depinho, 2006; Kopetz et al., 2012; McGonigle and Ruggeri, 2014; Sundberg et al., 2013). The landscape of mouse models of human cancer has evolved dramatically over the years in response to the advent of precision genome-editing technologies, improved immunodeficient hosts for xenograft models, and the availability of panels of genetically diverse mice (Kopetz et al., 2012; Justice et al., 2011; Abate-Shen and Pandolfi, 2013; Liu et al., 2017; Threadgill et al., 2011).
Managing the knowledge about the ever-changing nature of mouse models of human cancer, and the growing corpus of publications and heterogenous data associated with these models, is key to ensuring that an appropriate model is used for a specific research question or application. However, searching for and aggregating information about mouse models can be a daunting challenge, in part because well-established nomenclature guidelines and persistent identifiers for genes, alleles and strains are not often used in the published scientific literature or by data repositories. Using natural language processing of scientific journal articles, Chen et al. (2005) found that up to 85.1% of extracted mouse gene names were ambiguous. They also found that 74.7% of gene symbols in 50 randomly selected abstracts were synonyms instead of the official nomenclature. For example, the name and symbol for the mouse gene transformation related protein 53 (Trp53) were never used, whereas the synonym p53 was used frequently. The synonyms for the gene Erbb2 include Her2 and Neu, which are commonly used in the literature. The official symbol and/or persistent gene identifier for Erbb2 (e.g. MGI:95410 or NCBI Gene ID:13866) are rarely used in publications. The use of official nomenclature is particularly important for unambiguous identification of models in cases in which gene symbols can refer to multiple different genes or are synonyms for genes in other species. The symbol P60, for example, is a synonym for mouse genes Ppr1 and Stip1, and a synonym for human genes ARHGEF5, SRC, SQSTM1, IFIT3, IFIT3B and TNFRSF1A. P130 is a synonym for the mouse genes Nolc1, Rab3gap1 and Rb12. Simple keyword searches by PubMed or Google using ambiguous gene symbols require users to resolve nomenclature ambiguity manually.
Another common practice in the literature that complicates aggregation of knowledge for a specific model and comparison of data across mouse models is the use of generic symbols for knockout and conditional alleles without reference to the official allele name and symbol (e.g. P53− instead of a specific allele such as Trp53tm1Tyj). The Mouse Models of Human Cancer database (MMHCdb) contains records for 403 mouse strains that were reported in the literature as P53−. The lack of official nomenclature and persistent identifiers complicates the ability of researchers to quickly determine whether an engineered allele is germline or induced somatically. Often the strain background of the mouse model is not indicated, which is particularly problematic for reproducibility of research results as the same allele on different genetic backgrounds can result in very different cancer phenotypes (Chan et al., 2021; Hunter et al., 2018; Levine, 2017; Lifsted et al., 1998; Reilly, 2016; Svendsen et al., 2011). MMHCdb curators address this problem by using references in the source publication to identify the precise strain and allele and make corrections in the database record. Unofficial nomenclatures for these entities are recorded as synonyms. If the references are ambiguous, then the curators contact the communicating author to confirm the correct official strain/allele nomenclature. If a strain background is unable to be determined, the strain is given the designation of ‘[not specified]’. If an allele reported in the literature cannot be resolved to an official symbol, the data for the model are not included in the MMHCdb.
The MMHCdb was launched in 1998 as the Mouse Tumor Biology database (MTB), with the goal of providing web-based access to published and unpublished data on the pathobiology of cancer in genetically defined strains of laboratory mice (Bult et al., 1999). The MMHCdb is a contributing data resource to the Mouse Genome Informatics (MGI) consortium hosted at The Jackson Laboratory (Ringwald et al., 2022). The MMHCdb leverages annotation standards pioneered by MGI, including standardized genetic nomenclature for genes and mouse strains and bio-ontologies for annotation of gene function and phenotype. The initial focus of the database was on inbred and hybrid mouse strains and genetically engineered mouse models (GEMMs). As the types of mouse models have changed, so has the range of in vivo models represented in the resource (Bult et al., 2015). In 2019 the database name was changed from the ‘Mouse Tumor Biology database’ to the ‘Mouse Models of Human Cancer database’ to better reflect the translational and clinical relevance of mouse models. In 2020, the look and feel of the website was overhauled, and advanced search capabilities using faceted search interfaces were implemented.
The MMHCdb is unique among other open-access community databases and knowledgebases centered on the laboratory mouse because of the breadth of cancer types represented in the resource and the detail provided about the types and frequencies of tumors observed in different cancer models. The Mouse Genome Database (MGD) (Blake et al., 2021) is the source of official nomenclature for genes, alleles and genotypes in the MMHCdb, but the MGD does not provide information about the detailed characteristics of cancer models. For example, it is possible to search the MGD to find genotypes associated with increased or decreased incidence or susceptibility to specific cancers, but information on the tumor types that are typically observed in a cancer model and their frequency is information that is unique to the MMHCdb. The Mouse Phenome Database (MPD) (Bogue et al., 2023) and the International Mouse Phenotyping Consortium database (IMPC) (Groza et al., 2023) store baseline phenotype data collected from standardized phenotyping pipelines for different mouse strains and for genetically engineered lines of mice, but neither resource has substantial data for cancer phenotypes. Although the patient-derived models in MMHCdb are limited to those available from The Jackson Laboratory, MMHCdb collaborates with the European Bioinformatics Institute on the Patient-Derived Cancer Model Finder resource (PDCM) (Perova et al., 2022). The PDCM compiles information on patient-derived cancer models from repositories around the world and currently indexes over 4800 PDX models for more than 400 cancer types.
The MMHCdb currently includes data on over 60,000 mouse tumor models covering more than 1200 tumor classifications. These data have been acquired from more than 25,000 references and include over 7200 pathology records containing >6600 images with annotations. The images include 2596 JPGs and 3980 Zoomify TIFFs. The MMHCdb also contains 110 high-resolution whole-slide scans (Hamamatsu NDPI format) covering lung adenomas, lymphomas and leukemias. Whole-slide scans are presented online using the Open Microscopy Environment Remote Objects (OMERO) web-viewer. Pathology images in MMHCdb are made available to the research community for display with the permission of the submitting investigators or publishers. The cancer models in the MMHCdb are presented in over 110,000 tumor frequency records. Each of these records includes information on tissue of origin, tumor classification, frequency, genetic background and allelic composition. The MMHCdb also includes data on over 400 PDX models with more than 3000 histology and immunohistochemistry images.
The two primary types of in vivo cancer models in the MMHCdb are mouse models (inbred strains and GEMMs) and human-in-mouse models (i.e. PDXs). A cancer model in MMHCdb is defined as a unique combination of organ of origin, tumor classification, organ affected, strain (background plus genotype) and tumor-inducing agent(s). Information and data in the MMHCdb are acquired from peer-reviewed scientific literature, direct submission by research laboratories, and through downloads from related resources including PathBase (Schofield et al., 2010) and the Gene Expression Omnibus (GEO) (Clough and Barrett, 2016). Publications with information relevant to MMHCdb are identified through the application of a machine-learning classifier that scans publications from more than 120 scientific journals and identifies papers that are likely to be relevant to the resource (Ringwald et al., 2022). All data in the MMHCdb are directly attributed to a reference, either a primary literature reference or a reference created for a submission or download, to identify the original source of the data.
Nomenclature and ontology standards
Information and data acquired for MMHCdb are reviewed manually to ensure adherence to genetic nomenclature and annotation standards. Genes, alleles and mouse strains are named according to the rules established by the International Committee on Standardized Genetic Nomenclature for Mice. These entities are also linked to relevant records in the MGD, which provides users with MMHCdb information on additional phenotypes and disease models. Unofficial names and symbols used in publications are maintained as synonyms so that searches of MMHCdb using both official and unofficial nomenclature will return appropriate records.
Names of tumors in MMHCdb consist of two components: the organ of origin and a classification term (Bult et al., 2000). Both of the components rely on published community standards and terminologies, including Stedman's Medical Dictionary (Stedman et al., 1990), Pathology of the Mouse (Maronpot et al., 1999), Pathology of Tumours in Laboratory Animals, Volume II (Mohr and Turusov, 1994), International Classification of Rodent Tumors: The Mouse (Mohr, 2001), and Pathobiology of the Aging Mouse: Volumes 1 and 2 (Mohr et al., 1996a,b).
The MMHCdb is freely available without registration. Summaries of mouse models for 20 cancer types with the highest mortality rates in humans are available via hypertext links from a human cancer overview table on the MMHCdb home page (Fig. 1). Also available from the home page are a Quick Search dialog box, a project-specific news feed, and links to other search tools and data resources (Fig. 1).
In addition to help documentation available from the ‘About Us’ menu, MMHCdb has a YouTube channel with short video tutorials about using the resource. Links to the instructional videos are available under the ‘Other Resources’ and ‘Help’ menus on the MMHCdb home page.
Faceted searching of MMHCdb is supported on the Advanced Search form. Search terms within each facet are presented as a picklist. Typing a term in the text box associated with each facet will narrow the list of terms using an autocomplete function. Search results update dynamically in response to changes in facet choices. The number of records a search term is associated with in the database is provided to give users a sense of the volume of data available. Search results include summary data for each matching model including model name, treatment status, strain name and type, tumor frequency range, and additional information or data available for the model. The results link to a detailed description of data available for a model and the publication(s) associated with the model. Strain names in MMHCdb are linked to reports that list all of the tumor models associated with the strain. An example of a faceted search for lung adenocarcinoma models for which there are pathology images available is shown in Fig. 2.
Use case 1: impact of genetic background on tumor types and frequencies
The genetic background of a mouse model can significantly affect the observed disease-related phenotypes, including the types and frequencies of tumors that are characteristic of a cancer model (Doetschman, 2009; Montagutelli, 2000). The same allele on different backgrounds can result in very different cancer characteristics and, therefore, impact the choice of model for a specific research application. For example, a human HRAS transgene, Tg(Wap-HRAS)69Lln, expressed on a mixed C57BL/6 and SJL background, subline 69-2, results in mammary gland carcinomas at a frequency of 45-50% by 1 year of age (Nielsen et al., 1991; Nielsen et al., 1995). However, mice carrying the same transgene expressed on an inbred FVB/N strain background, subline 69-2 crossed to FVB/N for two generations creating subline 69-2F, develop mammary gland tumors at a frequency of 100% by 3 months of age (Nielsen et al., 1995). On a C57BL/6J background, 100% of mice heterozygous for the ApcMin allele develop tumors throughout the intestine, particularly in the small intestine. When crossed onto an FVB/NJ background, only 7% of ApcMin heterozygous mice develop intestinal tumors (Svendsen et al., 2011). Mice heterozygous for the Trp53tm1Tyj allele develop mammary tumors on the BALB/c background, but not the C57BL/6J background (Reilly, 2016). In a survey of breast cancer models based on the mammary tumor virus promoter-driven polyoma middle T oncogene [Tg(MMTV-PyVT)634Mul], the parental transgenic FVB/N was crossed to 27 wild-type inbred strain backgrounds. The metastatic burden for the F1 progeny of these crosses varied by 40-fold depending on the background (Hunter et al., 2018; Lifsted et al., 1998).
As important as genetic background is in identifying appropriate disease models, finding this information through searches of the primary literature is time consuming and error prone. In the MMHCdb, users can quickly review the impact of genetic background on cancer phenotypes in two ways. First, a curated summary table of the frequency of spontaneous tumors for inbred strains from published and unpublished sources is available under the Searches/Tools menu on the home page (Fig. 3A). Second, data from curated publications that specifically mention the impact of genetic background are presented as a summary table with color coding of reported tumor frequency. Fig. 3B shows the results from a survey of tumor susceptibility in which tumor type and frequency were documented for F1 offspring of mice homozygous for Trp53tm3.1Glo on a C57BL/6 background that were crossed to seven different wild-type backgrounds (Chan et al., 2021; Levine, 2017). The survey results indicated that 21-30% of C3H, DBA, NOD and SWR F1 hybrid mice developed lymphomas/lymphoid hyperplasia, while only 4% of BALB/c F1 hybrids were observed to have this pathology. Twenty percent of A/J F1 hybrid mice developed lipomas, which were rarely observed in other backgrounds.
Use case 2: susceptibility to gastric cancer using Collaborative Cross mice
The Collaborative Cross is a genetically diverse panel of recombinant inbred lines created by repeated crossing of eight inbred founder strains, which collectively comprise nearly 90% of the known genetic variation present in laboratory mice (Threadgill et al., 2011; Reilly, 2016). Collaborative Cross mice are a particularly useful experimental resource for identifying disease modifiers and genes associated with variation in cancer susceptibility. Wang et al. (2019) used 18 Collaborative Cross mice strains to examine the variation in tumor incidence and spectrum between strains, which led to the identification of a novel model for gastric cancer. In the MMHCdb, users can search for results on Collaborative Cross studies using the Strain Type=Collaborative Cross facet on the Advanced Search form. Fig. 4 shows the curated summary table for the Wang et al. (2019) study, revealing that gastric tumors were detected in one of the 18 Collaborative Cross strains examined (CC0036/Unc).
Use case 3: PDX models
PDXs have been used extensively for preclinical efficacy studies of single agent and combination cancer therapies (Kopetz et al., 2012; Lai et al., 2017). PDXs are generated through orthotopic or subcutaneous implantation of human tumor tissue into transplant-compliant immunodeficient mouse hosts (Ireson et al., 2019). The over 400 PDX models currently represented in the MMHCdb were generated by The Jackson Laboratory PDX Resource. These, and thousands of other PDX models from repositories around the world, are also included in the PDCM database that is maintained as a collaboration between MMHCdb and the European Bioinformatics Institute (EBI) (Perova et al., 2022). Deidentified model information such as age, sex and race of patient, host mouse, implantation method, tumor diagnosis and location, pathology annotations and images, and immunohistochemistry data are included. The MMHCdb also contains links to data repositories of genomic data such as the GEO, when available, including information on genomic sequence, copy-number variants, gene expression and tumor mutation burden. In addition, the PDX data also include preclinical study data, including treatment regimens and growth curves represented in multiple graphical methods.
Users can search for PDX models in the MMHCdb using web forms on the PDX Search Portal (Fig. 5A). Search criteria supported include model identifiers, cancer type, organ system, treatment results, tumor genome properties and allelic variants. Alternatively, users can search for models that match multiple molecular criteria using the ‘PDX Like Me’ query language (Fig. 5B). PDX Like Me is modeled after the cBioPortal's Onco Query Language (Cerami et al., 2012). Fig. 5B shows the results of a search for PDX models that have amplified KRAS, a TP53 A159V mutation, a deletion of the ALB gene, and high expression of the KIT gene.
MMHCdb implementation details
Public web interface
The primary curatorial interface for data entry is implemented as a Java Swing desktop application.
The backend for MMHCdb is a highly normalized relational database running on Postgres.
Application programming interface
The MMHCdb application programming interface (API) is implemented as JSON-based web services. The APIs allow access to the MMHCdb data in a platform and language-independent manner that is sufficiently flexible to serve the diverse needs of bioinformaticians.
The MMHCdb was implemented in Java, Swing and Struts. These are mature technologies that are widely used, have a well-maintained codebase and are stable. New technologies are evaluated regularly as the functional features of the database evolve over time.
The MMHCdb serves as an expertly curated knowledgebase for mouse models of human cancer. The two primary goals of the resource are to (1) facilitate aggregation of heterogeneous information generated by different laboratories for the same model through the enforcement of nomenclature and metadata annotation standards, and (2) highlight the impact of genetic background on the variation in the types of tumors and in the frequency of those types typical for a specific model.
In vivo mouse models have deeply informed our understanding of cancer biology, and the nature and use of these models is constantly evolving. Inbred mice and mice with spontaneous or chemically induced mutations have been used to study the basic genetic principles of gene function in cancer for many years (Reilly, 2016). Advancements in cellular and genome engineering technologies support the rapid generation of models carrying targeted mutations and conditional alleles with tissue- and temporal-specific expression of genes (Puccini et al., 2013; Guerin et al., 2020; Ran et al., 2013). PDX models have proven to be an exceptional platform for preclinical evaluation of the efficacy of novel cancer treatments (Kopetz et al., 2012). Mouse genetic diversity resources such as the Collaborative Cross are proving to be a powerful resource for investigating the genetic basis of cancer susceptibility, as well as susceptibility to adverse cancer treatment responses (Wang et al., 2019; Zeiss et al., 2019). Future plans for the MMHCdb include the continued adaptation of the resource to accommodate new types of mouse models of cancer as well as the implementation of tools and interfaces to support comparisons of mouse and human tumor genomics and treatment response data. These efforts will ensure that the MMHCdb continues to serve as a unique resource supporting research into the basic biology and genetics of human cancer and translational research.
The authors thank Andrew Currier of The Jackson Laboratory Digital Experience group for his contributions to the redesign of the MMHCdb web site and Danielle Meier of The Jackson Laboratory Creative group for her design of the MMHCdb logo.
Conceptualization: D.A.B., D.M.K., J.P.S., E.L.J., J.E.R., S.B.N., C.J.B.; Methodology: D.A.B., D.M.K., J.P.S., E.L.J., S.B.N., C.J.B.; Software: J.E.R., S.B.N.; Validation: D.A.B., D.M.K., J.P.S., E.L.J., J.E.R., S.B.N., C.J.B.; Formal analysis: D.A.B., D.M.K., J.P.S., E.L.J., J.E.R., S.B.N., C.J.B.; Investigation: C.J.B.; Resources: D.A.B., D.M.K., J.P.S., E.L.J., J.E.R., S.B.N., C.J.B.; Data curation: D.A.B., D.M.K., J.P.S., E.L.J., C.J.B.; Writing - original draft: D.A.B.; Writing - review & editing: D.A.B., D.M.K., E.L.J., S.B.N., C.J.B.; Visualization: C.J.B.; Supervision: C.J.B.; Project administration: C.J.B.; Funding acquisition: C.J.B.
This work was supported by the National Cancer Institute (CA R01089713). Open Access funding provided by The Jackson Laboratory. Deposited in PMC for immediate release.
All relevant data can be found within the article.
The authors declare no competing or financial interests.