Human lung tissue and peripheral blood
Freshly resected lung tissue was procured intraoperatively from patients undergoing lobectomy for focal lung tumours. Normal lung tissues (approximately 5 cm3) were obtained from uninvolved regions and annotated for the specific lung lobe and location along the airway or periphery. Pathological evaluation (by G.B.) confirmed normal histology of the profiled regions, except for areas of very mild emphysema in patient 1. Patient 1 was a 75-year-old male with a remote history of smoking, diagnosed with early stage adenocarcinoma who underwent left upper lobe (LUL) lobectomy; two blocks of normal tissue were obtained from lung periphery (distal 1a and 1b). Patient 2 was a 46-year-old male, non-smoker with a right middle lobe (RML) endobronchial carcinoid, who underwent surgical resection of the right upper and middle lobes; two blocks of tissue were selected from mid-bronchial region (medial 2) and periphery (distal 2) of right upper lobe (RUL). Patient 3 was a 51-year-old female, non-smoker with mild adult-onset asthma and a left lower lobe (LLL) endobronchial typical carcinoid, who underwent LLL lobectomy; three tissue blocks were resected from the bronchus (proximal 3), mid-bronchial (medial 2), and periphery (distal 3) of the LLL. All tissues were received and immediately placed in cold PBS and transported on ice directly to the research lab for single cell dissociation procedures. Peripheral blood was collected from patients 1 and 3 in EDTA tubes. For bulk RNA-seq of canonical immune populations, whole blood from healthy human donors was obtained commericially (AllCells) in EDTA tubes. Patient tissues were obtained under a protocol approved by Stanford University’s Human Subjects Research Compliance Office (IRB 15166) and informed consent was obtained from each patient before surgery. All experiments followed applicable regulations and guidelines.
Mouse lung tissue
Lung tissue for Tabula Muris Senis40 was obtained as previously described. We obtained additional tissue from two mice expressing Cre recombinase and two expressing oestrogen-inducible Cre recombinase (Cre-ERT2) for conditional cell-specific labelling in vivo with the gene-targeted alleles FVB-Tbx4-LME-cre41,42 (lung stroma) and B6.129-Axin2-cre-ERT241, respectively. Cre-dependent reporter alleles Rosa26ZsGreen1, which expresses cytosolic ZsGreen1 following Cre-mediated recombination, and Rosa26mTmG, which expresses membrane-targeted green fluorescent protein (mGFP) after recombination and membrane-targeted tdTomato (mTomato) in all other tissues, were used to label cells expressing Tbx4 and Axin2, respectively43,44. Induction of the Axin2-cre-ERT2 allele was done by intraperitoneal injection of tamoxifen (3 mg) once a day for three days as described25. All mouse experiments followed applicable regulations and guidelines and were approved by the Institutional Animal Care and Use Committee at Stanford University (Protocol 9780).
Isolation of lung and blood cells
Individual human lung samples were dissected, minced, and placed in digestion media (400 μg ml−1 liberase DL (Sigma 5466202001) and 100 μg ml−1 elastase (Worthington LS006365) in RPMI (Gibco 72400120) in a gentleMACS c-tube (Miltenyi 130-096-334). Samples were partially dissociated by running ‘m_lung_01’ on a gentleMACS Dissociator (Miltenyi 130-093-235), incubated on a Nutator at 37 °C for 30 min, and then dispersed to a single cell suspension by running ‘m_lung_02’. Processing buffer (5% fetal bovine serum in PBS) and DNase I (100 μg ml−1, Worthington LS006344) were then added and the samples rocked at 37 °C for 5 min. Samples were then placed at 4 °C for the remainder of the protocol. Cells were filtered through a 100-μm filter, pelleted (300g, 5 min, 4 °C), and resuspended in ACK red blood cell lysis buffer (Gibco A1049201) for 3 min, after which the buffer was inactivated by adding excess processing buffer. Cells were then filtered through a 70-μm strainer (Fisherbrand 22363548), pelleted again (300g, 5 min, 4 °C), and resuspended in magnetic activated cell sorting (MACS) buffer (0.5% BSA, 2 mM EDTA in PBS) with Human FcR Blocking Reagent (Miltenyi 130-059-901) to block non-specific binding of antibodies (see below).
Immune cells, including granulocytes, were isolated from peripheral blood using a high density ficoll gradient45. In brief, peripheral blood was diluted tenfold with FACS buffer (2% FBS in PBS), carefully layered on an RT Ficoll gradient (Sigma HISTOPAQUE-1119), and centrifuged at 400g for 30 min at room temperature. The buffy coat was carefully removed, diluted fivefold with FACS buffer, pelleted (300g, 5 min, 4 °C), and incubated in ice cold FACS buffer containing DNase I (Worthington LS006344) for 10 min at 4 °C. Clumps were separated by gentle pipetting to create a single-cell suspension.
Mouse lung samples were processed into single cell suspensions as previously described2. In brief, each lung was dissected, minced, and placed in gentleMACS c-tubes (Miltenyi 130-096-334) with digestion buffer (400 μg ml−1 liberase DL (Sigma 5466202001) in RPMI (Gibco 72400120)). The minced tissue was partially dissociated by running ‘m_lung_01’ on a gentleMACS Dissociator (Miltenyi 130-093-235), incubated at 37 °C on a nutator for 30 min, completely dissociated on a gentleMACS by running ‘m_lung_02’, and kept at 4 °C or on ice for the remainder of the protocol. Cells were washed with 5% FBS in PBS, centrifuged at 300g for 5 min, resuspended in 5% FBS in PBS, filtered through a 70-μm strainer (Fisherbrand 22363548), and centrifuged again and resuspended in FACS buffer (2% FBS in PBS).
Magnetic separation of lung tissue compartments
Immune and endothelial cells were overrepresented in our previous mouse single-cell suspensions. To partially deplete these populations in our human samples, we stained cells isolated from lung with MACS microbeads conjugated to CD31 and CD45 (Miltenyi 130-045-801, 130-091-935) then passed them through an LS MACS column (Miltenyi, 130-042-401) on a MidiMACS Separator magnet (Miltenyi, 130-042-302). Cells retained on the column were designated ‘immune and endothelial enriched’. The flowthrough cells were then split, with 80% immunostained for FACS (see below) and the remaining 20% stained with EPCAM microbeads (Miltenyi 130-061-101). EPCAM stained cells were passed through another LS column. Cells retained on the column were labelled ‘epithelial enriched’, and cells that flowed through were designated ‘stromal’.
Flow cytometry and cell sorting
Lysis plates for single-cell mRNA sequencing were prepared as previous described2. 96-well lysis plates were used for cells from the blood and mouse samples and contained 4 μl of lysis buffer instead of 0.4 μl.
After negative selection against immune and endothelial cells by MACS, the remaining human lung cells were incubated with FcR Block (Becton Dickinson 564219) for 5 min and stained with directly conjugated anti-human CD45 (Biolegend 304006) and EPCAM (eBioscience 25-9326-42) antibodies on a Nutator for 30 min at the manufacturer’s recommended concentration. Cells were then pelleted (300g, 5 min, 4 °C), washed with FACS buffer three times, then incubated with cell viability marker Sytox blue (1:3,000, ThermoFisher S34857) and loaded onto a Sony SH800S cell sorter. Living single cells (Sytox blue-negative) were sorted into lysis plates based on three gates: EPCAM+CD45− (designated epithelial), EPCAM−CD45+ (designated immune), and EPCAM−CD45– (designated endothelial or stromal).
Immune cells from subject matched blood were incubated with FcR Block and Brilliant Violet buffer (BD 563794) for 20 min and then stained with directly conjugated anti-human CD3 (BD 563548), CD4 (BD 340443), CD8 (BD 340692), CD14 (BD 557831), CD19 (Biolegend 302234), CD47 (BD 563761), CD56 (BD 555516), and CD235a (BD 559944) antibodies for 30 min at the manufacturer’s recommended concentration. Cells were pelleted (300g, 5 min, 4 °C), washed with FACS buffer twice, and then incubated with the viability marker propidium iodide and loaded onto a BD FACSAria II cell sorter. Living (propidium iodide-negative) single, non-red blood (CD235a−) cells were sorted into lysis plates along with specific immune populations: B cells (CD19+CD3−), CD8+ T cells (CD8+), CD4+ T cells (CD4+), natural killer cells (CD19−CD3−CD56+CD14−), classical monocytes (CD19−CD3−CD56−CD14+). After sorting, plates were quickly sealed, vortexed, spun down for 1 min at 1,000g, snap frozen on dry ice, and stored at −80 until cDNA synthesis.
Mouse cells were incubated with the viability marker DAPI and loaded onto a BD Influx cell sorter. Living (DAPI-negative) single cells were sorted into lysis plates based on presence or absence of the fluorescent lineage label (mEGFP for Axin2-cre-ERT2, ZsGreen1 for Tbx4-LME-cre).
Immune cells for bulk mRNA sequencing were incubated with FcR Block for 20 min and then stained with one of six panels of directly conjugated antibodies for 30 min at the manufacturers recommended concentration: anti-human CD16 (BD 558122), CD123 (BD 560826), CCR3 (R&D FAB155F), ITGB7 (BD 551082), CD3 (BD 555341), CD14 (Invitrogen MHCD1406), CD19 (BD 555414), and CD56 (BD 555517) (basophils, neutrophils and eosinophils); anti-human CD16 (BD 558122), CD14 (BD 347497), CD4 (BD 340443), CD3 (BD 555341), CD8 (BD 555368), CD19 (BD 555414), and CD56 (BD 555517) (classical and nonclassical monocytes); anti-human CD16 (BD 558122), CD1c (Miltenyi Biotec 130-098-007), CD11c (BD 340544), CCR3 (R&D FAB155F), CD123 (BD 560826), HLA-DR (BD 335796), CD3 (BD 555341), CD4 (BD 555348), CD8 (BD 555368), CD14 (Invitrogen MHCD1406), CD19 (BD 555414), and CD56 (BD 555517) (pDCs, mDCs, CD16+ dendritic cells); anti-human IgM/IgD (BD 555778), CD19 (BD 557835), CD27 (BD 558664), CD20 (BD 335794), CD3 (BD 555341), CD4 (BD 555348), CD14 (Invitrogen MHCD1406), and CD56 (BD 555517) (B cells); anti-human CD16 (BD 558122), CD57 (BD 347393), CD56 (BD 557747), CD3 (BD 555341), CD4 (BD 555348), CD14 (Invitrogen MHCD1406), and CD19 (BD 555414) (natural killer cells); and anti-human CD45RA (Biolegend 304118), CCR7 (R&D FAB197F), CD62L (BD 555544), CD45RO (BD Pharmingen 560608), CD4 (BD 340443), CD8 (BD 340584), CD11b (BD 555389), CD14 (Invitrogen MHCD1406), CD19 (BD 555414), CD56 (BD 555517) (T cells). Cells were washed with FACS buffer twice, incubated with the viability marker propidium iodide and loaded onto a BD FACSAria II cell sorter. Approximately 40,000 cells from 21 canonical immune populations (Supplementary Table 3) were sorted in duplicate into Trizol LS (Invitrogen 10296010).
After sorting, all plates and samples were quickly sealed, vortexed, spun down for 1 min at 1,000g and then snap frozen on dry ice and stored at −80 °C until cDNA synthesis.
Single-cell mRNA sequencing
mRNA from single cells sorted from human and mouse lungs and human blood into lysis plates was reverse transcribed to cDNA and amplified as previously described2. Illumina sequencing libraries for cDNA from single cells were prepared as previously described2. In brief, cDNA libraries were prepared using the Nextera XT Library Sample Preparation kit (Illumina, FC-131-1096). Nextera tagmentation DNA buffer (Illumina) and Tn5 enzyme (Illumina) were added, and the sample was incubated at 55 °C for 10 min. The reaction was neutralized by adding Neutralize Tagment Buffer (Illumina) and centrifuging at room temperature at 3,220g for 5 min. Mouse samples were then indexed via PCR by adding i5 indexing primer, i7 indexing primer, and Nextera NPM mix (Illumina). Human samples were similarly indexed via PCR using custom, dual-unique indexing primers (IDT)2.
Following library preparation, wells of each library plate were pooled using a Mosquito liquid handler (TTP Labtech), then purified twice using 0.7x AMPure beads (Fisher A63881). Library pool quality was assessed by capillary electrophoresis on a Tapestation system (Agilent) with either a high sensitivity or normal D5000 ScreenTape assay kit (Agilent) or Fragment analyser (AATI), and library cDNA concentrations were quantified by qPCR (Kapa Biosystems KK4923) on a CFX96 Touch Real-Time PCR Detection System (Biorad). Plate pools were normalized and combined equally to make each sequencing sample pool. A PhiX control library was spiked in at 1% before sequencing. Human libraries were sequenced on a NovaSeq 6000 (Illumina) and mouse libraries on a NextSeq 500 (Illumina).
Cells isolated from each compartment (immune and endothelial enriched, epithelial enriched, stromal) and subject blood were captured in droplet emulsions using a Chromium Single-Cell instrument (10x Genomics) and libraries were prepared using the 10x Genomics 3′ Single Cell V2 protocol as previously described2. All 10x libraries were pooled and sequenced on a NovaSeq 6000 (Illumina).
Immune cell bulk mRNA sequencing
Total RNA from bulk-sorted canonical immune populations was reverse transcribed to cDNA, amplified, and prepared as sequencing libraries as previously described45. Libraries were sequenced on a NextSeq 500 (Illumina).
Mouse and human lungs were collected as previously described25,46. After inflation, lungs were removed en bloc, fixed in 4% paraformaldehyde (PFA) overnight at 4 °C with gentle rocking, then cryo-embedded in Optimal Cutting Temperature compound (OCT, Sakura) and sectioned using a cryostat (Leica) onto Superfrost Plus Microscope Slides (Fisherbrand). Immunohistochemistry was performed using primary antibodies raised against the following antigens and used at the indicated dilutions to stain slides overnight at 4 °C: anti-proSP-C (rabbit, Chemicon AB3786, 1:250 dilution), HES1 (rabbit, Cell Signaling 11988S clone D6P2U, 1:100), MUC-1 (hamster, Thermo Scientific HM1630, clone MH1, 1:250), Ki67 (rat, DAKO M7249 clone MIB-1, 1:100), and keratin-5 (chicken, Biolegend 905901, 1:100). Primary antibodies were detected with Alexa Fluor-conjugated secondary antibodies (Jackson ImmunoResearch) unless otherwise noted, then mounted in Vectashield containing DAPI (5 μg ml−1, Vector labs). Images were acquired with a laser scanning confocal fluorescence microscope (Zeiss LSM780) and processed with Fiji (v.2.0) and Imaris (v.9.2.0, Oxford Instruments). Immunostaining experiments were performed on at least two human or mouse participants distinct from the donors used for sequencing, and quantifications were based on at least 10 fields of view in each.
Single molecule in situ hybridization
Samples were fixed in either 10% neutral buffered formalin, dehydrated with ethanol and embedded in paraffin wax or fixed in 4% paraformaldehyde and embedded in OCT compound. Sections from paraffin (5 μm) and OCT (20 μm) blocks were processed using standard pre-treatment conditions for each per the RNAscope multiplex fluorescent reagent kit version 2 (Advanced Cell Diagnostics) assay protocol. TSA-plus fluorescein, Cy3 and Cy5 fluorophores were used at 1:500 dilution. Micrographs were acquired with a laser scanning confocal fluorescence microscope (Zeiss LSM780) and processed with ImageJ and Imaris (version 9.2.0, Oxford Instruments). smFISH experiments were performed on at least two human or mouse participants distinct from the donors used for sequencing, and quantifications were based on at least 10 fields of view in each. For smFISH, fields of view were scored manually, calling a cell positive for each gene probed if its nucleus had at least three associated expression puncta. Proprietary (Advanced Cell Diagnostics) probes used were: KRT5 (547901-C2), SERPINB3 (828601-C3), SFTPC (452561-C2), WIF1 (429391), CLDN5 (517141-C2, 517141-C3), MYC (311761-C3), ACKR1 (525131, 525131-C2), COL1A2 (432721), GPC3 (418091-C2), SERPINF1 (564391-C3), C20rf85 (560841-C3), DHRS9 (467261), GJA5 (471431), CCL21 (474371-C2), COX4I2 (570351-C3), APOE (433091-C2), ACGT2 (828611-C2), ASPN (404481), IGSF21 (572181-C3), GPR34 (521021), EREG (313081), GPR183 (458801-C2), TREM2 (420491-C3), CHI3L1 (408121), MYRF (499261), AGER (470121-C3), TBX5 (564041), KCNK3 (536851), ACVRL1 (559221), SERPINA1 (435441), HHIP (464811), SLC7A10 (497081-C2), FGFR4 (443511), PI16 (451311-C2), SERPINF1 (310731), HHIP (448441-C3), SFTPC (314101-C2), NKX2-1 (434721-C3), and MYRF (524061).
Sequencing read alignments and quality control
Reads from single cells isolated using 10x chromium were demultiplexed and then aligned to the GRCh38.p12 human reference (from 10x Genomics) using Cell Ranger (version 2.0, 10x Genomics). Cells with fewer than 500 genes detected or 1,000 UMIs were excluded from further analyses.
Reads from single cells isolated by flow cytometry were demultiplexed using bcl2fastq (v.18.104.22.1686, Illumina), pruned for low nucleotide quality scores and adaptor sequences using skewer (v.0.2.2), and aligned to either (depending on organism) the GRCh38.p12 human reference genome with both the gencode-vH29 and NCBI-108 annotations or the GRCm38.p6 mouse reference genome with the NCBI-106 annotation (with fluorescent genes mEGFP, tdTomato, and ZsGreen1 supplemented) using STAR (v.2.6.1d) in two-pass mapping mode, in which the first pass identifies novel splice junctions and the second pass aligns reads after rebuilding the genome index with the novel junctions. The number of reads mapping to each annotated gene were calculated by STAR during the second pass alignment, and cells with fewer than 500 genes detected or 50,000 mapped reads were excluded from later analyses. Reads from mRNA sequencing of canonical immune populations were demultiplexed, aligned and quantified using the same pipeline.
Cell clustering, doublet calling, and annotation
Expression profiles of cells from different subjects and different capture approaches (10x and SS2) were clustered separately using the R software package Seurat (v.2.3)47. In brief, counts (SS2) and UMIs (10x) were normalized across cells, scaled per million (SS2) or per 10,000 (10x), and converted to log scale using the ‘NormalizeData’ function. These values were converted to z-scores using the ‘ScaleData’ command and highly variable genes were selected with the ‘FindVariableGenes’ function with a dispersion cutoff of 0.5. Principle components were calculated for these selected genes and then projected onto all other genes with the ‘RunPCA’ and ‘ProjectPCA’ commands. Clusters of similar cells were detected using the Louvain method for community detection including only biologically meaningful principle components (see below) to construct the shared nearest neighbour map and an empirically set resolution, as implemented in the ‘FindClusters’ function.
When clustering all cells from a single subject at once, we found that the first principal components defining heterogeneity represented differences in tissue compartment, but some cell types within a compartment (for example, basal, goblet club, neuroendocrine and ionocyte) had a tendency to co-cluster. Clusters were therefore grouped based on expression of tissue compartment markers (for example, EPCAM, CLDN5, COL1A2 and PTPRC) using the ‘SubsetData’ command and the same procedure (from ‘ScaleData’ onwards) was applied iteratively to each tissue compartment until the markers enriched in identified clusters, identified using the ‘MAST’ statistical framework48 implemented in the ‘FindMarkers’ command, were no longer biologically meaningful (for example, clusters distinguished by dissociation-induced genes30, ribosomal genes, mitochondrial genes, or ambient RNA released by abundant cells such as RBCs31). Doublets were identified by searching for cells with substantial and coherent expression profiles from two or more tissue compartments and/or cell types.
To assign clusters identities, we first compiled a list of all established lung cell types, their abundances, their classical markers, and any RNA markers (when available) (Supplementary Table 1). RNA markers for canonical immune populations (Supplementary Table 3) were obtained from bulk mRNA sequencing by correlating the average expression (each captured in duplicate) with a test vector where the target population position equaled 10 and all others equaled 0 (see GitHub for details). Clusters were assigned a canonical identity based on enriched expression of these marker genes. Pearson correlations were calculated between the average expression profiles from each immune cluster for all cells in the SS2 with the average bulk profiles using the ‘cor’ function in R. There were no clusters that lacked expression of canonical marker genes. When two or more clusters were assigned the same identity, we first determined whether their tissue locations differed substantially (for example, proximal versus distal, alveolar versus adventitial) and prepended these locations when applicable. When both clusters localized to the same tissue region (for example, capillary endothelial cells or AT2 cells), we next compared their differentially expressed genes head-to-head to identify differences in molecular functions. These functional differences were also prepended, when applicable (for example, signalling AT2 versus AT2, proliferating basal versus basal). If the clusters could not be resolved by location or function, we prepended a representative marker gene to their ‘canonical’ identity (for example, IGSF21+ dendritic, EREG+ dendritic, and TREM2+ dendritic). Cells from different subjects with the same annotation were merged into a single group for all downstream analyses.
Approximately 35,000 mouse lung and blood cell expression profiles by SS2 and 10x from Tabula Muris Senis2 were combined with 522 cells isolated from Axin2-Cre-ERT2> Rosa26mTmG (A.N.N.) and Tbx4-LME-Cre > Rosa26ZsGreen1 (K.J.T.) mice and amplified by SS2. Cells were stratified by technology (10x versus SS2), re-clustered and re-annotated using the strategy described above for human lung cells.
Re-annotation of existing human lung single cell RNA sequencing datasets
UMI tables were obtained from the Gene Expression Omnibus (GSE122960 for ref. 18, GSE130148 for ref. 19), clustered, and annotated using the strategy described above. New annotations for each cell are available on GitHub (see below).
Cell type pairwise correlations
We obtained average expression profiles for each cell type from all cells in the 10x dataset, supplemented with the average expression profile from neutrophils in the SS2 dataset, and calculated pairwise Pearson correlation coefficients using the ‘cor’ function in R.
Identification of proliferation signature
Expression profiles from matched proliferating and quiescent cell types were compared head-to-head using the ‘MAST’ statistical framework implemented in the ‘FindMarkers’ command in Seurat. Differentially-expressed genes common in each proliferating cell type were converted to z-scores using the ‘ScaleData’ command in Seurat, and summed to create a proliferation score for each cell in the 10x dataset.
Identification of immune egression signatures
Blood and tissue expression profiles for each immune cell type were compared head-to-head using the ‘MAST’ statistical framework implemented in the ‘FindMarkers’ command in Seurat. Differentially-expressed genes common in each subject were screened for dissociation artefact and contamination by red blood cells. Genes specific to tissue immune cells were binned based on their breadth of expression (lymphocyte, myeloid or both), converted to z-scores using the ‘ScaleData’ command in Seurat, and summed to create an egression score for each cell in the 10x dataset.
Identification of enriched marker genes, transcription factors, and disease genes
Differentially expressed genes for each annotated cell type relative to the other cells within its tissue compartment were identified using the ‘FindMarkers’ command in Seurat with the ‘MAST’ statistical framework after downsampling each cell type to 100 (SS2) or 500 (10x) cells. To obtain the most sensitive and specific markers for each cell type, we ranked enriched genes, with a P value less than 10−5 and a sensitivity greater than 0.4, by their Matthews correlation coefficients (MCCs) calculated for each cell type from all cells in the 10x data set (numbers available in Supplementary Table 2). To measure the utility of using multiple markers in assigning cell identities, we calculated MCC scores for all possible combinations of each cell type’s top five marker genes.
Enriched genes were annotated as transcription factors or genes associated with pulmonary pathology based on lists compiled from The Animal Transcription Factor Database (http://bioinfo.life.hust.edu.cn/AnimalTFDB), The Online Mendelian Inheritance in Man Catalog (OMIM)49, and Genome Wide Association Studies (GWAS) obtained from the EMBL-EBI Catalog50 (EFO IDs 0000270, 0000341, 0000464, 0000571, 0000702, 0000707, 0000708, 0000768, 0001071, 0003060, 0003106, 0004244, 0004312, 0004313, 0004314, 0004647, 0004713, 0004806, 0004829, 0005220, 0005297, 0006505, 0006953, 0007627, 0007744, 0007944, 0008431, 0009369, 0009370; GO IDs 0031427, 0097366; Orphanet IDs 586 182098; log(p-value) < -20, statistical tests vary in indicated studies). Viral entry genes were obtained from Gene Ontology (GO:0046718) and then curated and associated with their cognate virus(es) based on literature citations available in our GitHub repository.
Cellular interaction and hormone target mapping
Interactions between cell types were predicted using CellPhoneDB (‘statistical_analysis’ method) with all cells in the SS2 dataset, as previously described6. For our targeted analyses, we curated the chemokine receptor-ligand interaction map and list of hormone receptors from an extensive literature search (available on GitHub, see below).
Human and mouse gene alignment, cell type correlation, and gene expression comparisons
The gene expression matrices from our human SS2 cells and the Tabula Muris Senis SS2 cells, supplemented with the 522 mouse cells from Axin2-creER > mTmG and Tbx4-Cre > ZsGreen1 described above, were collapsed to HomologyIDs obtained from the Mouse Genome Informatics database to enable direct comparison. We obtained mean expression profiles for each cell type from all cells in the SS2 dataset and calculated pairwise Pearson correlation coefficients using the ‘cor’ function in R. We defined species-specific gene expression as those enriched 20-fold in either direction (mouse > human or human > mouse) with a P value less than 10−5 (calculated by ‘MAST’ as above) from all cells for the indicated types in the SS2 dataset. Correlations and age-specific genes were obtained the same manner using all cells from 3-month and 24-month in the combined SS2 mouse dat set.
To compare the expression pattern of each gene across species we binarized genes as expressed (1) or not expressed (0) in each cell type’s average expression profile calculated from all mouse and human SS2 cells of the types compared above. A cell type ‘expressed’ a gene if the median of that gene’s non-zero expression values across the constituent cells was greater than the median of every non-zero expression value for all other genes plus or minus two standard deviations (varied in 0.25 increments) and if the percentage of cells within the cell type with non-zero expression values was greater than the median percent of non-zero expression values for all other genes plus or minus two standard decisions (varied in 0.25 increments). These cutoffs were varied independently to ensure genes were robustly categorized. We then ordered these gene vectors to match homologous cell types between species with at least five cells and combined them to a single vector for each gene (V = (a − b) + 2ab, in which a is the ordered human vector and b is the ordered mouse vector) that indicated for each cell type whether: Both mouse and human expressed the gene (2), only human (1), only mouse (−1), or neither (0). We then classified genes by the following: conserved if any element of V equaled 2 and all other elements equaled 0; type 2 if any element equaled 2 and any other equaled 1 or −1; not expressed if all elements equaled 0; type 3 if elements were both positive and negative; and type 1 if elements were either positive or negative and 0.
Statistics and reproducibility
All heat maps and plots with single cell expression data include every cell from indicated types (numbers available in Supplementary Table 2 for human and Supplementary Table 6 for mouse) for sequencing technology specified (SS2 or 10x), unless otherwise stated. Scatter plots were generated with ggplot2’s ‘geom_point’ function. Dot plots were generated using a modified version of Seurat’s ‘DotPlot’ function (available on GitHub). Violin plots were created with Seurat’s ‘VlnPlot’ function and show proportion of single cells at indicated expression levels. Box-and-whisker plots were generated with ggplot2’s ‘geom_boxplot’ function; lower and upper hinges correspond to first and third quartiles, whiskers extend from hinge to the largest or smallest value no further than 1.5 times the interquartile range. Data beyond whiskers are shown as outlying points. Correlations use Pearson’s coefficient. Differentially expressed genes were identified using the ‘MAST’ statistical framework48 implemented in Seurat’s ‘FindMarkers’ function. Immunostaining and smFISH experiments were performed on at least 2 human or mouse subjects distinct from the donors used for sequencing, and quantifications were based on at least 10 fields of view in each. For smFISH, fields of view were scored manually, calling a cell positive for each gene probed if its nucleus had at least three associated expression puncta. No statistical methods were used to predetermine sample size. The experiments were not randomized and investigators were not blinded to allocation during experiments and outcome assessment.
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.