Integrate scRNA-seq datasets#

Hide code cell content
!lamin load test-scrna
πŸ’‘ found cached instance metadata: /home/runner/.lamin/instance--testuser1--test-scrna.env
βœ… loaded instance: testuser1/test-scrna

import lamindb as ln
import lnschema_bionty as lb
import pandas as pd
import anndata as ad
βœ… loaded instance: testuser1/test-scrna (lamindb 0.50.1)
ln.track()
πŸ’‘ notebook imports: anndata==0.9.2 lamindb==0.50.1 lnschema_bionty==0.29.2 pandas==1.5.3
🌱 saved: Transform(id='agayZTonayqAz8', name='Integrate scRNA-seq datasets', short_name='20-scrna-1', stem_id='agayZTonayqA', version='0', type=notebook, updated_at=2023-08-08 17:03:08, created_by_id='DzTjkKse')
🌱 saved: Run(id='aJ56SHp5fkgEeGpkinhn', run_at=2023-08-08 17:03:08, transform_id='agayZTonayqAz8', created_by_id='DzTjkKse')

Query files based on metadata#

ln.File.filter(tissues__name__icontains="lymph node").distinct().df()
storage_id key suffix accessor description size hash hash_type transform_id run_id updated_at created_by_id
id
aLCyyKidg521XKPiW0IV 0rwQCaix None .h5ad AnnData Detmar22 17342743 rk5lSoJvz6PHRRjmcB919w md5 Nv48yAceNSh8z8 1kFBK8ELqbiA9eLY9nx0 2023-08-08 17:02:18 DzTjkKse
7ASnz4e3JQn1teGM7dLH 0rwQCaix None .h5ad AnnData Conde22 28061905 3cIcmoqp1MxjX8NlRkKGlQ md5 Nv48yAceNSh8z8 1kFBK8ELqbiA9eLY9nx0 2023-08-08 17:02:50 DzTjkKse
ln.File.filter(cell_types__name__icontains="monocyte").distinct().df()
storage_id key suffix accessor description size hash hash_type transform_id run_id updated_at created_by_id
id
7ASnz4e3JQn1teGM7dLH 0rwQCaix None .h5ad AnnData Conde22 28061905 3cIcmoqp1MxjX8NlRkKGlQ md5 Nv48yAceNSh8z8 1kFBK8ELqbiA9eLY9nx0 2023-08-08 17:02:50 DzTjkKse
pbZQhgw7AZbYkRV5P6iS 0rwQCaix None .h5ad AnnData 10x reference pbmc68k 589484 eKVXV5okt5YRYjySMTKGEw md5 Nv48yAceNSh8z8 1kFBK8ELqbiA9eLY9nx0 2023-08-08 17:02:58 DzTjkKse
ln.File.filter(labels__name="female").distinct().df()
storage_id key suffix accessor description size hash hash_type transform_id run_id updated_at created_by_id
id
aLCyyKidg521XKPiW0IV 0rwQCaix None .h5ad AnnData Detmar22 17342743 rk5lSoJvz6PHRRjmcB919w md5 Nv48yAceNSh8z8 1kFBK8ELqbiA9eLY9nx0 2023-08-08 17:02:18 DzTjkKse

Intersect measured genes between two datasets#

file1 = ln.File.filter(description="Conde22").one()
file2 = ln.File.filter(description="10x reference pbmc68k").one()
file1.describe()
πŸ’‘ File(id=7ASnz4e3JQn1teGM7dLH, key=None, suffix=.h5ad, accessor=AnnData, description=Conde22, size=28061905, hash=3cIcmoqp1MxjX8NlRkKGlQ, hash_type=md5, created_at=2023-08-08 17:02:50.465201+00:00, updated_at=2023-08-08 17:02:50.465235+00:00)

Provenance:
    πŸ—ƒοΈ storage: Storage(id='0rwQCaix', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/datatype/test-scrna', type='local', updated_at=2023-08-08 17:03:05, created_by_id='DzTjkKse')
    πŸ“” transform: Transform(id='Nv48yAceNSh8z8', name='Curate & link scRNA-seq datasets', short_name='10-scrna', stem_id='Nv48yAceNSh8', version='0', type='notebook', updated_at=2023-08-08 17:02:58, created_by_id='DzTjkKse')
    πŸš— run: Run(id='1kFBK8ELqbiA9eLY9nx0', run_at=2023-08-08 17:01:57, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
    πŸ‘€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-08 17:03:05)
Features:
  πŸ—ΊοΈ var (X):
    πŸ”— index (36503, bionty.Gene.id): ['DR2rND9sNMZk', 'DGR0nS20xED7', 'fQ4hPIrbi78N', '0eSATeQoHyok', 'QFJc6WfWbXC6'...]
  πŸ—ΊοΈ external:
    πŸ”— species (1, bionty.Species): ['human']
  πŸ—ΊοΈ obs (metadata):
    πŸ”— cell_type (32, bionty.CellType): ['CD8-positive, alpha-beta memory T cell', 'gamma-delta T cell', 'non-classical monocyte', 'regulatory T cell', 'mast cell']
    πŸ”— assay (3, bionty.ExperimentalFactor): ["10x 5' v1", "10x 5' v2", "10x 3' v3"]
    πŸ”— tissue (17, bionty.Tissue): ['omentum', 'ileum', 'caecum', 'transverse colon', 'duodenum']
    πŸ”— donor (12, core.Label): ['A29', 'D503', 'A36', '637C', 'A31']
file1.view_lineage()
https://d33wubrfki0l68.cloudfront.net/b56033a24414ad9bb38fe205b46dd1849951ef35/9ad61/_images/940ed44c77c494978fb67c2ecd5fcf1403881afd8cbd2c09fa86dc4f8d57d21d.svg
file2.describe()
πŸ’‘ File(id=pbZQhgw7AZbYkRV5P6iS, key=None, suffix=.h5ad, accessor=AnnData, description=10x reference pbmc68k, size=589484, hash=eKVXV5okt5YRYjySMTKGEw, hash_type=md5, created_at=2023-08-08 17:02:58.546749+00:00, updated_at=2023-08-08 17:02:58.546783+00:00)

Provenance:
    πŸ—ƒοΈ storage: Storage(id='0rwQCaix', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/datatype/test-scrna', type='local', updated_at=2023-08-08 17:03:05, created_by_id='DzTjkKse')
    πŸ“” transform: Transform(id='Nv48yAceNSh8z8', name='Curate & link scRNA-seq datasets', short_name='10-scrna', stem_id='Nv48yAceNSh8', version='0', type='notebook', updated_at=2023-08-08 17:02:58, created_by_id='DzTjkKse')
    πŸš— run: Run(id='1kFBK8ELqbiA9eLY9nx0', run_at=2023-08-08 17:01:57, transform_id='Nv48yAceNSh8z8', created_by_id='DzTjkKse')
    πŸ‘€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-08 17:03:05)
Features:
  πŸ—ΊοΈ var (X):
    πŸ”— index (695, bionty.Gene.id): ['4DDdKM0LEjQ0', 'W9i7vtUUqAxD', '4Pa8WI5dcVfb', 'ONVhI8qWHKkI', 'vdWoHAHsKucN'...]
  πŸ—ΊοΈ obs (metadata):
    πŸ”— cell_type (9, bionty.CellType): ['conventional dendritic cell', 'effector memory CD4-positive, alpha-beta T cell, terminally differentiated', 'CD38-negative naive B cell', 'CD16-positive, CD56-dim natural killer cell, human', 'CD14-positive, CD16-negative classical monocyte']
file2.view_lineage()
https://d33wubrfki0l68.cloudfront.net/0e3df7199e57c666b736c738c9912fad5c9ce21b/1dbd6/_images/7e82cc11da33c67d72d5afe18befec17ebbc435021aa6d82b7a9bf529f2bf747.svg
file1_adata = file1.load()
file2_adata = file2.load()
πŸ’‘ adding file 7ASnz4e3JQn1teGM7dLH as input for run aJ56SHp5fkgEeGpkinhn, adding parent transform Nv48yAceNSh8z8
πŸ’‘ adding file pbZQhgw7AZbYkRV5P6iS as input for run aJ56SHp5fkgEeGpkinhn, adding parent transform Nv48yAceNSh8z8
file2_adata.obs.cell_type.head()
index
GCAGGGCTGGATTC-1                                       dendritic cell
CTTTAGTGGTTACG-6                                B cell, CD19-positive
TGACTGGAACCATG-7                                       dendritic cell
TCAATCACCCTTCG-8                                B cell, CD19-positive
CGTTATACAGTACC-8    effector memory CD4-positive, alpha-beta T cel...
Name: cell_type, dtype: category
Categories (9, object): ['CD8-positive, CD25-positive, alpha-beta regul..., 'effector memory CD4-positive, alpha-beta T ce..., 'cytotoxic T cell', 'CD38-negative naive B cell', ..., 'B cell, CD19-positive', 'conventional dendritic cell', 'CD16-positive, CD56-dim natural killer cell, ..., 'dendritic cell']

Here we compute shared genes without loading files:

file1_genes = file1.features["var"]
file2_genes = file2.features["var"]

shared_genes = file1_genes & file2_genes
shared_genes.list("symbol")[:10]
['EFHD2',
 'GSTK1',
 'IL2RG',
 'NUDCD2',
 'XCL1',
 'TMEM176B',
 'APEX1',
 'MRPL9',
 'MPHOSPH9',
 'DUSP2']

We also need to convert the ensembl_gene_id to symbol for file2 so that they can be concatenated:

mapper = (
    pd.DataFrame(file2_genes.values_list("ensembl_gene_id", "symbol"))
    .drop_duplicates(0)
    .set_index(0)[1]
)
mapper.head()
0
ENSG00000104852     SNRNP70
ENSG00000111832       RWDD1
ENSG00000179344    HLA-DQB1
ENSG00000129625       REEP5
ENSG00000130520        LSM4
Name: 1, dtype: object
file1_adata.var.rename(index=mapper, inplace=True)

Intersect cell types#

file1_celltypes = file1.cell_types.all()
file2_celltypes = file2.cell_types.all()

shared_celltypes = file1_celltypes & file2_celltypes
shared_celltypes_names = shared_celltypes.list("name")
shared_celltypes_names
['conventional dendritic cell',
 'CD16-positive, CD56-dim natural killer cell, human']

We can now subset the two datasets by shared cell types:

file1_adata_subset = file1_adata[
    file1_adata.obs["cell_type"].isin(shared_celltypes_names)
]
file1_adata_subset.obs["cell_type"].value_counts()
CD16-positive, CD56-dim natural killer cell, human    114
conventional dendritic cell                             7
Name: cell_type, dtype: int64
file2_adata_subset = file2_adata[
    file2_adata.obs["cell_type"].isin(shared_celltypes_names)
]
file2_adata_subset.obs["cell_type"].value_counts()
CD16-positive, CD56-dim natural killer cell, human    3
conventional dendritic cell                           2
Name: cell_type, dtype: int64
adata_concat = ad.concat(
    [file1_adata_subset, file2_adata_subset],
    label="file",
    keys=[file1.description, file2.description],
)
adata_concat
AnnData object with n_obs Γ— n_vars = 126 Γ— 695
    obs: 'cell_type', 'file'
    obsm: 'X_umap'
adata_concat.obs.value_counts()
cell_type                                           file                 
CD16-positive, CD56-dim natural killer cell, human  Conde22                  114
conventional dendritic cell                         Conde22                    7
CD16-positive, CD56-dim natural killer cell, human  10x reference pbmc68k      3
conventional dendritic cell                         10x reference pbmc68k      2
dtype: int64
Hide code cell content
!lamin delete test-scrna
!rm -r ./test-scrna
πŸ’‘ deleting instance testuser1/test-scrna
βœ…     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-scrna.env
βœ…     instance cache deleted
βœ…     deleted '.lndb' sqlite file
πŸ”Ά     consider manually delete your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/datatype/test-scrna