Curate & link multi-modal data#
Show code cell content
!lamin init --storage ./test-multimodal --schema bionty
π‘ creating schemas: core==0.44.3 bionty==0.29.2
π± saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-08 17:04:09)
π± saved: Storage(id='DxdUnaLi', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/datatype/test-multimodal', type='local', updated_at=2023-08-08 17:04:09, created_by_id='DzTjkKse')
β
loaded instance: testuser1/test-multimodal
π‘ did not register local instance on hub (if you want, call `lamin register`)
import lamindb as ln
import lnschema_bionty as lb
lb.settings.species = "human"
ln.settings.verbosity = 3
β
loaded instance: testuser1/test-multimodal (lamindb 0.50.1)
π± set species: Species(id='uHJU', name='human', taxon_id=9606, scientific_name='homo_sapiens', updated_at=2023-08-08 17:04:12, bionty_source_id='nc4C', created_by_id='DzTjkKse')
ln.track()
π‘ notebook imports: lamindb==0.50.1 lnschema_bionty==0.29.2
π± saved: Transform(id='yMWSFirS6qv2z8', name='Curate & link multi-modal data', short_name='40-multimodal', stem_id='yMWSFirS6qv2', version='0', type=notebook, updated_at=2023-08-08 17:04:13, created_by_id='DzTjkKse')
π± saved: Run(id='VTwgtWGbso4ZM3jIk14e', run_at=2023-08-08 17:04:13, transform_id='yMWSFirS6qv2z8', created_by_id='DzTjkKse')
MuData object#
Letβs use a MuData object:
Show code cell content
mdata = ln.dev.datasets.mudata_papalexi21_subset()
mdata
MuData object with n_obs Γ n_vars = 200 Γ 300 var: 'name' 4 modalities rna: 200 x 173 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase' var: 'name' adt: 200 x 4 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase' var: 'name' hto: 200 x 12 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase' var: 'name' gdo: 200 x 111 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase' var: 'name'
First we register the file:
file = ln.File(
"papalexi21_subset.h5mu", description="Sub-sampled MuData from Papalexi21"
)
file.save()
π± storing file 'rwZMC3EJL4Y3dueEZF4o' with key '.lamindb/rwZMC3EJL4Y3dueEZF4o.h5mu'
Register features#
Now letβs register the 3 feature sets this data contains:
rna
adt
obs (metadata)
modalities#
For the two modalities rna and adt, we use bionty tables as the reference:
mdata["rna"].var_names[:5]
Index(['RP5-827C21.6', 'XX-CR54.1', 'SH2D6', 'RP11-379B18.5', 'RP11-778D9.12'], dtype='object', name='index')
feature_set_rna = ln.FeatureSet.from_values(
mdata["rna"].var_names, field=lb.Gene.symbol
)
π‘ using global setting species = human
β
validated 93 Gene records from Bionty on symbol: SH2D6, ARHGAP26-AS1, GABRA1, HLA-DQB1-AS1, HLA-DQB1-AS1, HLA-DQB1-AS1, HLA-DQB1-AS1, HLA-DQB1-AS1, HLA-DQB1-AS1, HLA-DQB1-AS1, SPACA1, VNN1, CTAGE15, CTAGE15, PFKFB1, TRPC5, RBPMS-AS1, CA8, CSMD3, ZNF483, ...
πΆ ambiguous validation in Bionty for 11 records: HLA-DQB1-AS1, CTAGE15, CRYAB, CTRB2, LGALS9C, NPHS1, THPO, PCDHB11, XG, TBC1D3G, TUBB1
πΆ did not validate 96 Gene records for symbols: AC002066.1, AC004019.13, AC005150.1, AC006042.7, AC011558.5, AC026471.6, AC073934.6, AC091132.1, AC092295.4, AC092687.5, AE000662.93, AL132989.1, AP000442.4, AP003419.16, C14orf177, C1orf65, CASC1, CTA-373H7.7, CTB-134F13.1, CTB-31O20.9, ...
πΆ ignoring non-validated features: AC002066.1,AC004019.13,AC005150.1,AC006042.7,AC011558.5,AC026471.6,AC073934.6,AC091132.1,AC092295.4,AC092687.5,AE000662.93,AL132989.1,AP000442.4,AP003419.16,C14orf177,C1orf65,CASC1,CTA-373H7.7,CTB-134F13.1,CTB-31O20.9,CTC-467M3.1,CTC-498J12.1,CTD-2562J17.2,CTD-3012A18.1,CTD-3065B20.2,CTD-3193O13.8,FAM65C,HIST1H4K,IBA57-AS1,KIAA1239,LARGE,NBPF16,RP1-1J6.2,RP11-110I1.14,RP11-113K21.4,RP11-120C12.3,RP11-12D24.10,RP11-12J10.3,RP11-134K13.4,RP11-136I14.5,RP11-138C9.1,RP11-146I2.1,RP11-152H18.3,RP11-17J14.2,RP11-186N15.3,RP11-187A9.3,RP11-214C8.2,RP11-219B4.7,RP11-231C14.4,RP11-235C23.5,RP11-247A12.8,RP11-265N6.2,RP11-268G12.1,RP11-2H8.2,RP11-304L19.11,RP11-307N16.6,RP11-324E6.9,RP11-325L7.1,RP11-32B5.7,RP11-335O4.3,RP11-346D14.1,RP11-365N19.2,RP11-379B18.5,RP11-3D4.2,RP11-403P17.5,RP11-408E5.4,RP11-415J8.5,RP11-434D9.1,RP11-465N4.4,RP11-473O4.4,RP11-496I9.1,RP11-524H19.2,RP11-532F6.4,RP11-536K7.5,RP11-546K22.3,RP11-624M8.1,RP11-703G6.1,RP11-717H13.1,RP11-745O10.2,RP11-75C10.9,RP11-760N9.1,RP11-778D9.12,RP11-80H5.7,RP11-835E18.5,RP11-867G23.3,RP11-973N13.4,RP11-982M15.2,RP11-9M16.2,RP13-582O9.7,RP3-327A19.5,RP3-337O18.9,RP5-827C21.6,RP5-855F16.1,TMEM75,U52111.14,XX-CR54.1
mdata["adt"].var_names
Index(['CD86', 'PDL1', 'PDL2', 'CD366'], dtype='object', name='index')
feature_set_adt = ln.FeatureSet.from_values(
mdata["adt"].var_names, field=lb.CellMarker.name
)
π‘ using global setting species = human
β
validated 4 CellMarker records from Bionty on name: CD86, PDL1, PDL2, CD366
Link them to file:
file.features.add_feature_set(feature_set_rna, slot="rna")
file.features.add_feature_set(feature_set_adt, slot="adt")
metadata#
The 3rd feature set is the obs:
obs = mdata["rna"].obs
Weβre only interested in a single metadata column:
ln.Feature(name="gene_target", type="category").save()
feature_set_obs = ln.FeatureSet.from_df(obs, "metadata")
β
validated 1 Feature record on name: gene_target
πΆ did not validate 18 Feature records for names: G2M.Score, HTO_classification, MULTI_ID, NT, Phase, S.Score, guide_ID, nCount_ADT, nCount_GDO, nCount_HTO, nCount_RNA, nFeature_ADT, nFeature_HTO, nFeature_RNA, orig.ident, percent.mito, perturbation, replicate
πΆ ignoring non-validated features: G2M.Score,HTO_classification,MULTI_ID,NT,Phase,S.Score,guide_ID,nCount_ADT,nCount_GDO,nCount_HTO,nCount_RNA,nFeature_ADT,nFeature_HTO,nFeature_RNA,orig.ident,percent.mito,perturbation,replicate
file.features.add_feature_set(feature_set_obs, slot="obs")
gene_targets = lb.Gene.from_values(obs["gene_target"], "symbol")
ln.save(gene_targets)
file.add_labels(gene_targets)
π‘ using global setting species = human
β
validated 35 Gene records from Bionty on symbol: IFNGR1, IFNGR1, CAV1, IRF7, IRF7, IRF7, ATF2, NFKBIA, NFKBIA, STAT1, STAT1, SPI1, JAK2, JAK2, STAT2, STAT2, IFNGR2, IFNGR2, IFNGR2, CD86, ...
πΆ ambiguous validation in Bionty for 10 records: IFNGR1, IRF7, NFKBIA, STAT1, JAK2, STAT2, IFNGR2, SMAD4, STAT3, TNFRSF14
πΆ did not validate 2 Gene records for symbols: MARCH8, NT
π± linked labels 'IFNGR1', 'IFNGR1', 'CAV1', 'IRF7', 'IRF7', 'IRF7', 'ATF2', 'NFKBIA', 'NFKBIA', 'STAT1', 'STAT1', 'SPI1', 'JAK2', 'JAK2', 'STAT2', 'STAT2', 'IFNGR2', 'IFNGR2', 'IFNGR2', 'CD86', 'STAT5A', 'SMAD4', 'SMAD4', 'ETV7', 'IRF1', 'UBE2L6', 'PDCD1LG2', 'BRD4', 'POU2F2', 'STAT3', 'STAT3', 'TNFRSF14', 'TNFRSF14', 'CUL3', 'CMTM6', 'MARCH8', 'NT' to feature 'gene_target', linked feature 'gene_target' to registry 'bionty.Gene'
labels = []
for col in ["orig.ident", "perturbation", "replicate", "Phase", "guide_ID"]:
labels += ln.Label.from_values(obs[col])
πΆ did not validate 8 Label records for names: Lane7, Lane4, Lane2, Lane5, Lane3, Lane8, Lane1, Lane6
πΆ did not validate 2 Label records for names: Perturbed, NT
πΆ did not validate 3 Label records for names: rep3, rep1, rep2
πΆ did not validate 3 Label records for names: G1, S, G2M
πΆ did not validate 78 Label records for names: MARCH8g2, IFNGR1g3, MARCH8g4, CAV1g4, IRF7g1, ATF2g1, NFKBIAg2, STAT1g2, SPI1g1, JAK2g3, NTg7, IFNGR1g4, NTg1, STAT2g2, IFNGR2g2, CD86g2, IFNGR2g1, STAT5Ag2, IFNGR1g2, IFNGR1g1, ...
Because none of these labels seem like something weβd want to track in the registry or validate, we donβt link them to the file.
file.features
'rna': FeatureSet(id='8nn7IISox8zKtE1a7HOI', n=93, type='float', registry='bionty.Gene', hash='6Sd_y8RL6Uy6JQCuHM6Y', updated_at=2023-08-08 17:04:18, created_by_id='DzTjkKse')
'adt': FeatureSet(id='KaHiQ2OVMqLMqXuNPRyX', n=4, type='float', registry='bionty.CellMarker', hash='b-CtyjgPRO0WN27lTOqC', updated_at=2023-08-08 17:04:18, created_by_id='DzTjkKse')
'obs': FeatureSet(id='hvA2vNU077jIRCDVXZl2', name='metadata', n=1, registry='core.Feature', hash='xRjsyam7QDMxaNONkTxP', updated_at=2023-08-08 17:04:19, created_by_id='DzTjkKse')
file.describe()
π‘ File(id=rwZMC3EJL4Y3dueEZF4o, key=None, suffix=.h5mu, accessor=MuData, description=Sub-sampled MuData from Papalexi21, size=606320, hash=RaivS3NesDOP-6kNIuaC3g, hash_type=md5, created_at=2023-08-08 17:04:14.130072+00:00, updated_at=2023-08-08 17:04:14.130120+00:00)
Provenance:
ποΈ storage: Storage(id='DxdUnaLi', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/datatype/test-multimodal', type='local', updated_at=2023-08-08 17:04:09, created_by_id='DzTjkKse')
π transform: Transform(id='yMWSFirS6qv2z8', name='Curate & link multi-modal data', short_name='40-multimodal', stem_id='yMWSFirS6qv2', version='0', type='notebook', updated_at=2023-08-08 17:04:14, created_by_id='DzTjkKse')
π run: Run(id='VTwgtWGbso4ZM3jIk14e', run_at=2023-08-08 17:04:13, transform_id='yMWSFirS6qv2z8', created_by_id='DzTjkKse')
π€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-08 17:04:09)
Features:
πΊοΈ rna:
π index (93, bionty.Gene.id): ['4lbxA8yBpezg', 'JX35T6MPAehY', '127GIit19cYJ', 'E0XKlZowRYQs', 'WefD8C4raEa5'...]
πΊοΈ adt:
π index (4, bionty.CellMarker.id): ['BK30rjK34sZd', 'kbrA7wdDuqDK', 'L0m6f7FPiDeg', '82nG0xqSuEQD'...]
πΊοΈ obs (metadata):
π gene_target (37, bionty.Gene): ['STAT2', 'CD86', 'TNFRSF14', 'STAT1', 'STAT3']
file.view_lineage()
Show code cell content
!lamin delete test-multimodal
π‘ deleting instance testuser1/test-multimodal
β
deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-multimodal.env
β
instance cache deleted
β
deleted '.lndb' sqlite file
πΆ consider manually delete your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/datatype/test-multimodal