PrismEXP

An appyter interface to compute gene function predictions

This Appyter calculates predictions using the PrismEXP algorithm. It utilizes precomputed correlation scores and a pretrained model. The correlation scores are computed for 51 gene expression clusters in ARCHS4.

'SPP2'

'GO_Biological_Process_2018'

In [1] ¶

import h5py as h5
import s3fs
import numpy as np
import pandas as pd
from tqdm import tqdm
from typing import List, Dict
import urllib.request
import json
import hashlib
import shutil
import ssl
import os
import sys
import re
import itertools
import pickle
import warnings
import scipy.stats as st
from lightgbm import LGBMRegressor
from sklearn.metrics import roc_auc_score, roc_curve, auc
from matplotlib import pyplot as plt
from IPython.display import display, FileLink, Markdown, HTML

In [2] ¶

LIBRARY_LIST_URL="https://maayanlab.cloud/speedrichr/api/listlibs"
LIBRARY_DOWNLOAD_URL="https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName="
S3_URL="s3://mssm-prismx-100/"
GMT_EXAMPLE="https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName=KEGG_2019_Mouse"

In [3] ¶

# Use the selected user input
GENE = 'SPP2'
LIBRARY = 'GO_Biological_Process_2018'
ENRICHR_MODE = True

¶

S3 Data Query ¶

The following functions are used to load gene correlation directly from precomputed correlation matrices stored on S3.

In [4] ¶


def loadGenesS3():
    genes = 0
    s3 = s3fs.S3FileSystem(anon=True)
    with h5.File(s3.open(S3_URL+"correlation_0.h5", 'rb'), 'r', lib_version='latest') as f:
        genes = np.array([s.decode("UTF-8") for s in f["meta/genes"]])
    return [x.upper() for x in list(genes)]

def loadCorrelationS3(gene, genes, cormat):
    cor = 0
    s3 = s3fs.S3FileSystem(anon=True)
    with h5.File(s3.open(S3_URL+"correlation_"+str(cormat)+".h5", 'rb'), 'r', lib_version='latest') as f:
        idx = genes.index(gene.upper())
        cor = np.array(f["data/correlation"][idx,:]).astype(np.float64)
    return(cor)

def geneCorrelation(gene):
    cormats = list(range(100))
    cormats.append("global")
    genes = loadGenesS3()
    results = []
    pbar = tqdm(cormats)
    for i in pbar:
        pbar.set_description("Retrieve %s" % i)
        results.append(loadCorrelationS3(gene, genes, i))
    results = pd.DataFrame(np.array(results).T, index=genes)
    return(results)

¶

Load GMT ¶

These functions are used to load Enrichr gene set libraries.

In [5] ¶

def getDataPath() -> str:
    path = os.path.join(
        os.path.dirname(__file__),
        'data/'
    )
    return(path)

def listLibraries():
    return(loadJSON(LIBRARY_LIST_URL)["library"])

def loadLibrary(library: str, overwrite: bool = False) -> str:
    ssl._create_default_https_context = ssl._create_unverified_context
    if not os.path.exists("gmts/"+library or overwrite):
        os.makedirs("gmts", exist_ok=True)
        print("Download Enrichr geneset library")
        urllib.request.urlretrieve(LIBRARY_DOWNLOAD_URL+library, "gmts/"+library+".gmt")
    else:
        print("File cached. To reload use loadLibrary(\""+library+"\", overwrite=True) instead.")
    lib, rlib, ugenes = readGMT("gmts/"+library+".gmt")
    print("# genesets: "+str(len(lib))+"\n# unique genes: "+str(len(ugenes)))
    return("gmts/"+library)

def printLibraries():
    libs = listLibraries()
    for i in range(0, len(libs)):
        print(str(i)+" - "+libs[i])

def loadJSON(url):
    context = ssl._create_unverified_context()
    req = urllib.request.Request(url)
    r = urllib.request.urlopen(req, context=context).read()
    return(json.loads(r.decode('utf-8')))

def readGMT(gmtFile: str, backgroundGenes: List[str] = [""], verbose=False) -> List:
    file = open(gmtFile, 'r')
    lines = file.readlines()
    library = {}
    for line in lines:
        sp = line.strip().upper().split("\t")
        sp2 = [re.sub(",.*", "",value) for value in sp[2:]]
        if len(backgroundGenes) > 1:
            backgroundGenes = [x.upper() for x in backgroundGenes]
            library[sp[0]] = [value for value in sp2 if value in backgroundGenes]
        else:
            library[sp[0]] = sp2
    ugenes = list(set(list(itertools.chain.from_iterable(library.values()))))
    ugenes.sort()
    rev_library = {}
    for ug in ugenes:
        rev_library[ug] = []
    for se in library.keys():
        for ge in library[se]:
            rev_library[ge].append(se)
    if verbose:
        print("Library loaded. Library contains "+str(len(library))+" gene sets. "+str(len(ugenes))+" unique genes found.")
    return [library, rev_library, ugenes]

¶

Initialize Gene Set Library ¶

Load library from Enrichr or use custom GMT file.

In [6] ¶

if ENRICHR_MODE:
    libraryPath = loadLibrary(LIBRARY)
else:
    os.makedirs("gmts", exist_ok=True)
    shutil.copy2(LIBRARY, "gmts/"+LIBRARY)
    LIBRARY = LIBRARY.replace(".gmt", "")

library, rev_library, ugenes = readGMT("gmts/"+LIBRARY+".gmt", loadGenesS3())

¶

Download Enrichr geneset library
# genesets: 5103

# unique genes: 14433

¶

Load Gene Correlation ¶

Gene correlation matrices are precalculated for 101 gene expression matrices from ARCHS4. The matrices are stored in AWS S3 and can be queried for target genes of interest. The extraction can take a moment to finish.

In [7] ¶

correlation = geneCorrelation(GENE)
correlation

¶

	0	1	2	3	4	5	6	7	8	9	...	91	92	93	94	95	96	97	98	99	100
A1BG	0.014061	-0.006161	-0.023270	0.013496	0.035187	-0.144653	0.073975	-0.034302	0.030594	0.010345	...	-0.028610	0.089539	-0.011902	-0.025391	-0.044281	0.035034	0.702148	0.155518	-0.046021	0.148560
A1BG-AS1	0.047333	0.034546	0.022079	-0.121887	-0.366699	-0.143555	0.027695	-0.002596	-0.144653	-0.012680	...	-0.073364	-0.073914	-0.077148	-0.099487	-0.083801	-0.020950	0.399658	-0.113281	-0.349121	-0.107361
A1CF	0.034271	0.019882	0.017105	0.121216	0.303223	0.075684	0.115967	0.014435	0.069092	0.014740	...	0.102905	0.110107	0.016891	0.105408	-0.021042	0.044250	0.245239	0.210693	0.173340	0.268066
A2M	0.047516	-0.138306	0.031021	-0.049316	-0.326904	-0.043518	0.106689	-0.004215	0.072998	0.000233	...	-0.011024	0.012398	0.070007	0.015778	-0.076477	0.014206	0.343994	0.176025	0.304688	0.018036
A2M-AS1	0.043884	0.010223	0.028976	-0.026871	-0.162842	-0.025833	0.040070	-0.020874	0.066589	0.000164	...	-0.036469	0.025818	0.019974	0.049164	-0.031204	0.017746	0.413330	-0.001584	0.004059	0.015823
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
ZYG11A	0.153076	0.009117	0.030334	0.249390	0.113831	0.004910	-0.006187	0.029861	0.231445	0.027328	...	-0.015175	-0.000297	-0.032776	0.101196	0.017807	0.004955	-0.222412	-0.051514	-0.177490	0.107422
ZYG11B	0.137695	0.037506	-0.028885	0.117493	0.258301	0.205200	0.014717	-0.022903	0.071472	-0.025940	...	0.134644	0.018631	0.027908	0.119141	0.128540	0.026535	0.035034	0.002989	0.300049	-0.087463
ZYX	-0.083740	0.134521	-0.065796	0.076599	-0.332764	-0.121765	0.001639	-0.029114	-0.074158	-0.042419	...	-0.175903	-0.014992	-0.043823	-0.170166	-0.015961	-0.031250	-0.380859	-0.041992	-0.327148	-0.264893
ZZEF1	0.034637	0.047028	0.025635	0.043304	0.128662	0.024490	0.037872	0.002951	0.191772	0.004707	...	-0.032593	0.014328	-0.046875	0.103027	0.200684	0.011566	-0.215698	0.018097	0.373291	-0.202759
ZZZ3	0.205566	0.003960	0.011383	-0.093262	0.319580	0.198364	0.019318	0.019882	0.091797	-0.016235	...	0.164185	0.014252	0.037750	0.196167	0.039215	-0.001623	-0.062408	-0.052032	0.363770	-0.135864

38721 rows × 101 columns

¶

Retrieve global: 100%|██████████| 101/101 [01:10<00:00,  1.43it/s]

In [8] ¶

def getAverageCorrelation(correlation: pd.DataFrame, library: Dict):
    avgCor = []
    for ll in list(library.keys()):
        avgCor.append(correlation.loc[[x.upper() for x in library[ll]],:].mean(axis=0))
    avgCor = pd.DataFrame(np.array(avgCor).T, columns=list(library.keys()))
    avgCor = avgCor.transpose()
    return(avgCor)

¶

Avg Correlation Scores ¶

The average correlation scores are equivalent to the predictions performed by Geneshot. The values are used as prediction features of the PrismEXP model.

In [9] ¶

avgCor = getAverageCorrelation(correlation, library)
avgCor

¶

	0	1	2	3	4	5	6	7	8	9	...	91	92	93	94	95	96	97	98	99	100
POSITIVE REGULATION OF POSTTRANSCRIPTIONAL GENE SILENCING (GO:0060148)	0.028530	0.013445	-0.030144	-0.017905	0.019764	0.054787	0.009315	-0.007164	0.093160	-0.020029	...	0.009828	0.042973	-0.008694	0.074541	0.030839	-0.003144	-0.185386	-0.022912	0.099177	-0.162162
REGULATION OF CELL CYCLE PROCESS (GO:0010564)	0.014668	0.010467	-0.004776	-0.021308	-0.050975	0.000755	-0.010102	0.000393	-0.007776	-0.017524	...	-0.027919	0.010951	-0.022117	-0.032659	-0.015857	-0.002574	-0.129623	-0.052344	-0.073341	-0.156181
ANGIOTENSIN-ACTIVATED SIGNALING PATHWAY (GO:0038166)	-0.007368	0.060458	0.011759	-0.081643	-0.065805	0.000578	0.021345	0.022021	0.003771	0.010385	...	0.006148	0.022616	0.004354	-0.041443	-0.033170	-0.007148	0.013990	0.035081	-0.015564	-0.045323
DNA-TEMPLATED TRANSCRIPTION, TERMINATION (GO:0006353)	0.028010	0.011250	-0.045756	-0.081207	-0.072320	0.031386	-0.017687	-0.017492	-0.017730	-0.029685	...	-0.027694	0.038335	-0.027002	-0.025652	-0.028719	-0.020379	-0.080913	-0.111904	-0.152055	-0.187510
REGULATION OF PROTEIN SUMOYLATION (GO:0033233)	0.007242	-0.060966	-0.030385	-0.037269	-0.054569	-0.004475	0.002390	-0.000588	0.004087	-0.025772	...	-0.016046	0.017532	-0.008485	0.033519	-0.007733	-0.017392	-0.109492	-0.050744	-0.159023	-0.122839
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
POSITIVE REGULATION OF MITOCHONDRIAL MEMBRANE POTENTIAL (GO:0010918)	-0.079254	0.036510	-0.030518	-0.041734	-0.276245	-0.091207	0.009480	-0.009392	-0.097407	-0.014893	...	-0.123800	0.068288	-0.043008	-0.162608	-0.096602	-0.012653	0.005414	-0.108073	-0.259852	-0.201010
TRNA AMINOACYLATION FOR PROTEIN TRANSLATION (GO:0006418)	0.019276	0.024175	-0.014499	-0.062649	-0.093210	-0.005923	-0.016060	-0.009635	-0.073474	-0.016146	...	-0.052658	0.013889	-0.042471	-0.038830	-0.075434	-0.018588	-0.074383	-0.104355	-0.228878	-0.177006
TRYPTOPHAN METABOLIC PROCESS (GO:0006568)	-0.000142	0.041794	0.026452	-0.109806	-0.022050	-0.003532	0.066672	0.026117	-0.014624	0.019155	...	0.009325	0.018199	0.018393	-0.020126	0.001466	0.021132	0.305872	0.101912	0.122851	0.101353
GUANOSINE-CONTAINING COMPOUND METABOLIC PROCESS (GO:1901068)	0.021357	0.001004	-0.025620	-0.059650	-0.069775	0.039845	-0.011044	-0.006068	-0.033152	-0.022824	...	0.017245	-0.001832	-0.018181	-0.038015	-0.012296	-0.016532	-0.065089	-0.015273	0.033738	-0.137028
TRNA MODIFICATION (GO:0006400)	0.031546	0.021881	-0.004469	-0.061516	-0.038072	0.006053	-0.013484	-0.007689	-0.051600	-0.008988	...	-0.026328	0.043957	-0.054939	0.012906	-0.050238	-0.007103	-0.075182	-0.087967	-0.195260	-0.159961

5103 rows × 101 columns

In [10] ¶

def loadPrismXModel():
    os.makedirs("model", exist_ok=True)
    urllib.request.urlretrieve("https://"+S3_URL.replace("s3:", "").replace("/", "")+".s3.amazonaws.com/prismxmodel.pkl", "model/prismxmodel.pkl")
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        model = pickle.load(open("model/prismxmodel.pkl", 'rb'))
    return(model)

def prismxPredictions(avgcor, model, verbose: bool=False) -> pd.DataFrame:
    avgcor = avgcor.fillna(0)
    predictions = pd.DataFrame(model.predict(avgcor))
    predictions.index = avgcor.index
    predictions.columns = ["predictions"]
    mean = np.mean(predictions[:], axis=0)
    std = np.std(predictions[:], axis=0)
    zscore = (predictions - mean)/std
    predictions["z-score"] = zscore
    pvalue = 1-st.norm.cdf(zscore)
    predictions["p-value"] = pvalue
    predictions["bonferroni"] = np.where(pvalue*len(pvalue) >= 1, 1, pvalue*len(pvalue))
    return predictions

¶

Apply PrismEXP ¶

The pretrained model is loaded and applied on the correlation features of 51 correlation matrices.

In [11] ¶

model = loadPrismXModel()
predictions = prismxPredictions(avgCor, model)
predictions.to_csv(GENE+"_"+LIBRARY+"_predictions.tsv", sep="\t")
top_predictions = pd.DataFrame(predictions.sort_values(by=["predictions"], ascending=False).iloc[0:20:])

In [12] ¶

def plotROC(fpr, tpr, auc):
    plt.figure(figsize=(12, 7), dpi= 300)
    plt.plot([0, 1], [0, 1], linestyle='--', label='baseline')
    plt.plot(fpr, tpr, linestyle='--', label='PrismEXP')
    plt.text(0.65, 0.1, "AUC: "+str(np.round(auc, decimals=3)), fontsize=18)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend()
    plt.savefig(GENE+"_"+LIBRARY+"_ROC.pdf")  
    plt.savefig(GENE+"_"+LIBRARY+"_ROC.png", dpi=300)
    plt.show()

def calculateGeneAUC(prediction: pd.DataFrame, gene: str, rev_library: Dict, minLibSize: int=1) -> List[float]:
    if gene in list(rev_library.keys()):
        gold = [i in rev_library[gene] for i in prediction.index]
        fpr, tpr, _ = roc_curve(list(gold), list(prediction.iloc[:,0]))
        roc_auc = auc(fpr, tpr)
        plotROC(fpr, tpr, roc_auc)
        return(auc)
    else:
        print("Not enough information to calcualte AUC")
        return(0)

¶

Prediction Validation ¶

The ROC curve shows how well previously known annotations for the gene have been recovered by PrismEXP. For this, all gene sets in the library are ranked by the prediction score in descending order. Previously known associations should rank high. The AUC can vary by gene and gene set library.

In [13] ¶

calculateGeneAUC(predictions, GENE, rev_library)

¶

Not enough information to calcualte AUC

¶

Top Predictions ¶

The top 20 predictions are shown below. The p-values are calculated from the z-scores of the gene set library.

In [14] ¶

top_predictions

¶

	predictions	z-score	p-value	bonferroni
LIPID HYDROXYLATION (GO:0002933)	1.012105	7.982877	6.661338e-16	3.399281e-12
OMEGA-HYDROXYLASE P450 PATHWAY (GO:0097267)	0.899964	7.012590	1.169731e-12	5.969137e-09
DRUG CATABOLIC PROCESS (GO:0042737)	0.878997	6.831172	4.211187e-12	2.148969e-08
POSITIVE REGULATION OF CYTOLYSIS (GO:0045919)	0.810521	6.238696	2.206170e-10	1.125808e-06
PEPTIDYL-AMINO ACID MODIFICATION (GO:0018193)	0.803662	6.179344	3.218416e-10	1.642357e-06
PRIMARY ALCOHOL CATABOLIC PROCESS (GO:0034310)	0.799828	6.146172	3.968755e-10	2.025256e-06
EXOGENOUS DRUG CATABOLIC PROCESS (GO:0042738)	0.795264	6.106682	5.086189e-10	2.595482e-06
REGULATION OF SARCOMERE ORGANIZATION (GO:0060297)	0.789959	6.060782	6.773067e-10	3.456296e-06
MONOTERPENOID METABOLIC PROCESS (GO:0016098)	0.788461	6.047821	7.340897e-10	3.746060e-06
TERPENOID METABOLIC PROCESS (GO:0006721)	0.732701	5.565360	1.308058e-08	6.675019e-05
ANDROGEN METABOLIC PROCESS (GO:0008209)	0.725586	5.503805	1.858403e-08	9.483429e-05
TRYPTOPHAN METABOLIC PROCESS (GO:0006568)	0.721988	5.472674	2.216479e-08	1.131069e-04
MANGANESE ION TRANSPORT (GO:0006828)	0.716715	5.427046	2.864714e-08	1.461863e-04
SULFATION (GO:0051923)	0.698839	5.272374	6.733514e-08	3.436112e-04
STEROID CATABOLIC PROCESS (GO:0006706)	0.697626	5.261879	7.129529e-08	3.638199e-04
CELLULAR MACROMOLECULAR COMPLEX ASSEMBLY (GO:0034622)	0.692257	5.215425	9.169820e-08	4.679359e-04
ORGANOPHOSPHATE CATABOLIC PROCESS (GO:0046434)	0.682374	5.129914	1.449369e-07	7.396132e-04
EPOXYGENASE P450 PATHWAY (GO:0019373)	0.680279	5.111791	1.595590e-07	8.142295e-04
ETHANOL CATABOLIC PROCESS (GO:0006068)	0.679597	5.105884	1.646255e-07	8.400837e-04
ETHANOL METABOLIC PROCESS (GO:0006067)	0.679597	5.105884	1.646255e-07	8.400837e-04

¶

Download Files ¶

Download full prediction table and high resolution ROC plots. (If plots are opened in the browser the images might require a refresh of the browser page to display)

In [15] ¶

display(FileLink(GENE+"_"+LIBRARY+"_predictions.tsv", result_html_prefix=str('Download prediction table: ')))
display(FileLink(GENE+"_"+LIBRARY+"_ROC.pdf", result_html_prefix=str('Download PDF: ')))
display(FileLink(GENE+"_"+LIBRARY+"_ROC.png", result_html_prefix=str('Download PNG: ')))

¶

Download prediction table: SPP2_GO_Biological_Process_2018_predictions.tsv

¶

Path (SPP2_GO_Biological_Process_2018_ROC.pdf) doesn't exist. It may still be in the process of being generated, or you may have the incorrect path.

¶

Path (SPP2_GO_Biological_Process_2018_ROC.png) doesn't exist. It may still be in the process of being generated, or you may have the incorrect path.

¶

References ¶

[1] Kuleshov, Maxim V., Matthew R. Jones, Andrew D. Rouillard, Nicolas F. Fernandez, Qiaonan Duan, Zichen Wang, Simon Koplev et al. "Enrichr: a comprehensive gene set enrichment analysis web server 2016 update." Nucleic acids research 44, no. W1 (2016): W90-W97.

[2] Lachmann, Alexander, Brian M. Schilder, Megan L. Wojciechowicz, Denis Torre, Maxim V. Kuleshov, Alexandra B. Keenan, and Avi Ma’ayan. "Geneshot: search engine for ranking genes from arbitrary text queries." Nucleic acids research 47, no. W1 (2019): W571-W577.

[3] Lachmann, Alexander, Denis Torre, Alexandra B. Keenan, Kathleen M. Jagodnik, Hoyjin J. Lee, Lily Wang, Moshe C. Silverstein, and Avi Ma’ayan. "Massive mining of publicly available RNA-seq data from human and mouse." Nature communications 9, no. 1 (2018): 1-10.

[4] Wang, Yuhang, Fillia S. Makedon, James C. Ford, and Justin Pearlman. "HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data." Bioinformatics 21, no. 8 (2005): 1530-1537.

¶