PrismEXP

An appyter interface to compute gene function predictions

This Appyter calculates predictions using the PrismEXP algorithm. It utilizes precomputed correlation scores and a pretrained model. The correlation scores are computed for 51 gene expression clusters in ARCHS4.

'LINC01089'

'GO_Biological_Process_2018'

In [1] ¶

import h5py as h5
import s3fs
import numpy as np
import pandas as pd
from tqdm import tqdm
from typing import List, Dict
import urllib.request
import json
import hashlib
import shutil
import ssl
import os
import sys
import re
import itertools
import pickle
import warnings
import scipy.stats as st
from lightgbm import LGBMRegressor
from sklearn.metrics import roc_auc_score, roc_curve, auc
from matplotlib import pyplot as plt
from IPython.display import display, FileLink, Markdown, HTML

In [2] ¶

LIBRARY_LIST_URL="https://maayanlab.cloud/speedrichr/api/listlibs"
LIBRARY_DOWNLOAD_URL="https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName="
S3_URL="s3://mssm-prismx-100/"
GMT_EXAMPLE="https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName=KEGG_2019_Mouse"

In [3] ¶

# Use the selected user input
GENE = 'LINC01089'
LIBRARY = 'GO_Biological_Process_2018'
ENRICHR_MODE = True

¶

S3 Data Query ¶

The following functions are used to load gene correlation directly from precomputed correlation matrices stored on S3.

In [4] ¶


def loadGenesS3():
    genes = 0
    s3 = s3fs.S3FileSystem(anon=True)
    with h5.File(s3.open(S3_URL+"correlation_0.h5", 'rb'), 'r', lib_version='latest') as f:
        genes = np.array([s.decode("UTF-8") for s in f["meta/genes"]])
    return [x.upper() for x in list(genes)]

def loadCorrelationS3(gene, genes, cormat):
    cor = 0
    s3 = s3fs.S3FileSystem(anon=True)
    with h5.File(s3.open(S3_URL+"correlation_"+str(cormat)+".h5", 'rb'), 'r', lib_version='latest') as f:
        idx = genes.index(gene.upper())
        cor = np.array(f["data/correlation"][idx,:]).astype(np.float64)
    return(cor)

def geneCorrelation(gene):
    cormats = list(range(100))
    cormats.append("global")
    genes = loadGenesS3()
    results = []
    pbar = tqdm(cormats)
    for i in pbar:
        pbar.set_description("Retrieve %s" % i)
        results.append(loadCorrelationS3(gene, genes, i))
    results = pd.DataFrame(np.array(results).T, index=genes)
    return(results)

¶

Load GMT ¶

These functions are used to load Enrichr gene set libraries.

In [5] ¶

def getDataPath() -> str:
    path = os.path.join(
        os.path.dirname(__file__),
        'data/'
    )
    return(path)

def listLibraries():
    return(loadJSON(LIBRARY_LIST_URL)["library"])

def loadLibrary(library: str, overwrite: bool = False) -> str:
    ssl._create_default_https_context = ssl._create_unverified_context
    if not os.path.exists("gmts/"+library or overwrite):
        os.makedirs("gmts", exist_ok=True)
        print("Download Enrichr geneset library")
        urllib.request.urlretrieve(LIBRARY_DOWNLOAD_URL+library, "gmts/"+library+".gmt")
    else:
        print("File cached. To reload use loadLibrary(\""+library+"\", overwrite=True) instead.")
    lib, rlib, ugenes = readGMT("gmts/"+library+".gmt")
    print("# genesets: "+str(len(lib))+"\n# unique genes: "+str(len(ugenes)))
    return("gmts/"+library)

def printLibraries():
    libs = listLibraries()
    for i in range(0, len(libs)):
        print(str(i)+" - "+libs[i])

def loadJSON(url):
    context = ssl._create_unverified_context()
    req = urllib.request.Request(url)
    r = urllib.request.urlopen(req, context=context).read()
    return(json.loads(r.decode('utf-8')))

def readGMT(gmtFile: str, backgroundGenes: List[str] = [""], verbose=False) -> List:
    file = open(gmtFile, 'r')
    lines = file.readlines()
    library = {}
    for line in lines:
        sp = line.strip().upper().split("\t")
        sp2 = [re.sub(",.*", "",value) for value in sp[2:]]
        if len(backgroundGenes) > 1:
            backgroundGenes = [x.upper() for x in backgroundGenes]
            library[sp[0]] = [value for value in sp2 if value in backgroundGenes]
        else:
            library[sp[0]] = sp2
    ugenes = list(set(list(itertools.chain.from_iterable(library.values()))))
    ugenes.sort()
    rev_library = {}
    for ug in ugenes:
        rev_library[ug] = []
    for se in library.keys():
        for ge in library[se]:
            rev_library[ge].append(se)
    if verbose:
        print("Library loaded. Library contains "+str(len(library))+" gene sets. "+str(len(ugenes))+" unique genes found.")
    return [library, rev_library, ugenes]

¶

Initialize Gene Set Library ¶

Load library from Enrichr or use custom GMT file.

In [6] ¶

if ENRICHR_MODE:
    libraryPath = loadLibrary(LIBRARY)
else:
    os.makedirs("gmts", exist_ok=True)
    shutil.copy2(LIBRARY, "gmts/"+LIBRARY)
    LIBRARY = LIBRARY.replace(".gmt", "")

library, rev_library, ugenes = readGMT("gmts/"+LIBRARY+".gmt", loadGenesS3())

¶

Download Enrichr geneset library
# genesets: 5103

# unique genes: 14433

¶

Load Gene Correlation ¶

Gene correlation matrices are precalculated for 101 gene expression matrices from ARCHS4. The matrices are stored in AWS S3 and can be queried for target genes of interest. The extraction can take a moment to finish.

In [7] ¶

correlation = geneCorrelation(GENE)
correlation

¶

	0	1	2	3	4	5	6	7	8	9	...	91	92	93	94	95	96	97	98	99	100
A1BG	-0.078064	-0.006565	-0.038208	0.155029	0.237061	-0.084717	0.024292	-0.009850	0.312744	0.015854	...	0.061249	-0.031067	0.004425	0.125488	0.071655	0.000419	0.102844	0.112122	-0.006329	-0.033875
A1BG-AS1	-0.141479	0.002529	-0.011139	-0.031586	0.312744	0.110901	0.018082	0.003407	-0.140747	0.049561	...	0.197266	0.044128	-0.001400	-0.013306	0.099365	0.019073	0.273682	0.158203	0.130737	0.177856
A1CF	0.062500	0.050629	-0.007236	0.015549	-0.072998	-0.003386	-0.041809	-0.003250	0.034851	-0.056671	...	-0.073608	0.006592	-0.007843	0.122192	-0.146851	-0.069153	-0.314453	-0.041077	-0.166382	-0.109436
A2M	-0.127930	-0.166748	-0.019043	0.112000	0.099243	-0.088806	-0.035828	-0.032074	-0.049927	-0.037659	...	0.187744	0.086548	-0.041016	0.231445	0.020386	-0.023560	-0.103821	-0.015945	0.107300	0.066223
A2M-AS1	0.105835	0.003344	0.003853	0.018005	0.083984	-0.069519	0.001346	0.016037	0.027298	0.027878	...	0.261963	-0.010368	-0.035370	0.289795	0.093018	0.008957	0.012383	0.233765	-0.049438	0.061829
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
ZYG11A	-0.077026	0.010559	-0.019867	0.060089	-0.057861	0.103943	-0.002546	0.035309	0.159668	-0.045258	...	-0.130249	-0.109680	-0.020569	-0.241089	-0.136719	-0.001708	-0.066650	-0.009239	-0.142212	-0.085938
ZYG11B	-0.178101	0.080872	-0.033081	0.017258	-0.455078	-0.155029	-0.016037	0.016907	0.061768	-0.020905	...	-0.326416	-0.020599	-0.043121	0.170410	-0.302490	-0.000885	-0.330322	0.014969	-0.363037	0.124817
ZYX	-0.022614	0.170410	-0.005089	-0.113037	0.210449	0.133545	0.018219	-0.018677	0.019363	0.019745	...	0.212769	0.417236	0.019608	0.049316	0.039612	-0.027115	0.107117	0.162842	0.031067	0.303955
ZZEF1	0.157227	0.029800	0.016190	-0.002901	0.204956	0.120972	0.002829	0.010681	0.043488	-0.019257	...	0.152832	-0.017807	-0.016953	0.628906	0.054260	-0.016129	-0.084595	0.155273	0.161011	0.270996
ZZZ3	-0.246338	0.154175	0.029068	-0.153076	-0.509277	-0.048157	-0.011620	0.008774	0.044922	-0.024918	...	-0.291016	0.029495	-0.039307	-0.011200	0.006355	-0.000510	-0.285645	0.097290	-0.414062	0.144531

38721 rows × 101 columns

¶

Retrieve global: 100%|██████████| 101/101 [01:30<00:00,  1.11it/s]

In [8] ¶

def getAverageCorrelation(correlation: pd.DataFrame, library: Dict):
    avgCor = []
    for ll in list(library.keys()):
        avgCor.append(correlation.loc[[x.upper() for x in library[ll]],:].mean(axis=0))
    avgCor = pd.DataFrame(np.array(avgCor).T, columns=list(library.keys()))
    avgCor = avgCor.transpose()
    return(avgCor)

¶

Avg Correlation Scores ¶

The average correlation scores are equivalent to the predictions performed by Geneshot. The values are used as prediction features of the PrismEXP model.

In [9] ¶

avgCor = getAverageCorrelation(correlation, library)
avgCor

¶

	0	1	2	3	4	5	6	7	8	9	...	91	92	93	94	95	96	97	98	99	100
POSITIVE REGULATION OF POSTTRANSCRIPTIONAL GENE SILENCING (GO:0060148)	-0.047904	0.047157	0.009664	-0.072569	-0.159220	-0.011856	-0.001483	0.000112	-0.058082	-0.008082	...	-0.026425	0.025710	-0.011838	0.001329	-0.043085	-0.002395	-0.111916	0.018154	-0.084856	0.151921
REGULATION OF CELL CYCLE PROCESS (GO:0010564)	-0.024116	0.012065	0.007913	-0.002002	-0.050237	-0.031833	-0.007974	-0.003765	-0.029473	0.000526	...	0.023078	-0.002176	-0.007133	-0.091126	0.043201	0.001967	0.040453	0.049989	-0.048331	0.123720
ANGIOTENSIN-ACTIVATED SIGNALING PATHWAY (GO:0038166)	0.018334	0.058921	-0.013917	-0.039557	0.015252	0.028551	-0.011902	-0.002770	-0.122648	0.017391	...	0.002906	0.030363	-0.025193	-0.220912	-0.045924	-0.010670	-0.041119	-0.064610	-0.002627	0.040052
DNA-TEMPLATED TRANSCRIPTION, TERMINATION (GO:0006353)	0.002982	0.019203	0.020950	-0.038063	-0.012806	0.026153	-0.006091	0.000304	0.034575	0.000805	...	0.008419	0.000581	0.008603	-0.156055	0.140696	0.007125	0.093243	0.088050	0.010586	0.152198
REGULATION OF PROTEIN SUMOYLATION (GO:0033233)	-0.032882	-0.069005	0.003750	-0.021812	-0.000083	-0.036688	-0.008479	0.001705	0.011248	-0.008182	...	0.044316	0.037429	-0.008483	0.013826	0.031829	-0.010222	0.047991	0.060884	0.023367	0.093822
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
POSITIVE REGULATION OF MITOCHONDRIAL MEMBRANE POTENTIAL (GO:0010918)	0.007772	0.051959	0.025124	-0.031354	0.140884	0.051343	-0.005771	-0.009717	-0.025576	0.027154	...	0.136784	0.078908	0.025777	-0.140328	0.218770	0.020377	0.198151	0.088826	-0.033337	0.180868
TRNA AMINOACYLATION FOR PROTEIN TRANSLATION (GO:0006418)	-0.117703	0.036884	0.003500	-0.021754	-0.089685	-0.034542	-0.018777	0.000537	-0.038181	-0.002750	...	-0.068384	-0.000246	0.003725	-0.255207	0.131439	-0.000591	0.076741	0.097652	-0.039619	0.126190
TRYPTOPHAN METABOLIC PROCESS (GO:0006568)	-0.010589	0.035971	-0.013135	-0.022052	-0.005585	-0.019770	-0.000825	-0.008690	-0.072685	-0.017034	...	0.023837	0.013752	-0.006267	0.036244	-0.010660	0.001597	-0.069653	-0.012646	0.090252	-0.021786
GUANOSINE-CONTAINING COMPOUND METABOLIC PROCESS (GO:1901068)	-0.106287	0.018205	0.013382	-0.020149	-0.088438	-0.075312	-0.000874	-0.005951	-0.039151	-0.004695	...	-0.115871	0.058095	0.000103	-0.138862	-0.017191	-0.000083	-0.099719	0.058921	-0.104123	0.125642
TRNA MODIFICATION (GO:0006400)	-0.066722	0.030451	0.001362	-0.062949	-0.042465	-0.017996	-0.015938	0.002405	0.004143	0.000143	...	-0.062400	-0.002377	0.010917	-0.146757	0.135090	0.005917	0.067695	0.113063	-0.013959	0.125337

5103 rows × 101 columns

In [10] ¶

def loadPrismXModel():
    os.makedirs("model", exist_ok=True)
    urllib.request.urlretrieve("https://"+S3_URL.replace("s3:", "").replace("/", "")+".s3.amazonaws.com/prismxmodel.pkl", "model/prismxmodel.pkl")
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        model = pickle.load(open("model/prismxmodel.pkl", 'rb'))
    return(model)

def prismxPredictions(avgcor, model, verbose: bool=False) -> pd.DataFrame:
    avgcor = avgcor.fillna(0)
    predictions = pd.DataFrame(model.predict(avgcor))
    predictions.index = avgcor.index
    predictions.columns = ["predictions"]
    mean = np.mean(predictions[:], axis=0)
    std = np.std(predictions[:], axis=0)
    zscore = (predictions - mean)/std
    predictions["z-score"] = zscore
    pvalue = 1-st.norm.cdf(zscore)
    predictions["p-value"] = pvalue
    predictions["bonferroni"] = np.where(pvalue*len(pvalue) >= 1, 1, pvalue*len(pvalue))
    return predictions

¶

Apply PrismEXP ¶

The pretrained model is loaded and applied on the correlation features of 51 correlation matrices.

In [11] ¶

model = loadPrismXModel()
predictions = prismxPredictions(avgCor, model)
predictions.to_csv(GENE+"_"+LIBRARY+"_predictions.tsv", sep="\t")
top_predictions = pd.DataFrame(predictions.sort_values(by=["predictions"], ascending=False).iloc[0:20:])

In [12] ¶

def plotROC(fpr, tpr, auc):
    plt.figure(figsize=(12, 7), dpi= 300)
    plt.plot([0, 1], [0, 1], linestyle='--', label='baseline')
    plt.plot(fpr, tpr, linestyle='--', label='PrismEXP')
    plt.text(0.65, 0.1, "AUC: "+str(np.round(auc, decimals=3)), fontsize=18)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend()
    plt.savefig(GENE+"_"+LIBRARY+"_ROC.pdf")  
    plt.savefig(GENE+"_"+LIBRARY+"_ROC.png", dpi=300)
    plt.show()

def calculateGeneAUC(prediction: pd.DataFrame, gene: str, rev_library: Dict, minLibSize: int=1) -> List[float]:
    if gene in list(rev_library.keys()):
        gold = [i in rev_library[gene] for i in prediction.index]
        fpr, tpr, _ = roc_curve(list(gold), list(prediction.iloc[:,0]))
        roc_auc = auc(fpr, tpr)
        plotROC(fpr, tpr, roc_auc)
        return(auc)
    else:
        print("Not enough information to calcualte AUC")
        return(0)

¶

Prediction Validation ¶

The ROC curve shows how well previously known annotations for the gene have been recovered by PrismEXP. For this, all gene sets in the library are ranked by the prediction score in descending order. Previously known associations should rank high. The AUC can vary by gene and gene set library.

In [13] ¶

calculateGeneAUC(predictions, GENE, rev_library)

¶

Not enough information to calcualte AUC

¶

Top Predictions ¶

The top 20 predictions are shown below. The p-values are calculated from the z-scores of the gene set library.

In [14] ¶

top_predictions

¶

	predictions	z-score	p-value	bonferroni
MITOTIC G2/M TRANSITION CHECKPOINT (GO:0044818)	0.554333	6.319960	1.308151e-10	6.675496e-07
MRNA SPLICE SITE SELECTION (GO:0006376)	0.521577	5.883586	2.007361e-09	1.024356e-05
PROTEIN DEACETYLATION (GO:0006476)	0.512762	5.766140	4.055393e-09	2.069467e-05
RNA 3'-END PROCESSING (GO:0031123)	0.491285	5.480025	2.126328e-08	1.085065e-04
MITOTIC G2 DNA DAMAGE CHECKPOINT (GO:0007095)	0.485201	5.398975	3.351145e-08	1.710089e-04
HISTONE LYSINE METHYLATION (GO:0034968)	0.465652	5.138538	1.384418e-07	7.064686e-04
NEGATIVE REGULATION OF MRNA SPLICING, VIA SPLICEOSOME (GO:0048025)	0.454867	4.994863	2.943874e-07	1.502259e-03
RESPONSE TO LIGHT STIMULUS (GO:0009416)	0.449716	4.926239	4.191371e-07	2.138857e-03
MRNA 3'-END PROCESSING (GO:0031124)	0.441173	4.812428	7.455377e-07	3.804479e-03
POSITIVE REGULATION OF INTRACELLULAR STEROID HORMONE RECEPTOR SIGNALING PATHWAY (GO:0033145)	0.438947	4.782772	8.644715e-07	4.411398e-03
NEGATIVE REGULATION OF TRANSCRIPTION, DNA-TEMPLATED (GO:0045892)	0.438937	4.782636	8.650565e-07	4.414384e-03
REGULATION OF GENE EXPRESSION (GO:0010468)	0.436923	4.755806	9.882833e-07	5.043210e-03
REGULATION OF CELLULAR MACROMOLECULE BIOSYNTHETIC PROCESS (GO:2000112)	0.432196	4.692831	1.347251e-06	6.875020e-03
NEGATIVE REGULATION OF NUCLEIC ACID-TEMPLATED TRANSCRIPTION (GO:1903507)	0.430712	4.673064	1.483698e-06	7.571313e-03
REGULATION OF TRANSCRIPTION, DNA-TEMPLATED (GO:0006355)	0.429928	4.662614	1.561093e-06	7.966256e-03
NEGATIVE REGULATION OF CELLULAR MACROMOLECULE BIOSYNTHETIC PROCESS (GO:2000113)	0.428903	4.648956	1.668095e-06	8.512289e-03
WNT SIGNALING PATHWAY, CALCIUM MODULATING PATHWAY (GO:0007223)	0.428317	4.641161	1.732284e-06	8.839846e-03
REGULATION OF NUCLEIC ACID-TEMPLATED TRANSCRIPTION (GO:1903506)	0.420272	4.533974	2.894213e-06	1.476917e-02
STEROID HORMONE MEDIATED SIGNALING PATHWAY (GO:0043401)	0.395059	4.198089	1.345886e-05	6.868057e-02
REGULATION OF ALTERNATIVE MRNA SPLICING, VIA SPLICEOSOME (GO:0000381)	0.393508	4.177424	1.474146e-05	7.522565e-02

¶

Download Files ¶

Download full prediction table and high resolution ROC plots. (If plots are opened in the browser the images might require a refresh of the browser page to display)

In [15] ¶

display(FileLink(GENE+"_"+LIBRARY+"_predictions.tsv", result_html_prefix=str('Download prediction table: ')))
display(FileLink(GENE+"_"+LIBRARY+"_ROC.pdf", result_html_prefix=str('Download PDF: ')))
display(FileLink(GENE+"_"+LIBRARY+"_ROC.png", result_html_prefix=str('Download PNG: ')))

¶

Download prediction table: LINC01089_GO_Biological_Process_2018_predictions.tsv

¶

Path (LINC01089_GO_Biological_Process_2018_ROC.pdf) doesn't exist. It may still be in the process of being generated, or you may have the incorrect path.

¶

Path (LINC01089_GO_Biological_Process_2018_ROC.png) doesn't exist. It may still be in the process of being generated, or you may have the incorrect path.

¶

References ¶

[1] Kuleshov, Maxim V., Matthew R. Jones, Andrew D. Rouillard, Nicolas F. Fernandez, Qiaonan Duan, Zichen Wang, Simon Koplev et al. "Enrichr: a comprehensive gene set enrichment analysis web server 2016 update." Nucleic acids research 44, no. W1 (2016): W90-W97.

[2] Lachmann, Alexander, Brian M. Schilder, Megan L. Wojciechowicz, Denis Torre, Maxim V. Kuleshov, Alexandra B. Keenan, and Avi Ma’ayan. "Geneshot: search engine for ranking genes from arbitrary text queries." Nucleic acids research 47, no. W1 (2019): W571-W577.

[3] Lachmann, Alexander, Denis Torre, Alexandra B. Keenan, Kathleen M. Jagodnik, Hoyjin J. Lee, Lily Wang, Moshe C. Silverstein, and Avi Ma’ayan. "Massive mining of publicly available RNA-seq data from human and mouse." Nature communications 9, no. 1 (2018): 1-10.

[4] Wang, Yuhang, Fillia S. Makedon, James C. Ford, and Justin Pearlman. "HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data." Bioinformatics 21, no. 8 (2005): 1530-1537.

¶