Analyse de données de comptage multivariées et intégration de données omiques

Program

Morning

9h30-10h00 Accueil - café
10h00-10h15 Opening presentation
10h15-11h Bastien Batardière, INRAE, AgroParisTech, MIA-Paris-Saclay, {pyPLNmodels}: A Python package to analyze multivariate high-dimensional count data

{pyPLNmodels} est une bibliothèque Python pour l’analyse de données de comptage multivariées à l’aide du modèle Poisson Lognormal (PLN) et ses variantes. Conçue pour traiter de grands jeux de données, elle permet d’extraire des informations clés à partir de données complexes. Elle est particulièrement adaptée à l’analyse des données RNA-seq, en normalisant les comptes et en identifiant les relations entre gènes.

Fonctionnalités principales: Elle propose une modélisation avancée des données de comptage, l’estimation des paramètres clés tels que les coefficients de régression et la covariance, ainsi que la normalisation des données pour une meilleure interprétation. L’analyse des corrélations et des poids des covariables est également intégrée, avec une capacité d’adaptation aux données de grande dimension.

Modèles implémentés: PLN est le modèle de base pour l’analyse des données de comptage. PlnPCA permet une réduction de dimension via l’ACP probabiliste adaptée aux données de comptage. ZIPln prend en compte l’inflation de zéros pour mieux modéliser des données comportant de nombreuses valeurs nulles.

Le site web: https://pypi.org/project/pyPLNmodels/ et les slides.

11h00-11h20 El Hacene DJAOUT, LPSM, Sorbonne Université, France, {VaRaPS} : Un package Python pour l’estimation des proportions de variants SARS-CoV-2 à partir de données de séquençage groupées

L’épidémiologie basée sur les eaux usées s’est révélée être un outil puissant pour la surveillance des variants du SARS-CoV-2. Nous présentons {VaRaPS} (Variants Ratios from Pooled Sequencing), un package Python qui optimise les algorithmes statistiques fondamentaux pour estimer efficacement les proportions de lignées SARS-CoV-2 à partir de données de séquençage groupées d’eaux usées. {VaRaPS} intègre un algorithme personnalisé d’identification des variants, spécialement conçu pour les données de séquençage groupées, exploitant les informations de mutations au niveau des lectures et utilisant des représentations matricielles creuses pour une gestion efficace des données.

{VaRaPS} améliore considérablement l’efficacité computationnelle des méthodes existantes (Freyja, LCS et VirPool), atteignant des facteurs d’accélération jusqu’à 3 570 fois pour les temps d’optimisation et 103 fois pour l’extraction des données, tout en maintenant ou améliorant la précision des estimations. Le package intègre un paramètre de taux d’erreur pour prendre en compte les erreurs de séquençage, affinant davantage la précision statistique. Les analyses comparatives utilisant des jeux de données synthétiques démontrent la performance supérieure de VaRaPS dans l’estimation de la prévalence des lignées, avec une précision et une cohérence améliorées pour toutes les méthodes implémentées.

L’architecture modulaire et la conception orientée utilisateur du package facilitent son application dans des contextes réels. {VaRaPS} représente une avancée significative dans les outils bioinformatiques et biostatistiques pour l’analyse des lignées virales, offrant une solution robuste et efficace pour l’épidémiologie basée sur les eaux usées dans l’ère post-pandémique.

11h20-11h40 Maximilien Colange, Abdelkader Behdenna, Epigene Labs, France, Bridging the Gap Between R and Python in Bulk Transcriptomic Data Analysis with InMoose

We introduce {InMoose}, an open-source Python environment for omic data analysis. Due to its wide adoption, Python has grown as a de facto standard in fields increasingly important for bioinformatic pipelines, such as data science, machine learning, or AI. As a general-purpose language, Python is also recognized for its versatility and scalability. InMoose aims at bringing state-of-the-art tools, historically written in R, to the Python ecosystem. Our intent is to provide a drop-in replacement for R tools, so our approach focuses on the faithfulness to the original tools outcomes. The first development phase has focused on bulk transcriptomic data, with current capabilities encompassing data simulation, batch effect correction, and differential analysis and meta-analysis. {InMoose} offers a Python implementation of several state-of-the-art tools originally written in R: {ComBat} and {ComBat-Seq} (batch effect correction), {edgeR}, {DESeq2}, {limma} (differential gene expression analysis), {splatter} (RNA-Seq data simulation).

To our knowledge, {InMoose} is the sole Python implementation of {ComBat-Seq}, {edgeR} and {limma}. {InMoose} also offers original features: a quality control report for cohorts built through the batch effect correction features; a differential gene expression meta-analysis module.

We compare {InMoose} with the R original implementations and alternative Python implementations when available. Our experiments show that the results of InMoose are very similar, if not identical, to those of the original R tools. This positions {InMoose} as a key tool to bridge R and Python ecosystem and to ensure reproducibility and comparability between R-based and Python-based bioinformatics pipelines.

Slides : here
Package : https://github.com/epigenelabs/inmoose
References :
Behdenna, A., Colange, M., Haziza, J. et al. pyComBat, a Python tool for batch effects correction in high-throughput molecular data using empirical Bayes methods. BMC Bioinformatics 24, 459 (2023). https://doi.org/10.1186/s12859-023-05578-5
Colange, M., Appé, G., Meunier, L. et al. Bulk Transcriptomic Analysis with InMoose, the Integrated Multi-Omic Open-Source Environment in Python. Bioarxiv 2024.11.29.625982 (2024). https://doi.org/10.1101/2024.11.29.625982
Colange, M., Appé, G., Meunier, L. et al. Differential Expression Analysis with InMoose, the Integrated Multi-Omic Open-Source Environment in Python. Bioarxiv 2024.11.14.623578 (2024). https://doi.org/10.1101/2024.11.14.623578

11h40-12h30 Work and feedback around methylation
11h40-12h00 Valentin Costes, Eliance, Microarray vs NGS

Chez le bovin, la méthylation de l’ADN pourrait potentiellement être utilisée pour prédire des phénotypes. Il existe plusieurs technologies permettant d’analyser cette dernière, chacune ayant certains points forts et des points faibles. Parmi celle-ci, l’équipe utiliser principalement des méthodes de séquençages par RRBS, et depuis peu des données obtenues par Micro-array. L’objectif de la présentation sera de présenter les types de données qui sont générées, et de montrer quelques comparaisons entre des données produites par séquençages ou par Micro-array.

12h00-12h15 Luc Jouneau, VIM, BREED, INRAE, Jouy-en-Josas, Application of {edgeR} package to the analysis of RRBS methylation data

En 20xx, dans un post sur un forum, Gordon Smyth montre qu’il est possible d’utiliser {edgeR} pour l’analyse de données RRBS. Depuis, son équipe a publié un article à ce sujet et a même intégré ce cas d’étude dans la documentation et les fonctions de son package, section 4.8. En pratique, quelles sont les atouts et inconvénients de l’application de {edgeR} sur ce type de données ?

12h15-12h30 Anne Frambourg, VIM, BREED, INRAE, Jouy-en-Josas, Evaluation des résultats sur une analyse à 2 facteurs avec le package {DSS}

Application de la méthode Dispersion Shrinkage for Sequencing data (DSS, Park and Wu, 2016), fondée sur un modèle beta-binomial, dans le cadre du projet BoSexDim qui a pour ambition entre autres d’identifier les dimorphismes sexuels précoces chez l’embryon de bovin et d’étudier comment ces dimorphismes sont modifiés en réponse à l’environnement embryonnaire (vitro/vivo). Plusieurs modèles ont été considérés pour l’analyse à deux facteurs sexe et milieu. Un effet d’interaction entre les deux effets ayant été identifié, le modèle avec variables concaténées sexe:milieu a été retenu car il permet de réaliser toutes les analyses différentielles entre chaque sous-catégorie (Femelle – Vivo, Femelle – Vitro, Mâle – Vivo, Mâle – Vitro).

12h30 - 14h00 Lunch break

Afternoon

Afternoon presentations will be in english.

14h00 - 14h50 Claire Gormley, Professor, School of Mathetmathics and Statistics, University College Dublin, Integrated differential analysis of multi-omics data using a joint mixture model: idiffomix

Koyel Majumdar, Florence Jaffrézic, Andrea Rau, Isobel Claire Gormley, Thomas Brendan Murphy

Gene expression and DNA methylation are two interconnected biological processes and understanding their relationship is important in advancing understanding in diverse areas, including disease pathogenesis, environmental adaptation, developmental biology, and therapeutic responses. Differential analysis, including the identification of differentially methylated cytosine-guanine dinucleotide (CpG) sites (DMCs) and differentially expressed genes (DEGs) between two conditions, such as healthy and affected samples, can aid understanding of biological processes and disease progression. Typically, gene expression and DNA methylation data are analysed independently to identify DMCs and DEGs which are further analysed to explore relationships between them. Such approaches ignore the inherent dependencies and biological structure within these related data.

A joint mixture model is proposed that integrates information from the two data types at the modelling stage to capture their inherent dependency structure, enabling simultaneous identification of DMCs and DEGs. The model leverages a joint likelihood function that accounts for the nested structure in the data, with parameter estimation performed using an expectation-maximisation algorithm.

Performance of the proposed method, {idiffomix}, is assessed through a thorough simulation study and application to a publicly available breast cancer dataset. Several genes, identified as non-differentially expressed when the data types were modelled independently, had high likelihood of being differentially expressed when associated methylation data were integrated into the analysis. The idiffomix approach highlights the advantage of an integrated analysis via a joint mixture model over independent analyses of the two data types; genome-wide and cross-omics information is simultaneously utilised providing a more comprehensive view.

Slides : here
Preprint : https://doi.org/10.48550/arXiv.2412.17511
R package : https://cran.r-project.org/package=idiffomix

14h50 - 15h10 Ekaterina Tomilina, MaIAGE (Dynenvie) and GABI (Gibbs), INRAE, Jouy-en-Josas, Likelihood-based partial correlation Gaussian copula network inference for heterogeneous multi-omics data

Large-scale heterogeneous data integration for network inference is a key methodological challenge, especially in the context of multi-omic data analysis. The heterogeneous nature can be handled by assuming a Gaussian copula model, where the biological network can be assimilated to the partial correlations of a latent Gaussian vector. The estimation of the corresponding precision matrix is realized by combining the graphical Lasso (Friedman et al., 2007) with a semi-parametric pairwise estimation of the copula correlation matrix (Tomilina et al., 2024). We present a simulation study by comparing the performance of the method above against a moment-based approach (Zhang et al., 2021). Then, we show promising application results on an ICGC data set related to breast cancer in the US.

References :
Sparse inverse covariance estimation with the graphical lasso, Friedman, J. and Hastie T. and Tibshirani, R., 2007, Biostatistics vol. 9 no.3 pp 432-441.
A latent gaussian copula model for mixed data analysis in brain imaging genetics, Zhang, A. et al, 2021, IEEE Computer Society Press vol. 18 no.4 pp 1350-1360.
A semi-parametric Gaussian copula model for heterogeneous network inference: an application to multi-omics data, Tomilina, E. and Mazo, G. and Jaffrezic, F., 2024, https://hal.inrae.fr/hal-04847648

15h10 - 15h30 Yanis Asloudj, LaBRI, Univ. Bordeaux, {fEVE}, a modular framework for ensemble clustering in -omics data science, and more.

In the era of big data analytics and personalized medicine, clustering analyses have become a cornerstone of -omics bioinformatics. Shortly, clustering analyses attempt to classify several entities into fewer homogeneous groups, by leveraging a set of measurements taken from these entities. For instance, in bulk oncology transcriptomics, clustering analyses are frequently conducted to identify groups (or « clusters ») of patients, inflicted with similar cancer subtypes, according to the gene expression measured in their tumor. Because of their importance in data science as a whole, numerous clustering algorithms have been developed since the last century. In the field of single-cell transcriptomics (or « scRNA-seq ») only, more than 300 algorithms have been developed already. Because each of these algorithms exploits different aspects of the data, the clusters they predict can vary drastically.

To ensure that the results of a clustering analysis were robust to the algorithm used, ensemble clustering methods were developed. Succinctly, these methods applied multiple clustering algorithms on the same dataset, and they integrated the results of each algorithm so as to generate a single robust prediction of clusters.

In a previous work, we had developed our own scRNA-seq ensemble clustering algorithm, named {scEVE}. Unlike state-of-the-art algorithms, {scEVE} was able to address two main conceptual challenges in clustering : it was able (1) to quantify the uncertainty of its predictions, and (2) to generate clustering results with multiple resolutions. To do so, our algorithm first applied multiple scRNA-seq clustering methods on a dataset to predict a large pool of base clusters. From this pool, robust clusters (i.e. clusters predicted by multiple methods) were predicted, and their robustness was quantified by using basic properties from graph and set theories. Finally, these robust clusters were biologically characterized, and this procedure was repeated recursively on every characterized cluster – so long as the robustness of the analysis kept improving.

While some steps of our algorithm were specific to scRNA-seq measurements – namely, the prediction of base clusters, and the biological characterization of robust clusters – we believed that the main contributions of {scEVE} (i.e. the robustness quantification and the multi-resolution recursive clustering) were valuable for any type of clustering problem. Thus, to facilitate the utilization of these contributions, we have now developed the {fEVE} framework, which we have implemented in an R package available on Github.

Briefly, the {fEVE} framework is a modular ensemble clustering framework, that allows developers to easily replace the single-cell specific functions we have set prior to (and after) the identification of robust clusters. By replacing these functions with ones developed for bulk transcriptomics data (for example), one can easily conduct a bulk RNA-seq clustering analysis with uncertainty values and multiple resolutions. Consequently, we expect that any data scientist interested in clustering will benefit from the modularity our framework.

As of today, we have two research perspectives ourselves, that capitalize on this novel clustering framework. First, we plan on measuring the effects of individual framework components (e.g. the methods used to predict base clusters) on clustering results. Secondly, we would like to investigate the usefulness of our framework in answering more diverse biological problems (e.g. multi-omics analyses).

Slides : here
Package : https://github.com/yanisaspic/fEVE

15h30 - 16h00 Coffee break
16h00 - 16h30 Marie Chion, University of Cambridge, {ProteoBayes}: a Bayesian framework for differential proteomics analysis

This work introduces a Bayesian hierarchical model for mass spectrometry-based differential proteomics analysis, addressing missing values, peptide correlations and uncertainty quantification. Unlike existing approaches that rely on moderated variance estimates, {ProteoBayes} leverages conjugate priors to enable direct posterior sampling without costly MCMC methods. Applied to real proteomics data, it provides a more intuitive and comprehensive quantification of the uncertainty around peptide intensity differences.

Preprint : https://arxiv.org/abs/2307.08975
Package : https://mariechion.github.io/ProteoBayes/

16h30 - 17h00 Szymon Urbas, Maynooth University, Ireland, Bayesian probabilistic partial least squares (BPLS)

I will introduce a statistical model that emulates the desirable properties of partial least squares regression whilst enabling principled uncertainty quantification, and which allows for various model-based modifications. The model can be used in univariate and multivariate regression/prediction problems involving high-dimensional sets of covariates. The model is implemented in the R package {bplsr} available on CRAN.

Slides : here
Paper : https://doi.org/10.1214/24-AOAS1947
Package : https://cran.r-project.org/package=bplsr

17h00 - 17h15 Closing

Informations pratiques

Date : Mardi 11 mars de 9h30 à 17h15

Venue : Campus Agro Paris Saclay, Palaiseau, France

L’accès sur site sera réservé aux personnes inscrites.