Improving genetic risk prediction across diverse population by disentangling ancestry representations

In this section, we first describe how we constructed the dataset and provide details of our proposed method and implementations for collaboratively training the AI model.

Dataset preparation

This study considers two datasets: the ADSP and the UKB. We use the p < 1e-5 threshold to obtain the candidate regions of 5014 GWAS variants obtained from Jansen et al. (2019)28 and Andrews et al. (2020)29. We remove SNPs with more than a 10% missing rate to ensure marker quality resulting in 3892 variants. We then remove participants with absent AD phenotype or ancestral information. For the ADSP, we have a dichotomous case and control label for the AD phenotype. For the UKB, we use the AD-proxy score defined in Jansen et al. (2019)28, which combines the self-reported parental AD status and the individual AD status. Since AD is an age-related disease, we removed control participants below age…

Read more…