Full Text

london-journal-of-research-in-computer-science-technology

London Journal of Research in Computer Science & Technology

2514-863X 2514-8648

JournalsPress

104323

Two-Step Screening Method for Cancer Gene Data Analysis -Multivariate Oncogenes Candidates, including Oncogenes and Tumor Suppressor Genes-

Full Text

In 1995, the microarray could measure the amount of protein produced by animal genes. First-generation medical projects have used these high-dimensional expression data to study cancer gene diagnosis and released their microarrays. Many engineering researchers, such as statistics, machine learning (ML), and AI, tackled the new research theme of high-dimensional gene data analysis. The main research themes are 1) the feature selection method (FS) to select oncogenes, which are collections of legacy one oncogene that separates cancer patients from normal subjects, and 2) to ecide the best classifiers, including discriminant functions with low error rates (ERs). Since 1971, the author has studied helpful discriminant theory for medical diagnosis and solved four severe discriminant problems in 2015. The number of misclassifications (NM) and the ER calculated from NM are essential for evaluating discriminant results. Most researchers ignored the defects of NM. Problem1 is different NMs obtained by the discrimination methods. Thus, we considered the minimum misclassifications (MNM) instead of NM, which is unique or the data and solved Problem1. MNM is the first statistic to define linearly separable data (LSD) with MNM=0. All discriminant functions using variance-covariance matrices under Fisher’s assumption (the two groups have the same normal distribution with different means) have Problems2 (no LSD study) and Problems3 (the defects of variance- covariance matrix). We also found two essential facts of the new discriminant theory (Theory1). Fact1 is the first to elucidate “the relationship between the LDF discrimination coefficients and NMs.” Fact2 is “a monotonic decrease of MNM.” Two facts show that the combinatorial theory is appropriate to consider 2-group discrimination of cancer patients and normal subjects. We developed the Revised IP (Integer Programming) Optimal Linear Discriminant Function (Revised IP-OLDF, RIP), the OLDF for obtaining MNM, and solved both problems by LSD research. Because the discriminant theory is not inferential statistics (Problem4), a k-fold CV (Method1) that evaluates ER and discrimination coefficient solved it. We used k=100 before 2019 and used k=10 after 2019. In 2015, we completed Theory1. When RIP discriminated the first-generation six microarrays measured until 2004, six MNMs were 0 and LSD. LSD is the crucial signal of the multivariate oncogene candidates. We also found that only three LDFs, RIP, hard margin maximization SVM (H-SVM), and logistic regression using the maximum likelihood method, can correctly discriminate LSD.