By Diganta Bose, BE, MS, Clinical Programmer
Introduction
The completion of the Human Genome Project (2003) and Bioinformatics research has created an opportunity for a significant rise in a new breed of data in both research and clinical care. This new kind of data, generated as a result of Pharmacogenomics (PGx) research, promises understanding of molecular pathways and underlying disease risks in populations at a more appropriate, quantitative and qualitative way.
The PGx team, a sub-team within the CDISC SDS Team (Clinical Data Interchange Standards Consortium-Study Data Submission), has developed several domains designed to carry Pharmacogenomics data. The development of these CDISC PGx domains was done in parallel with the work being done by the HL7 (Health Level Seven, an authority that sets standards in Information Technology for Healthcare Research) Clinical Genomics Work Group (CG), which was initiated jointly several years ago by CDISC and HL7. This creates new opportunities and challenges for the SAS programmers working on CDISC complaint data structures.
Pharmacogenomics - the science and the data
Pharmacogenomics (PGx) is a branch of pharmacology which explains the relationship between genetic variations and drug response in patients by correlating gene expression or SNPs (Single Nucleotide Polymorphism) with a drug’s efficacy and toxicity. Such studies can help to develop rational means to optimize drug therapy, with respect to the patients' genotype, to ensure maximum efficacy with minimal adverse effects. Such approaches promise the advent of "personalized medicine"; in which drugs and drug combinations are optimized for each individual's unique genetic makeup.
Pharmacogenomics information is helpful particularly when it comes to cancer trials; pharmacogenomics tests are used to identify which patient will have toxicity from commonly used cancer drugs and identify which patient will not respond to commonly used cancer drug. The tests most commonly include gene expression analysis using microarrays, which can be performed on specific tissues specimens collected from the patients. Other tests may include SNP or any genetic polymorphism analysis, genotyping and a few more. The data one would expect out of such PGx tests are mainly biospecimen and genetic (DNA, RNA) samples along with their date and time captures including clinical significance information.
Handling, manipulating and analyzing the data - A SAS Programmer’s challenge
The PGx Findings domain stores key results such as intensity values (both raw and normalized), P-Value, fold-change, ratio, genetic change, amino-acid change, etc. Such data are significantly different from the currently available findings module (Lab data, Vital Signs, Pharmacokinetic data, etc.).
The HL7 CG has developed a Genetic Variation model in conjunction with clinical care participants such as Partners Healthcare and Intermountain Healthcare who are leading the adoption of PGx in healthcare. As part of the HL7 work, LOINC (Logical Observation Identifiers Names and Codes) was extended to include the most commonly used genetic variation tests. CDISC plans to create vocabulary for CDISC TESTCD and TEST which will reside in the NCI EVS (National Cancer Institute’s Enterprise Vocabulary Services) and be a counterpart to the LOINC codes. The NCI is currently working with the group that originally developed the Microarray and Gene Expression (MGED) standards, to validate and populate the Ontologies for Biomedical Investigators (OBI) into EVS. This ontology will be used for the Gene Expression data by both CDISC and HL7.
The initial package contains the following domains: BS-Biospecimen, BE-Biospecimen Event, ES-Extracted sample, PG-Pharmacogenomics, PF-Pharmacogenomics Findings.
A genetic variation data could be anything from complex arrangements of strings that looks like random character strings (generally A T G (U) C in case of a DNA/RNA sequence information like a gene substring) to strings or number arranged in an ambiguous array which actually holds some hidden meaning that need to be decoded again by some complex algorithm which a programmer has to implement. Unlike the other finding variables, which contains derived or assigned values that are simple to understand letters or text, numeric values and discrete values (Yes/No, 0/1, Male/Female), the variables in the PGx domain may not be necessarily simple and a programmer may need to think of rules and very specific programming to derive certain variables from the captured raw data in order to make them CDISC data structure complaint. Programming may involve developing macros to automate standardized algorithms across the domains and major operations effective to handle complex strings such as using regular expressions and string functions. Mapping codes likely will involve databases like NCBI (National Center for Biotechnology Information) and GenBank along with medical dictionaries like MeddRA and WHODRUG. All these new kind of data collected in PGx domains brings new opportunities for more flexible SDTM (Study Data Tabulation Model) and ADaM (Analysis Data Model) programming for CDISC.
For details read the CDISC Pharmacogenomics news article available on the CDISC website at http://www.cdisc.org/pgx-review-article