GMS data validation using MySQL stored procedures (merged DB)
GMS Schema | https://cropforge.github.io/iciswiki/articles/t/d/m/TDM_Genealogy_Management_System_d3f7.html |
---|---|
GMS setup & checks | https://cropforge.github.io/iciswiki/articles/g/m/s/GMS_Setup_3b5b.html |
Step-by-step guide
- Download and save the SQL file: chkMergedGMS_v5.sql. OTher versions as follows:
- chkMergedGMS_v6.sql (Graham McLaren) - Small corrections to two of the scripts
- chkMergedGMS_v7.sql removed checks that create intermediate views ( with * ) and added location and method checks. To call the checks, please use this file chkMergedGMS_v7_calls.sql
- chkMergedGMS_v8.sql added two new checks for checking existence of name and atribut types in the UDFLDS table. To call the checks, please use this file chkMergedGMS_v8_calls.sql
- If using SQLYog, right-click on the database where GMS data of a crop has to be checked and upload the file via Import › Execute SQL script.
- Browse through where GMSChecks file has been saved.
- Click Execute.
- Do a CALL for each check.
Usage in MySQL query editor:
CALL merged_CHK01A();
CALL merged_CHK01B();
CALL merged_CHK01C();
CALL merged_CHK01D();
CALL merged_CHK01E();
CALL merged_CHK10L1();
CALL merged_CHK10L2();
CALL merged_CHK10L3();
CALL merged_CHK10L4();
CALL merged_CHK04D2_GPID1();
CALL merged_CHK04D2_GPID2();
CALL merged_CHK04D4_GPID1();
CALL merged_CHK04D4_GPID2();CALL merged_CHK01G1;
CALL merged_CHK01I2B();
CALL merged_CHK01I3();
CALL merged_CHK01N1();
CALL merged_CHK01N2();
CALL merged_CHK01L();
CALL merged_CHK01L2();
CALL merged_CHK01L3();
CALL merged_CHK01L4();CALL merged_CHK01M();
CALL merged_CHK01N();
CALL merged_CHK01U1();
CALL merged_CHK01U2();
Summary | Check # | Description | Required action | Suggested Update Query *intermediate views deleted in v7 |
---|---|---|---|---|
The method of genesis for the germplasm should have a corresponding value for number of progenitors (GERMPLSM.GNPGS) | merged_CHK01A | Check that all derivative germplasm have GNPGS = -1 | See table view merged_CHK01A. Set GNPGS = -1 | Update germplsm g, merged_CHK01A m set gnpgs = -1 where m.gid = g.gid; |
merged_CHK01B | Check that all generative germplasm have GNPGS >= 0 | See table view merged_CHK01B. Set GNPGS = 2 if with two parents, 3 if with three, etc | Updated Check: DROP VIEW IF EXISTS merged_CHK01b; CREATE VIEW merged_CHK01B AS SELECT germplsm.gid, germplsm.methn, methods.mtype, germplsm.gnpgs, germplsm.grplce , COUNT(p.gid) pno FROM germplsm INNER JOIN methods ON germplsm.methn = methods.mid LEFT JOIN progntrs p ON p.gid=germplsm.gid WHERE (((methods.mtype)="GEN") AND ((germplsm.gnpgs)<0)) GROUP BY germplsm.gid; To Correct: UPDATE germplsm g, `merged_chk01b` m SET g.gnpgs = m.pno + 2 WHERE g.gid = m.gid; | |
Circular reference: If germplasm A has germplasm B as one of its parents and if germplasm B has germplasm A as one of its parents, then we have a circular reference situation. This also checks for two and three-level circularity. | merged_CHK01C | Check that GPID1 <> GID | See table view merged_CHK01C. Change value of GPID1
| Updated Check: DROP VIEW IF EXISTS merged_CHK01C; CREATE VIEW merged_CHK01C AS SELECT g.gid, g.methn,g.gpid2, m.mtype, g2.gpid1 gpid1_of_parent2 FROM germplsm g LEFT JOIN methods m ON g.methn = m.mid LEFT JOIN germplsm g2 ON g.gpid2 = g2.gid WHERE g.gid=g.gpid1; Two Step Update Query:
|
merged_CHK01D | Check that GPID2 <> GID | See table view merged_CHK01D. Change value of GPID2 | ||
merged_CHK04D1 | Retrieves list of all non-replaced derivative germplasm | none | *(removed in version 7) | |
merged_CHK04D2_GPID1 | Check for one-step circularities on group (GPID1) of derivative germplasm | See table view merged_CHK04D2_GPID1. Change value of GPID1 | ||
merged_CHK04D2_GPID2 | Check for one-step circularities on source (GPID2) of derivative germplasm | See table view merged_CHK04D2_GPID2. Change value of GPID1 | ||
merged_CHK04D3_GPID1 | Retrieves group (GPID1) of non-replaced derivative germplasm in view CHK04D1 | none | *(removed in version 7) | |
merged_CHK04D3_GPID2 | Retrieves source (GPID2) of non-replaced derivative germplasm in view CHK04D1 | none | *(removed in version 7) | |
merged_CHK04D4_GPID1 | Check for two-step circularities on group (GPID1) of derivative germplasm | See table view merged_CHK04D4_GPID1. Traverse generation history and change value of GPID1 where appropriate | ||
merged_CHK04D4_GPID2 | Check for two-step circularities on source (GPID2) of derivative germplasm | See table view merged_CHK04D4_GPID2. Traverse generation history and change value of GPID2 where appropriate | ||
Query derivative germplasm and associated groups (GPID1) and sources (GPID2) | merged_CHK01E | Check that all derivative germplasm with known source (GPID2 <> 0) have known group (GPID1 <> 0) | See table view merged_CHK01E. Set value of GPID1. Retrieve GPID1 by traversing generation history of GPID2 | |
merged_CHK01G1 | Check that all group IDs are generative or have unknown source | See table view merged_CHK01G1. | UPDATE merged_chk01g1 INNER JOIN germplsm ON merged_chk01g1.gid = germplsm.gid SET germplsm.gpid1 = merged_chk01g1.grandparent1; | |
merged_CHK01I1 | Retrieve list of non-replaced derivative germplasm (NRDG) with non-zero sources (GPID2<>0) | none | *(removed in version 7) | |
merged_CHK01I2b | Check that all source germplasm (which are not also the group sources) are derivative | See table view merged_CHK01I2. | UPDATE (merged_chk01i2b INNER JOIN germplsm ON merged_chk01i2b.gid = germplsm.gid) INNER JOIN germplsm AS germplsm_1 ON merged_chk01i2b.gpid2 = germplsm_1.gid SET germplsm.gpid1 = germplsm_1.gid WHERE (((germplsm_1.gnpgs)>0)); | |
merged_CHK01I3 | Check that a non-replaced derivative germplasm has the same group ID as its derivative source | See table view merged_CHK01I3. | UPDATE germplsm INNER JOIN merged_chk01i3 ON germplsm.gid = merged_chk01i3.gid SET germplsm.gpid1 = merged_chk01i3.g2pid1; | |
Let us say germplasm A is replaced with germplasm B as shown in the GERMPLSM.GRPLCE database column. All germplasm that references germplasm A should be corrected to germplasm B. | merged_CHK10L1 | Create view CHK10L1 to get a list of all replaced or deleted germplasm | none | *(removed in version 7) |
merged_CHK10L2 | Check that no replaced germplasm is used as GERMPLSM.GPID1 | See table view merged_CHK10L2. Change value of GPID1 | UPDATE germplsm g, merged_CHK10L2 m SET g.gpid1 = m.change_gpid1_to WHERE g.gid= m.gid; | |
merged_CHK10L3 | Check that no replaced germplasm is used as GERMPLSM.GPID2 | See table view merged_CHK10L3. Change value of GPID2 | UPDATE germplsm g, merged_CHK10L3 m SET g.gpid2 = m.change_gpid2_to WHERE g.gid= m.gid; | |
merged_CHK10L4 | Check that no replaced germplasm is used as PROGNTRS.PID | See table view merged_CHK10L3. Change value of PROGNTRS.PID | UPDATE progntrs p, merged_CHK10L4 m SET p.pid = m.change_pid_to WHERE p.pid = m.gid; | |
Exactly one name (neither more nor less) of a non-deleted GID (germplsm.grplce <> germplsm.gid) must have NSTAT=1 (preferred English name) | merged_CHK01N1 | Germplasm with multiple preferred English names (NSTAT=1) | See table view merged_CHK01N1 | |
merged_CHK01N2 | Germplasm with no preferred English name (NSTAT=1 or NSTAT = 8) | See table view merged_CHK01N2 | ||
Check that all germplasm methods are in the Methods table | merged_CHK01M | Check that all methn values is in the methods table | See table merged_CHK01M | Add the method to the METHODS table or change the Methn value |
Check that all germplasm connections to Locations table | merged_CHK01L | Check that all glocn in the germplsm table is the location table | See table merged_CHK01L | Add the location to the LOCATIONS table or change the glocn value |
merged_CHK01L2 | Check that all listlocn in listnms table is in the location table | See table merged_CHK01L2 | ||
merged_CHK01L3 | Check for Duplicates abbreviations (labbr) in the locations table | See table merged_CHK01L3 | ResolveDuplicateLocationAbbr.sql will add "2" to the duplicate labbr Note: labbr field is only allowed to have 8 characters. Adding another character (in this case, 2) will result to an error. Check first if there are duplicates with 8 characters in length: SELECT labbr,LENGTH(labbr), COUNT(labbr) Usage:
You can also manually edit this: SELECT location.locid, lname, location.labbr FROM location INNER JOIN merged_chk01l3 ON location.`labbr` = merged_chk01l3.labbr2 | |
merged_CHK01L4 | Will add location IDs = 6000 and 6001 if it does not exist | select * from location where locid in (6000,60001) | ||
Check that all germplasm IDs has a corresponding Name | merged_CHK01N | Check that the non-replaced germplasms in the germplsm table has a matching GID in the Names table | See table merged_CHK01N | |
Check Name type and Attribute types are in UDFLDS | merged_CHK01U1 | Check id Name type in NAMES table are present in UDFLDS table | See table merged_CHK01U1 | |
merged_CHK01U2 | Check id Attribute type in ATRIBUTS table are present in UDFLDS table | See table merged_CHK01U2 |
More info:
CHECK CENTRAL DATABASE INTEGRITY
Check that all derivative germplasm have GNPGS = -1 [CHK01A].
Check that all generative germplasm have GNPGS >= 0 [CHK01B].
Check that GPID1 <> GID [CHK01C].
Check that GPID2 <> GID [CHK01D].
Check that all derivatives with known source (GPID2 <> 0) have known group (GPID1 <> 0) [CHK01E].
Check that there are no references to deleted or replaced germplasm in central [CHK10L2] for GPID1, [CHK10L3] for GPID2 and :CHK10L4 (progenitors).
(These queries use CHK10L1 to get a list of all replaced or deleted germplasm)
Check for one-step circularities [CHK04D2].
This first calls CHK04D1 to get a list of all non replaced derivative germplasm (NRDG), GIDx with source GPID2x.
Then it finds all sources of these (GIDy with GIDy=GPID2x) which have their sources GIPD2y=GIDx.
Check for two-step circularities [CHK04D4]
This calls CHK04D1 to find all NRDG, GIDx, and then finds the sources of those with query CHK04D3, GIDy,
and finally finds the sources of those, GIDz, which have GPID2z=GIDx
Check for three-step circularities [CHK04D6]
This calls CHK04D1 to find all NRDG, GIDx, then finds the sources of those with query CHK04D3, GIDy,
and then finds the sources of those with query CHK04D5, GIDz, and finally finds the sources of those, GIDw, which have GPID2w=GIDx
Check that all group IDs are generative or have unknown source (-1, 0, 0) [CHK01G1]
This finds non-generative group sources with non-zero GPID1 or GPID2. (This calls CHK04D1 first to find NRDG).
Check that all source germplasm which are not also the group sources are derivative [CHK01I2]
Does this by finding all NRDG with non-zero sources and then finding those sources which are not also group sources (GPID1) but are generative.
Check that NRDG has the same group ID as its derivative source [CHK01I3]
Find NRDG which have different groups (CHK01I1.GPID1) to their derivative sources (GERMPLSM.GPID1).
Check special name relationships, e.g., Group source for IR nnn- - - should have name IR nnn. CHK04F1 – CHK04F3
(somewhat IRIS specific but can be adapted to similar situations).