GMS data validation using MySQL stored procedures (merged DB)


Step-by-step guide

  1. Browse through where GMSChecks file has been saved.
  2. Click Execute. 
  3. Do a CALL for each check.

    Usage in MySQL query editor:

CALL merged_CHK01A();
CALL merged_CHK01B();
CALL merged_CHK01C();
CALL merged_CHK01D();
CALL merged_CHK01E();


CALL merged_CHK10L1();
CALL merged_CHK10L2();
CALL merged_CHK10L3();
CALL merged_CHK10L4();


CALL merged_CHK04D2_GPID1();
CALL merged_CHK04D2_GPID2();
CALL merged_CHK04D4_GPID1();
CALL merged_CHK04D4_GPID2();

CALL merged_CHK01G1;

CALL merged_CHK01I2B();

CALL merged_CHK01I3();


CALL merged_CHK01N1();
CALL merged_CHK01N2();


CALL merged_CHK01L();
CALL merged_CHK01L2();
CALL merged_CHK01L3();
CALL merged_CHK01L4();

CALL merged_CHK01M();
CALL merged_CHK01N();
CALL merged_CHK01U1();
CALL merged_CHK01U2();

SummaryCheck #DescriptionRequired action

Suggested  Update Query

*intermediate views deleted in v7

The method of genesis for the germplasm should have a corresponding value for number of progenitors (GERMPLSM.GNPGS)merged_CHK01A Check that all derivative germplasm have GNPGS = -1

See table view merged_CHK01A. Set GNPGS = -1


Update germplsm g, merged_CHK01A m  set gnpgs = -1 where m.gid = g.gid;
merged_CHK01BCheck that all generative germplasm have GNPGS >= 0

See table view merged_CHK01B. Set GNPGS = 2 if with two parents, 3 if with three, etc



Updated Check:

DROP VIEW IF EXISTS merged_CHK01b;

CREATE VIEW merged_CHK01B AS

SELECT germplsm.gid, germplsm.methn, methods.mtype, germplsm.gnpgs, germplsm.grplce , COUNT(p.gid) pno

FROM germplsm INNER JOIN methods ON germplsm.methn = methods.mid

LEFT JOIN progntrs p ON p.gid=germplsm.gid

WHERE (((methods.mtype)="GEN") AND ((germplsm.gnpgs)<0))

GROUP BY germplsm.gid;

To Correct:

UPDATE germplsm g, `merged_chk01b` m SET g.gnpgs = m.pno + 2 WHERE g.gid = m.gid;

Circular reference: If germplasm A has germplasm B as one of its parents and if germplasm B has germplasm A as one of its parents, then we have a circular reference situation. This also checks for two and three-level circularity.merged_CHK01CCheck that GPID1 <> GID 

See table view merged_CHK01C. Change value of GPID1

  1. Set gpid1 = 0 if gpid2 = 0;
  2. Set gpid1 = gpid1 of gpid2 if gpid2 <> 0
  3. set gpid1 = 0 if gpid1 of gpid2 = 0 or NULL gpid1

Updated Check:

DROP VIEW IF EXISTS merged_CHK01C;

CREATE VIEW merged_CHK01C AS

SELECT g.gid, g.methn,g.gpid2, m.mtype, g2.gpid1 gpid1_of_parent2 FROM germplsm g LEFT JOIN methods m ON g.methn = m.mid LEFT JOIN germplsm g2 ON g.gpid2 = g2.gid WHERE g.gid=g.gpid1;

Two Step Update Query:

  1. UPDATE germplsm g, `merged_chk01c` m SET g.gpid1 = 0 WHERE g.gid = m.gid AND parent2 = 0;
  2. UPDATE germplsm g, `merged_chk01c` m SET g.gpid1 = IF(((gpid2>0) AND (m.gpid1_of_parent2>0)),gpid1_of_parent2,parent2) WHERE g.gid = m.gid;



merged_CHK01DCheck that GPID2 <> GID See table view merged_CHK01D. Change value of GPID2
merged_CHK04D1Retrieves list of all non-replaced derivative germplasmnone*(removed in version 7)
merged_CHK04D2_GPID1Check for one-step circularities on group (GPID1) of derivative germplasmSee table view merged_CHK04D2_GPID1. Change value of GPID1
merged_CHK04D2_GPID2Check for one-step circularities on source (GPID2) of derivative germplasmSee table view merged_CHK04D2_GPID2. Change value of GPID1
merged_CHK04D3_GPID1Retrieves group (GPID1) of non-replaced derivative germplasm in view CHK04D1none*(removed in version 7)
merged_CHK04D3_GPID2Retrieves source (GPID2) of non-replaced derivative germplasm in view CHK04D1none*(removed in version 7)
merged_CHK04D4_GPID1Check for two-step circularities on group (GPID1) of derivative germplasmSee table view merged_CHK04D4_GPID1. Traverse generation history and change value of GPID1 where appropriate
merged_CHK04D4_GPID2Check for two-step circularities on source (GPID2) of derivative germplasmSee table view merged_CHK04D4_GPID2. Traverse generation history and change value of GPID2 where appropriate
Query derivative germplasm and associated groups (GPID1) and sources (GPID2)merged_CHK01ECheck that all derivative germplasm with known source (GPID2 <> 0) have known group (GPID1 <> 0)See table view merged_CHK01E. Set value of GPID1. Retrieve GPID1 by traversing generation history of GPID2
merged_CHK01G1Check that all group IDs are generative or have unknown sourceSee table view merged_CHK01G1. UPDATE merged_chk01g1 INNER JOIN germplsm ON merged_chk01g1.gid = germplsm.gid SET germplsm.gpid1 = merged_chk01g1.grandparent1;
merged_CHK01I1Retrieve list of non-replaced derivative germplasm (NRDG) with non-zero sources (GPID2<>0)none*(removed in version 7)
merged_CHK01I2b

Check that all source germplasm (which are not also the group sources) are derivative

See table view merged_CHK01I2.UPDATE (merged_chk01i2b INNER JOIN germplsm ON merged_chk01i2b.gid = germplsm.gid)
INNER JOIN germplsm AS germplsm_1 ON merged_chk01i2b.gpid2 = germplsm_1.gid SET germplsm.gpid1 = germplsm_1.gid
WHERE (((germplsm_1.gnpgs)>0));
merged_CHK01I3 Check that a non-replaced derivative germplasm has the same group ID as its derivative source See table view merged_CHK01I3.UPDATE germplsm INNER JOIN merged_chk01i3 ON germplsm.gid = merged_chk01i3.gid SET germplsm.gpid1 = merged_chk01i3.g2pid1;
Let us say germplasm A is replaced with germplasm B as shown in the GERMPLSM.GRPLCE database column. All germplasm that references germplasm A should be corrected to germplasm B.
merged_CHK10L1Create view CHK10L1 to get a list of all replaced or deleted germplasmnone*(removed in version 7)
merged_CHK10L2Check that no replaced germplasm is used as GERMPLSM.GPID1See table view merged_CHK10L2. Change value of GPID1

UPDATE germplsm g, merged_CHK10L2 m

SET g.gpid1 = m.change_gpid1_to

WHERE g.gid= m.gid;

merged_CHK10L3Check that no replaced germplasm is used as GERMPLSM.GPID2See table view merged_CHK10L3. Change value of GPID2

UPDATE germplsm g, merged_CHK10L3 m

SET g.gpid2 = m.change_gpid2_to

WHERE g.gid= m.gid;

merged_CHK10L4Check that no replaced germplasm is used as PROGNTRS.PIDSee table view merged_CHK10L3. Change value of PROGNTRS.PID

UPDATE progntrs p, merged_CHK10L4 m

SET p.pid = m.change_pid_to

WHERE p.pid = m.gid;

Exactly one name (neither more nor less) of a non-deleted GID (germplsm.grplce <> germplsm.gid) must have NSTAT=1 (preferred English name)




merged_CHK01N1Germplasm with multiple preferred English names  (NSTAT=1)See table view merged_CHK01N1
merged_CHK01N2Germplasm with no preferred English name (NSTAT=1 or NSTAT = 8)See table view merged_CHK01N2
Check that all germplasm methods are in the Methods tablemerged_CHK01MCheck that all methn values is in the methods tableSee table merged_CHK01MAdd the method to the METHODS table or change the Methn value
Check that all germplasm connections to Locations tablemerged_CHK01LCheck that all glocn in the germplsm table is the location tableSee table merged_CHK01LAdd the location to the LOCATIONS table or change the glocn value
merged_CHK01L2Check that all listlocn in listnms table is in the location tableSee table merged_CHK01L2
merged_CHK01L3Check for Duplicates abbreviations (labbr) in the locations tableSee table merged_CHK01L3

ResolveDuplicateLocationAbbr.sql will add "2" to the duplicate labbr

Note:  labbr field is only allowed to have 8 characters.  Adding another character (in this case, 2) will result to an error.  Check first if there are duplicates with 8 characters in length:

SELECT labbr,LENGTH(labbr), COUNT(labbr)
FROM location WHERE LENGTH(labbr)=8
GROUP BY labbr HAVING COUNT(labbr)>1;

For Multiple duplicates and to correct duplicates for all crops in a BMS instance, you may use: ResolveMultipleDuplicateLocationAbbrv.sql

Usage:

  1. Download file
  2. Execute the script in workbench database
  3. call procedure by executing : Call resolveLocationMultipleDuplicates();

You can also manually edit this:

SELECT location.locid, lname, location.labbr FROM location INNER JOIN merged_chk01l3 ON location.`labbr` = merged_chk01l3.labbr2
ORDER BY location.labbr;


merged_CHK01L4Will add location IDs = 6000 and 6001 if it does not existselect * from location where locid in (6000,60001)
Check that all germplasm IDs has a corresponding Namemerged_CHK01NCheck that the non-replaced germplasms in the germplsm table has a matching GID in the Names tableSee table merged_CHK01N
Check Name type and Attribute types are in UDFLDSmerged_CHK01U1

Check id Name type in NAMES table are present in UDFLDS table

See table merged_CHK01U1
merged_CHK01U2Check id Attribute type in ATRIBUTS table are present in UDFLDS tableSee table merged_CHK01U2


More info:

https://cropforge.github.io/iciswiki/articles/g/m/s/GMS_Setup_3b5b.html#CHECK_CENTRAL_DATABASE_INTEGRITY

CHECK CENTRAL DATABASE INTEGRITY

  • Check that all derivative germplasm have GNPGS = -1 [CHK01A].
  • Check that all generative germplasm have GNPGS >= 0 [CHK01B].
  • Check that GPID1 <> GID [CHK01C].
  • Check that GPID2 <> GID [CHK01D].
  • Check that all derivatives with known source (GPID2 <> 0) have known group (GPID1 <> 0) [CHK01E].
  • Check that there are no references to deleted or replaced germplasm in central [CHK10L2] for GPID1, [CHK10L3] for GPID2 and :CHK10L4 (progenitors).
         (These queries use CHK10L1 to get a list of all replaced or deleted germplasm)
  • Check for one-step circularities [CHK04D2].
          This first calls CHK04D1 to get a list of all non replaced derivative germplasm (NRDG), GIDx with source GPID2x. 
          Then it finds all sources of these (GIDy with GIDy=GPID2x) which have their sources GIPD2y=GIDx.
  • Check for two-step circularities [CHK04D4]
          This calls CHK04D1 to find all NRDG, GIDx, and then finds the sources of those with query CHK04D3, GIDy, 
          and finally finds the sources of those, GIDz, which have GPID2z=GIDx
  • Check for three-step circularities [CHK04D6]
          This calls CHK04D1 to find all NRDG, GIDx, then finds the sources of those with query CHK04D3, GIDy, 
          and then finds the sources of those with query CHK04D5, GIDz, and finally finds the sources of those, GIDw, which have GPID2w=GIDx
  • Check that all group IDs are generative or have unknown source (-1, 0, 0) [CHK01G1]
          This finds non-generative group sources with non-zero GPID1 or GPID2. (This calls CHK04D1 first to find NRDG).
  • Check that all source germplasm which are not also the group sources are derivative [CHK01I2]
          Does this by finding all NRDG with non-zero sources and then finding those sources which are not also group sources (GPID1) but are generative.
  • Check that NRDG has the same group ID as its derivative source [CHK01I3]
          Find NRDG which have different groups (CHK01I1.GPID1) to their derivative sources (GERMPLSM.GPID1).
  • Check special name relationships, e.g., Group source for IR nnn- - - should have name IR nnn. CHK04F1 – CHK04F3 
         (somewhat IRIS specific but can be adapted to similar situations).