Data exchange between BMS and EBS

The BMS development team is developing a tool for transferring breeding data between systems using the BrAPI (BrAPI.org) standard. This tool will support the two primary use cases requested by the EiB:

To support migrating data from the BMS to the EBS
To enable the exchange of data between a BMS instance and an EBS instance to support breeding networks.

To support the multiple interoperability use cases we need to resolve in the context of BMS deployed in a regional network. The first one is the idea of a federation of instances sharing studies (and therefore germplasm) for joint evaluation. For example:

Node A in the federation prepares a study and assigns node B to do the data capturing
Node B of the federation browses the studies of Node A, and pulls the germplasm of the study and then the study design (observation units) of the assigned study
Node B takes the observations
Node B pushes the observations back to Node A

We also have the longstanding idea of exchange and migrate data of information via the BrAPI. This has been the main driver for the BraPI project and given that BrAPI v2 is now supporting POST and PUT calls so the exchange can go further than read-only, we are in a good position to make progress in that sense

Considering these two objectives, it would be good for us to come up with a solution that relies on BrAPI exclusively to solve the federation problem, so that we can easily extrapolate this to other BrAPI databases instead of doing a BMS specific thing.

The idea of this initiative is then to solve the federation problem by creating a BrAPP that would sit as the middleware between BrAPI instances (initially BMS instances). In the first approximation we want to embed this tool within the BMS so that it is seamless to a BMS user, but we want to design it in such a way that we can extrapolate it to a standalone, system agnostic BrAPP for moving data from a given BrAPI enabled source into a BrAPI enabled target that we can share with the BrAPI community for further extension (I’m calling it BrING ™ for now )

There are many entities that could be shared, but I think it would be better to to start with the germplasm because:

Germplasm introduction is a precondition for distributed evaluation (can’t create the studies if I don’t have the germplasm you want me to plant)
We’ve made great progress in moving the germplasm import and updates into APIs . We’ve got germplasm momentum.
Germplasm sharing has value by itself even if you’re not planning to do joint evaluation

Additionally there are (at least) to ways we can start this project:

We start with the tool as a BMS app and then we work to extrapolate to a standalone tool
We start with the standalone tool and then we see how to integrate it to the BMS so that it is seamless to a BMS user (integrate with https://ibplatform.atlassian.net/browse/IBP-4305 )

This would be a development team decision. In this AC I’ll describe the behavior as if the BrAPP is completely integrated with the BMS. For the standalone BrAPP mockups see Technical Considerations

With regards to migrating data from the BMS to the EBS, it is important to note that data migration projects fall into two broad categories: a copy or a merge. A copy involves moving a complete dataset into a new, empty database. As long as the original data is consistent and the target system is well documented, this is relatively straight forward and tools and expertise developed for one crop will be directly applicable to the next crop. Being reasonably predictable, we can adequately plan and budget resources for these situations and have done so in this proposal.

A merge is a different animal. To implement a quality merge requires that entities in common between the two datasets be identified and connected. In many cases, this will require pattern matching with supporting logic for each type of connection. A good example is the ICARDA wheat program where data will likely be housed in a shared instance with CIMMYT wheat. Loading ICARDA wheat pedigree data into an existing CIMMYT wheat database would be difficult, requiring a lot of effort in descending each ICARDA pedigree tree and searching for matches in the CIMMYT database followed by setting the parent-child relationships. Similar challenges would be encountered with study data. For example, ICARDA may have participated in a large trial planned by CIMMYT. If the ICARDA trial were naively loaded into the combined database, there would be duplication of data, leading to skewed results. To ensure data integrity will require extensive effort by people with deep expertise in data manipulation but it will also be critically important to have people who are intimately familiar with the data. Resources would be needed from not only the IT staff at the host institution (i.e. CIMMYT for wheat and maize and IRRI for rice) but also from the breeding staff – their participation will be needed if a quality result is to be obtained. Given the complexity and dependency on as yet unmade decisions, separate projects with additional funding would need to be defined for data migration involving a merge of databases.

With the development of the One CGIAR vision, there is a strong emphasis on developing breeding networks involving One CGIAR centers, NARS, and Small to Medium Enterprises (SMEs). In these networks, the ideal is to foster a collaborative community with multiple breeding programs exchanging germplasm and sharing testing resources. These networks will have heterogeneous, distributed data management systems and practices. For the foreseeable future, some of the participating institutions will have their own crop databases and data management practices. For example, the BMS supports a number of crop networks in West Africa (e.g. cowpea, groundnut, millet and sorghum) and several USAID Innovation Laboratory projects engaged with a number of NARS partners in Sub-Saharan Africa (Peanut, PIL; soybean, SIL; sorghum and millet, SMIL) that are faced with this challenge. Providing reliable, simple mechanisms for exchanging data between the multiple BMS databases of the network is one of the functionality improvements most often requested by BMS users. Improving data exchange across BMS instances will be part of the broader effort of improving data exchange across data management systems, another major requirement as underlined on the next paragraph.