Name Matching XML Parser

Available from: BaseSpace Clarity LIMS v2.1.0

Often, data can be parsed from an instrument result file in XML format into Clarity LIMS, for the purposes of QC.

For example, perform a TapeStation instrument run. This produces an XML result file, which the user imports into the LIMS. The file includes information of interest for each sample, which should be parsed and stored for a range of capabilities, such as QC threshold checking, searching, and visibility in the LIMS interface.

The XmlSampleNameParser tool allows for sample data to be parsed into UDFs on result files (measurement records) that map directly to the derived samples being measured.

The XmlSampleNameParser tool is installed as a standalone jar file as part of the NGS Extensions Package. Currently it contains one script, parseXmlBySampleName.

Provided the result file is in XML format, this script can be used to match data in the file to samples in the LIMS using the measurement record LIMSID.

Values are mapped to UDFs in the LIMS using a configuration file that contains XPath mappings for the result file. (External resources, such as w3schools, can be used to learn more about Xpath, and many XML viewing tools will generate it automatically for elements of interest.)

The format for the data needed to make the association between the file contents and the sample in the LIMS is: LIMSID_NAME.

The name is optional and is supported for readability. This means it may come from the input sample on which the step is being run.
The LIMSID must come from the output result file, which is also where the parsed information will be stored in UDFs.

Typically, it is ideal to set up the instrument run with the sample and result file information, so that it will appear in the same format in the XML result file. To automate setup, you can use a tool such as the template driver file generator.

The LIMSID_NAME can be provided to the instrument as the sample name, or as a comment or other field on the sample. The only conditions are that:

The sample field that you want to use for the LIMSID_NAME must be passed into the file result file (eg via a driver file).
The configuration file must be set up such that it can access this field from the correct location. (See Configuration File Format.)

Script Parameters and Usage

The parseXmlBySampleName script uses the following parameters, all of which are required:

Parameter

Description

-u {user}

LIMS username

-p {password}

LIMS password

-i {URI}

LIMS process URI

-inputFile {result file}

LIMSID of the XML file to be parsed.

-log {log file name}

Log file name

-configFile {configuration file name}

Parsing configuration file

Example

This example shows the script run on a manually imported TapeStation XML file that has been attached to the TapeStation DNA QC process.

bash -c "/opt/gls/clarity/bin/java -jar /opt/gls/clarity/extensions/ngs-common/v5/EPP/XmlSampleNameParser.jar
script:parseXmlBySampleName
-i {processURI:v2:http}
-u {username}
-p {password}
-inputFile {compoundOutputFileLuid0}
-log {compoundOutputFileLuid1}.html
-configFile /opt/gls/clarity/extensions/conf/tapestation/defaultTapeStationDNAConfig.groovy"

Configuration

The process type for the steps on which information will be tracked must be configured with the following output generation:

1x fixed ResultFile output per input
2x fixed ResultFile outputs applied to all inputs
Shared output naming pattern example: {LIST:Instrument Result XML (required),XML Parsing Log}

This represents the minimum configuration. Additional shared output files may be added as required.

For each piece of information that will be parsed from the XML file and stored on the step outputs, configure desired UDFs on ResultFile and associate them with the per-input output result files for the process type.

Configuration File Format

The configuration file should be produced as a .groovy file and stored in the /opt/gls/clarity/customextensions directory. Its format allows for four types of entries:

baseSampleXPath
sampleNameXPath
process.run.UDF."UDF name"
process.output.UDF."UDF name"

The examples provided here use XPath for a TapeStation XML result file.

baseSampleXPath

Provide this one time.
This XPath indicates the list of samples and the associated sample information, relative to the root of the XML file. Specific sample information will be retrieved relative to this path.

sampleNameXPath

Provide this one time.

This XPath indicates where the LIMS sample association information (LIMSID_NAME) can be found, relative to the sample list indicated by baseSampleXPath. Often this will be stored as the sample name or in a comment field for the sample.

// **Sample matching information**
// These two entries are required to locate and identify individual samples' information in the XML
file.baseSampleXPath = "/File[1]/Samples[1]/Sample[Observations!='Ladder']"
sampleNameXPath = "${baseSampleXPath}/Comment[1]/text()"

process.run.UDF."UDF name"

May be provided multiple times.
Indicates information that is tracked for the entire run, and not on individual samples.
Typically, this will be XPath relative to the root of the XML file, as shown.

The destination result file UDF name is specified as part of the entry name and must match the UDF name in the LIMS exactly.

// **Details that correspond to the whole run**
process.run.UDF."Conc. Units".xPath = "/File[1]/Assay[1]/Units[1]/ConcentrationUnit[1]/text()"
process.run.UDF."Molarity Units".xPath = "/File[1]/Assay[1]/Units[1]/MolarityUnit[1]/text()"
process.run.UDF."MW Units".xPath = "/File[1]/Assay[1]/Units[1]/MolecularWeightUnit[1]/text()"

In the example above, three values will be parsed into the LIMS from the XML file, represented on each individual measurement record (result file) output:

Conc. Units
Molarity Units
MW Units

process.output.UDF."UDF name"

May be provided multiple times.
Indicates information that is tracked on individual samples.
Typically, this will be XPath relative to the sample XPath (baseSampleXPath), as shown.

The destination result file UDF name is specified as part of the entry name and must match the UDF name in the LIMS exactly.

// **Details that correspond to Samples**
process.output.UDF."Concentration".xPath = "${baseSampleXPath}/Concentration[1]/text()"
process.output.UDF."Region 1 Average Size - bp".xPath = "${baseSampleXPath}/Regions/Region[1]/AverageSize[1]/text()"
process.output.UDF."Region 1 Conc.".xPath = "${baseSampleXPath}/Regions/Region[1]/Concentration[1]/text()"
process.output.UDF."Peak 1 MW".xPath = "${baseSampleXPath}/Peaks/Peak[1]/Size[1]/text()"
process.output.UDF."Peak 1 Conc.".xPath = "${baseSampleXPath}/Peaks/Peak[1]/CalibratedQuantity[1]/text()"

In the example above, five values will be parsed into the LIMS from the XML file, represented on each individual measurement record (result file) output:

Concentration
Region 1 Average Size - bp
Region 1 Conc.
Peak 1 MW
Peak 1 Conc.

Additional Information

Some other scripts you may find useful:

PreviousParse CSV NextSample Placement Helper

Last updated 11 months ago

Was this helpful?