NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Data Wrangler Controlled Vocabulary HOWTO

From Encode2 Wiki
Jump to: navigation, search

Adding Human Cell Types and Protocols

1) The lab drives the process for creating an approved protocol

  • Generates a protocol document
  • Sends to the appropriate Working Group for approval (Resources Working Group if human cell line)

2) In parallel, the lab generates a new entry in the wiki table

  • Creates an entry in the Tier 3 table in the Human Cell Types or the Registration table in the Mouse Cell Types wiki page.
  • Human: if the lab needs help finding the appropriate ontology entries, here are some hints on how to do this.
    • To find a human cell type in the Brenda Tissue Ontology:
      • Go to the EBI Ontology Lookup page
      • Select BRENDA Tissue / enzyme source in the pull-down ontology list
      • Enter the cell type in the Term Name box and hit return. This should generate a list of hyperlinks to BTO terms.
      • Browse through the list of hyperlinks by (1) clicking on the hyperlink (which should take you back to the Ontology Lookup page) and (2) clicking the Browse button (which should take you to the BTO Ontology browser).
      • When you find an appropriate term, copy its URL from the BTO Ontology browser.
  • The data wrangler should not be moving to the next step until the lab has filled in all columns of the cell type table to the absolute best of their ability.

3) When the lab has filled in their entry and when the protocol is approved (for human), the wrangler 'scrapes' the new entry from the wiki table into cv.ra.

  • The simplest way to do this is with kent/src/hg/encode/cellTypeParser/cellTypeParser.py. Instructions below assume running the script on hgwdev.
    • Setup as follows:
      • set your path to /hive/groups/encode/dcc/bin so you use the provided python rather than the default python.
    • Run the script:
      • Run the command cellTypeParser.py > stanza.ra
        • use the -f (force) option if you want to process cell types that haven't been approved yet.
        • use the -n (noDownload) option if you want to skip downloading the protocol documents.
      • Edit stanza.ra to make the stanzas consistent with what's in cv.ra. Here are some things that often need to be fixed:
        • The lab enters a hyperlink to the vendor information, but doesn't enter the vendor information explicitly.
        • All .ra fields are expected to be on one single line! The description field in particular might have embedded carriage returns that will have to be removed.
        • The lab might specify the BTO ontology number but not put down a hyperlink.
        • Sometimes, the cells were obtained from another researcher instead of purchased from a commercial source. In such cases, it's strongly recommended that the orderUrl field be filled in with a hyperlink to this researcher's lab page: this protects us from having a name that's mistyped slightly, resulting in an unidentifiable source. Also in such cases, the vendorId field should contain the antibody name.
        • The default tag might not be unique. In these cases, the wrangler should add a short suffix to the tag to make it unique.
      • Insert each stanza from stanza.ra into cv.ra, maintaining proper alphabetical order.
      • Each approved protocol file is by default downloaded into the current directory. Rename the protocol files as needed to maintain consistency with the protocol filename given in cv.ra (and by extension, consistency with the other protocol files, named <celltype>_<lab>_protocol.pdf. Example: MCF-10A_Struhl_protocol.pdf).
      • Check each protocol file into htdocsExtras/ENCODE/protocols/cell/<species>
  • the wranger edits the wiki page to move the new cell types to the comment section at the bottom of the table.
    • The wrangler first enters a header comment indicating the date of the modification. See the wiki table for many examples of this.
    • The wrangler then moves each entry just registered from the 'active' portion of the table to the comment portion, underneath the new header comment.

4) The wrangler runs the commands to rebuild the table of approved cell lines

  • in ~/kent/src/hg/makeDb/trackDb/cv/alpha/ commit the changes to the CV git add cv.ra, git commit -m"Adding <foo>" , git pull, git push.
  • if updating antibody or cell type protocol documents, in ~/htdocsExtras/ do a git pull then do a make and then a make alpha.
  • Go to ~/kent/src/hg/encode/encodeValidate/. Do a git pull, then make prod.
  • Go to ~/kent/src/hg/makeDb/trackDb/. Do a git pull, then make DBS=hg19 then a make alpha.
  • Look for the new cell type on the Cell Types page of the test browser. Evaluate the new entry.
  • As a final step, move the cell type entry from the wiki table into the commented-out section below.

Adding Antibodies

Note: the current naming convention for antibodies is <targetname>_(<vendorId>), such as BCLAF1_(SC-101388). By following this convention, when a lab needs to start ordering a different antibody because the old one is no longer available (a frequent occurrence), we can document exactly what the new antibody is without generating a confusing list such as BCLAF1_2, BCLAF1_3, and so forth.

  • The lab produces an antibody validation document, as described in http://encodewiki.ucsc.edu/EncodeDCC/index.php/Antibodies.
  • The lab enters the antibodies into the Antibody wiki table, uploads the antibody validation document, and contacts the NHGRI to get approval for that document.
  • After the validation document has been approved, the wrangler registers the antibody as follows:
    • Setup (assumes hgwdev):
      • the wrangler sets his/her path to run the standard python executable /cluster/software/bin/ and adds the subdirectory kent/python/lib/ucscgenomics to his/her PYTHONPATH environment variable.
      • the wrangler verifies that he/she has access to the pdftk executable (required only if the lab has submitted multiple PDF files as one antibody validation "document"). Contact cluster-admin as needed.
    • Execution
      • The wrangler runs the script antibodyWikiParser.py, located in kent/src/hg/encode/antibodyWikiParser/, directing stdout into a separate file. This script parses the contents of the antibody wiki table into an ra stanza, prints the stanza to stdout, and downloads all approved validation files to the download directory (the local directory by default) with a filename that follows the expected naming convention.
        • use the -f (force) option if you want to process cell types that haven't been approved yet.
        • use the -n (noDownload) option if you want to skip downloading the protocol documents.
        • use the -h (help) option to check options and default values.
      • The wrangler verifies the contents of the antibody validation documents. If they are legible PDFs that contain the expected information for the indicated antibody, then the wrangler checks them into git under htdocsExtras/ENCODE/validation/antibodies.
      • The wrangler edits the new stanza, in the file that captured the output of antibodyWikiParser.py, to clear up any inconsistencies. This step is mandatory, because no two users ever fill in the wiki tables in exactly the same way and the new stanza WILL have some errors. Failure to do this step WILL result in disaster, and a bad code review. Here are some things that will often need editing:
        • the antibody validation types, which appear in parentheses on the validation line. These are listed in the antibody validation document, in the standardized form section.
        • the vendorId, vendorName, and orderUrl. Sometimes, labs will fill in an orderURL and neglect specifying the vendor information.
        • sometimes, the lab obtains the antibody from another researcher rather than a commercial source. In these cases, it's highly recommended to put the researcher's lab URL in the orderUrl field, as a safeguard to make sure the correct person was specified.
      • The wrangler copies the new stanza into cv.ra, and updates git.
    • The wrangler performs makes, as described above.
    • The wranger cleans up the antibody table by moving any newly-registered antibodies into a comment section below the visible rows of the table, as described above.

Maven Review Process

Run the cv validator

~/kent/python/programs/cvValidate/cvValidate -c <path to cv file> (defaults to ~kent/src/hg/makeDb/trackDb/alpha/cv.ra)

Prior to running this script, add /hive/groups/encode/dcc/python/lib to the PYTHONPATH environment variable.

Additional things to check for include:

  • all tags should be uppercase, minus a possible lowercase suffix (see the antibodies stanzas for examples).
  • descriptions should be clear. The stanza should generally be clear, and consistent with related stanzas.
  • there should not be any lines with a label and no contents. Apparently, such a line is outside the .ra file specificiations.
  • all hyperlinks should be functional


Check the mdbPrint - validate Readme in the cv directory (ask morgan to do this)) code review ( looking for inconsitencies, double check the names for , Description understanding, )


Maven CV Beta Management

After the git reports are generated on Sunday, the maven updates beta/cv.ra as follows:

  • Preliminaries: address any remaining provisional stanzas.
  • Do a git pull to get a fresh version of alpha/cv.ra
  • Do a diff between alpha/cv.ra and beta/cv.ra to help in documenting what is different between versions.
  • Copy alpha/cv.ra to beta/cv.ra
  • Update git with the new beta/cv.ra. In the update message, summarize the major differences since the previous version.