NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Design for staging downloads for release

From Encode2 Wiki
Jump to: navigation, search

Plan Basics

This is the plan for managing ENCODE downloads so that it is clear which files need to be released when a composite track is updated with new data (subtracks). Currently, all download files for a track are in a single directory, with an index.html file listing the files with metadata. The index file is generated by a perl script that uses: preamble.html, the *.gz files in the local directory, the associated table metadata from hgwdev's trackDb and a fileDb.ra containing metadata for download only files. An example of the current setup is:

   goldenPath/hg18/encodeDCC/wgEncodeYaleChipSeq/{index.html, preamble.html, fileDb.ra, *.gz}

The proposed setup is:

   goldenPath/hg18/encodeDCC/wgEncodeYaleChipSeq/{index.html, preamble.html, fileDb.ra, *.gz}                      # containing the "live" set seen in hgwdev downloads.
   goldenPath/hg18/encodeDCC/wgEncodeYaleChipSeq/{release1, release2}/{index.html, preamble.html, fileDb.ra, *.gz} # frozen in time: all files which should be in the release (not just new ones)

However, on the RR (hgDownloads) site, there will NOT be a release subdirectory structure> All files (current and historical) are kept at the track downloads root directory. The index.html on the RR will only list the current versions.


Historical perspective

  • The pipeline will create files in the track downloads dir (e.g. wgEncodeYaleChipSeq).
  • When the composite track is first staged for release:
  1. A "release1" subdirectory will be created.
  2. The index.html to be frozen will be copied to the release1 dir. Also copy preamble.html and fileDb.ra.
  3. The *.gz files should be hard linked to release1. Now they can be seen in both the track downloads directory (necessary for genome test to find them) and in the "frozen" release directory.
  • As time goes on, the pipeline will create new files in the track downloads root (aka "live") directory. These new files will NOT be hardlinked into the older release directory.
  • Additionally, some files will be "versioned". Versioned files should have distinct names from the already released versions (e.g. V2 appended to name). The result should be new files that are only in the live root directory. The older versions need to be removed from the live root directory, but will still be in the release1 directory.
  • When the composite track is staged for a second release, the 3 steps listed above are repeated for a release2 subdirectory.
  • In the release2 directory there should be files with 3 links (haven't changed since release 1) and files with only 2 links (new for this release). In the release1 directory there may be files which have only one link (eliminated/replaced by release2). Obviously the number links will change with additional releases.

Advantages

  • Both genome-test and the RR will maintain their files at the track downloads root directory.
  • On genome-test, the index.html can be rebuilt over and over to show the current list of *.gz files in the root directory.
  • On the RR the index.html will have been pushed from hgwdev and should list only the current versions of the files. It will hide (from HTTP) any replaced files. However, those replaced (versioned) files should still be in the track downloads root directory on RR and may be reached via FTP.
  • For each release there will be a "frozen" copy of what the downloads dir looked like when the release occurs.
  • Hard links instead of copies are used to conserve space.
  • Generating lists of changed files should be easy by comparing the contents of two release directories.

Weaknesses

  • Since the files are hardlinked, changing the contents of the live version will change the contents of the "frozen" version in the release directories. Making sure the live directory is only populated by "mv" and not "cp" should minimize this problem.
  • The contents of the downloads dir on hgwdev will differ from RR. In hgwdev, the this root dir should contain only the current files, while the RR dir will contain the latest released files and any older ones for which FTP access is maintained. The only plan currently for accessing old files (on the RR) is via FTP. If this is a large enough issue, it may be necessary to create a release directory structure on RR too.