NOTE! This is a read-only copy of the ENCODE2 wiki.
Please go to the ENCODE3 wiki for current information.

Pipeline Architecture

From Encode2 Wiki
Jump to: navigation, search

Setting up the ENCODE Submission Pipeline production environment

End-users in this discussion are the bioinfomatic leads in the ENCODE labs who create and upload the submission tars.

Rails sandbox (production)

The ENCODE Submission Pipeline is a Ruby on Rails application.

http://encodesubmit.ucsc.edu/

It is Ruby process running on hgwdev and listening on port 49000. The production domain is encodesubmit.ucsc.edu Apache maps port 49000 to encodesubmit.ucsc.edu which is a dns alias for hgwdev. See hgwdev:/usr/local/apache/conf/httpd.conf for details.

Note that Rails instances in the port 3000 range are dev and testing instances beta, galt, kate, aamp

The Rails production sandbox is in /hive/groups/encode/dcc/pipeline/hgEncodeSubmit.

The sandbox configuration file is in /hive/groups/encode/dcc/pipeline/hgEncodeSubmit/config/database.yml To pick up changes to this, you must stop and restart the Rails instance.

BEGIN by doing this:

    ssh qateam@hgwdev
    tcsh
    cd /hive/groups/encode/dcc/pipeline/hgEncodeSubmit


Other useful scripts are: ./status which shows you 4 groups of things (rails, runner, runforever, all-qateam-processes)

cat ./status

    #!/bin/tcsh
    ps aux | grep rails | grep -v grep
    echo ""
    ps aux | grep pipeline_runner | grep -v grep
    echo ""
    ps aux | grep runforever | grep -v grep
    echo ""
    ps aux | grep qateam


[qateam@hgwdev:/hive/groups/encode/dcc/pipeline/hgEncodeSubmit] ./status

    qateam   24164  [...] /usr/bin/ruby /usr/bin/mongrel_rails start -e production -p 49000 -d
    
    qateam    1616  [...] ruby ./pipeline_runner.rb prod 6149 1017 validate
    qateam   24161  [...] ruby ./pipeline_runner.rb prod
    
    qateam   24134  [...] /bin/tcsh ./runforever
    
    [...]

In this example, you see the pipeline foreground mongrel rails process, you see the background pipline_runner.rb, and one child validator exec'd, and you see runforever which tries restart the background runner automatically if it crashes. Only the production instance uses runforever.

This example only shows production stuff, but you may well see stuff for beta and other developer instances.

To stop the pipline:

    ./stop    # wait about 5 seconds to give background process a chance to stop
    ./status  # verify that it did what you expected
    hgsql encpipeline_prod -e 'select * from queued_jobs'  
    # make sure no "quit" messages are there; often this queue is empty if you are restarting the system.

If you somehow still have an extra background pipline_runner.rb still running, even after you ran ./stop and waited, you may need to also run

    ./quitepipebg  # wait 5 seconds
    hgsql encpipeline_prod -e 'select * from queued_jobs'  # just to verify
    # make sure no "quit" messages are there; often this queue is empty if you are restarting the system.
    

To start the pipeline:

    ./go      # wait a few seconds
    ./status  # verify that it did what you expected

Non-modifying cvs check for changes in the repository for existing files. (does not change the sandbox unlike cvsup or cvs up)

    ./cvscheck

The pipeline is automatically restarted after reboot. hgwdev:/etc/rc.d/rc.local is the startup script which contains:

/bin/su -l qateam -c /cluster/encodeftp/hgEncodeSubmit/go

Rails sandbox (beta)

The beta sandbox lives in /cluster/bin/build/scripts/hiding/hgEncodeSubmit.

BEGIN by doing this:

    ssh qateam@hgwdev
    tcsh
    cd /hive/groups/encode/dcc/pipeline/hgEncodeSubmit

Use ./status, ./stop, ./go, ./cvscheck as documented above.

Rails db

The Rails pipeline maintains its own innodb tables in hgwdev.encpipeline_prod

Data dir

Files uploaded by end-users are placed in /cluster/data/encode/pipeline/encpipeline_prod/$projectid, where $projectid is a numeric project id.

/cluster/data/encode/pipeline/encpipeline_prod/ is a symlink to /cluster/encodeftp/encpipeline_prod

These locations are about to change as the new hive storage system is coming on line and all the /cluster/store/ locations include /cluster/data/encode are going away. Look in /hive/archive/.

Scripts

The ruby front-end runs various scripts (loader and validator); the locations of these scripts are configured via the project_types table in the encpipeline_prod database.

The scripts are installed thus:

    hgwdev$ cd ~/kent/src/hg/encode
    hgwdev$ make install

This copies the scripts into /cluster/encodeftp/bin

The scripts also use some Perl automation packages (e.g. Encode.pm and HgDb.pm) which are installed into /cluster/bin/scripts; if you change those, you should run:

    hgwdev$ cd ~/kent/src/hg/utils/automation
    hgwdev$ make alpha

The pipeline's loader script (doEncodeLoad.pl) loads data into the hgwdev.hg18 mysql database and creates links in /gbdb and /usr/local/apache/htdocs/goldenPath.

The scripts use configuration ra files located in the same tree as the upload directories (/cluster/data/encode/pipeline/encpipeline_prod/config).

FTP

End users may use FTP to upload large files to encodeftp.cse.ucsc.edu (default port 21). These files are then available to the loader script running on hgwdev via:

    /cluster/encodeftp/prod

Thus, /cluster/encodeftp/prod/mpw6/Yale/HeLa-S3_Pol2.tar is where ftp places this submission archive originally.

After uploading into their ftp space, the files will appear in the ftp dropdown box on the upload dialog box page after hitting refresh on the browser.

The ftpasswd file is maintained in parallel with the same users and passwords by the pipeline system. This requires that the perl script ftpasswd can be found.

ProFTP is the FTP provider we are using. The config file is in /etc/proftpd.conf