HUBzero Platform

No comments

Michael McLennan, Purdue ITaP Chief Hub Technology Architect

HUBzero is  scientific collaboration platform.  You can go to http://nanohub.org/ to see what it’s all about.  It has a lot of scientific aps.

Hubbub in Indy, April 5-6, 2011.  More info at hubzero.org

Purdue hosts some hubs for others.  As the hubs get more tools and users on them, the hubs have to move off of one machine.

Teragrid is a free HPC resource for US researcher.

https://www.teragrid.org/web/eot/campus_champions is a place to go for information. Kim Dillman helps campus champions, she’s a champion of champions.

www.teragrid.org/web/user-support/resources

Cristina Beldica, Blue Waters

Blue Waters is part of NSF’s strategy for high end computing. Track 1 is for sustained performance approaching 10^15 operations, or large data sets.

Those on the Blue Waters project talked to industry and researchers to see what they were looking for in a petascale system.  As of this time, these systems are not on the floor yet, but Blue Waters will be online some time in ‘11.

Campus grid, Preston Smith

It doesn’t have to be a college campus, it can be a lab or corporation. Purdue uses Condor to schedule jobs on machines when they are idle, such as the ones found in labs.

Funding

No comments

Bioinformatics at NIH, Cheryl Kraft

National Cancer Institute gets a lot of NIH funding($5.2B). NIH will try to double their cancer research over the next few years.  Dollars for autism research is also increasing.

Bioinformatics priorities (this is an unofficial list based on Cheryl’s seven years there):

  • privacy, data sharing
  • standard phenotypes
  • novel tool development (analysis, visualization, prediction)
  • reuse and secondary analysis of data to produce novel finding, (ie, producing and sharing data that others can use.)
  • multi-scale modeling of disease process and human immune system

What is the committee looking for?

  • innovation
  • team science, cross disciplinary teams
  • broad applicability
  • well written proposal
  • clearly described concepts
  • specificity

Susanne Hambrusch, NSF Funding Opprotunities

To figure out what the solicitations are really asking for, ask one of the people on the peer review committee.  Not all committee members will be equally able to help you.  Check the most current NSF budget request to Congress.  These budgets are not fulfilled exactly, but it gives an idea of what NSF will have money for.

There are some programs NSF is funding:

  • Cyber-Enabled discovery and innovation (CDI), Cyber-Physical systems (CPS), CI Team (training, education, advancement and mentoring) will be open in ‘11.

Other ‘11 funding:

  • Cyberlearning
  • Science, engineering, and education for sustainability
  • Science and engineering beyond Moore’s Law

Christine King, Research Development Services

Her office finds funding opportunities and help build research communities. They will review your proposal and organize any site visits.

Amanda Hamamaker, Pre Award Services

Her office helps put together proposals.

Questions:

Should proposals go to NSF or NIH?  It is difficult to know where to send things, but in general if there is about disease it should go to NIH. Computational proposal go to NSF.  Clinical studies got to NIH.  Ask a committee member if you are unsure.

Ralph Johnson in forestry uses many CI resources from Purdue.  His primary distribution method is Blackboard.  He embeds powerpoint and videos into Blackboard. It can give quizzes. He also uses Adobe Connect, which has a better method for office hours than Blackboard.

Mixable encouraged students to interact with each other.  Since Blackboard’s discussion tools are not as good, he uses Adobe Connect.

Ralph doesn’t have the time to do the research for good CI teaching tools, so he uses ITaP recommendations.  ITaP recommends software and hardware (such as ccameras and headsets) for given needs.  Even with ITaP’s help there is still some amount of troubleshooting that has to be done during the first week of class.

Signals

Signals is generated from data mining and includes grades and information stored in Blackboard.  Signals is a tool to show students how they are doing in the class.  Instructors can customize the criteria for a green, yellow, or red light.  Information like quizzes, homework scores and test scores can be added to what goes into Signal’s criteria.Signals will then send out email, which can also be customized, with grades so to the students. The idea of Signals is to provide feedback to students before it is too late to drop or pass the class.

iclicker, Mary Sadowski

The iclicker is a remote with six buttons: A-E and  on/off. This device works really well for in class multiple choice quizzes.

The instructor opens the “vote” (there is a light on the iclicker that indicates when a button is clicked), and only then do the clicks count.  The last click is the one that counts.  The “vote” light turns green when a click has been recorded.  When a question is not open for voting and a button is clicked, the light turns red.

Students have to register their clicker for each class they are in. Students can share a clicker as long as they do not have class at the same time.

NSF is still hammering out data retention policies.  How long should data be kept?  There is currently no perfect solution.  It will be an iterative process between PIs, librarians, and everyone else involved in the process.

Librarians want the data to be in the public’s eye so that eventually data can become a commodity.  One of the reasons for archiving data is that right now it is possible to generate more data than can easily be looked at over the course of one project.  The same data can be reused for multiple projects. Pure numerical data can not be copyrighted in the USA.  In some countries someone can own data.  This is one of the issues to overcome when dealing with an international market.

Data management

No comments

James L. Mullins, Dean of Libraries talked about how and why librarians are invloved in data storage.  Library science defines appropriate structure (relationship within knowledge continuum) and creates retrieval points (cataloging/metadata).

Librarians love cataloguing data! They’ve been doing it for centuries with print.  The Purdue libraries have been working on data curation for years, even before the NSF announced their data archiving requirement.

Sorin Matei, Communication

Looking at wikipedia dataset’s edits from 2001-2008.  17 million articles and 280 million edits.  They are studying information dissemination, and are using some of Purdue’s clusters for number crunching.

Phillip San Miguel, Purdue Genomics Core Facility.

Over 100 labs use the data generated from the Genomics Core Facility.  Over time, sequencing DNA has gotten faster and cheaper. At one time sequence calculations were 1 million sequences/day @ $1200/million sequences.  Now they are up  to 6 billion sequences/day @ $0.50/million.  There is so much data generated for so cheap that it is difficult to analyze all the data now created over the course of an experiment.

As long as Moore’s Law holds, scientists will always be able to afford the computing power.  Around ‘05, Moore’s Law failed.  At some point, it will cost a large % of grant money for computing power and there won’t be enough money left for reagents.

Jin Xia, Statistics

To analyze data, people have to spend a lot of time programming instead of how to analyze their data. With this in mind, the statistics students decided to write a program.  The goal: handles data storage, numerical computation and visualization.  It must be easy to program

R programming language comes from S.  It is commonly used by researchers.  It won the 1998 ACM software system award.  It is very easy to program, it is highly extensible through user submitted packages.  There are 2672 additional packages as of Dec 2010.  It is maintained by the highly active R development core team.  One disadvantage is that it can’t handle very large data sets.  This is why industry uses SAS instead of R.

Hadoop stores large files by breaking them into small blocks of a certain size.  It creates multiple copies of data across different machines and is efficient and reliable.  It was inspired by Google’s file system.  It is used by Yahoo, Amazon, and IBM.

This data is stored in (key, value) format.  The Map step performs user-defined operations on all data in parallel.

RHIPE is an integration of R and Hadoop.  It was developed by Purdue grad Saptarshi Guha in 2010. RHIPE provides data analysis completely in R.  Purdue has a RHIPE cluster which is being evaluated for serving the whole Prudue campus.

CSPAN Archiving

No comments

There is a group that is archiving CSPAN broadcasts. CSPAN only keeps its footage for so long before recycling the tape.  There are 3 networks, 7 days a week, 24 hrs/day for 23 years.  This project archives all of it, or tries to.  Some of the old content is lost forever.

They wrote their own software to manage video tape machines.  Now they record in mpeg-4.  They also record closed caption text.  Historic vid servers recorded 96,360 hours/year.  High res format for archive and copies, low res for desktop, real video, windows video.  At the time (’94) the technology didn’t exist to record/encode fast.  They are in the process of digitizing old vhs tapes, but it is a long process.

Cyber Center

No comments

Gabriela C. Weaver, chemistry professor.  She is working on CASPiE.  Part of CASPiE is getting undergrads into research.  How do you engage students in research that results in meaningful data, and is interesting to students?  Typical undergrad lab equipment is not good enough to get data for publishing.  Purdue buys one of each such instrument and equips them with an autosampler and remote access so other institutions can use them. Purdue currently has: Raman Spectrometer, HPLC with diode array detector, gas chormatography, and agas chomatographer/mass spectrometer.

The data go from the instrument to a CITRIX server, then over the internet and is eventually stored.

Only one person can be logged in to the instrument at a time, so as soon as data are collected, the data are dumped and the student is logged off the instrument.

There are two ways to use these instruments: batch mode and individual mode.  In batch mode, the instructor runs the data for the students.  In individual mode, students run their own data.

For training purposes, all options can be turned off.  This will protect the instrument and previously collected data.

David Salt, Horticulture and landscape architecture

Ionomics meaures micro nutrients required for life in a high throughput system.  Ionomics can also be used detect toxins.

A recent project was a genome-wide analysis of ionomic gene function in yeast using 12k different strains.  It look about a year to gather the data.  There are many ways to mine this data. Without CI using the data is impossible.

The evolution of this took about 10 years.  It started on Excel, then went to Access.  Then it went online with the E enterprise center, and has gotten more sophisticated since then.  Currently, the data and the metadata are stored in a relational database.

www.ionomicsHUB.org is where the yeast (and other) data lives.

NSF would not have been as keen to fund research if there was not CI in place.

Matt Potrawski, Purdue Center for Prediction of Reliability

The Center for Prediction of Reliablility is DOE funded, 21.2M for 5 years. There are 35-40 faculty, staff, grad students, and post docs.  MEMS (the items they test) are now in laptops and cellphones.  There are stringent requirements that the chips must survive:

  • billions of cycles
  • dynamic impact conditions
  • high g, up to 30k g over milliseconds
  • -50 C to 800 C

All of these are good for simulations.  Much of the simulations deal with uncertainty in inputs and outputs.

CI at Purdue

No comments
  • Some of the HPC nodes at Purdue are purchased by researchers on campus.  There have been 16 repeat purchasers so far.
  • For  every faculty member on the node, there are 6-7 total users, including students.  There has been about 1 million in savings because of bargaining power with vendors.
  • HUBzero platform is used for scientific collaboration.
  • There is a multi-institutional condor pool uses computers when they are sitting idle on desks.

In the classroom there are many CI tools.  One of the tools is called a clicker.  It looks somewhat like a remote control.  There are many ways to use this, and instructors have to be trained on ways to use it, and what it can be used for.

Signals is a system to identify students at risk for failing.  A email goes out to the student.  The eventual goal is to identify behavior that means a student is at risk of failing and show them where to get help.

There are some tools that work on cell phones as well as computers. Mixable is the newest one.  One thing Mixable does is give students the option of seeing who is enrolled in the class so they can talk.  At the end of the semester, they can leave the class group.

DoubleTake uses video on phones.  Those videos are cumbersome to send through email, so DT provides a repository as well as a grading rubric for the submitted videos.

  • ImpactEarth models what would happen if an asteroid hit the earth.

What to do about CI’s life cycle?  Look for similar problems.  The ideas for products come from staff themselves.  They can’t support everything, or all they will do is software support.  They look for common problems.

A comment from the audience: Not all faculty use the same version of the clicker.  Students have to buy multiple versions of the hardware.  Adding to the confusion, the computer version of the clicker is cheaper than the physical version, but not all faculty know about the computer version.

Donna Cox got an MFA and then started programming in 1985, and was talking to scientists in 1981. Her interest is numerical modeling.  We as a culture are developing a new visual language over time.  She calls them visaphors, digital visual metaphors.  Facebook is teaching us how visualization works, while at the same time being a visualization itself.

There was a tornado simulation that showed a second tornado.  The visualization was useful for debugging the mathematics.  This is an example of how some data can only be understood through the CI process.  We can also share the data sets with others, and others bring a fresh perspective to problems.

She and her team helped make Hubble 3D IMAX.  There was so much data modeling based on Hubble data.  Amazing!!

It is much easier to collect and store data than to analyze it.