Software Carpentry
Fall 2005 Projects

Available

Jeremy Hoisak

I am developing a tool to help physicians evaluate PET and CT images and plan radiation treatment. Currently I am grabbing images and other data from a commercial treatment planning system (or TPS), importing it into MATLAB and applying a series of functions to visualize and manipulate the data. I then have to get this information back to the TPS.

What I would like to do is skip MATLAB entirely and develop a plug-in for the TPS. I have access to the planning system's development environment for the creation of plug-ins, and have been provided with templates for some sample GUIs to help us get up to speed quickly. We would have to provide the C++ code (based on my MATLAB functions) to tie it together. The TPS is running on a Solaris workstation.

The essentials of this project would be to translate my MATLAB functions into C++ and develop a GUI using the planning system's development utilities. I would benefit from having a CS student to hold my hand while diving into Solaris and C++, and the CS student would benefit from working within an industry-standard software development environment. And the tool would be used to support clinical research!

Lei Jiang

Dept. of Computer Science

Text categorization is the task of assigning a piece of text (e.g. a news article, technical report, research paper, or chapter from a book) to one or more predefined categories. Automatic text categorization normally involves huge amount of effort to train the categorizers before they become useful. If a category can be well described by a set of keywords, then the problem of automatic document categorization reduces to the problem of keyword-based search.

In this project, we aim at automatically classifying research related text to intentional categories, which are categories that describe research goals of a group of scientists. These categories are normally either too abstract to be unambiguously described as a set of keywords (or key phrases) or if do so there are too many "relevant" documents returned by a search engineer. We advocate a hybrid approach that combines conceptual modeling and text analysis.

We plan to carry out a case study in a specific research area, ideally some area from life sciences such as biology, genomics, medicine, biomedical and biochemistry. Elements of the case study:

  1. build a light-weight conceptual model for a scientist
  2. use text analysis techniques to extract structured knowledge from papers
  3. develop an algorithm to match the conceptual model with extracted knowledge and assign a relevance score accordingly

Sadath Malik

Dept. of Mechanical and Industrial Engineering

I am doing my Masters in Mech Eng and as part of my thesis, I am working on ADAMS which is used for dynamic and kinematic simulations of mech systems. Although adams is built using C++ and Fortran, but Python is also used for certain things.

The project that I would like to do is as follows. Right now ADAMS is installed only on my lab pc (only 1 license) and generally the simulations related to my project takes long hours to complete. If I want to make changes and do simulations again I have to always be at my lab pc and wait till the current simulation is over to start the next simulation. But if it is possible to invoke ADAMS remotely through the Web from another pc (eg: my home pc), it would be really helpful and I dont have to be always at my lab pc. I have found from ADAMS discussion forum that this has been attempted by some developers. It may rougly involve the following:

  1. Install apache webserver and python on the machine that runs ADAMS.
  2. Write scripts to search for all ADAMS files in a directory modify it, then simulate the new model and present results in xml format on a webpage.

I am not sure exactly how much work this involves, but if this is not feasible then we can alternately work on things like improving the speed of the simulation, or reformatting the results using xml etc.

Luke McKinney

In my laser laboratory, we use a large number of camera/iris combinations placed along the beam to align the laser. We align the beam with the irises to make sure it passes along the optimum path. This means comparing the position of the beam with and without the iris closed as we adjust the parameters. We also use the cameras to gather information during experiments.

The software we use is SecuritySpy, a security program that is good at administering a large number of cameras but not as good for some of the things we would like to do. For example, our laser fires about once a second, and the cameras are in real time so we have to judge from a blinking image. Also, using ImageJ we can analyse the images (generate profiles, colour the black and white images according to intensity) in ways that are extremely useful but too time consuming to do while work is in progress.

I would like to develop a utility that can capture the most recent image (the SecuritySpy program can save image files) and hold it on screen until it is updated, with the option of applying certain filters/operations to it. Also the ability to mark locations/dimensions on the screen (movable objects, or some sort of whiteboard like write/erase layer) would be extremely useful.

This utility has the potential for very wide application, not just in my own lab but in others --- fame, and the blessings of laser physicists who save hours of time could be yours!

Mehdi Raessi

Dept. of Mechanical and Industrial Engineering

For my PhD study, I am developing a multi-phase flow model. The code's been written in C and it is about 7000 lines. Most of the schemes used in the code are explicit except in the pressure equation where a system of equations is solved iteratively using conjugate gradient solver (cgitj).

I'd like to:

  1. increase the speed of the program (parallelization is not an option!), and
  2. (which I don't know is possible or not) get rid of round-off errors, specially during calculation of the mesh sizes at the beginning of the program. Currently, if the mesh size in each numerical cell is supposed to be 0.1, I'll get 0.1, or 0.09999999999999993 or 0.1000000000000002 for different cells. This obviously affects the solution. For instance, while solving a hyperbolic equation in my code, these tiny errors which are on the order of 1.e-16 accumulate and the total error exceeds the machine-zero.

Muhammad Saeed

Dept. of Mechanical and Industrial Engineering

In real world scenarios most of the healthcare systems don't talk to each other. For example, if one system needs patient information or to find out the last medications ordered for this patient then it has to get this information from some other clinical application or HIS (healthcare information system). Several years back an organization was formed called Health Level Seven with the objective to make it simple for these applications to talk to each other. This organization came up with a standard called HL7.

Our project will be to develop a library to parse the various segments of HL7 messages and store it into the Dictionary. This can be later taken on by others to become a conformance tool where people can take this library and build an HL7 conformance system to test if a message conforms to the standard or not.

A sample HL7 message looks like this:

MSH|^~\&|TEST|128|TESTING|128|200010311207||ADT^A01|200010311207240571|P|2.3
EVN|A01|200010310955
PID|||268794296|418190|TEST2^JACQUALINE^J^^||19650729|F|^^^^|U|2000
2000
ST^^OGDEN^UT^84444^||8015554516|||M|MTH|9867250|528111111|||||||||||N
PV1||A|N3^N3SS|3|||4777^TEST^JAMES^CHRISTIAN|208^SNOOPY^RICHARD^G||||||1|0|N|4777^TEST^JAMES^CHRISTIAN|I|9867250|REGENCE
BLUE
CROSS-FEDERAL||||||||||||||||||||||||200010310955||||||||208^SNOOPY^RICHARD^G
DG1||||TUBE CHANGE||||||||||||||
GT1||268794296|TEST2^JACQUALINE^J^^||2000 2000
ST^^OGDEN^UT^84401^|8016214516||19650729|F||1|528175251||||RET/HAFB/1999|^^^^^|||5|||||||||9994|M|||||||||||MTH||||||||||||||U
FT1||||20020515||CHARGE|1537653|LOOP RESECTOSCOPE
BROWN
6/BX||1|33.33|33.33|SU|||||S|88.88|27825|27825|||4496|2100035
IN1|1|10401|104|REGENCE BLUE CROSS-FEDERAL|PO BOX
30270^^SLC^UT^84130^|||104|HILL AIR
FORCE||||||1|TEST2^JACQUALINE^J^^|1|19650729|2000 2000
ST^^OGDEN^UT^84401^|||||||||||||||||R58000257||||||5|F|^^^^^|||||268794296
IN2|1|528111111|RET/HAFB/1999||||||||||||||||||||||||||||||||||||MTH||||M|||||5|||||||||||||||8015554516||||||||U
IN1|2|508401|5084|PHYSICIAN MUTUAL|PO BOX
2018^^OMAHA^NE^681032018^||||HILL AIR
FORCE||||||1|TEST2^JACQUALINE^J^^|1|19650729|2000 2000
ST^^OGDEN^UT^84401^|||||||||||||||||0172614851||||||5|F|^^^^^|||||268794296
IN2|2|555111111|RET/HAFB/1999||||||||||||||||||||||||||||||||||||MTH||||M|||||5|||||||||||||||8015554516||||||||U

If you need further information on HL7 standard, the starting point is www.hl7.org or www.cihi.ca, but do send me an email and I have lot to share.

Yi Zhao

Dept. of Physics

The aim of this project is to optimize parameters for a quantum cryptography experiment. (Yes, really...) Basically, the program is to find the maximum value of a function R(m, n1, n2, Nm, Nn1, Nn2). These variables are independent, and we have to find the optimal combination. If we use exhaustive search, even each variable has an accuracy of 1%, we have to loop 1E12 times. Besides, if we intend to perform data simulation, we might need around 1E3 points, which means, the above optimization process has to be performed for 1E3 times. That's why we need to make the program efficient.

Previously we have used a simplified model. The previous (in MatLab) program of optimization is about 200 lines. The data simulation program, which use the optimization program as a function, is about 100 lines.

Kai Zhuang

Institute of Biomaterials and Biomedical Engineering

I use software called Gromacs for computational biology (molecular dynamics to be specific). It's interface is Unix/Linux command based. The software is designed for single/multi CPU execution with cluster support.

Currently, our lab has setup the system so that all jobs submitted to the cluster is managed by a single master node, i.e. outside connects to master node, which connect to individual nodes. Currently this is done manually by moving file onto the master node using FTP or SSH, and then logging onto the master node and run a script which manages the job. The master node as well as all nodes on the cluster has the same software installed (i.e., Gromacs).

We would like to setup a web front where jobs can be submitted from anywhere, and be able to monitor the system from the web. Doesn't have to be flashy web design, looking for efficiency and convinience only.

Taken

Anand Agarawala and Alex Kolliopoulos

Dept. of Computer Science

My research investigates human performance as it relates to fluid selection and command invocation of objects in a desktop interface. I would like to run user studies of people doing various combined selection and command invocation tasks and compare results in the dimensions of user preference, performance, error rate and effectiveness.

I have a series of design sketches for potential kinesthetic movements of hands with an input device performing several variations of a selection and command invocation task. I would like someone who can help me with the graphical programming and testing framework. Users will be run in several trials with all usability statistics logged to a file to be later processed with SAS statistics package.

Scott Briggs and Christian Lessig

Dept. of Civil Engineering and Dept. of Computer Science

I am currently studying the growth of biofilms on the interior of pipe walls using cellular automata (CA). To take advantage of today's technology I would like to write some CA code that is multi-threaded and/or 64-bit enabled.

Junwei Huang and Yuan Gan

Dept. of Physics and Dept. of Computer Science

I want to build a web-based platform to run some codes which is previously run by Linux command line. Basicly, it's like the GUI in Matlab but now we use webpages as the interface. Thus people can login and view their results through internet. I need a teammate who knows website construction well. Considering that Linux commands can implement very complex operations, the functions that can be carried out through this web-based platform depend on our time schedule.

Two software packages: one is written in C and should be run on Linux cluster. The other one is for results visualization. After half-hour calculation through the first software package, we use the latter one to visualize the outputs. Graphs can be saved as PostScript files.

I don't have much knowledge about web server etc.,but I am familiar with the usage of the software packages. I am tired of repeatedly typing the long commands when only one parameter is changed.

Deanna Langer and Abhishek Ranjan

Institute of Medical Science and Dept. of Computer Science

For my PhD I'm working with a MR radiologist, ultimately performing parametric analysis on sets of different types of MR images performed on prostate cancer patients. The goal is to be able to better differentiate cancer/normal tissues that through the use of one or two modalities alone.

I would like to develop an integrated environment in Matlab to pull in and analyse the images (which are in different folder arrangements, ie by slice or by volume depending on imaging type), perhaps even creating a GUI for the first image import step. The analysis step should be able to handle region of interest determination of the images and be able to port that to analysis, as well as pixel-to-pixel calculations. I have a few very simple models that could be implemented as a prototype for analysis plugins (to be expanded on as the project grows).

I'm reasonably comfortable in Matlab, however would benefit from collaborating with someone who has done larger scale Matlab work or Matlab GUI interfaces if it makes sense in this project.

Simona Mindy and Ken Miura

Dept. of Computer Science and Dept. of Physics

I'm currently working on a web-based system that will store various types of biographical media (images, text documents, videos etc.) for a group of users (probably a family). A description and a number of key words will be associated with each item in the database. Users can then search the database and view all the matching results.

I'd like to add some functionality to this system that would allow users to add pictures/videos to the database using an XML-compliant HTML page. That way, if a user has already created a family site, he/she can easily upload a lot of media with minimal effort.

Yarong Mu and Nilesh Bansal

Institute for Aerospace Studies and Dept. of Computer Science

I am working on a particle tracking code. We chose Fortran because of its speed, and also because almost all the similar existing codes from which I can refer to or borrow from are written by Fortran. My code takes several files as input (including data blocks, whose format is quite flexible sometimes), simulate the particle's movement according to some physics laws, and then prepare a set of results for each case run. These results include both texts and graphs(1D, 2D or 3D).

Obviously, Fortran is good at calculation, but it's quite clumzy in producing graphes and plots. For example, to produce a 2D plot from the raw data, Fortran uses thousands of lines while, say, MATLAB, uses less than 100 lines. Therefore, I was wondering whether the whole code could be organized in such a way that different computer languages are used to construct different parts, and then they are all managed by something; this something is responsible for the interfaces, as well as calling them one by one, without any human hand work (e.g., after running Fortran and get the raw data file, we don't have to start MATLAB and run the plotting part by hand).

I see that it's quite difficult to achieve this by shell script; but Python seems possibly easier here, plus that Python is always readily avaiable. So, my current idea is: maybe we can have a small project, which aims at using Python and/or shell script to organize codes in different languages. These codes communicate with each other by files (binary/text) only. (Of course if we can actually mix these codes, it will be ideal - but personally I don't think it realistic.) Hence in the future we can always choose the best language for a certain task.

Daniel Neufeld and Jeremy Hussell

Dept. of Aerospace Engineering, Ryerson University, and Dept. of Computer Science

My research is developing optimization software for unmanned aerial vehicle conceptual design. Much of my work has been coding up a reaonably succesful software package to do this in Java. However, I now realize that my code is inefficient.

I'm not proposing a rewrite of this whole thing, but maybe of one important component of it. I wrote a Genetic Algorithm optimization module for this software. It uses Darwin's concept of natural selection to find optimal solutions to multi-variable problems like aircraft design as in this case. There are many different types of genetic algorithms, but they all have common features. The one I wrote only uses one of many possible selection, crossover, and ranking schemes and can't be generalized to include these other methods easily. Building a modular GA with object oriented programming would make alot of sense. A genetic algorithm package that can inherit classes defining a number of available techniques and that can be generalized for use in any optimization problem would be of great use to me.

Mike Rennie with Nilton Bila

Dept. of Zoology and Dept. of Computer Science

One of the things I encounter a lot is seaching through pages and pages and pages of statistical output files for the data I need (i.e., regression coefficients, means, etc). Repeated copying and pasting or (god forbid) transcribing the information is time consuming and error-prone, and it would be great if I could somehow automate this process. What I'd like to do is build a program in Python (probably using regular expressions) that will read a statistical output text file, pull out the numbers I want, and format them into a table into a .csv file or something that I can then open in a spreadsheet program.

Kiana Toufighi and Amit Chandel

Dept. of Botany and Dept. of Computer Science

I am conducting my thesis in the field of functional genomics. We often mine publicly available data to formulate hypothesis or to double check validity of biological results. One of the most useful datasets we use is gene expression dataset. A gene expression data set is basically a nxm matrix composed of n genes across m samples. The gene ids are unique and represented as rows. Each row represents a unique gene along with expression values (how much of the gene is found in the cell) across all samples. So for m samples each gene has m expression values that can range from 0 to 10,000. Analysing such data set can reveal so much about the mechanisms of the cells and the organisms. All my proposed projects involve mining and studying the expression data set.

UNIQID samp1_a samp1_b samp2_a samp2_b ......
gene1
gene2
gene3

Option A: I have written a program in Perl that takes an expression data set and for all the genes in the Arabidopsis genome (about 28,000 genes) calculates the average expression across sample replicates. In biological sciences samples are always treated and collected in replicates (at least 2 or 3) to show the statistical validity of results. The problem with this script is that it's very inefficient and for large data sets we run into the problem of running out of memory. I want to use python and make it more optimal. Currently the program uses large dictionaries to store genes and their corresponding expression values. I want to use object oriented method to make the code more readable and more modular.

Option B: I have written a webbased tool in Perl CGI which can be found here.

You may test it by copying and pasting this list in the the text box and clicking submit:

At1g01010
At1g01020
At1g01030
At1g01040
At1g01100
At1g01110
At1g01120
At1g01130
At1g01140

This program takes a set of genes from Arabidopsis thaliana and displays expression values. There are numerous options like calculating averages across replicates (in biology experiments are always done in replicates). The user can also query different expression data sets released by various labs around the world. This is a very useful tool and we've had very good feedback, but the code is in ugly shape and impossible to maintain. I would love to rewrite this in object-oriented fashion python and make it more modular, maintainable and testable.

Option C: Applying a statistical model to an expression dataset to extract significant samples. The idea is to use a expression data set containing 1878 samples and querying it with a gene family composed of a number of genes (from 2 to 200). The idea is the generate a list of tissues and/or treatments (i.e. samples) where a pair or a small cluster of genes are most differentially or most highly expressed. This basically involves looking at the expression values of each gene across all samples as well as within a given sample to find out if the magnitude of expression is very high or very low. We need a machine learning statistical algorithm to conduct this task. I think the challenge is the math behind is and not the programming.