You are currently browsing the monthly archive for February, 2007.
Microsoft has come up with ceritification for Windows Vista Logo program where a product is certified for having high reliability, security and compatibility on Vista. So, any product which has the Vista logo enables to identify that it is compatible with Vista and provides high quality experience possible.
According to the program:
The Certified for Windows Vista program will:
-
Help the product owners deliver the best Windows Vista experience by building greater quality into the applications.
-
Provide an opportunity to differentiate the products in the market segment.
-
Support with resources that reduce the cost of developing high-quality software.
The logo is displayed on the product and thus it benifits the customers to differentiate the product from others.
There is a list which Microsoft released which contains 100 products which have already been certified with the logo.
All this is fine, but I am still not much satisfied with the third party plugins allowed by Microsoft on Vista. I have not used it, but those who have been using it – complaint about the inability of using Yahoo Voice Conference and other issues. I hope Microsoft comes up with solutions for such problems..
This is a call to the all the IM users through out the world.. The storm worm which had once affected the E-Mails is now targeting the IMs..
The worm attacks the IM users by detecting when someone is chatting and sending out a message with a link to the first stage of malware on a site. If the user clicks the link, the first stage will execute.
“The botnet handlers will periodically inject new commands into this peer-to-peer network, and one of the first things they do is tell the infected machines to download several executables,” explained Jose Nazario, software and security engineer for Arbor Networks, based in Lexington, Mass.
“Instead of targeting the same infected boxes, the authors are choosing to DDoS each other’s base of operations,” Nazario said. “They have also targeted high-impact DDoS events against anti-spam and anti-criminal efforts, such as Spamhaus. These two malware networks are built specifically for spam, it seems, and so anti-spam efforts go a long way to hurting their spam delivery efforts.”
PC users can protect themselves from malware targeting IM systems in a number of ways. First, Nazario said, they should configure their IM clients to ignore messages from people not on their buddy lists. Secondly, they should practice the same kind of security there that they do for e-mail—treat unsolicited messages with suspicion.
“Sometimes asking someone to resend the link is enough to see that it was a machine sending the message and not a person,” he said. “All of this goes a long way towards defeating the simplest attack of all, the social engineering attack.”
So, IM users beware
I was just browsing my Television sets for a good show, and suddenly I stopped at Star Plus(a very famous Indian channel), seeing the logo of Microsoft and a background saying “Wow – Microsoft Vista”. There were 2 very popular anchors hosting the show. And the show was about launching Windows Vista and Office 2007.
Microsoft this time brought a new way of launching the product by making a tie up with the leading actors and actresses of Indian Film Industry. The anchors talk something about virus and that vista has a feature of real time virus protection, and there comes a girl performing on stage for a very famous song. They talk about various features, and they introduce every feature in a different manner. They conduct a interview with stars, something similar to the very popular koffee with karan show where the hosts ask few questions which is unrelated to Vista but it gives a message – Buy Original products.. Then they also put up a humour saying – Vista can do all except cooking
It was all in all a marketing for the Vista and Office 2007, They also gave out a prize for “one in billion contest” where after a survey, a person named Kaushik recieved a desktop computer and Vista package from Microsoft India.
It was a nice marketing strategies and innovative idea of Microsoft to launch the product. There was fun, humour as well as the message and features introduced to the common man.
Bio-Computing these days are gaining lot of importance. I had mentioned about Convergence Technology in the blog. Bio-Computing also belongs to the convergence technology. I had written a paper on Biological Databases, thought of sharing with you all
Introduction to Bio-Computing:
There is an interesting question to be asked: why biologists and biochemists become increasingly interested in using computational approaches for their daily work??
Biocomputing, as the computational basis for e.g. Genetic Diagnostics, has increasingly more influence on the life of everybody but most people are not aware of it. It provides the theoretical background and practical tools for scientists to explore proteins and DNA. DNA and proteins are large molecules which consist of a chain of smaller residues called nucleotides or amino acids, respectively. They are nature’s building blocks, but these building blocks are not exactly used as ‘bricks’, the function of the final molecule rather strongly depends on the order of these blocks. So it is possible to think of these residues as being numbered.
![]()
The 3D (three dimensional) structure of a protein depends on the individual sequence of these numbered residues. The order of amino acids of a given protein is derived from the corresponding DNA. This piece of DNA consists of an ordered sequence of nucleotides.
Over the last 20 years it has turned out that many proteins from different origin with similar function also have similar amino acid sequences. Thus, there are corresponding DNA sequences which are similar even though the protein under analysis occurs in different species such as mice and humans. So, we look for differences and similarities on the DNA level between a mouse and a human for many similar sequences.
Since the beginning of the 1990s, many laboratories are analyzing the full genome of several species such as bacteria, yeasts, mice, and humans. During these collaborative efforts enormous amounts of data are collected and stored in databases, most of which are publically accessible. Besides gathering all these data, it is necessary to compare these nucleotide or amino acid sequences to find similarities and differences. Since it is not very convenient to compare the sequences of several (hundred) nucleotides or amino acids by hand, several computational techniques were developed to approach this problem. In addition, these are less error-prone than a manual approach. Using computational techniques to analyse biological data is referred to as Biocomputing.
Biological Databases:
A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. For example, a record associated with a nucleotide sequence database typically contains information such as contact name; the input sequence with a description of the type of molecule; the scientific name of the source organism from which it was isolated; and, often, literature citations associated with the sequence.
For researchers to benefit from the data stored in a database, two additional requirements must be met:
- Easy access to the information; and
- A method for extracting only that information needed to answer a specific biological question.

Currently, a lot of bioinformatics work is concerned with the technology of databases. These databases include both “public” repositories of gene data like GenBank or the Protein DataBank (the PDB), and private databases like those used by research groups involved in gene mapping projects or those held by biotech companies. Making such databases accessible via open standards like the Web is very important.
Few Popular Databases:
GenBank:
GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of known genetic sequences. It has a flat file structure, that is an ASCII text file, readable by both humans and computers. In addition to sequence data, GenBank files contain information like accession numbers and gene names, phylogenetic classification and references to published literature.
EMBL:
The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences collected from the scientific literature and patent applications and directly submitted from researchers and sequencing groups. Data collection is done in collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ).
SwissProt:
This is a protein sequence database that provides a high level of integration with other databases and also has a very low level of redundancy (means less identical sequences are present in the database).
The principal requirements on the public data services are:
Data quality – data quality has to be of the highest priority. However, because the data services in most cases lack access to supporting data, the quality of the data must remain the primary responsibility of the submitter.
Supporting data – database users will need to examine the primary experimental data, either in the database itself, or by following cross-references back to network-accessible laboratory databases.
Deep annotation – deep, consistent annotation comprising supporting and ancillary information should be attached to each basic datat object in the database.
Timeliness – the basic data should be available on an Internet-accessible server within days (or hours) of publication or submission.
Integration – each data object in the database should be cross-referenced to representation of the same or related biological entities in other databases. Data services should provide capabilities for following these links from one database or data service to another.
The Creation of Sequence Databases:
For researchers to benefit from all this information, however, two additional things were required:
Ready access to the collected pool of sequence information
A way to extract from this pool only those sequences of interest to a given researcher.
Simply collecting, by hand, all necessary sequence information of interest to a given project from published journal articles quickly became a formidable task. After collection, the organization and analysis of this data still remained. It could take weeks to months for a researcher to search sequences by hand in order to find related genes or proteins.
Computer technology has provided the obvious solution to this problem. Not only can computers be used to store and organize sequence information into databases, but they can also be used to analyze sequence data rapidly. The evolution of computing power and storage capacity has, so far, been able to outpace the increase in sequence information being created. Theoretical scientists have derived new and sophisticated algorithms which allow sequences to be readily compared using probability theories. These comparisons become the basis for determining gene function, developing phylogenetic relationships and simulating protein models. The physical linking of a vast array of computers in the 1970’s provided a few biologists with ready access to the expanding pool of sequence information. This web of connections, now known as the Internet, has evolved and expanded so that nearly everyone has access to this information and the tools necessary to analyze it. Databases of existing sequencing data can be used to identify homologues of new molecules that have been amplified and sequenced in the lab. The property of sharing a common ancestor, homology, can be a very powerful indicator in bioinformatics.
Indexing and searching of databases:
Searching – BLAST:
In Bio-informatics Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if human beings carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.Here is an approach to searching genetic DNA sequences using an adaptation of the suffix tree data structure deployed on the general purpose persistent Java platform, PJama. The implementation technique is novel, in that it allows us to build suffix trees on disk for arbitrarily large sequences, for instance for the longest human chromosome consisting of 263 million letters. It has been proposed to use such indexes as an alternative to the current practice of serial scanning. This describes the tree creation algorithm, analyse the performance of the index, and discuss the interplay of the data structure with object store architectures.
Algorithm
To run, BLAST requires two sequences as input: a query sequence (also called the target sequence) and a sequence database. BLAST will find subsequences in the query that are similar to subsequences in the database. In typical usage, the query sequence is much smaller than the database, e.g., the query may be one thousand nucleotides while the database is several billion nucleotides.
BLAST searches for high scoring sequence alignments between the query sequence and sequences in the database using a heuristic approach that approximates the Smith-Waterman algorithm. The exhaustive Smith-Waterman approach is too slow for searching large genomic databases such as GenBank. Therefore, the BLAST algorithm uses a heuristic approach that is slightly less accurate than Smith-Waterman but over 50 times faster. The speed and relatively good accuracy of BLAST are the key technical innovation of the BLAST programs and arguably why the tool is the most popular bioinformatics search tool.
The BLAST algorithm can be conceptually divided into three stages.
- In the first stage, BLAST searches for exact matches of a small fixed length W between the query and sequences in the database. For example, given the sequences AGTTAC and ACTTAG and a word length W = 3, BLAST would identify the matching substring TTA that is common to both sequences. By default, W = 11 for nucleic seeds.
- In the second stage, BLAST tries to extend the match in both directions, starting at the seed. The ungapped alignment process extends the initial seed match of length W in each direction in an attempt to boost the alignment score. Insertions and deletions are not considered during this stage. For our example, the ungapped alignment between the sequences AGTTAC and ACTTAG centered around the common word TTA would be:
..AGTTAC..
| |||
..ACTTAG..
If a high-scoring ungapped alignment is found, the database sequence is passed on to the third stage.
- In the third stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm. Statistically significant alignments are then displayed to the user.
An extremely fast but considerably less sensitive alternative to BLAST that compares nucleotide sequences to the genome is BLAT (Blast Like Alignment Tool). A version designed for comparing multiple large genomes or chromosomes is BLASTZ. Also there is another well-known software called PatternHunter which produces significantly better sensitivity results than BLAST at the same speed or very similar sensitivity results at a much faster speed.
Parallel BLAST
Parallel BLAST versions are implemented using MPI, Pthreads and are ported on various platforms including Windows, Linux, Solaris, OSX, and AIX. Popular approaches to parallelize BLAST include query distribution, hash table segmentation, computation parallelization, and database segmentation(partition).
Indexing:
Introduction
DNA sequences, which hold the code of life for every living organism, can be abstractly viewed as very long strings over a four-letter alphabet of A, C, G and T. Many projects to sequence the genome of some species are well advanced or concluded. The very large number of species (and their genetic variations) that are of interest to man, suggest that many new sequences will be revealed as the improved sequencing techniques are deployed. Consequently we are at a technical threshold. Techniques that were capable of exploiting the smaller collections of genetic data, for example via serial search, may require radical revision, or at least complementary techniques. As the geneticists and medical researchers with whom we work seek to search multiple genomes to find model organisms for the gene functions they are studying, we have been investigating the utility of indexes. The fundamental lack of structure in genetic sequences makes it difficult to construct efficient and effective indexes.
The length of a DNA sequence can be measured in terms of the number of base pairs (bp). Because of their size, gigabase pairs (Gbp) is a more convenient unit. For example, mammalian genomes are typically 3 Gbp in length. The largest public database of DNA which contains over 15 Gbp (June 2001), is an archive which holds indexes to fields associated with each DNA entry but does not index the DNA itself. In the industrial domain, Celera Genomics2 have sequenced several small organisms, the human genome, and four different mouse strains. Their sequences are accessed as at files.
Searching DNA sequences is usually carried out by sequentially scanning the data using a filtering approach, and discarding areas of low string similarity. Typically, this approach uses a large infrastructure of parallel computers. Its viability depends on biologists being able to localise the searches to relatively small sequences, on skill in providing appropriate search parameters, and on batching techniques. Even under these circumstances it cannot always deliver fast and appropriate answers.
Suffix trees
Suffix trees are compressed digital tries (prefix trees). Given a string, we index all suffixes, e.g. for a string of length 10, all substrings starting at index 0 through 9 and finishing at index 9 will be indexed. The root of the tree is the entry point, and the starting index for each suffix is stored in a tree leaf. Each suffix can be uniquely traced from the root to the corresponding leaf. Concatenating all characters along the path from the root to a leaf will produce the text of the suffix.

An example digital trie representing ACATCTTA is shown in above Figure. The number of children per node varies but is limited by the alphabet size. This trie can be compressed to form a suffix tree, shown in Figure below.

To change a trie into a suffix tree, we conceptually merge each node which has only one child with that child, recursively, and annotate the nodes with the indices of the start and end positions of a substring indexed by that node. Commonly, a special terminator character is also added, to ensure a one-to-one relationship between suffixes and leaves (otherwise a suffix that is a proper prefix of another suffix would not be represented by a leaf | for instance node number 8 in Figure 2). The change from a trie to a suffix tree reduces the storage requirement from O(n2) to O(n)
Most implementations of the suffix tree also use the notion of the suffix link. A suffix link exists for each internal node, and it points from the tree node indexing aw to the node indexing w, where aw and w are traced from the root and a is of length 1. Suffix links were introduced so that suffix trees could be built in O(n) time. However, in our understanding, they are also the cause of the so-called memory bottleneck”. Suffix links, shown in Figure 3, traverse the tree horizontally, and together with the downward links of the tree graph, make for a graph with two distinct traversal patterns, both of which are used during construction.

So, at least one of those traversal patterns must be effectively random access of the memory. At each level of the memory hierarchy this induces cache misses. For example, it makes reliance on virtual memory impractical.
As would be expected from this analysis, we have observed very long tree construction times when using disk with the O(n) suffix-link based algorithms. A first approach is to attempt to build the trees incrementally; checkpointing the tree after each portion has been attempted. Here, the suffix-link based algorithm exhibits another form of pathological behavior. The construction proceeds by splitting existing nodes, adding siblings to nodes and filling in suffix-link pointers. As a result of the dual-traversal structure, no matter how the tree is divided into portions, a large number of these updates apply to the tree already checkpointed. This has the cost of installation reads and logged writes, if the checkpointed structure is not to be jeopardised. In addition, the heckpointed portions of the tree are repeatedly faulted into main memory by the construction traversals. These effects combine to limit the size of tree that can be constructed and stored on disk using suffix-link based algorithms to approximately the size of the available main memory. For example, in Java, using 1.8 Gbytes of available main memory we could build transient trees for up to 26 Mbp sequences. Using the suffix-link based algorithm under PJama, checkpointing trees indexing more than 21 Mbp has not been possible (the reduction on using PJama is due to two effects:
(i) It increases the object header size,
(ii) It competes for space, e.g. to accommodate
The PJama platform
The first set of experimental trials of this new algorithm has been conducted using the PJama9 platform. PJama minimizes the software engineering cost of providing integrated software environments supporting a very wide range of bioinformatics tasks. PJama enabled easy transitions between different underlying tree representations, and immediate transparent store creation from Java without any intermediate steps. Both transient and persistent trees can be produced using the same compiled code, but a different command-line parameter for PJama indicating whether a persistent store is being used. Although tuned, purpose-built mechanisms may be appropriate for large-scale indexes, the cost of implementing them and maintaining them would be an impediment to rapid experimentation. In addition, a great many index technologies are proposed and tested, in this area of application, as well as many others.
Hence, if we can make the general purpose persistence mechanism work for indexes, there could be considerable pay offs in reduced implementation times and more rapid deployment. The applications of the suffix trees will require much annotation and other data to make them useful to the biologists. This data, at least, does not have demanding processing and access performance requirements. Consequently, there are advantages to developing as much of the application code as possible in Java, for ease of multi-platform deployment.
Conclusion:
Biological databases have become an important tool in assisting scientists to understand and explain a host of biological phenomena from the structure of biomolecules and their interaction, to the whole metabolism of organisms and to understanding the evolution of species. This knowledge helps facilitate the fight against diseases, assists in the development of medications and in discovering basic relationships amongst species in the history of life. The biological knowledge of databases is usually distributed amongst many different specialized databases. This makes it difficult to ensure the consistency of information, which sometimes leads to low data quality. Thus with the latest search and indexing techniques we can access the data stored in biological databases and assist the researchers for comparing the DNA structures and finding a match.


