investment possibilities in the area of Big Biological Data.
Indeed, biological experiments produce plenty of data, from basic R&D up to clinical drug trials and all the way to the point of medical care. Here's a quick guide to just a few of the ways that software is transforming the life sciences.
Raw Biological Data
In R&D, data consist primarily of images and numerical measurements. The numbers are easier to store, manage and analyze.
To take just one example, when drug companies screen potential compounds, they run them through assays in high throughput, checking for key measures of activity, dosing and toxicity. The output fits into simple text files that can be opened in any spreadsheet program. Relational databases have made data access more convenient, and scale fine up into the millions of records.
High resolution images take up vastly more space. Gone are the days of drawing pictures with colored pencils while squinting through a microscope. Digital photography has changed the way laboratory scientists record what they see, saving time and money, and moving the results from filing cabinets to file systems. Several software vendors offer the equivalent of Flickr for biology and medicine.
I've written previously about genetic sequence data, and some of the challenges of dealing with it, partly caused by limitations in current measurement technologies. Suffice it to say that genetic data is Big and requires serious storage space.
In a medical setting, patient records consist of many relatively small data bits, in the form of visit dates and blood test measurements, as well as vastly bigger data files, like medical imaging from various forms of scanning. Electronic patient records are steadily moving all of the data into the computer age.
Once data can be linked directly to an individual, a host of medical regulations kick in to protect patient privacy and assure traceability of medical decision. Regulations require extensive compliance checks that produce reams of electronic paperwork, yet another form of abundant biologically related data.
Organizing Biological Information
Scientists used to write down what they did in the laboratory in a paper notebook, then have fellow scientists co-sign as witnesses for future patent reasons. But over the past decade, scientists have increasingly adopted electronic lab notebooks (ELNs). In digital form, lab records are much easier to search -- by date, by experiment type, by key word, by equipment used and more.
In order to manage the sequence of experiments, and track back all of the inputs, like reagants, disposable test kits, and more, scientists use laboratory information management systems (LIMS). Lab notebooks and LIMS are merging, so that the same software vendors, and even the same software products, often provide both.
During the course of data collection and entry, scientists need to perform some careful curation and classification. For example, scientists may recognize that an aberrant cell shape means the cells have run amok in cancer. Or they may see that your wisdom tooth is impacted and pushing your other teeth forward.
Image processing software has made tremendous progress in classifying images, reducing the need for people to do it. And even where artificial intelligence and machine learning algorithms can't match the skill of a doctor, the fact that images are digital means that hospitals can zap the files around the world, where skilled medical practitioners in other countries can classify the results of heart imaging, brain scans, x-rays of bones and more.
After collecting and classifying data, turning it into useable information, scientists and medical practitioners can begin to derive knowledge from it.
Producing Knowledge from Biology
In order to reach medical conclusions from Big Biological Data, doctors and scientists need to look at it within the framework of specific questions. They carefully design clinical trials, determining what clinical end point will prove that a drug works, and making sure they enroll enough patients to be confident the results aren't statistical flukes. When the results come in, they apply rigorous statistical tests.
Decades ago, dedicated statisticians would help to construct valid studies with sufficient numbers of patients or samples to reach a conclusion. Laboratory scientists would perform experiments and hand-curate the results. Then they'd turn the results back to the statisticians to analyze the results on mainframe computers.
Today software packages make basic statistics widely accessible. Expert statisticians continue to provide valuable services, but software has enabled them to perform more sophisticated analyses much faster. For standard data analysis, even those without deep mathematical skills can go pretty far on their own.
The techniques of Big Data have also enabled researchers to look for patterns that would have been difficult or even impossible to find previously. For example, literature searches from online databases of scientific publications can uncover remote correlations between treatments and side effects, or disease and diet.
Some of these approaches allow hypothesis-free search. The algorithm can find patterns without even understanding the meaning of any of the technical biological words.
Deeply knowledgeable doctors and scientists still need to look at the results in order to extract knowledge from the patterns. And they do, because of professional integrity and fear of embarrassing themselves.
Remarkably, very little biological data currently exists in the public domain, a point that holds true of the sciences in general. The excellent book Reinventing Discovery describes some of the great benefits of open science, and how the scientific enterprise needs to change to encourage more openness.
For genetic sequence data, the National Institute of Health and the European Molecular Biology Laboratory host free public databases. Scientific journals require all genetic sequence data to be posted prior to publication.
But the standards are evolving, and in many fields outside of genetics, very little data every makes it into the public domain.
So long as biological data can't be tracked to individual people, the privacy issues have mostly simmered in the background. As whole human genome sequencing becomes more common, simply hiding the name of the subject will not be enough.The reason is that your genetic sequence identifies you more accurately than any photo on Facebook.
If we were truly to make biological data open, strict privacy controls would need to be in place. People would want control over what information is made public. Would you choose to share all of your medical data with the whole world? With all of your friends? With researchers under legal contract not to share? With your doctor only?
Other biological and medical data also remain private, living in individual research labs, drug development companies, hospitals and insurance companies. That's unlikely to change, because of the perceived and real value, as well as the fear of lawsuits.
Even without data truly becoming open, mergers and acquisitions are putting ever bigger biological data sets in the hands of a small number of companies. From the perspective of Big Data analytics, that's an exciting thing. From the perspective of patient privacy, it's worrisome.
Big Biological Data in the Future
In order to take advantage of the kinds of algorithms that Big Data has inspired, biological and medical data will need to move onto new types of storage and analysis servers, aggregated across more individual researchers and patients.
There's going to be big tension between consolidation and individual privacy and non-discrimination. Because of the legal and ethical ramifications, Big Biological Data will likely evolve more slowly than the current generation of Big Data.
To wrap up, there will be great ideas, companies and investments in Big Biological Data in the coming years. But don't expect billion dollar acquisitions of young startups. Consumer trends last for a few minutes up to a few years. The transformation of data for human health will yield benefits that last many lifetimes.
Open Science: How to Crowd-Source Knowledge Creation
Big Genetic Data: Throw it Away?
Perfect Pills: Big Pharma in Crisis