Life Sciences Data Management for Better Outcomes

Introduction

In the healthcare and life sciences (HCLS) industry, life sciences data management plays a crucial role in handling the massive influx of omics data. Organizations are integrating genomics, proteomics, and other datasets into applications for drug discovery, clinical development, molecular diagnostics, and population health. However, managing this data efficiently is a challenge, especially as laboratory instrumentation advances and generates exponentially larger datasets.

To ensure seamless data movement, life sciences data management must incorporate secure, reliable, and automated data transfer mechanisms. Moving large datasets from laboratory instruments to cloud-based analytics platforms requires optimized strategies that support high-speed transfers, maintain data integrity, and comply with industry regulations.

This blog explores best practices for life sciences data management, ensuring that organizations can securely store, process, and analyze data to drive meaningful scientific discoveries and improve healthcare outcomes.

Understanding Life Sciences Data Management in Laboratory Operations

Key Challenges in Life Sciences Data Management

HCLS organizations leverage cutting-edge lab techniques to study cellular interactions, aiming to discover new therapies and improve human health. For example, advanced microscopy techniques used to study brain function can generate up to 300 TB per sample in eight hours, while large-scale genomics technologies profiling 100,000 patients for cancer can produce 20 PB of raw data. The challenge lies in transferring such massive datasets to cloud computing resources without disrupting scientific workflows.

Understanding your lab’s infrastructure is essential for optimizing data transfers. Evaluation criteria fall into three main categories:

Lab Operations

Understanding how the lab runs on a typical day and on peak, high throughput days is important. To estimate the volume of data generated, consider:

a. How many instruments of each type are there?
b. How many instruments are running at once?
c. How many runs are completed per day per instrument?

Instrument Duty-Cycle

Each instrument follows a unique duty cycle, impacting data transfer automation and optimization. Consider:

a. How many samples per run and how much data does each run generate?
b. What is the length of the run and timing of the output?
c. What is the layout of the data? Is it a single, large file or multiple smaller files?
d. Does the instrument software write a file or provide a status that signals the start or completion of a run? Can they be monitored as events to trigger actions? Can third-party software be installed on instrument workstation or not?
e. Is any local pre-processing of the run data needed before transferring to the cloud (e.g., de-multiplexing, motion correction)?
f. How will the data be processed and analyzed after it reaches the cloud?

IT Infrastructure

The role of on-premises IT infrastructure is crucial in data transfers. Key considerations include:

a. What is the local infrastructure? Does the on-prem infrastructure remain in place? Is it needed for preprocessing?
b. Where is the data stored after it is generated? How much storage is needed to cache active data if the network connection to AWS is unavailable?
c. Is the local network shared or dedicated to the lab? What is the network topology of the building and lab?
d. Are there any network security devices that might increase latency?
e. What is the available network bandwidth for data transfers?
f. What is the connection to AWS? Is it over the internet or is there a direct connection (e.g., AWS VPN, AWS Direct Connect)?

Optimizing Life Sciences Data Management with AWS DataSync

Moving large data volumes to AWS is challenging due to network limitations and infrastructure constraints. Organizations employ various strategies to optimize data transfers:

DataSync provides security, data integrity, transfer mechanics, and logging that are critical for tracking, validating, and auditing data transfers between different storage technologies.

Data can be transferred to or from on-premises multi-protocol network attached storage to multiple AWS storage options, such as Amazon S3 and Amazon FSx for Lustre. DataSync uses compression to move the data to AWS as fast as possible. Although DataSync transfer tasks cannot directly trigger off instrument events, tasks can be launched with a laboratory control plane built with AWS Lambda and other services.

For many HCLS customers, building the laboratory control plane for DataSync is a critical component for automating lab data transfers. Automated DataSync tasks, follow one of two patterns. The “push” pattern uses DataSync agents running on premises to push data into AWS. The “pull” pattern uses DataSync agents running on Amazon Elastic Compute Cloud (Amazon EC2) minimizing the need for on-premises data transfer infrastructure. Both the push and pull patterns have been successfully deployed by HCLS customers to accelerate laboratory data transfers. AWS Prescriptive Guidance and solution architectures are available for genomics data (push) and for a connected lab solutions. Both use DataSync for performant data transfers so that scientists can focus on high value science, instead of data movement.

The push pattern is the recommended deployment method to take full advantage of the DataSync inline compression and minimize latency. The DataSync agent is a virtual machine deployed on premises close to the data source to minimize the effects of latency from NFS/SMB traffic. This is especially important where the laboratory instrumentation is sensitive to latency of the source storage system. The agent provides optimized and resilient logic to push data over commodity internet, VPN, or Direct Connect to your destination storage in AWS.

Recommended push pattern deployment methods for DataSync in an HCLS laboratory

While the push pattern is the recommended deployment for most lab environments, the pull pattern is also a supported deployment. Use the pull pattern when growing on premises infrastructure is not an option or a dynamic fleet of on-demand DataSync agents is needed to match peak laboratory output. The pull pattern uses DataSync agents running on EC2 instances. To use the pull pattern, Direct Connect is necessary to ensure a low-latency connection to AWS. Essentially, Direct Connect extends your on-premises NFS/SMB storage to DataSync agents on EC2 instances that pull data across to AWS. Agents are activated with private endpoints and data is transferred through the DataSync service, which then writes data directly to the destination such as Amazon S3.

Pull pattern deployment methods for DataSync in an HCLS laboratory

Validating Data Transfer Strategies for Life Sciences Data Management

The number of instruments, rate of data generation, file types, and networking determines the number and configuration of DataSync agents required for optimal data transfers. As there are many factors to consider, we profiled the pull pattern to help you determine if it is a fit for your laboratory.

We considered two types of customers when validating the pull pattern. One runs a small genomics sequencing lab with one instrument, and another runs a large lab with a fleet of five sequencing instruments.

Both customers use the Illumina Novaseq 6000 for their next-generation sequencing applications. Each instrument generates 2.4 TB per instrument run. The run requires 48 hours to complete and generates ~500K BCL files. The instrument software creates a flow cell directory on the local network attached storage signaling the start of a run. The instrument software issues a semaphore marking the completion of the run. Our customers perform primary analysis converting BCL output to Fastq files (BCL2Fastq) on premises. Transfers begin when Fastq files are generated. Both laboratories use a dedicated 10 Gbps Direct Connect link to AWS.

Using the best practices architectures for scaling out DataSync transfers, we validated the pull pattern for big and small transfers using instrument data stored on network attached storage (NAS) hosted in a data center with a 10 Gbps Direct Connect link to AWS. A 3.6 TB Fastq dataset approximated a single Illumina Novaseq 6000 run representing the smaller laboratory. An 18 TB Fastq data set represented the output for the large customer with five Illumina Novaseq 6000 sequencers. We evaluated 1 to 4 agents, using one task per agent to pull data from the NAS to a S3 bucket. The datasets were evenly split among the DataSync agents using DataSync filters.

To understand the behavior of DataSync agents deployed on EC2 instances, new agents were deployed for each test iteration to ensure the availability network burst credits. DataSync task output, Amazon EC2 metrics, and Direct Connect CloudWatch metrics were collected for analysis.

Two key points to remember when using a pull pattern:

1. Single-flow traffic: Amazon EC2 network bandwidth is limited to 5 Gbps when not in the same cluster placement group.
2. Network burst performance: EC2 instances with an “up to” network bandwidth use a burst credit model to achieve that bandwidth and have an underlying baseline bandwidth. When credits are exhausted, the EC2 instance reverts to using underlying baseline network bandwidth.

Simulated HCLS lab architecture for the DataSync pull pattern for two test datasets

Small Dataset: 3.6 TB

We began testing with using one task per agent. In the results, the single task transfer hit the 5 Gbps flow limit, using the available network burst credits. We know this is the flow limit because the m5.2xlarge EC2 instances are rated up to 10 Gbps. A single task and agent never consumed the maximum available bandwidth. At one hour, the burst credits are consumed and the transfer maintains 2.5 Gbps until complete. Using two agents, each configured with one task to transfer the data, are more efficient with the available Direct Connect bandwidth and maintains use of the network burst credits throughout the transfer. With this data set, when you move to two tasks each configured with one agent, you do not run out of burst credits on either EC2 instance. Adding third and fourth tasks and related agents does not dramatically accelerate transfers because additional burst credits are not used.

Small dataset data transfer measurements

Large Dataset: 18 TB

Similar to our small dataset, we started with a single task configured with one agent. The transfer hit the 5 Gbps flow limit consuming all the network burst credits. Once consumed, the transfer returns to the 2.5 Gbps baseline. When the number of tasks/agents are increased to two, transfers consume the additional bandwidth available with Direct Connect. Like the single task/agent scenario, once the network burst credits are consumed, the transfer returns to its baseline throughput. Using three tasks/agents, no individual task reaches the 5 Gbps flow limit. Network burst credits last longer and the transfer starts to maximize the available 10 Gbps Direct Connect to AWS. Even with three tasks and agents, the burst credits are consumed before the transfer completed and returns to the baseline throughput of 2.5 Gbps. At four tasks and agents the network burst credits are not consumed except in opportunistic spikes. Each of the four agents maintain approximately 2.5 Gbps transfer for the duration. Moreover, the transfer uses the entire 10 Gbps Direct Connect for the entire task duration. A similar pattern was observed with much larger datasets. DataSync continues transferring data and optimizing for the 10 Gbps Direct Connect, thereby maximizing the transfer.

Large dataset data transfer measurements

Other Factors

The performance achieved in the preceding validation is just one scenario and not a performance guarantee. A data transfer solution must satisfy business requirements such as sample turn-around-time, security, and compliance.

Consider turn-around-time (TAT). For example, a large clinical lab operating 30 instruments generating over 50 TB of genomic data per day may require twelve, 10 Gb Direct Connect links spread across virtual private cloud subnets to meet the required 24-hour turn-around-time to deliver results for a batch of diagnostic tests. Whereas a smaller research lab with two to three instruments generating 5 TB weekly may work fine with a site-to-site VPN tunnel using their existing 1 Gbps internet pipe. For more information and architectures for DataSync transfers, refer to the DataSync documentation. Refer to these examples to learn how to configure Direct Connect routing for DataSync architectures.

DataSync securely transfers data between self-managed storage systems and AWS storage services, and between AWS storage services. How your data is encrypted in transit depends on the locations involved in the transfer. Data protection, data integrity, resilience, and infrastructure security as it relates to DataSync can be found in the documentation.

For auditing and logging, DataSync interacts with Amazon CloudWatch, AWS CloudTrail, and Amazon EventBridge, but you can also monitor it using manual tools.

Conclusion

This blog explored best practices for optimizing data transfers in HCLS laboratories. Understanding lab operations, instrument duty cycles, and IT infrastructure enables organizations to implement efficient data movement strategies.

In healthcare and life sciences, data powers innovation. Efficiently transferring omics data from laboratories to the cloud is essential for accelerating research, improving diagnostics, and enhancing patient care. Adopting AWS DataSync and tailored data transfer solutions ensures organizations can scale with growing data volumes and drive meaningful scientific insights.

At Cloudastra, we help HCLS organizations streamline life sciences data management by providing cutting-edge solutions for secure, scalable, and efficient data transfers—empowering laboratories to focus on scientific discovery and better patient outcomes.

Do you like to read more educational content? Read our blogs at Cloudastra Technologies or contact us for business enquiry at Cloudastra Contact Us.