Best Practices for Running Queries in DBX

Overview

Datavant’s DBX tool provides direct access to a robust Redshift database that stores standardized, aggregated healthcare data across a diverse set of customers. While this capability supports rich analytics and insights, it also requires mindful query construction. Queries that pull entire, large tables or omit critical filters (such as client_id joins) can drastically degrade performance, impact the experience for all of your users, and may be terminated by our system.

This article covers best practices to ensure that your queries are both efficient and reliable, preventing slowdowns or disruptions in the data environment.

Why Care About Query Efficiency?

DBX connects to approximately 20 key standard tables, including but not limited to:

Allergy
Appointment
Attribution
Charge
Demographic
Diagnosis
Encounter
Immunization
Lab Order
Lab Result
Medical / Surgical History
Medication
Patient Tokens
Payer
Procedure
Provider
Quality Data
Social History
Transaction
Vitals

These tables contain large, aggregated datasets that serve multiple customers. Inefficient queries—such as SELECT * FROM transaction without filters—can inadvertently return millions of rows. This not only bogs down performance, but also may cause downstream queries for other users to slow to a crawl or even force the system to kill the offending query to maintain platform stability.

Common Pitfalls to Avoid

Unfiltered Table Scans:
- Example of a Problematic Query:
  sql
```
SELECT * FROM transaction;
```
- Why It’s Problematic:
  This query attempts to pull every column and row from a potentially massive table. Even if you plan to apply filters later in your analytics tool, this initial step places a huge load on the database.
Missing client_id or Critical Joins:
- Example of a Problematic Query:
  sql
```
SELECT d.*
FROM demographic d
JOIN encounter e ON d.patient_id = e.patient_id;  -- Missing client_id condition
```
- Why It’s Problematic:
  Without joining on client_id (or another appropriate partitioning key), you risk pulling in unrelated data for multiple customers. This ballooning of result sets can overwhelm the database and degrade performance across the board.
Wide Table Joins Without Proper Filters:
- Example of a Problematic Query:
  sql
```
SELECT *
FROM encounter e
JOIN diagnosis dx ON e.encounter_id = dx.encounter_id
JOIN transaction t ON e.encounter_id = t.encounter_id;
```
- Why It’s Problematic:
  Joining multiple large tables without selective filtering (like date ranges, patient subsets, or client_id) can create a massive combined result set that’s expensive to process.

Recommended Best Practices

Use Targeted Filters:
- Always include client_id in your WHERE clause or join conditions if applicable. This ensures you’re only querying the data relevant to your organization.
- Apply date filters, patient segments, or encounter IDs to limit the dataset.
  Example:
sql
```
SELECT e.encounter_id, e.client_id, e.encounter_date
FROM encounter e
WHERE e.client_id = 'your_client_id'
  AND e.encounter_date >= '2020-01-01';
```
Limit Columns:
- Select only the columns you need rather than using SELECT *. Reducing the number of columns retrieved can cut down on unnecessary data transfers and speed up query execution.
  Example:
sql
```
SELECT transaction_id, client_id, amount
FROM transaction
WHERE client_id = 'your_client_id'
  AND amount > 0;
```
Use Aggregations Judiciously:
- If you need high-level metrics, aggregate them at the database level rather than exporting raw data and aggregating locally. This can reduce the result set size significantly.
  Example:
sql
```
SELECT client_id, COUNT(*) AS encounter_count
FROM encounter
WHERE client_id = 'your_client_id'
GROUP BY client_id;
```
Test and Refine Queries Before Scaling Up:
- Start by testing your queries with a known smaller dataset or narrower filters.
- Gradually broaden your filters, monitoring execution time and performance.

Impact by Customer Size

Customers with a handful of connections may not immediately feel the full impact of an unoptimized query. Smaller datasets can tolerate less efficient queries without major slowdowns. However, as your data volume grows, so does the need for careful query design. Larger customers with more extensive data connections stand to face severe performance issues if queries are not properly filtered and structured.

Remember: Even if you’re a smaller customer today, following these best practices ensures you’re prepared as your data grows and helps maintain a stable environment for all users.

Articles in this section

Overview

Why Care About Query Efficiency?

Common Pitfalls to Avoid

Recommended Best Practices

Impact by Customer Size

Further Reading

Articles in this section

Overview

Why Care About Query Efficiency?

Common Pitfalls to Avoid

Recommended Best Practices

Impact by Customer Size

Further Reading

Related articles