COVID-19 may turn out to be the biggest data event of the decade — perhaps even the century.
When the pandemic hit in early 2020, getting accurate, real-time data became essential at every level of society: Government agencies enacted lockdown measures based on data, hospitals relied on it to forecast bed shortages and the general public used it when gauging the safety of everyday activities. Since then, government agencies, research labs and media organizations have worked tirelessly to provide this kind of accessible data.
UCSF’s data science and innovation team was at the forefront of these efforts in the Bay Area. The team is made up of seven members, each of whom have a background in data science, health care or a mix of both. Together, they use data science and data visualization to address the most pressing problems across the health system’s four campuses (Parnassus, Mission Bay, Mount Zion and BCH-Oakland).
At the onset of the pandemic, when COVID admissions were just starting to pick up, UCSF’s chief medical officers tracked hospitalization counts by simply writing them down on a whiteboard. Eventually, word of this low-tech data tracking got back to Associate Professor of Medicine Sara Murray, who leads the data science team, and they quickly mobilized to develop an automated solution.
The result was an online dashboard, which, over time, has turned into a suite of dashboards, each one packed with metrics and visualizations. The original dashboard, for instance, includes hospitalization and test positivity rates cut by every imaginable grouping — vaccination status, level of hospital care, patient demographics, and symptomatic versus asymptomatic cases — and tracked over time.
But according to Rhiannon Croci, a clinical informatics specialist and one of the developers of the dashboards, the biggest challenge was not the front-end visualizations, but the back-end data engineering — extracting and restructuring the data into a format that could then be visualized.
The data that Croci worked with came from electronic health records, or EHR — digital versions of patients’ paper charts with information on medical histories, diagnoses, medications, treatments, laboratory test results, vital signs and billing information. These records are ubiquitous within health care: Every time a clinician measures a patient’s vitals (e.g. blood pressure, pulse, body temperature) or administers a treatment (like the monoclonal antibody treatment for COVID-19), each action is logged into the EHR. Whenever a new patient is admitted to a hospital, their medical history is pulled from the EHR. If a hospital administrator needs to bill an insurance provider, they verify the amount by consulting the EHR.
EHR’s update in real-time, with changes reflected in the system whenever clinicians enter them. Access to the data, however, is not real-time — at least the kind of access that is needed for UCSF data scientists to analyze and visualize the data. This is because the EHR data is stored in a database that exists outside of the EHR system. When new information is entered into the EHR, it is stored internally until a database update is made each day at around 1 a.m.
For Croci’s past projects visualizing EHR data, the information always came from the database. Visualizations only updated just once per day, in sync with the database update. According to Croci, daily data was often fine for past projects, like one that tracked opioid prescriptions issued. But with COVID-19 and the constant changes it brought to the hospital, daily data wouldn’t cut it, which is why Croci, along with fellow data scientist Joanne Yim and director of analytics strategy, Sana Sweis, developed an entirely new method of bypassing the database and extracting data directly from the EHR. The result was data that updated hourly — a huge improvement from daily updates but still not perfect, said Croci.
“It’s literally a perfect example of building the plane while flying it or learning how to fly while flying it,” said Croci. “There was a lot of grasping at straws, utilizing completely novel methods to get this data out of EHR in near real-time.”
Croci’s challenges with data engineering resonated with my experiences as a data visualization developer. Readers often marvel at the interactivity or stylistic choices of a visualization — elements that make up what data scientists call the “front end” of a product. But readers are largely unaware of the patience and precision involved in the “back end” work, which involves the data extraction, as well as tasks like converting data into suitable formats and deciding on which metrics to even include in the dashboard in the first place.
For Croci and Yim, these back-end tasks were never-ending. As the medical community’s understanding of COVID-19 evolved, so did the data behind their dashboards: When different COVID lab tests cropped up in the early months of the pandemic, Croci and Yim incorporated each one into their test positivity numbers; every time the CDC authorized vaccines or boosters for new age groups, they adjusted their “fully vaccinated” data definition.
“All the variants, all the different types of testing, different types of vaccines … drugs and treatments. All those things were so dynamic, and we were literally seeing these changes on an hour-by-hour basis,” said Croci.
This meant working long nights and weekends. Croci recalls working nonstop one weekend as she and Yim tried to complete the dashboard. “(A)t the time, the pressure was … we need this done and we needed it yesterday. Any time that you’re taking to do this is already time that is too late,” said Croci when describing the pressure she felt to quickly build the dashboard. “(P)ressure in the sense of knowing that we were needed … to help deliver a product or solution that could help with decision making and help our hospital take care of patients and work more efficiently.”
In the midst of mounting pressure at work, Croci and Yim were also managing changes in their personal lives. Yim had just returned from maternity leave and was facing the challenges of caring for a baby while working at home. Croci had just started home renovations and didn’t have a working kitchen. “You had to walk through a construction zone to get to (the bathroom),” said Croci.

Rhiannon Croci, a UCSF clinical informatics specialist, helped develop an entirely new method of extracting data directly from electronic health records.
Constanza Hevia H./Special to The ChronicleFor the past two years, Croci and Yim have essentially only worked on COVID-related data. “We’ve been in our own little world. … After the first part of the pandemic, a lot of other teams moved on,” said Croci. “(A) lot of other people that we know in the organization that do data work kind of went back to normal stuff but (Yim) and I are like, yeah, we’re still working on COVID.”
What kept Croci motivated was the importance and impact of the work. During COVID-19 surges, UCSF leadership consulted the dashboard when making system-wide policies, such as canceling elective surgeries or upgrading personal protective equipment, said Dr. Bob Wachter, chair of medicine at UCSF.
Though Croci is excited to move on to other projects in a hopefully not-too-distant post-COVID world, she feels lucky to have been able to work on a project that touched so many people’s lives. While the team originally built the dashboards for UCSF leadership, they made them accessible to all UCSF staff and students. In recent months, visits to the dashboard made up almost three-quarters of all visits to dashboards that are available on the UCSF network. In total, the dashboard has had over 330,000 views since it launched in early April 2020.
“It’s wonderful what (the data science team has) done for COVID, and I think they’ve done it extraordinarily well. And it sort of says something larger about health care — that in health care, we need teams like that,” said Wachter. “And when COVID goes away, hopefully, we will have learned the value of people and teams like that and apply it to the rest of health care.”
Wachter, who has closely followed the digital transformation of medicine for the last eight years, hopes the new EHR data pipeline will allow for more analyses and visualizations, like those that he regularly consults in the COVID dashboard. Despite clinicians spending huge amounts of time inputting data into the EHR, hospitals get almost nothing useful back, aside from individual patient records, said Wachter.
For instance, suppose a physician wants to know how many people in their clinic have diabetes. A summary of diabetes treatments administered and how those treatments correlate with patients’ level of glucose control would be hugely beneficial, but unavailable through the EHR. Even analyses on the individual level — like charting a patient’s blood pressure over time with data points of when their diet or medication changed — should theoretically be possible within the EHR system but are not.
“You would think these are completely no-brainers, but it’s not. It’s not built into these systems,” said Wachter.
In a city teeming with data-savvy tech workers, it’s hard to imagine such a rich data set was left unanalyzed for so long. But the UCSF data science team has been showcasing the untapped potential of this data over the years.

Director Sara Murray and her data science team at UCSF developed a suite of online dashboards to track COVID hospitalization and test positivity rates over time.
Constanza Hevia H./Special to The ChronicleOne of the team’s first projects conducted five years ago, was a spatio-temporal mapping of location data stored in the EHR. At the time, the hospital was working on lowering infection rates of C. diff — a bacterium that causes severe diarrhea and inflammation of the colon, often occurring after taking antibiotics. Because C. diff can be transmitted in the environment, cleaning and isolation protocols are important in lowering transmission rates, so the data science team analyzed location data of C. diff patients — data that already existed in the EHR (a patient’s location is logged whenever they relocate within the hospital). This analysis identified possible hot spots of C. diff infection, and the hospital designated better cleaning protocols for these locations.
According to Murray, director of the data science team, this project changed the conversation at UCSF about the importance of data science in health care. “This was the first use case that showed sophisticated analytics can lead to better outcomes within a hospital,” said Murray.
In a way, the team operates like any other data science team you might find at a tech company. But there are some notable differences: Health care data is incredibly messy and the stakes feel higher because analyses affect patient care.
The motivation for their work is also hyper-focused on implementation. Data scientists at other companies will sometimes adopt complex methodologies for the sake of doing something novel. But at UCSF, they stick to the simplest techniques which are often the easiest to implement. Suppose they want to run an analysis that will be viewed by all UCSF clinicians. Incorporating a strategy like deep learning might be cool, but a regression is more sound since clinicians — of which there are thousands at UCSF — are likely to understand that more, said Murray.
“We don’t usually do work for research purposes only,” said data scientist Hossein Soleimani. “The only purpose for us to start the work is to actually change something on the health care side of UCSF.”

Data scientist Hossein Soleimani analyzed audit log data to automate the process of tracking medical staff’s working hours.
Constanza Hevia H./Special to The ChronicleFor example, one of Soleimani’s projects automated the manual process of recording one’s working hours that is required of medical staff. Soleimani did this by analyzing audit log data already stored in the EHR system (every time a clinician accesses the EHR, their activity is logged into the system). This feature was originally meant as a security measure, but Soleimani used it to develop an algorithm that assigned time estimates to each recorded task. He then was able to calculate residents’ total working hours, which, under national requirements, cannot exceed 80 hours per week.
These projects are just scratching the surface, said Murray. EHR data is rich with other opportunities, and the data science team is poised to leverage the infrastructure they laid out in its COVID dashboards for future projects.
Nami Sumida is a San Francisco Chronicle data visualization developer. Email: [email protected] Twitter: @namisumida