Finding the Needle in the Data Stack: Advice from a Facebook Data Scientist
Alumna Delia Mocanu is a double husky and 2014 PhD recipient in Physics. During her time at Northeastern, she developed a passion for network science, working on data projects with incredible scale. Now at Facebook, she finds herself working on one of the largest data systems in history— News Feed.
She participated in a written engagement with the Northeastern COS on going industry, why epidemiology works better in the dark, and the most important skill to succeed in data science.
Growing up, did you find that you were interested in physics? Or did that interest develop later in life?
I grew up in Romania and I did not like Physics much at the time because it was too formulaic. In a weird twist of events, as a freshman in college here in the US, I actually switched from Chemical Engineering to Physics/Math double major..
At some point I realized that Physics made a lot of sense for me and I just enjoyed pure science more than engineering. I liked the rush of solving problems from scratch.
I jumped around a bit until I landed on what I wanted to do. I was originally curious about Astroparticle Physics and I wanted to know everything about how the universe worked. During my first year at NEU, it became clear that I was seeking something more fast-paced. [The irony is not lost on me, this was very antithetical to why I switched from Engineering to Physics four years earlier.]
At Northeastern, seeing the kind of work Barabasi and Vespignani were doing, I immediately recognized that this interdisciplinary field (Network Science/Complex Systems) was more aligned to my existing interests and personal values.
Is there anything that stands out about your graduate/Phd experience?
My advisor really emphasized the idea of ownership, and I liked that we were held to very high standards. Looking back at it now, I always felt like what I was putting my time in truly mattered. Prof. Vespignani was very good at instilling energy to the group.
Did your experience working in Professor Vespignani’s MOBS Lab and other research facilities shape your career decisions?
Absolutely. What I cherish about this PhD experience is that we felt very plugged in; we had well funded projects that were designed to solve real problems, in real time.
We loved doing something that mattered. I was simultaneously learning something and solving a problem involving millions of people. Little did I know I was going to reach billions later.
You’re currently working at Facebook as a Data Scientist. What does your current role entail?
I work in News Feed, and I’ve been here since I started at Facebook. What makes this role especially stimulating is building solutions that work at scale. I still very much rely on the thought processes and models of the world that I adopted during my PhD. Right now, I couldn’t imagine a better place to apply these.
However, my favorite part of my job is actually identifying opportunities. When I find something worth investing in, I put all my energy into making it a reality. That last part is the most rewarding and it is really more about general problem solving than it is about any specific math/engineering skill.
Have you spoken to friends or colleagues about Professor Vespignani’s COVID-19 models and what they are trying to accomplish? Has it been a source of pride, frustration, or a little of both?
To my friends mostly, yes. I touched epidemiology models a bit during grad school. At the time I remember thinking ‘I hope this software makes a difference someday,’ but I never thought we’d see something like this. Healthy skepticism is good, but I’m quite surprised to see the amount of pushback against these predictions at large, so I would say this has caused a bit of frustration.
Frankly, epidemiology is best when you don’t know that it exists and when the predictions don’t come true. Otherwise, it is not too dissimilar from weather forecasting; every modeling exercise comes with error bars, but one can tell the difference between a major hurricane and a light summer rain. However, in epidemiology you ‘can’ actually turn the would-be hurricane into a light summer rain. Taking action invalidates the original prediction, that’s really the goal. Whereas if your predictions come true, you have failed; that’s the curse of this field.
Computational epidemiology has advanced so much in the past two decades, that it’s quite challenging to establish a common language even with other highly technical folks.
What do you find most rewarding about data science?
It’s the most rewarding job you can possibly imagine. It’s always changing and you are constantly learning new things or building new tools so that you can iterate faster.
It’s not just about the act of doing the analysis, but more about where the data fits. A lot of data science is problem solving, which is what I liked about Physics in the first place. You don’t have a solution and no one has ever solved this problem before. There are no instructions and every single day feels like a journey. That dynamic aspect is very important to me.
What was the most important thing you learned at Northeastern?
Professor Vespignani wanted nothing short of perfection. He would sometimes ask you to iterate on the same chart a dozen times before it felt right. It’s about communicating this data in the best way possible. If I have to repeat the same steps several times before I get it right, then I do it, and I think it’s worth it. As a result, I do notice when others take shortcuts.
I can’t stress this enough: the analysis is not an end in itself.
Is there advice you would give to students who are interested in this field or the type of work you’re doing now?
Don’t be afraid of change, your interests will continue to evolve over time. Look at your PhD program as a time in your life to discover what you like doing and work with your advisor through that process, as they should guide you in making the most out of your career.
Your goal in academia is to publish papers and advance knowledge, while you may not necessarily implement them right away, and that’s ok. If you choose the industry path, your focus will be on the application itself. The optimal mathematical solution may need a 20-fold simplification so that you can enable the rest of the team to be part of it.
Anything else you’d like to add?
I do want to acknowledge the fact that Northeastern did an incredible job bringing professors from other universities and building great research programs, and not just in network science.
It’s incredible. I do think that I was very lucky to be part of Northeastern because that sort of environment so focused on research is very, very important. I really loved it.
People in the U.S. Started Moving Around More Before Stay-At-Home Measures Were Lifted
People in the U.S. are outside before they’re supposed to be – wearing masks, meeting outside with small numbers of people, and keeping your distance can help minimize the risks inherent in leaving your house, according to public health officials. If people are traveling slightly further and seeing slightly more people, these safeguards could make a difference. Matteo Chinazzi and Stefan McCabe from the Network Science Institute weigh in on how this can effect the curve during the pandemic.
To continue reading this article, click here. Originally published on News@Northeastern on May 26, 2020.
The Coronavirus Was in the Us in January. We Need to Understand How We Missed It.
COVID-19 was in the United States as early as January, and yet we had no idea. To most people, the virus was a distant worry, if that.
But SARS-CoV-2, the coronavirus that causes COVID-19, was already circulating in major U.S. cities, according to Alessandro Vespignani, Sternberg Family distinguished university professor, who directs Northeastern’s Network Science Institute. And if we want to keep our communities safe going forward, we need to understand how we missed a virus that was right under our noses. “We don’t want to fall into this trap in the future,” Vespignani says.
What researchers are learning now will help us make smart decisions when the number of infections has dropped off and we begin to lift physical-distancing measures.
This article was originally published on News@Northeastern on April 26, 2020. To continue reading, click here
Herd Immunity Won’t Come Anytime Soon for Covid-19
A vaccine for SARS-CoV-2, the virus that causes COVID-19, is still more than a year away, but some individuals, and governments, are hoping that life can return to normal once enough of us have had the disease.
But estimates that 70-80 percent of the population are going to be infected are way too high, Sam Scarpino says. “It’s going to be somewhere like 5 to 20 percent, and you’re going to have multiple waves of infections because you’re still going to have a large fraction of the population susceptible.” The difference between these numbers, Scarpino said, originates with some of the simplifications that epidemiological modelers make to estimate how a disease will spread.
This article was originally published on News@Northeastern on April 23, 2020. To continue reading, click here.
Network Scientists Identify 40 New Drugs to Test Against Covid-19
Big news! Northeastern researchers have identified 40 new potential drugs that could treat COVID-19. Albert-László Barabási, Robert Gray Dodge Professor of Network Science and University Distinguished Professor of physics, believes the best drug candidates will probably be those that don’t target the proteins that SARS-CoV-2 initially attacks but work within the same subcellular neighborhood.
This story was originally published on News@Northeastern on April 2, 2020. To read more, click here!
‘Social Distancing’ Is Only the First Step Toward Stopping the Covid-19 Pandemic
“We should start thinking how to repurpose industries and places and build labs to do testing. This is what we have to do. There is no other way,” says Alessandro Vespignani, director of the Network Science Institute at Northeastern.
Vespignani believes that wartime efforts will need to be in full effect in order to slow the spread of this virus. This means social distancing for a longer time, in order to slow the disease, and use that time to increase capacity in hospitals, therefore increasing capacity in testing. Vespignani says that the virus will likely resurge, and four weeks of social distancing then going back to normal is not going to cut it.
He says “we should start thinking how to repurpose industries and places and build labs to do testing. This is what we have to do. There is no other way.”
This article was originally published on News@Northeastern on March 24, 2020. Follow this link to read more
The Coronavirus Outbreak Is an International Public Health Emergency. Here’s What You Need to Know.
The World Health Organization on Thursday declared the current coronavirus outbreak a public health emergency of international concern. The virus, designated 2019-nCoV, has infected more than 9,800 people in China and killed more than 200, with roughly 140 cases appearing in at least 18 other countries on four continents.
“We are at a crossroads, in which two things are possible,” says Alessandro Vespignani, who directs the Laboratory for the Modeling of Biological and Socio-technical systems at Northeastern. “Either the screening, detection, and isolation in China will be able to contain the epidemic there, or it will be a global issue. And this will be decided in the next couple of weeks.”
The disease was first detected in Wuhan, China, in December 2019. Last week, Chinese authorities shut down all transportation in Wuhan, quarantining the city of 11 million people. Quarantines of other Chinese cities followed. Russia closed its border with China, and several other countries have suspended travel to the area. On Thursday, the U.S. issued its highest level warning, advising people not to travel to the country.
Vespignani says these efforts are likely slowing the spread of 2019-nCoV to other countries, as scientists rush to learn as much about the new disease as possible.
“It provides some time for the international community to better understand the virus,” says Vespignani, who is the Sternberg Family Distinguished University Professor of physics. “But this is not something that can be done indefinitely within China and internationally.”
Vespignani has been working with an international collaboration of researchers to try to predict the potential spread of the disease. Their prediction map is publicly available and constantly being updated with information coming from China and other countries.
Even in the best-case scenario, he says, researchers expect to see an estimate of 50,000 cases in China before 2019-nCoV is contained.
“This is something that is going to stay for a long time, unfortunately,” Vespignani says. “It’s not a matter of today or tomorrow or next week, this is going to stay for a few months and, depending on what happens, even longer.”
But that doesn’t mean that people should be panicking, Vespignani says.
“Those numbers that we get from China are scary, because you talk about tens of thousands of cases, but we’re talking about a country with 1.3 billion people,” Vespignani says. “The incidence is still very small, even there.”
Determining the characteristics of the disease and how far it may spread requires good data. And while China has been openly sharing information with the international community (a shift from the way it handled the SARS epidemic in 2003), learning about a new disease takes time.
“There’s a number of things that we don’t yet understand well enough,” says Samuel Scarpino, an assistant professor in Northeastern’s Network Science Institute. “One of them is the case fatality rate. It’s a very hard number to estimate, especially early on. We know more now about the incubation period, and the amount of time that someone is infectious, but that’s only been in the last few days that we really started to get good, reliable estimates of that.”
Scarpino is working with a group of researchers at Boston Children’s Hospital, HealthMap, Tsinghua University, and other institutions to compile data about individual cases of 2019-nCoV in one place. Their data includes the age of infected individuals, when they began to show symptoms, their travel history, whether they recovered, and other details which are organized into a map.
“The information is coming from all over the place,” Scarpino says. “From news sources, from public health officials from, from hospitals—basically we’re pulling from all publicly available information that we can access online.”
Visualizing that data can help researchers understand the spread of a disease and implement strategies to slow or stop it. It’s one aspect of a global effort to control the epidemic.
“One of the things that we’re seeing is a very robust, rapid public health response,” Scarpino says. “We basically didn’t even know what this thing was two weeks ago, and now we have something like 40 genomes that are online and cases that are being collected and analyzed all over the world.”
Scarpino and his colleagues are also trying to use the information in these various genomes to trace the origin of the disease back to its source.
“We’re sure this isn’t going to be the last novel infectious disease outbreak that we deal with, maybe not even this year,” Scarpino says. “Knowing where this thing came from, what things led up to the spread starting, is going to be really important for how we continue to monitor for future outbreaks.”
And what can we do in the meantime? Get a flu shot, Scarpino says. Influenza and pneumonia are the eighth leading cause of death in the U.S.
“That way, you don’t get sick and end up in the ER,” Scarpino says. “If we do have an influx of novel coronavirus cases…. there will probably be novel coronavirus cases there.”
This story was originally published on News@Northeastern on January 30, 2020.
It’s Not Just Your Genes That Are Killing You. Everything Else Is, Too.
In the age-old debate of nature versus nurture, the question is which aspects of our mental and physical traits are written into our genetic code, and which are a product of the environment around us.
When it comes to our health, we tend to focus on genetics, says Albert-László Barabási, Robert Gray Dodge Professor of Network Science and University Distinguished Professor of physics at Northeastern. But environmental factors drive as much as 70 to 80 percent of our risk for various non-communicable diseases, such as heart disease.
To understand how and why people get sick, researchers need to take a deep dive into the molecules around us.
“We are actually exposed to over 20,000 different molecules every time we eat, through the food’s composition,” Barabási says. “And there’s quite a number of other chemicals that we are exposed to through air, as well as simply by contact.”
In a paper to be published on Friday in Science, Barabási and colleagues at Columbia University, Utrecht University, and the University of Luxembourg lay out the case for increased study of all these environmental factors, which researchers call ‘the exposome.’
“Our genes are not our destiny, nor do they provide a complete picture of our risk for disease,” says Gary Miller, a senior author on the paper and professor of environmental health sciences at Columbia. “Our health is also shaped by what we eat and do, our experiences, and where we live and work.”
This includes the pollutants and other chemicals that make it into our bodies through our food, water, and air, as well as those that are products of microbes, inflammation, infections, and stress.
“The exposome concept is trying to capture everything—all the chemicals that we humans are exposed to on a daily basis—to understand which, how, in what quantities, and in what circumstances they have an effect on health,” Barabási says.
Some of these chemicals do very little. Others can alter cell behavior or interact with different molecules to set off a series of reactions, which could be helpful or harmful. But studying them individually won’t give us an accurate picture of how they affect our health.
“We have a very complex chemical world around us,” says Roel Vermeulen, an environmental epidemiologist at Utrecht University and the lead author on the paper. “We have to change the way that we have been looking at these problems, which has been one disease, one chemical at a time. We need to move to much more of a system approach.”
Network science offers a way to map how various molecules connect to our cells and their eventual biological impact, Barabási says. Understanding the interactions between chemicals and their cumulative effects on our health could eventually provide new ways to prevent diseases from developing.
“Being exposed to the exposome, which we are on a daily basis, is the equivalent of taking a huge number of pills,” Barabási says. “Some of these pills don’t enter the bloodstream, so they don’t affect our health. Others, however, do. Distinguishing from the many, many chemicals those that are harmful, and how they are harmful, is the key, because that’s where regulation needs to step in to eliminate those from the environment.”
Barabási’s lab at Northeastern is currently working on a project to catalogue and map all the chemicals in our food. Other researchers are tracking chemicals in the air using wearable technologies or using new analytical tools to evaluate water and soil samples.
“There is no single expertise that could solve this problem alone,” Barabási says. “There are so many routes through which we are exposed to chemicals. Many different communities will have to come together.”
This story was originally published on News@Northeastern on January 23, 2020.