On Data Science

Origin of the Species

Sandeep Nair
7 min readJul 31, 2023

“…the power of a science seems quite generally to increase with the number of symbolic generalizations its practitioners have at their disposal.”
― Thomas S. Kuhn, The Structure of Scientific Revolutions

The winter of 2008 was hard. US’s GDP was down by 8% YoY¹, employment numbers were dropping in the order of hundreds of thousands¹ per month. The quants — dubbed as ‘Rocket Scientists of the Wall Street’², who used mathematics, and computer science to build models — were apparently to be blamed for the harsh winter.

The winter was relatively less cold for me. For one, I lived in a coastal city in India, 13 degrees north of the equator. Two, I was employed. My title — “Senior Analyst, Analytics”, was quite a mouthful. It did not help that I (mis)pronounced it in a different way each time: uh-nah-li-tiks? anaal-e-tix? It was also hard to explain what I did: Use mathematics, and computer science to build models that help executives make decisions? That was suspiciously close to being a quant. Luckily, not many cared. My parents were happy that I made a good living³ working for an international firm that did something related to this new thing called — the internet.

The recession we were living through (in 2008) was not long after the last one (in 2001) caused by the fall of internet companies. The survivors: Amazon (founded in 1994), eBay (1995), Google (1998) were joined by a new breed of companies — a digital society of sorts: LinkedIn (founded in 2003), Facebook (2004), YouTube (2005), Twitter (2006).

In the backdrop of the global financial collapse, some of us got front row seats to witness a phase transition: an explosion of data!

Before the internet, even for a sophisticated commercial database, a unit of data (think of a row in a spreadsheet) was in the order of a transaction — like a phone-call, or a purchase. Now, every flick of a screen — a view of a webpage, or a click — was being tracked and stored.

To provide an idea of this scale — by 2010, eBay was producing >150 billion new records/day⁴ and each of this record was rich with features about the webpage. In contrast, even if each of AT&T’s 95 million subscribers that year⁵ made 50 calls per day, it would amount to <5 billion boring records.

I worked at eBay on understanding customer journey. The data was so “large” — 99% of it was thrown away. We wrangled the remaining 1% using computing tricks, yet, waited for hours to get an answer!

Wrangle: To round up, herd, or take charge of. Note: This image is an accurate representation of a data analyst (later called data scientist) in the 2000s.

The same year (2008), leads of analytics at LinkedIn and Facebook coined the term: ‘Data Science’.

Technically, it was a misnomer. How can there be science without data? But, this redundancy was perhaps kept to highlight the importance of managing the astronomical amounts of data.

The term emerged in response to a call for an urgent demand of new skills:

  1. The ability to store large amounts of data efficiently.
  2. The ability to retrieve data efficiently and wrangle it to answer specific questions like: ‘How many clicks does it take to make a purchase?’, or ‘How many cinnamon buns were sold in Cincinnati on Christmas day?’
  3. The ability to attribute user behaviors to micro-changes like: ‘The hue of a click button’, or ‘A tweak in the weights of a parameter in an algorithm’. Traditional businesses were solving these problems earlier, but the scale of action was much bigger — like a nation-wide TV advertisement, or a big pricing change. The ability to track micro-actions brought about an opportunity to measure the impact they made.
  4. The ability to optimize by leveraging data. The explosion of data, brought about the opportunity (i) to use more levers that were unknown earlier, and (ii) to decide how much a lever should be pulled at a customer level.

Skills #1 and #2 represented the ‘Data’; #3 and #4 the ‘Science’ respectively.

Existing academic disciplines lent tools feeding to this new field:

  • Computer Scientists were already working on Databases for storage (skill #1 above); Algorithms & Computational Complexity for retrieval (#2); and the (then) upcoming field of Machine Learning for optimization (#4).
  • Social scientists worked on Causal Inference (#3) in the policy space on questions like: ‘Does making education free increase GDP?’ The same tools could be used to measure the impact of design, content, and algorithmic changes on a website.
  • Operation Research (O.R.) was know as ‘The Science of Better’⁶. If social scientists worked on understanding how the world worked, O.R. practitioners worked on making it better by reducing wastage (#4). Their expertise in Stochastic Modeling & Simulation, or Revenue Management, could be directly applied to the internet.

While this new hybrid breed was being trained and hired, the labor supply of immediate demand was met by existing analysts comprising:

  • Ex-management consultants, who enjoyed problem-solving but not the travel or scouting for clients. They picked up basic computational skills, or delegated it to “technical folks” and focussed on the problem formulation and executive communication part.
  • Ex-engineers, who wanted to move closer to business (i.e. where decisions were made) while leveraging their quantitative skills.
  • Experts in O.R., statistics, or computer science — as mentioned above.

A common spirit all these analysts shared was— as Richard Feynman⁷ would put it — the pleasure of finding things out!

The new title was even more of a mouthful than the last one. Two words and five-syllables! Introductions at parties became awkward:

‘Oh! You are a scientist?’ ‘Well, not quite!’

Each interaction needed a sheepish and guilt-full clarification that we optimized ad-revenue vs. splitting particles or decoding the genome. It’s not that optimizing ad-revenue was any less of a noble cause than the alternative. It just did not sound sexy. But, that’s exactly what the 2012 HBR article⁸ — “Data Science: Sexiest Job of the 21st Century” claimed!

This article firmly secured a place for ‘Data Science’ in history:

GoogleTrends⁹ for the string “Data Science”; Y is a relative scale.

Soon after, the field saw a surge from another source:

  • Natural scientists. They already knew how to tease out signal with either sparse or huge amounts of data. Now, they just had to pretend that customers were social atoms or cells or chemicals colliding in space.

PhDs found the title more aligned to their identity (maybe that was the catch?) and the renumeration even more so. In 2012 — same year the famous article came out — ‘Insight Data Science’¹⁰, a bootcamp bridging post-docs and data science was founded.

Total compensation for Data Science by levels at Meta across the US, in early 2020s. Source: levels.fyi¹¹

‘Data Science’ had arrived.

Like any species, ‘Data Science’ is constantly evolving.

In mid 2000s, an old order was collapsing. A new order was rising. Developments in technology (data storage systems) and market (internet) fed off each other. These developments demanded a new set of skills (data wrangling, attribution, customization). A hybrid speciation¹² of existing disciplines (computer science, econometrics, operations research) occurred. A new species (data science) emerged. The new species fitted to the new environment better than its ancestors. It flourished.

The beautiful Heliconius butterflies are products of hybridization

The speciation¹³ will continue. With the demands of a changing environment, some will separate from other members and develop their own unique characteristics.

As of 2020s, Data Science has become big — as a brand and as a function. There are signs of elite overproduction¹⁴ resulting in rebranding¹⁵ and specialization¹⁶: Analytics Engineer, Research Scientist, Product Analyst, Machine Learning Engineer, BizOps Analyst, etc.

On the eve of the 789th AI hype cycle¹⁷ and an impending global boiling¹⁸, it’ll be interesting to see how this function evolves and what new functions emerge?

Notes and links:

  1. Charts on the 2008 recession by Center for Budget and Policy Priorities [Link]
  2. Quants: The Rocket Scientists of Wall Street [Link]
  3. A decade after India’s economic liberalization, most 20-somethings with a decent education, as the foot soldiers of American capitalism in India, were making more money than their parents did on retirement from reputable posts at State owned institutions [Link]
  4. 2014 Computer Weekly article on eBay Data [Link]
  5. AT&T subscriber base [Link]
  6. Operation Research: The Science of Better [Link]
  7. The Meaning of it All, Richard Feynman [Link]
  8. “Data Scientist: The Sexiest Job of the 21st Century” published by HBR in October 2012 [Link]
  9. GoogleTrends for the string “Data Science” [Link]
  10. Insight Data Science [Link]
  11. Compensation of Data Scientists at Levels.FYI [Link]
  12. Hybrid Speciation [Link]
  13. Speciation [Link]
  14. A society that produces too many potential elite members relative to its ability to absorb them can cause social instability [Link]
  15. “What’s in a name?” — Lyft rebrands Data Analyst function as Data Scientist, and Data Scientist function as Research Scientist [Link]
  16. AirBnB’s Data Science role division [Link]
  17. AI Hype Cycle Is Distracting Companies [Link]
  18. UN chief : ‘Era of global boiling has arrived’ [Link]

--

--

No responses yet