Data science is an interdisciplinary field that involves the extraction of insights and knowledge from data using advanced techniques and tools. It involves the use of statistical and computational methods to analyse and interpret large, complex datasets. Data science typically involves the following steps:
- Data collection: Gathering large amounts of data from various sources.
- Data cleaning and preprocessing: Cleaning and formatting data to make it ready for analysis.
- Data analysis: Applying statistical and computational techniques to identify patterns and insights in the data.
- Data visualisation: Creating visual representations of the data to communicate insights to others.
- Machine learning: Developing predictive models based on the data to make informed decisions.
Data science has applications in a wide range of fields, including business, healthcare, finance, marketing, and more. It is a rapidly growing field and is becoming increasingly important as organizations seek to leverage data to gain a competitive advantage.
Data collection and data preprocessing address fundamental tooling and engineering aspects of modern data science, while data analysis and machine learning involve the application of data science foundations. Data visualisation, however, addresses the fundamental communicative challenges of understanding and comprehending massive data sets. People can usually comprehend information best when the magnitude of the data is relatively low, for example, tens or maybe hundreds of data points. But when you have data points in the tens or hundreds of thousands, it becomes difficult to understand the nuances in the data, see trends, or even generally “have a feel” for what the data means. Good data visualisation, applied with sound statistical and visualisation principles, can address this issue.
## Data Visualisation
Data visualisation is the graphical representation of data and information. It involves the use of charts, graphs, maps, and other visual elements to communicate complex data in a clear and understandable way. The primary goal of data visualisation is to help people quickly and easily understand patterns, trends, and relationships in large amounts of data.
Effective data visualisation requires choosing the appropriate type of chart or graph for the data being presented, selecting appropriate colours and visual elements, and designing the visual in a way that is visually appealing and easy to interpret. Data visualisation is an important tool for anyone who needs to communicate information in a way that is both accessible and memorable, such as business analysts, researchers, and data scientists. With the increasing availability of data and the importance of data-driven decision-making, data visualisation has become an increasingly important skill in many industries.
## Using D3.js to build custom visualisations
D3.js (short for Data-Driven Documents) 1 is a JavaScript library for creating dynamic and interactive data visualisations in web browsers. It allows developers to bind data to the Document Object Model (DOM) and apply transformations to the document based on that data.
D3.js provides a wide range of tools and functionalities for working with data, including data-driven DOM manipulation, scalable vector graphics (SVG) for creating graphics and visualisations, and a comprehensive set of layout and formatting options for creating sophisticated data visualisations.
D3.js is widely used in data visualisation and web development communities, and it has a large and active open-source community that continuously contributes to its development and improvement. With its powerful capabilities and flexible architecture, D3.js is a popular choice for building custom data visualisation tools and creating interactive dashboards for data analysis and communication.
In order to experiment with D3.js and demonstrate some of the visualisation techniques enabled by this software, I’ve built a demo website around the narrative of “Discovering the Next Earth” based on data from the Open Exoplanet Catalogue 2. This catalogue contains a substantial amount of data and is perfect for establishing techniques for organising, consuming, and comprehending large data sets.
Some key features of this project are:
- It loads data directly from source hosted on Github 3, which means it can be rendered by static site services and doesn’t need to be collocated with its data source.
- It uses a Parallel Coordinates (PC) chart to express a number of key characteristics across a volume of entities in an efficient manner.
- Data selection is shared across visualisations allowing a exploratory and narrowing techniques by the user from the PC chart to the bar graph.
This demonstration covers a subset of the capabilities of D3.js, but provides a useful starting point for building powerful visualisations.