Free Software Directory talk:Big Data Team

From Free Software Directory
Jump to: navigation, search

Big data definitions

  • A 2018 definition states "Big data is where parallel computing tools are needed to handle data". "Parallel computers can be roughly classified according to the level at which the hardware supports parallelism, with multi-core and multi-processor computers having multiple processing elements within a single machine, while clusters, MPPs, and grids use multiple computers to work on the same task. Specialized parallel computer architectures are sometimes used alongside traditional processors, for accelerating specific tasks. " - https://en.wikipedia.org/wiki/Parallel_computing
  • "Volume", "variety", "velocity", and various other "Vs" are added by some organizations to describe it, a revision challenged by some industry authorities.[28] The Vs of big data were often referred to as the "three Vs", "four Vs", and "five Vs". They represented the qualities of big data in volume, variety, velocity, veracity, and value.[4] Variability is often included as an additional quality of big data. - https://en.wikipedia.org/wiki/Big_data#Definition

"The three characteristics of big data are summarized as follows.

● Volume: This feature shows the huge amount of data that can range from terabytes to exabytes. According to a Cisco’s forecast, the data traffic is expected to reach 930 exabytes by 2020, a seven-fold growth from 2017 [26].

● Variety: It refers to the diversity and heterogeneity of big data. For example, big data in healthcare can be produced from healthcare users (i.e. doctors, patients), medical IoT devices, and healthcare organizations. Data can be formatted in text, images, videos with structured or un-structured dataset types [27].

● Velocity: It expresses the data generation rate that can be calculated in time or frequency domain. In fact, in industrial applications like healthcare, data generated from devices is always updated in real-time, which is of significant importance for time-sensitive applications such as health monitoring or diagnosis [28]." - [1]

"We divided these datasets into three categories: small-scale small resolution datasets (CIFAR-10/100, MNIST, and FashionMNIST), small-scale larger resolution (Flowers-102), and medium-scale (ImageNet-1k) datasets" - [2]

Medium scale: "ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total."

Small scale: "The CIFAR-10 set has 6000 examples of each of 10 classes and the CIFAR-100 set has 600 examples of each of 100 non-overlapping classes." & " To this end we introduce a 103 class flower dataset. We compute four different features for the flowers"


napkin calculations with 100kb per image,

Small = Megabytes and below

Medium = Gigabytes

Big = Terabytes and beyond



Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the page “GNU Free Documentation License”.

The copyright and license notices on this page only apply to the text on this page. Any software or copyright-licenses or other similar notices described in this text has its own copyright notice and license, which can usually be found in the distribution or license text itself.