ملاحظات

الفصل الأول: ما علمُ البيانات؟

(1)

Quote taken from the call for participation sent out for the KDD workshop in 1989.

(2)

Some practitioners do distinguish between data mining and KDD by viewing data mining as a subfield of KDD or a particular approach to KDD.

(3)

For a recent review of this debate, see Battle of the Data Science Venn Diagrams (Taylor 2016).

(4)

For more on the Cancer Moonshot Initiative, see https://www.cancer.gov/research/key-initiatives.

(5)

For more on the All of Us program in the Precision Medicine Initiative, see https://allofus.nih.gov.

(6)

For more on the Police Data Initiative, see https://www.policedatainitiative.org.

(7)

For more on AlphaGo, see https://deepmind.com/research/alphago.

الفصل الثاني: ما المقصود بالبيانات وما المقصود بمجموعة البيانات؟

(1)

Although many data sets can be described as a flat n^*m matrix, in some scenarios the data set is more complex: for example, if a data set describes the evolution of multiple attributes through time, then each time point in the data set will be represented by a two-dimensional flat n^*m matrix, listing the state of the attributes at that point in time, but the overall data set will be three dimensional, where time is used to link the two-dimensional snapshots. In these contexts, the term tensor is sometimes used to generalize the matrix concept to higher dimensions.

(2)

This example is inspired by an example in Han, Kamber, and Pei 2011.

الفصل الثالث: النظام البيئي لعلم البيانات

(1)

See Storm website, at http://storm.apache.org.

الفصل الرابع: أساسيات تعلُّم الآلة

(1)

This subheading, Correlations Are Not Causations, but Some Are Useful, is inspired by George E. P. Box’s (1979) observation, “Essentially, all models are wrong, but some are useful.”

(2)

For a numeric target, the average is the most common measure of central tendency, and for nominal or ordinal data the mode (or most frequently occurring value is the most common measure of central tendency).

(3)

We are using a more complex notation here involving ω₀ and ω₁ because a few paragraphs later we expand this function to include more than one input attribute, so the subscripted variables are useful notations when dealing with multiple inputs.

(4)

A note of caution: the numeric values reported here should be taken as illustrative only and not interpreted as definitive estimates of the relationship between BMI and likelihood of diabetes.

(5)

In general, neural networks work best when the inputs have similar ranges. If there are large differences in the ranges of input attributes, the attributes with the much larger values tend to dominate the processing of the network. To avoid this, it is best to normalize the input attributes so that they all have similar ranges.

(6)

For the sake of simplicity, we have not included the weights on the connections in figures 14 and 15.

(7)

Technically, the backpropagation algorithm uses the chain rule from calculus to calculate the derivative of the error of the network with respect to each weight for each neuron in the network, but for this discussion we will pass over this distinction between the error and the derivative of the error for the sake of clarity in explaining the essential idea behind the backpropagation algorithm.

(8)

No agreed minimum number of hidden layers is required for a network to be considered “deep,” but some people would argue that even two layers are enough to be deep. Many deep networks have tens of layers, but some networks can have hundreds or even thousands of layers.

(9)

For an accessible introduction to RNNs and their natural-language processing, see Kelleher 2016.

(10)

Technically, the decrease in error estimates is known as the vanishing-gradient problem because the gradient over the error surface disappears as the algorithm works back through the network.

(11)

The algorithm also terminates on two corner cases: a branch ends up with no instances after the data set is split up, or all the input attributes have already been used at nodes between the root node and the branch. In both cases, a terminating node is added and is labeled with the majority value of the target attribute at the parent node of the branch.

(12)

For an introduction to entropy and its use in decision-tree algorithms, see Kelleher, Mac Namee, and D’Arcy 2015 on information-based learning.

(13)

See Burt 2017 for an introduction to the debate on the “right to explanation.”

الفصل الخامس: مهام علم البيانات القياسية

(1)

A customer-churn case study in Kelleher, Mac Namee, and D’Arcy 2015 provides a longer discussion of the design of attributes in propensity models.

الفصل السادس: الخصوصية والأخلاقيات

(1)

Behavioral targeting uses data from users’ online activities—sites visited, clicks made, time spent on a site, and so on—and predictive modeling to select the ads shown to the user.

(2)

The EU Privacy and Electronic Communications Directive (2002/58/EC).

(3)

For example, some expectant women explicitly tell retailers that they are pregnant by registering for promotional new-mother programs at the stores.

(4)

For more on PredPol, see http://www.predpol.com.

(5)

A Panopticon is an eighteenth-century design by Jeremy Bentham for institutional buildings, such as prisons and psychiatric hospitals. The defining characteristic of a Panopticon was that the staff could observe the inmates without the inmates’ knowledge. The underlying idea of this design was that the inmates were forced to act as though they were being watched at all times.

(6)

As distinct from digital footprint.

(7)

Civil Rights Act of 1964, Pub. L. 88-352, 78 Stat. 241, at https://www.gpo.gov/fdsys/pkg/STATUTE-78/pdf/STATUTE-78-Pg241.pdf.

(8)

Americans with Disabilities Act of 1990, Pub. L. 101-336, 104 Stat. 327, at https://www.gpo.gov/fdsys/pkg/STATUTE-104/pdf/STATUTE-104-Pg327.pdf.

(9)

The Fair Information Practice Principles are available at https://www.dhs.gov/publication/fair-information-practice-principles-fipps.

(10)

Senate of California, SB-568 Privacy: Internet: Minors, Business and Professions Code, Relating to the Internet, vol. division 8, chap. 22.1 (commencing with sec. 22580) (2013), at https://leginfo.legislature.ca.gov/faces/billNavClient.xhtml?bill_id=201320140SB568.

الفصل السابع: التأثير المستقبلي لعلم البيانات ومبادئ النجاح

(1)

For more on the SmartSantander project in Spain, see http://smartsantander.eu.

(2)

For more on the TEPC’s projects, see http://www.tepco.co.jp/en/press/corp-com/release/2015/1254972_6844.html.

(3)

Leo Tolstoy’s book Anna Karenina (1877) begins: “All happy families are alike; each unhappy family is unhappy in its own way.” Tolstoy’s idea is that to be happy, a family must be successful in a range of areas (love, finance, health, in-laws), but failure in any of these areas will result in unhappiness. So all happy families are the same because they are successful in all areas, but unhappy families can be unhappy for many different combinations of reasons.