ملاحظات
الفصل الأول: ما علمُ البيانات؟
(1)
Quote taken from the
call for participation sent out
for the KDD workshop in
1989.
(2)
Some practitioners do
distinguish between data mining and
KDD by viewing data mining as a
subfield of KDD or a particular
approach to
KDD.
(3)
For a recent review of
this debate, see Battle of the Data
Science Venn Diagrams
(Taylor 2016).
(4)
For more on the Cancer
Moonshot Initiative, see
https://www.cancer.gov/research/key-initiatives.
(5)
For more on the All of
Us program in the Precision Medicine
Initiative, see
https://allofus.nih.gov.
(6)
For more on the Police
Data Initiative, see
https://www.policedatainitiative.org.
(7)
For more on AlphaGo, see
https://deepmind.com/research/alphago.
الفصل الثاني: ما المقصود بالبيانات وما المقصود بمجموعة البيانات؟
(1)
Although many data sets can
be described as a flat n * m
matrix, in some scenarios the data set is
more complex: for example, if a data set
describes the evolution of multiple
attributes through time, then each time
point in the data set will be represented
by a two-dimensional flat n * m
matrix, listing the state of the
attributes at that point in time, but the
overall data set will be three
dimensional, where time is used to link
the two-dimensional snapshots. In these
contexts, the term tensor is
sometimes used to generalize the
matrix concept to higher
dimensions.
(2)
This example is inspired by
an example in Han, Kamber, and Pei
2011.
الفصل الثالث: النظام البيئي لعلم البيانات
(1)
See Storm website, at
http://storm.apache.org.
الفصل الرابع: أساسيات تعلُّم الآلة
(1)
This subheading,
Correlations Are Not Causations, but
Some Are Useful, is inspired by
George E. P. Box’s (1979)
observation, “Essentially, all models
are wrong, but some are
useful.”
(2)
For a numeric target,
the average is the most common
measure of central tendency, and for
nominal or ordinal data the mode (or
most frequently occurring value is
the most common measure of central
tendency).
(3)
We are using a more
complex notation here involving
ω0 and
ω1 because
a few paragraphs later we expand this
function to include more than one
input attribute, so the subscripted
variables are useful notations when
dealing with multiple
inputs.
(4)
A note of caution: the
numeric values reported here should
be taken as illustrative only and not
interpreted as definitive estimates
of the relationship between BMI and
likelihood of
diabetes.
(5)
In general, neural
networks work best when the inputs
have similar ranges. If there are
large differences in the ranges of
input attributes, the attributes with
the much larger values tend to
dominate the processing of the
network. To avoid this, it is best to
normalize the input attributes so
that they all have similar
ranges.
(6)
For the sake of
simplicity, we have not included the
weights on the connections in figures
14 and 15.
(7)
Technically, the
backpropagation algorithm uses the
chain rule from calculus to calculate
the derivative of the error of the
network with respect to each weight
for each neuron in the network, but
for this discussion we will pass over
this distinction between the error
and the derivative of the error for
the sake of clarity in explaining the
essential idea behind the
backpropagation
algorithm.
(8)
No agreed minimum number
of hidden layers is required for a
network to be considered “deep,” but
some people would argue that even two
layers are enough to be deep. Many
deep networks have tens of layers,
but some networks can have hundreds
or even thousands of
layers.
(9)
For an accessible
introduction to RNNs and their
natural-language processing, see
Kelleher 2016.
(10)
Technically, the
decrease in error estimates is known
as the vanishing-gradient
problem because the
gradient over the error surface
disappears as the algorithm works
back through the
network.
(11)
The algorithm also
terminates on two corner cases: a
branch ends up with no instances
after the data set is split up, or
all the input attributes have already
been used at nodes between the root
node and the branch. In both cases, a
terminating node is added and is
labeled with the majority value of
the target attribute at the parent
node of the
branch.
(12)
For an introduction to
entropy and its use in decision-tree
algorithms, see Kelleher, Mac Namee,
and D’Arcy 2015 on information-based
learning.
(13)
See Burt 2017 for an
introduction to the debate on the
“right to
explanation.”
الفصل الخامس: مهام علم البيانات القياسية
(1)
A customer-churn case
study in Kelleher, Mac Namee, and
D’Arcy 2015 provides a longer
discussion of the design of
attributes in propensity
models.
الفصل السادس: الخصوصية والأخلاقيات
(1)
Behavioral targeting
uses data from users’ online
activities—sites visited, clicks
made, time spent on a site, and so
on—and predictive modeling to select
the ads shown to the
user.
(2)
The EU Privacy and
Electronic Communications Directive
(2002/58/EC).
(3)
For example, some
expectant women explicitly tell
retailers that they are pregnant by
registering for promotional
new-mother programs at the
stores.
(4)
For more on PredPol, see
http://www.predpol.com.
(5)
A Panopticon is an
eighteenth-century design by Jeremy
Bentham for institutional buildings,
such as prisons and psychiatric
hospitals. The defining
characteristic of a Panopticon was
that the staff could observe the
inmates without the inmates’
knowledge. The underlying idea of
this design was that the inmates were
forced to act as though they were
being watched at all
times.
(6)
As distinct from digital
footprint.
(7)
Civil Rights Act of
1964, Pub. L. 88-352, 78 Stat. 241,
at
https://www.gpo.gov/fdsys/pkg/STATUTE-78/pdf/STATUTE-78-Pg241.pdf.
(8)
Americans with
Disabilities Act of 1990, Pub. L.
101-336, 104 Stat. 327, at
https://www.gpo.gov/fdsys/pkg/STATUTE-104/pdf/STATUTE-104-Pg327.pdf.
(9)
The Fair Information
Practice Principles are available at
https://www.dhs.gov/publication/fair-information-practice-principles-fipps.
(10)
Senate of California,
SB-568 Privacy: Internet: Minors,
Business and Professions Code,
Relating to the Internet, vol.
division 8, chap. 22.1 (commencing
with sec. 22580) (2013), at
https://leginfo.legislature.ca.gov/faces/billNavClient.xhtml?bill_id=201320140SB568.
الفصل السابع: التأثير المستقبلي لعلم البيانات ومبادئ النجاح
(1)
For more on the
SmartSantander project in Spain, see
http://smartsantander.eu.
(2)
For more on the TEPC’s
projects, see
http://www.tepco.co.jp/en/press/corp-com/release/2015/1254972_6844.html.
(3)
Leo Tolstoy’s book
Anna
Karenina (1877)
begins: “All happy families are
alike; each unhappy family is unhappy
in its own way.” Tolstoy’s idea is
that to be happy, a family must be
successful in a range of areas (love,
finance, health, in-laws), but
failure in any of these areas will
result in unhappiness. So all happy
families are the same because they
are successful in all areas, but
unhappy families can be unhappy for
many different combinations of
reasons.