Mining of Massive Datasets
The popularity of the Web and Internet commerce provides many extremely large datasets from which information can be gleaned by data mining. This book focuses on practical algorithms that have been used to solve key problems in data mining and which can be used on even the largest datasets. It beginPrefaceThis book evolved from material developed over several years by Anand rajaraman and Jeff Ullman for a one-quarter course at Stanford. The courseCS345A, titled"Wcb Mining, was dcsigned as an advanccd graduatc courscalthough it has bccomc accessible and interesting to advanced undergraduatesWhen Jure Leskovec joined the St anford faculty, we reorganized the materiaIconsiderably. Ile introduced a new course C5224 W on network analysis anddded material to (s345A. which was renumbered Cs246. The three authorsalso introduced a large-scale data-Inlining project course, CS341. The book llOwcontains material taught in all three coursesWhat the book is aboutAt the highest level of descriptiOn, this book is about data lining. Howeverit focuses on data mining of very large amounts of data, that is, data so largeit does not fit in main memory. Because of the emphasis on size, many of ourcxamples arc about the Web or data derived from the Web. Further, the booktakes an algorithmic point of view: data mining is about applying algorithmsto data, rather than using data to train"a. machine-learning engine of somesort. The principal topics covered are1. Distributed file systems and map-reduce as a tool for creating parallelalgorithns that succeed Ol very large aMounts of data2. Similarity search, including the key techniques of min hashing and locality-sensitive has hin3. Data-strcam proccssing and specialized algorithms for dcaling with datathat arrives so fast it must be processed immediately or lost4. The technology of search engines, including Google's PageRank, link-spamdetection, and the hubs-and-authorities approach5. Frequent-itelnlset Inlining, including association rules, Inlarket-baskets, theA-Priori Algorithm and its improvements6. Algorithms for clustering very large high-dimensional datasetsPREFACE7. Two key problems for Web applications: managing advertising and rec-ommendation systems8. Algorithms for analyzing and mining the structure of very large graphsespecially social-network graphs9. Techniques for obtaining the important properties of a large dataset bydimensionality reduction, including singular-value decomposition and latent semantic indexing10. Machine-learning algorithms that can be applied to very large data, suchas perceptrons, support-vector Illachines, and gradient descentPrerequisitesTo appreciate fully the matcrial in this book, we recommend the followingprerequisites:1. An introduction to database systems, covering SQL and related programming systems2. A sophoimlore-level course in data structures, algorithMaid discretemath3. A sophomore-level course in software systems, software engineering, andprogramming languagesExercisesThe book contains extensive exercises, with some for almost every section. Weindicate harder exercises or parts of exercises with an exclamation point. Thehardest exercises have a double exclamation pointSupport on the WebGotohttp://www.mmds.orgforslides,homeworkassignmentsprojcctrcquirc-ments, and exams from courses related to this bookGradiance Automated homeworkThcrc arc automated cxcrciscs bascd on this book. using thc gradiancc root-questiontechnologyavailableatwww.gradiance.com/services.Studentsmayenter a public class by creating an account at that site and entering the classwith code 1EDD8A1D. Instructors may use the site by making an account therePREFACEand then emailing support at gradiance dot com with their login name, thename of their school, and a request to use the MMDs materialsAcknowledgementsCover art is by scott UllmanWe would like to thank foto Afrati. Arun marathe and rok Sosic for criticalreadings of a draft of this manuscriptErrors were also reported by Rajiv Abraham, Apoorv Agarwal, Aris Anag-nostopoulos, Atilla Soner Balkir, Arnaud Belletoile, Robin Bennett, Susan Biancani, Amitabh Chaudhary, Lcland Chen, Hua Fcng, Marcus Gcmcindcr, Anastasios Gounaris. Clark Grubb, Shrey Gupta, Waleed Hameid, Saman Harati-zadeh, Przemyslaw Horba.n, Jeff Hwang, Rafi Kamal, Lachlan Kang, Fd KnorrIlaewoon Kwak, Ellis Lau, Greg lee, David Z. Liu, Ethan Lozano, Yunan LuoMichael Mahoney, Justin Mever, bryant Moscon. Brad Penoff, John phillipsPhilips Kokoh Prasetyo, Qi Ge, Harizo Rajaola, Timon Ruball, Rich SeiterHitesh Shetty, Angad Singh, Sandeep Sripada, Dennis Sidharta, Krzysztof stencel, Mark Storus, Roshan Sumbaly, Zack Taylor, Tim Triche Jr.. Wang binWeng Zhen-Bin, Robert West, Oscar Wu. Xie Ke, Nicolas Zhao, and ZhouJingho, The remaining errors are ours, of courseARJ.D. UPalo alto. CAMarch, 2014PREFACEContents1 Data mining1.1 What is Data mining?1.1.1 Statistical Modeling1.1.2 Machinc Learning1.1.3 Computational Approaches to Modeling1.1. 4 Summarization1.1.5 Feature Extract1. 2 Statistical Limits on Data Mining1.2.1 Total Information Awarcncss1.2.2 Bonferronis Princip23445561.2.3 An Example of Bonferronis Principle1.2. 4 Exercises for Section 1.21.3 Things Useful to Kn1.3.1 Importance of Words in Documents71.2Ⅱ ash functions1.3.3Ind101.3.4 Secondary Storag111.3.5 Thc Basc of Natural Logarithms121.3.6 Power law131.3.7 Exercises for Section 1.1.4 Outline of the book151.5 Summary of Chapter 1171.6 Rcfcrcnccs for Chaptcr 12 Mapreduce and the New Software Stack212.1 Distributcd Filc Systcms2.1.1 Physical Organization of Compute Nodes2.1.2 Large-Scale File-System Organization232.2 MapReduce242.2. 1 The Map tasks2.2.2 Grouping by Kcy2.2.3 The Reduce Tasks2.2. 4 Combiners27CONTENTS2. 2.5 Details of MapReduce Execution5)2.2.6C2.2.7 Exercises for Section 2.22.3 Algorithms Using Mapreduce302.3.1 Matrix-Vcctor Multiplication by Mapreduce312.3.2 If the Vector v Cannot Fit in Main Memory2.3.3 Relational-Algebra operations2.3.4 Colllputing Selections by Mapreduce2.3.5 Cuinlputing Projections by Mapreduce362.3.6 Cnion, Intersection, and difference by Mapreduce362.3.7 Computing Natural Join by Mapreduce372.3.8 Grouping and Aggregation by MapReduce72.3.9 Matrix Multiplication2.3.10 Matrix Multiplication with One MapReduce Step..... 392.3. 11 Exercises for Section 2.3402.4 Extensions to Mapreduce2.4.1 Workflow Syster2.4. 2 Recursive Extensions to Mapreduce2.4.3 Pregel2.1.1 Exercises for Section 2.4.462.5 The Communication Cost Mode462.5.1 Communication-Cost for Task Nctworks472.5.2 Wall-Clock Time492.5.3 Multiway joins92.5.4 Exercises for Section 2.5,,,522.6 Complexity Theory for Mapreduce2.6.1 Reducer Size and replication Rate542.6.2 An Example: Similarity Joins2.6.3 A Graph Model for MapReduce problems572.6. 4 Mapping Schemas2.6.5 When Not All Inputs Are Present602.6.6 Lower Bounds on Replication Rate616.7 Casc Study: Matrix Multiplication2. 6. 8 Exercises for Section 2.6,,662.7 Summary of Chapter 22.8 References for Chapter 2B Finding similar items3.1 Applications of Near-Neighbor Search3.1.1 Jaccard Similarity of Set743.1.2 Similarity of Documents743.1.3 Collaborative Filtering as a Similar-Sets Problem753.1.4 Exercises for section 3.13.2 Shingling of Documents3.2.1k-ShiIngles77CONTENTS3. 2.2 Choosing the Shingle Size3.2.3 Lashing Shingles3.2. 4 Shingles built from Words3.2.5 Exercises for Section 3.2803.3 Similarity-Preserving Summaries of Sets803.3. 1 Matrix Representation of Sets813.3.2 Minhashing3.3.3 Minhashing and Jaccard Similarity3.3.4 Minhash Signatures833.3.5 Computing Minhash Signatures3.3.6 Exercises for Section 3.3863.4 Locality-Sensitive Hashing for Documents873.4.1 ISH for Minhash Signatures883.4.2 Analysis of the Banding Technique3.4.3 Combining the Techniques913.4.4 Exercises for Section 3.4913.5 Distance Measures3.5.1 Dcfinition of a Distance Mcasurc3.5.2F1Dist3.5.3 Jaccard Distance943.5.4 Cosine distance933.5.5 Edit dista3.5.6 Hamming Distance3.5.7 Excrciscs for Scction 3.5.973.6 The Theory of Localit y-Sensitive Functions993.6.1 Locality-Sensitive Functions993.6.2 Locality-Sensitive Fainiilies for Jaccard Distance1003.6. 3 AMplifying a Locality-Sensitive Anil1013.6.4 Exercises for Section 3.61033.7 LSH Familics for Other Distance Mcasurcs1043.7.1 LSH Families for Hamming Distance1043.7.2 Random llyperplanes and the Cosine Distance1053.7.3 Sketches1063.7.4 LSH FaIriilies for Euclidean distance1073.7.5 More LSH Families for Euclidean Spaces3.7.6 Excrciscs for Scction 3.71093.8 Applications of Locality-Sensitive Hashin1103.8. 1 Entity Resolution1103.8.2 An Entity-Resolution Example1113.8.3 Validating Record Matches1123.8.41 Matching Fingerprints1133.8.5 A LSH Family for Fingerprint Matching1143. 8. 6 Similar news articles,,,1153. 7 Exercises for Section 3.81173.9 Methods for High Degrees of Similarity118CONTENTS3.9.1 Finding ldentical Items1183.9.2 Representing Sets as Strings3.9.3 Length-Based Filtering1193.9.1 Prefix Indexing1193.9.5 Using Position Information.1213.9.6 Csing Position and Length in Indexes3.9.7 Exercises for Section 3g1253.10 Suinllary of Chapter 31263.11 References for Chapter 31284 Mining Data Streams1314.1 The Stream Data Model1314.1.1 A Data-Stream-Management System4.1.2 Examples of Strcam S4.1.3 Strcam Qucrics1344.1.4 Issues in Stream Processing134.2 Sampling Data in a Stream1364.2.1 A Motivating Example1364.2.2 Obtaining a Representative Sample1374.2. 3 Thc Gcncral Sampling Problcm37134.2.4 Varying the Sample Size1384.2.5 Exercises for section 4.24.3 Filtering streams1394.3.1 A Motivating e1394.3.2 The Bloom Filter4.3.3 Analysis of Bloom Filtering1404.3. 4 Exercises for section 4.31414.4 Counting Distinct Elements in a Stream1424.4.1 The Coult-Distinct Probler1424.4.2 The Flajolet-Martin Algorithm1434.4.3 Combining estimates1444.4.4 Space requirements1444.4.5 Exercises for Section 4.41454.5 Estimating Moments144.5. 1 Definition of moments1454.5.2 The Alon- Matias-Szegedy Algorithm for SecondMoments.1464.5.3 Why the Alon-Matias-Szegedy Algorithm Works1474.5.4 Iligher-Order Moments1484.5.5 Dealing With Infinite Streams4.5.6 Exercises for SectioN 4.51494.6 Counting ones in a window.1504.6.1 The cost of exact counts1514.6.2 The Da.tar-Gionis-Indyk-Motwani Algorit hm4.6.3 Storage Requirements for the dGIM algorithm153
下载地址
用户评论