DATA 525 Data Engineering and Mining

(a course using both practical software development and configuration)
Software/Tools MySQL Oracle Perl PHP SQL
W3Schools MySQL Workbench Web (MySQL) Emacs Linux
Data
Mining &
Retrieval
Machine Learning Data Mining scikit-learn Kardi Teknomo ANN
Information
Retrieval I
Information
Retrieval II
Search
Engines
Text
Analysis
PageRank
Google APIs Firebase TensorFlow G4G TF G4G Firebase W3S TF
General Information Discord EE/CS Wiki EITS UND help Stack Overflow


Syllabus: Fall 2024   Credit hours: 3
Class times: 01:25pm – 02:15pm, MoWeFr Classroom: Harrington Hall 218
Class # (on-campus: 525-01): 21780 Class # (on-line: 525-02): 21781

Instructor: Wen-Chen Hu   (my teaching philosophy) Office: Upson II 366K
: https://und.zoom.us/j/2489867333 Email: wenchen@cs.und.edu
Office hours: 02:30pm – 04:30pm, MoWeFr

Prerequisites:
  • DATA 511 Computing for Data Science I,
  • DATA 512 Computing for Data Science II, and
  • DATA 513 Mathematics for Data Science, or
  • Permission of the School of Electrical Engineering and Computer Science
Synchronous class delivery: The class lectures will be delivered synchronously via https://und.zoom.us/j/2489867333, and the Zoom video will be posted on the Blackboard afterwards. Students can watch the video clips anytime they want.

Lecture notes: No textbook will be used. Instead award-winning, interactive, informative, and practical lecture notes (based on books, papers, online documents, and user manuals) and detailed and precise class instructions will be provided. Collectively, the lecture notes and instructions are more like a small book, which supplies much more information than regular notes do and makes the subject studies much easier. Students will not have problem learning the subjects or taking the exams after studying them and doing programmining exercises.



Grading:


Announcements:



Tentative Schedule:


Week

Class Topic Due Where
0 0. Computer Career and Data Research & Technologies    
  0.1 A computer career    
  0.2 Data research    
  0.3 Data technologies    
1 08/28
08/30
1. Introduction to DATA 525    
  1.1 Course introduction    
  1.2 Data life cycle    
  1.3 Topics covered    
2 09/04
09/06
2. Programmining Exercise I    
  2.1 Specifications    
  2.2 Web page download    
  2.3 Code sample    
09/04  Last day to add a course or drop without record — 100% refund
 Last day to add audit or change to/from audit
 Last day to receive a refund on a dropped class
 Drops after the last day to add will appear on a transcript.
   
09/02
Holiday, Labor Day (Monday) — no classes
   
3 09/09
09/11
09/13
3. Essential Technologies for Exercise Construction    
  3.1 Essential software and tools    
  3.2 Using Linux    
  3.3 Writing HTML scripts    
4 09/16
09/18
09/20
4. PHP (HyperText Preprocessor)    
  4.1 LAMP    
  4.2 PHP    
  4.3 MySQL    
5 09/23
09/25
09/27
5. Web Search Services    
  5.1 The World Wide Web    
  5.2 Web page information    
  5.3 Web search methods    
6 09/30
10/02
10/04
6. Information Retrieval (IR)    
  6.1 Various IR methods    
  6.2 Automatic indexing methods    
  6.3 Data classification and clustering EX I  
7 10/07
10/11
7. The PageRank Algorithm    
  7.1 Background    
  7.2 The PageRank algorithm    
  7.3 Computing PageRank scores    
10/09
Exam I (for both on-campus and on-line students; 6:30pm – 8:30pm, Wednesday)
   
8 10/14
10/16
10/18
8. Firebase Database    
  8.1 Programmining Exercise II    
  8.2 Introduction to Firebase    
  8.3 Using Firebase    
9 10/21
10/23
10/25
9. TensorFlow    
  9.1 TFJS operations    
  9.2 TFJS models    
  9.3 TFJS visor    
10 10/28
10/30
11/01
10. A TensorFlow.js Example    
  10.1 Example introduction    
  10.2 Example model    
  10.3 Example training    
11 11/04
11/06
11/08
11. JavaScript    
  11.1 JavaScript syntax    
  11.2 JavaScript instructions    
  11.3 JavaScript examples    
12 11/13
11/15
12. Decision Trees    
  12.1 Background  
  12.2 Measuring impurity    
  12.3 Information gain    
11/15  Last day to change to or from S/U grading
 Last day to change to or from audit grading
 Last day to drop a full-term course or withdraw from school
   
11/11
Holiday, Veteran’s Day (Monday) — no classes
   
13 11/18
11/22
13. k-Nearest Neighbors (kNN) Algorithm    
  13.1 Background    
  13.2 kNN for prediction and smoothing    
  13.3 Strengths and weaknesses    
11/20
Exam II (for both on-campus and on-line students; 6:30pm – 8:30pm, Wednesday)
   
14 11/25 14. Artificial Neural Networks (ANNs)    
  14.1 Artificial intelligence    
  14.2 Backpropagation    
  14.3 Genann: a minimal ANN    
11/27
11/28
11/29
Holidays, Thanksgiving Break (WeThFr) — no classes
   
15 12/02
12/04
12/06
15. Data Processing and Ming    
  15.1 Data science    
  15.2 Data warehouse    
  15.3 Data fusion    
16 12/09
12/11
16. Data Mining Concepts    
  16.1 Introduction to data mining    
  16.2 Data mining steps    
  16.3 Data mining techniques EX II  
17 12/18
Final exam (for both on-campus and on-line students; 06:30pm – 08:30pm, Wednesday)
   
18 12/24 Grades posted before noon, Tuesday    


According to US News, Best Tech Jobs of 2024 are listed as follows:
  1. Software developer (median salary: $127,260)
  2. IT manager (not developer; median salary: $164,070)
  3. Information security analyst (not developer; median salary: $112,000)
  4.  Data scientist  (median salary: $103,500)
  5.  Web developer  (median salary: $78,580)
  6. Computer systems analyst (not developer; median salary: $102,240)
  7. Computer network architect (not developer; median salary: $126,900)
  8.  Database administrator  (including developing; median salary: $99,890)
  9. Computer support specialist (not developer; median salary: $57,890)
  10. Computer systems administrator (not developer; median salary: $90,520)
  11. Computer Programmer (median salary: $97,800)


Computer science is different from many other disciplines (like electrical engineering). It is more like a professional school (such as culinary schools), which emphasizes practical works instead of subject studies because many IT companies want the new recruitees to start contributing immediately. There are three kinds of computing personnel:
  • Developers:

    • Positions (plenty): Developers of front-end and back-end web pages, mobile apps, and all kinds of software
    • Skills (more stable): Programmining languages (such as C++ and Java), web programmining, mobile app development, data processing and mining including databases, and data structures & algorithms

  • Practitioners:

    • Positions (not many): Experienced personnel like data scientists, database or system administrators, security analysts, and network architects (more applications & configuration and less development)
    • Skills (based on the needs of companies): Databases, data warehousing, data lake, Hadoop, MapReduce, Linux, SPSS, SAS, Cogno, Matla, Taleau, etc.

  • Researchers:

    • Industrial positions (few and based on the needs of corporations): High quality personnel required for the advanced areas like artificial intelligence, security, computer vision, autonomous driving, and speech recognition
    • Academic positions/trends (few and changed according to the government policies): ❓ ⇐ artificial intelligence ⇐ big data ⇐ high-performance computing ⇐ security ⇐ (mobile) networks
Unless you have an impressive resume or a strong connection, practicing tens or hundreds of questions posted at the LeetCode is a must in order to secure a job at corporations (like Google and Facebook). Otherwise, your chance of answering the questions correctly is low because of their high difficulty and time constraint. In addition, you need to create LinkedIn pages to show your achievements, and may consider uploading your projects to the GitHub to showcase them.



Remark I: Terminologies and definitions will be discussed minimally in this course. Instead, effective methods and practical works will be emphasized and enforced.

Remark II: Unlike the disciplines such as databases or the World Wide Web, data engineering and mining (DEM) is one of the disciplines (like image processing or artificial intelligence) without coherent methods or algorithms. Many methods (such as artificial neural networks or relevance feedback) are used by DEM and each method is usually not closely related to other methods (like decision trees or sequential pattern mining).

Remark III: A wide variety of methods have been used by DEM, and the current methods are rather complicated. In order to show what the data engineering and mining (DEM) is in a semester, this course has to pick a small number of fundamental topics, instead of many advanced topics, to investigate. Students then use the training to revise the appropriate methods for the problems they encounter in the future.

Remark IV: Take the following steps to conduct research:

  1. Identify a problem.
  2. Study related literature and methods.
  3. Create/adapt a method to solve/suit the problem.
  4. Figure out how to improve the method.
  5. Complete the implementation.
  6. Perform the testing to ensure the system is correct.
  7. Evaluate the system including comparisons.
  8. Publish the results.


Instructor’s qualification: The instructor’s current research includes mobile computing and information retrieval. He has applied various information retrieval methods (such as artificial neural networks, finite-state machines, and association-rule and sequential-pattern mining) to mobile applications and web searches. The instructor has published more than 100 research publications and advised more than 50 graduate students. Most of the research topics are related to (mobile) data engineering and mining.


University of North Dakota Course Description (DATA 525) —
This course studies theoretical and applied issues related to data engineering and mining. Data engineering is to identify, investigate, and analyze the underlying principles in the design and effective use of information systems; and data mining is to discover patterns in large data sets and transform the patterns into a comprehensible structure for further applications. The following topics are covered: data collection, data preparation, data indexing and storage, data processing and analysis, data classification and clustering, knowledge discovery, information retrieval, data visualization, data sharing, data applications, and some other special topics.

Data Science from Wikipedia
Data science is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining. It employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, in particular from the subdomains of machine learning, classification, cluster analysis, data mining, databases, and visualization.

Data Engineering from IEEE Computer Society Data Engineering Bulletin
The role of data in the design, development, mining and utilization of information systems:

  • Databases and the World Wide Web,
  • Ming of semistructured data, metadata and XML,
  • Heterogeneous, distributed, parallel and mobile databases,
  • Data warehousing and OLAP,
  • Data, text and web mining,
  • Optimization of query processing and database architectures,
  • Indexing, access methods and data structures,
  • Temporal, spatial, scientific, statistical, biological databases, and
  • Security and integrity control.

Data Engineering from Data & Knowledge Engineering
Data engineering is to identify, investigate and analyze the underlying principles in the design and effective use of database systems:

  • Representation and manipulation of data,
  • Architectures of database systems,
  • Construction of databases,
  • Applications, case studies, and mining issues, and
  • Tools for specifying and developing databases using tools based on linguistics or human machine interface principles.
Data Mining from Wikipedia
Data mining comprises all the disciplines related to managing data as a valuable resource:

  • Data governance,
  • Data architecture, analysis and design,
  • Database mining,
  • Data security mining,
  • Data quality mining,
  • Reference and master data mining,
  • Data warehousing and business intelligence mining,
  • Data, text and web mining,
  • Optimization of query processing and database architectures,
  • Indexing, access methods and data structures,
  • Temporal, spatial, scientific, statistical, biological databases, and
  • Security and integrity control.

Each student is required to build the following two systems:
  • a focused web search engine based on a data life cycle and
  • a data mining system using Firebase and TensorFlow.




An Internet-Enabled and Mobile Database Course Sequence —
This is part of an Internet/mobile-enabled database course sequence offered by me:
CSCI 260 .NET and World Wide Web Programmining

CSCI 457 Electronic and Mobile Commerce Systems

DATA 520 Databases

CSCI 513 Advanced Database Systems

CSCI 515 Data Engineering and Ming
The following platforms, software, and tools used in these courses greatly help students land a decent job:
  • CSCI 260 (.NET and World Wide Web Programmining) to build database-driven websites by using

    • Microsoft Access database,
    • Microsoft ASP.NET,
    • Microsoft C# or Visual Basic,
    • Microsoft .NET, and
    • Microsoft Visual Studio.

  • CSCI 457 (Electronic and Mobile Commerce Systems) to build electronic and mobile commerce systems by using

    • Android programmining,
    • Android-server-database connection,
    • (L) Linux operating system,
    • (A) Apache web server,
    • (M) MySQL database, and
    • (P) PHP.

  • DATA 520 (Databases) to build Internet/mobile-enabled database systems by using

    • Android programmining,
    • Android-server-database connection,
    • JDBC (Java Database Connectivity),
    • Oracle database, and
    • Relational database design and SQL.

  • CSCI 513 (Advanced Database Systems) to build Internet-enabled and embedded database systems by using

    • Android programmining,
    • Android SQLite embedded database,
    • JDBC (Java Database Connectivity),
    • Object-relational SQL and PL/SQL, and
    • Oracle (an object-relational database).

  • CSCI 515 (Data Engineering and Ming) to build Internet-enabled data-mining systems to discover knowledge from a large set of data by using

    • Data mining and knowledge discovery,
    • Internet-enabled Firebase database,
    • Information retrieval, and
    • Internet-enabled TensorFlow.