UCL5 MieMie

What it does:

Scrape, map and generate classifiers with the intention of generating an overview of the extent of activity already taking place at an organization

Requirements
Research
System Design
Implementation
Testing
Evaluation

Project Background & Client Introduction

Our project came about due to a growing need within UCL, centred around the research facility in which their research administration, namely finding various things based on their topic of research such as names and other key terms. This search was taking longer and longer each year due to the increasing number of articles, papers and other publications on each topic coming out each year. This got us connected to various researchers varying from professors, PhD research students and sustainable development researchers. They all required a method to speed up this process and to help get more accurate data automatically rather than repeating this each year manually.

You may find the client's details below:

Neel Desai - neel.desai.13@ucl.ac.uk
Marilyn Aviles - marilyn.aviles@ucl.ac.uk
Prof Ann Blandford - ann.blandford@ucl.ac.uk
Dr. Simon Knowles - s.knowles@ucl.ac.uk

Project Goals

Our main project goals include trying to achieve a way for all researchers to be able to find and contact other researchers that are working on or have worked on the same field of study as them. The project goals also include trying to ensure that the RPS database can be scraped in order to find all papers that are linked to the search field used that can then be used to find all related researchers.

Requirement Gathering

We gathered various requirement in the form of a MOSCOW list that was created through a series of group and individual meetings with each of our clients in order to gather all the features and methods that are needed as well as the priority of each of these. We went over these features multiple times with our clients and added a few as well as splitting up other requirements into different parts.

Personas

Use Cases

Alison has trouble identifying researchers to collaborate with across the IHE. She would like to use this tool as a quick and efficient way of searching for researchers across different engineering fields. She wants to establish connections across UCL and monitor the progress of her colleagues.

Jonathan is a PhD researcher who would use this data tool in order to quickly find and sort through different research topics that have or currently are conducted at his university. He aims to gain insight into the extent to which UCL is involved in promotion of the 2015 UN’s Sustainable Development Goals in both teaching and research.

Functional MoSCoW List

ID	Requirement Description	Priority
1	Scrape UCL research publications from Scopus by leveraging the Scopus API to gather the following data: {title, abstract, DOI, subject areas, index keywords, author keywords, elsevier link, …}.	Must
2	Scrape UCL modules from the UCL module catalogue by leveraging the UCL API to acquire the following data: {description, title, ID, module lead, credit value, …}.	Must
3	Produce an extensive set of keywords (CSV file) for UN SDGs (United Nations Sustainable Development Goals) and IHE (Institute of Healthcare Engineering) topics.	Must
4	Use NLP to preprocess text for UCL module fields: {description, name} and Scopus research publication fields: {title, abstract, index keywords, author keywords}.	Must
5	Train a semi-supervised NLP model to map UCL module catalogue descriptions to UN SDGs (United Nations Sustainable Development Goals).	Must
6	Train a semi-supervised NLP model to map Scopus research publications to IHE (Institute of Healthcare Engineering) research specialities and subject areas.	Must
7	Providing the most up-to-date data on Scopus research publications and UCL course modules.	Must
8	Django web application implemented and fully deployed.	Must
9	Django keyword search functionality for Scopus research publications.	Must
10	Train a supervised machine learning model, such as an SVM (Support Vector Machine), aimed at reducing the number of false positives from the NLP model.	Should
11	Use the NLP model, trained on SDG-specific keywords, to make SDG mapping predictions for UCL research publications from Scopus.	Should
12	Validate the NLP model using a string matching algorithm to count SDG-specific keyword occurrences and compare the probability distribution with that produced by the NLP model, trained on the same set of SDG-specific keywords (represents a similarity index in the range [0,1]).	Should
13	Machine Learning visualisation using TSNE Clustering (via dimensionality reduction) and Intertopic Distance Map (via multidimensional scaling) for SDG & IHE topic mappings, deployed on Django.	Should
14	Visualisation of SDG results using Tableau (accessed through database credentials). Should be able to view SDG sizes in accordance to the number of students per module/department/faculty across UCL.	Should
15	Walkthrough guide describing how to use the final product & system maintenance (for rerunning the scraping and retraining the NLP model on an annual basis).	Should
16	Django keyword search functionality for UCL modules, from the module catalogue (bonus feature).	Could
17	Django logic for visualising validation using similarity index, mapping from values in the range [0,1] to a red-green colour gradient [red, green].	Could
18	Django button for exporting Scopus keyword search results to a CSV file format (bonus feature for exporting UCL module keyword searches).	Could
19	Django option for sorting rows of the NLP model results table, based on the validation similarity index (red-green colour gradient).	Could

Non-Functional MoSCoW List

ID	Requirement Description	Priority
1	Will have a responsive Django web interface	Must
2	Web interface will be reliable and publically available at all times	Must
3	Product will be scalable for a constantly increasing number of modules and publications	Must
4	Avoids any possible legal or licensing conflicts	Must
5	Data integrity for publications and modules	Must
6	Interface will be intuitive and easily usable	Should
7	There will be a home button on a navigation bar for ease of browsing	Should
8	Main thematic colour across the site will be #007FFF	Should
9	Source code will be highly readable to any external user	Could
10	Informative, multi-page visually pleasing user interface	Would

Model	Accuracy
SVM	83%
Logistic Regression	77%

ID	Bug Description	Priority	State
1	Scraping error identified as an invalid DOI. Error raised was not due to an attempt to pull a research publication that does not exist in Scopus records but rather due to inability to control request rate and request count, given that the API sets weekly quotas. Hence, the code would seem as if it is performing well, however, it may just make a call to a wrongly formatted DOI string or the key may have exceeded the allocated quota limit.	High	Solved
2	In the LOADERS component, although the data loaded from MongoDB is seemingly in dictionary format, the data type of each key and respective value are still that of the cursor, which caused numerous bugs with extra fields added, specifically the "_id". What made it tricker is the illusion that traditional JSON conversion does not solve the issue, merely prolongs the stage of failure. It was eventually solved using bson's library function json_util.dumps(data) for cursor type conversion.	Medium	Solved
3	Currently, the production of the validation set that leverages keyword occurrence analysis only performs unique string matching. This impacts the probability that is assigned for a given module & publication with regards to relevancy to a particular SDG. Hence, the validation similarity index may potentially change. This would be solved by keeping a record of the number of keyword occurrences and addressing it to the total count for a given SDG.	Medium	Not Solved
4	Originally, pushing data to MongoDB was performed by identifying the key and values to perform an update(key, value, upsert=True) operation. However, if the data field changes for the same object, MongoDB will record that as a separate instance, this time, of another object. It creates issues of redundant duplication by keeping track of previous values. It was solved through increasing key specificity to a particular data field, for example, the EID or the DOI. In this manner, data fields of a specified object would be updated, rather than duplicated.	Low	Solved

Work Packages	Varun	Kareem	Albert
Website Report	33%	33%	34%
Project Blog	0%	10%	90%
Video Editing	33%	33%	34%
Research	25%	50%	25%
Natural Language Processing (NLP)	0%	95%	5%
Machine Learning	0%	100%	0%
NLP Model Validation	0%	90%	10%
Data Visualisation	0%	95%	5%
Data Mining (Scraping)	0%	0%	100%
Django UI Development	0%	5%	95%
Database Management	5%	10%	85%
Deployment	90%	0%	10%
Communication	40%	30%	30%
Requirement Gathering	40%	30%	30%
Overall contribution	30%	35%	35%
Main Roles	Database Deployer, Researcher, Team Liaison	Back End Developer, NLP / Machine Learning Developer, Researcher	Project Website Manager, Back End Developer, Front End Developer

Library / Technology	Source	License
beautifulsoup4 v4.9.3	https://pypi.org/project/beautifulsoup4/	MIT
bokeh v2.3.0	https://pypi.org/project/bokeh/	Freely Distributable, OSI Approved (BSD-3-Clause)
certifi v2020.12.5	https://pypi.org/project/certifi/	MPL-2.0
chardet v4.0.0	https://pypi.org/project/chardet/	GNU or LGPL
click v7.1.2	https://pypi.org/project/click/	BSD (BSD-3-Clause)
cycler v0.10.0	https://pypi.org/project/Cycler/	BSD
dnspython v2.1.0	https://pypi.org/project/dnspython/	ISCL
funcy v1.15	https://pypi.org/project/funcy/	BSD
future v0.18.2	https://pypi.org/project/future/	OSI Approved, MIT
gensim v3.8.3	https://pypi.org/project/gensim/	GNU LGPLv2+
idna v2.10	https://pypi.org/project/idna/	BSD (BSD-3-Clause)
jinja2 v2.11.3	https://pypi.org/project/Jinja2/	BSD (BSD-3-Clause)
joblib v1.0.1	https://pypi.org/project/joblib/	BSD
kiwisolver v1.3.1	https://pypi.org/project/kiwisolver/	BSD
lxml v4.6.2	https://pypi.org/project/lxml/	BSD
markupSafe v1.1.1	https://pypi.org/project/MarkupSafe/	BSD (BSD-3-Clause)
matplotlib v3.3.4	https://pypi.org/project/matplotlib/	PSF
nltk v3.5	https://pypi.org/project/nltk/	Apache Software License v2
numexpr v2.7.3	https://pypi.org/project/numexpr/	MIT
numpy v1.20.1	https://pypi.org/project/numpy/	BSD
packaging v20.9	https://pypi.org/project/packaging/	Apache Software License, BSD (BSD-2-Clause)
pandas v1.2.3	https://pypi.org/project/pandas/	BSD
pbr v5.5.1	https://pypi.org/project/pbr/	Apache Software License
pillow v8.1.2	https://pypi.org/project/Pillow/	HPND
psycopg2 v2.8.6	https://pypi.org/project/psycopg2/	GNU or LGPL
pybliometrics v2.9.1	https://pypi.org/project/pybliometrics/	MIT
pyLDAvis v3.2.2	https://pypi.org/project/pyLDAvis/	MIT
pymongo v3.11.3	https://pypi.org/project/pyLDAvis/	MIT
pyodbc v4.0.30	https://pypi.org/project/pyodbc/	MIT
pyparsing v2.4.7	https://pypi.org/project/pyparsing/	MIT
python-dateutil v2.8.1	https://pypi.org/project/python-dateutil/	Apache Software License, BSD (Dual License)
pytz v2021.1	https://pypi.org/project/pytz/	MIT
pyYAML v5.4.1	https://pypi.org/project/PyYAML/	MIT
regex v2021.3.17	https://pypi.org/project/regex/	Apache Software License
requests v2.25.1	https://pypi.org/project/requests/	Apache Software License v2
scikit-learn v0.24.1	https://pypi.org/project/scikit-learn/	OSI Approved (New BSD)
scipy v1.6.1	https://pypi.org/project/scipy/	BSD
simplejson v3.17.2	https://pypi.org/project/simplejson/	AFL, MIT
six v1.15.0	https://pypi.org/project/six/	MIT
smart-open v4.2.0	https://pypi.org/project/smart-open/	MIT
soupsieve v2.2.1	https://pypi.org/project/soupsieve/	MIT
threadpoolctl v2.1.0	https://pypi.org/project/threadpoolctl/	BSD
tornado v6.1	https://pypi.org/project/tornado/	Apache Software License v2
tqdm v4.59.0	https://pypi.org/project/tqdm/	MIT, MPLv2.0
typing-extensions v3.7.4.3	https://pypi.org/project/typing-extensions/	PSF
urllib3 v1.26.4	https://pypi.org/project/urllib3/	MIT
Python v3.7	https://www.python.org/downloads/release/python-379/	GNU GPL-compatible
Microsoft Azure v1.5.45.0	https://www.microsoft.com/en-us/licensing/product-licensing/azure	Windows Services, CAL (Closed Source)
Django v3.1.7	https://www.djangoproject.com/trademarks/	BSD, OSI-approved (BSD-3-clause)
PostgreSQL v9.6	https://www.postgresql.org/about/licence/	The PostgreSQL License, OSI-approved
Microsoft SQL Server v15.0.2000.5	https://www.microsoft.com/en-us/licensing/product-licensing/sql-server	Windows Services, CAL (Closed Source)
MySQL v8.0.23	https://www.mysql.com/about/legal/licensing/oem/	GPLv2, commercial proprietary License (dual license)
MongoDB v4.4.4	https://www.mongodb.com/community/licensing	SSPLv1, GNU AGPLv3, Apache Software License v2
GitHub v2.18.0	https://docs.github.com/en/github/site-policy/github-terms-of-service	FOSS, SSI
Tableau v2020.4.2	https://mkt.tableau.com/legal/tableau_eula.pdf	TSL
Visual Studio Code v1.54.3	https://code.visualstudio.com/license	MIT, Proprietary License
Azure Data Studio v1.27.0	https://docs.microsoft.com/en-us/sql/azure-data-studio/faq?view=sql-server-ver15	MIT
Nano v5.6.1	https://www.nano-editor.org	GNU GPLv3

Data Mining & Scraping Engine

UCL5 MieMie

Problem Statement

Our Solution

Achievement & Impact

Meet the Team!

Varun W.

Kareem K.

Albert M.

What it does:

Requirements

Research

System Design

Implementation

Testing

Evaluation

Project Background & Client Introduction

Project Goals

Requirement Gathering

Personas

Use Cases

Functional MoSCoW List

Non-Functional MoSCoW List

Related Projects Review

Technology Review

NLP and Topic Modelling

Database

Deployment Automation:

References

System Design

Django Framework Structure

Programming Language and Integrated Development Environment

Web Scraping

Processing Algorithms

NLP – Natural Language Processing

NLTK – Natural Language Toolkit

NLTK – Natural Language Toolkit

Django Search Engine Web Application

References

Testing Strategy

Unit Testing

NLP Testing: Model Validation

User acceptance testing: Feedback From Testers & Project Partners

Why This Test, Which Tool, How The Test Was Conducted, Results & Analysis

Summary of Achievements

Optional Feature

Critical Evaluation of the Project

Future work

Our prototypes

Legal Statement

Credits

Contact Us