The researchers and academics of UCL require an in-depth breakdown of other researchers who are working within the same area; however, this is a long and arduous process that currently requires a manual search and verification process, one that can take months to complete and would make research much more effective if sped up. This is becoming a larger issue as there are more researchers every year, and this process must be completed annually which causes an increasing problem every time this process is redone.
Our solution is to automate this process for all areas of research by allowing researchers to search for all other academics using various filters using scraping tools across UCL related websites before cleaning the data and processing it into various forms of visualisations for the researchers to use. This process would be automatically updated each year to include new publications and researchers in order to constantly shorten the time period.
We have created an easy to use web app interface that allows any user to search for any module or publication based on a keyword search, giving them the ability to find fellow researchers and other essential information about their field of study easily. We have further implemented a module to SDG mapping feature that ensures a high and consistent accuracy when matching any module to relevant SDGs with useful visualisations of data. There is also a proof of concept tool that maps IHE research expertise areas to provided key topic terms, a feature that can be easily expanded to automatically include researchers and their respective fields. We believe that this tool overall makes it very easy for researchers to find both publications and fellow researchers, overall decreasing preparation time and increasing efficiency.
Scrape, map and generate classifiers with the intention of generating an overview of the extent of activity already taking place at an organization
Our project came about due to a growing need within UCL, centred around the research facility in which their research administration, namely finding various things based on their topic of research such as names and other key terms. This search was taking longer and longer each year due to the increasing number of articles, papers and other publications on each topic coming out each year. This got us connected to various researchers varying from professors, PhD research students and sustainable development researchers. They all required a method to speed up this process and to help get more accurate data automatically rather than repeating this each year manually.
You may find the client's details below:
Neel Desai - neel.desai.13@ucl.ac.uk
Marilyn Aviles - marilyn.aviles@ucl.ac.uk
Prof Ann Blandford - ann.blandford@ucl.ac.uk
Dr. Simon Knowles - s.knowles@ucl.ac.uk
Our main project goals include trying to achieve a way for all researchers to be able to find and contact other researchers that are working on or have worked on the same field of study as them. The project goals also include trying to ensure that the RPS database can be scraped in order to find all papers that are linked to the search field used that can then be used to find all related researchers.
We gathered various requirement in the form of a MOSCOW list that was created through a series of group and individual meetings with each of our clients in order to gather all the features and methods that are needed as well as the priority of each of these. We went over these features multiple times with our clients and added a few as well as splitting up other requirements into different parts.
Alison has trouble identifying researchers to collaborate with across the IHE. She would like to use this tool as a quick and efficient way of searching for researchers across different engineering fields. She wants to establish connections across UCL and monitor the progress of her colleagues.
Jonathan is a PhD researcher who would use this data tool in order to quickly find and sort through different research topics that have or currently are conducted at his university. He aims to gain insight into the extent to which UCL is involved in promotion of the 2015 UN’s Sustainable Development Goals in both teaching and research.
ID | Requirement Description | Priority |
---|---|---|
1 | Scrape UCL research publications from Scopus by leveraging the Scopus API to gather the following data: {title, abstract, DOI, subject areas, index keywords, author keywords, elsevier link, …}. | Must |
2 | Scrape UCL modules from the UCL module catalogue by leveraging the UCL API to acquire the following data: {description, title, ID, module lead, credit value, …}. | Must |
3 | Produce an extensive set of keywords (CSV file) for UN SDGs (United Nations Sustainable Development Goals) and IHE (Institute of Healthcare Engineering) topics. | Must |
4 | Use NLP to preprocess text for UCL module fields: {description, name} and Scopus research publication fields: {title, abstract, index keywords, author keywords}. | Must |
5 | Train a semi-supervised NLP model to map UCL module catalogue descriptions to UN SDGs (United Nations Sustainable Development Goals). | Must |
6 | Train a semi-supervised NLP model to map Scopus research publications to IHE (Institute of Healthcare Engineering) research specialities and subject areas. | Must |
7 | Providing the most up-to-date data on Scopus research publications and UCL course modules. | Must |
8 | Django web application implemented and fully deployed. | Must |
9 | Django keyword search functionality for Scopus research publications. | Must |
10 | Train a supervised machine learning model, such as an SVM (Support Vector Machine), aimed at reducing the number of false positives from the NLP model. | Should |
11 | Use the NLP model, trained on SDG-specific keywords, to make SDG mapping predictions for UCL research publications from Scopus. | Should |
12 | Validate the NLP model using a string matching algorithm to count SDG-specific keyword occurrences and compare the probability distribution with that produced by the NLP model, trained on the same set of SDG-specific keywords (represents a similarity index in the range [0,1]). | Should |
13 | Machine Learning visualisation using TSNE Clustering (via dimensionality reduction) and Intertopic Distance Map (via multidimensional scaling) for SDG & IHE topic mappings, deployed on Django. | Should |
14 | Visualisation of SDG results using Tableau (accessed through database credentials). Should be able to view SDG sizes in accordance to the number of students per module/department/faculty across UCL. | Should |
15 | Walkthrough guide describing how to use the final product & system maintenance (for rerunning the scraping and retraining the NLP model on an annual basis). | Should |
16 | Django keyword search functionality for UCL modules, from the module catalogue (bonus feature). | Could |
17 | Django logic for visualising validation using similarity index, mapping from values in the range [0,1] to a red-green colour gradient [red, green]. | Could |
18 | Django button for exporting Scopus keyword search results to a CSV file format (bonus feature for exporting UCL module keyword searches). | Could |
19 | Django option for sorting rows of the NLP model results table, based on the validation similarity index (red-green colour gradient). | Could |
ID | Requirement Description | Priority |
---|---|---|
1 | Will have a responsive Django web interface | Must |
2 | Web interface will be reliable and publically available at all times | Must |
3 | Product will be scalable for a constantly increasing number of modules and publications | Must |
4 | Avoids any possible legal or licensing conflicts | Must |
5 | Data integrity for publications and modules | Must |
6 | Interface will be intuitive and easily usable | Should |
7 | There will be a home button on a navigation bar for ease of browsing | Should |
8 | Main thematic colour across the site will be #007FFF | Should |
9 | Source code will be highly readable to any external user | Could |
10 | Informative, multi-page visually pleasing user interface | Would |
The software is an early proof of concept for development purposes and should not be used as-is in a live environment without further redevelopment and/or testing. No warranty is given and no real data or personally identifiable data should be stored. Usage and its liabilities are your own.
Legal Issues
We have worked hard to ensure that our project completely follow all licensing, copyright and GDPR law. Although no data was collected under GDPR, we do have names of various authors of publications being collected, as well as the names of professors who teach different modules across UCL; however, this information is already publicly available and does not constitute a need for any confidentiality. In line with UK copyright, using the Scopus API stays in line with the moral right that all authors have which means that they are always credited for their own work. We further have a strict contractual relationship, defined through the conditions set within our MoSCoW list (see above). As the contractor team, we have been resolute with ensuring that we complete as many of the minimum defined requirements from our clients. The LDA algorithm we used, was from an open-source library for topic modelling and natural language processing (Gensim) while our GuidedLDA algorithm further uses the LatentDiricletAllocation (LDA) library and is publicly licensed for use by Vikash Singh.
Processes
Our deployment of our database uses Azure from a Django app which currently does not have an ongoing cost due to the project being within the limits of the free Azure web app subscription. The web app uses a Django application. We also have a subscription with MongoDB; however, this did not have enough available storage on the free subscription. In order to keep this running for our long term project sustainability, it has an average monthly cost of £8.77.
Abbreviation Key
MIT = Massachusetts Institute of Technology
OSI = Open-source Initiative
BSD = Berkeley Software Distribution
BSD-3-Clause: The 3 clause BSD License, also known as the New or Modified BSD License
BSD-2-Clause: The 2 clause BSD License, also known as the simplified or FreeBSD License
MPL = Mozilla Public License
GNU = GNU’s Not Unix
GPL = General Public License
LGPL = Lesser General Public License
ISC = Internet Systems Consortium
PSF = Python Software Foundation
HPND = Historical Permission Notice and Disclaimer
AFL = Academic Free License
CAL = Client Access License
SSPL = Server Side Public License
AGPL = Affero General Public License
FOSS = Microsoft Free and Open-Source Software
SSI = Shared Source Initiative
TSL = Tableau Software License
Library / Technology | Source | License |
---|---|---|
beautifulsoup4 v4.9.3 | https://pypi.org/project/beautifulsoup4/ | MIT |
bokeh v2.3.0 | https://pypi.org/project/bokeh/ | Freely Distributable, OSI Approved (BSD-3-Clause) |
certifi v2020.12.5 | https://pypi.org/project/certifi/ | MPL-2.0 |
chardet v4.0.0 | https://pypi.org/project/chardet/ | GNU or LGPL |
click v7.1.2 | https://pypi.org/project/click/ | BSD (BSD-3-Clause) |
cycler v0.10.0 | https://pypi.org/project/Cycler/ | BSD |
dnspython v2.1.0 | https://pypi.org/project/dnspython/ | ISCL |
funcy v1.15 | https://pypi.org/project/funcy/ | BSD |
future v0.18.2 | https://pypi.org/project/future/ | OSI Approved, MIT |
gensim v3.8.3 | https://pypi.org/project/gensim/ | GNU LGPLv2+ |
idna v2.10 | https://pypi.org/project/idna/ | BSD (BSD-3-Clause) |
jinja2 v2.11.3 | https://pypi.org/project/Jinja2/ | BSD (BSD-3-Clause) |
joblib v1.0.1 | https://pypi.org/project/joblib/ | BSD |
kiwisolver v1.3.1 | https://pypi.org/project/kiwisolver/ | BSD |
lxml v4.6.2 | https://pypi.org/project/lxml/ | BSD |
markupSafe v1.1.1 | https://pypi.org/project/MarkupSafe/ | BSD (BSD-3-Clause) |
matplotlib v3.3.4 | https://pypi.org/project/matplotlib/ | PSF |
nltk v3.5 | https://pypi.org/project/nltk/ | Apache Software License v2 |
numexpr v2.7.3 | https://pypi.org/project/numexpr/ | MIT |
numpy v1.20.1 | https://pypi.org/project/numpy/ | BSD |
packaging v20.9 | https://pypi.org/project/packaging/ | Apache Software License, BSD (BSD-2-Clause) |
pandas v1.2.3 | https://pypi.org/project/pandas/ | BSD |
pbr v5.5.1 | https://pypi.org/project/pbr/ | Apache Software License |
pillow v8.1.2 | https://pypi.org/project/Pillow/ | HPND |
psycopg2 v2.8.6 | https://pypi.org/project/psycopg2/ | GNU or LGPL |
pybliometrics v2.9.1 | https://pypi.org/project/pybliometrics/ | MIT |
pyLDAvis v3.2.2 | https://pypi.org/project/pyLDAvis/ | MIT |
pymongo v3.11.3 | https://pypi.org/project/pyLDAvis/ | MIT |
pyodbc v4.0.30 | https://pypi.org/project/pyodbc/ | MIT |
pyparsing v2.4.7 | https://pypi.org/project/pyparsing/ | MIT |
python-dateutil v2.8.1 | https://pypi.org/project/python-dateutil/ | Apache Software License, BSD (Dual License) |
pytz v2021.1 | https://pypi.org/project/pytz/ | MIT |
pyYAML v5.4.1 | https://pypi.org/project/PyYAML/ | MIT |
regex v2021.3.17 | https://pypi.org/project/regex/ | Apache Software License |
requests v2.25.1 | https://pypi.org/project/requests/ | Apache Software License v2 |
scikit-learn v0.24.1 | https://pypi.org/project/scikit-learn/ | OSI Approved (New BSD) |
scipy v1.6.1 | https://pypi.org/project/scipy/ | BSD |
simplejson v3.17.2 | https://pypi.org/project/simplejson/ | AFL, MIT |
six v1.15.0 | https://pypi.org/project/six/ | MIT |
smart-open v4.2.0 | https://pypi.org/project/smart-open/ | MIT |
soupsieve v2.2.1 | https://pypi.org/project/soupsieve/ | MIT |
threadpoolctl v2.1.0 | https://pypi.org/project/threadpoolctl/ | BSD |
tornado v6.1 | https://pypi.org/project/tornado/ | Apache Software License v2 |
tqdm v4.59.0 | https://pypi.org/project/tqdm/ | MIT, MPLv2.0 |
typing-extensions v3.7.4.3 | https://pypi.org/project/typing-extensions/ | PSF |
urllib3 v1.26.4 | https://pypi.org/project/urllib3/ | MIT |
Python v3.7 | https://www.python.org/downloads/release/python-379/ | GNU GPL-compatible |
Microsoft Azure v1.5.45.0 | https://www.microsoft.com/en-us/licensing/product-licensing/azure | Windows Services, CAL (Closed Source) |
Django v3.1.7 | https://www.djangoproject.com/trademarks/ | BSD, OSI-approved (BSD-3-clause) |
PostgreSQL v9.6 | https://www.postgresql.org/about/licence/ | The PostgreSQL License, OSI-approved |
Microsoft SQL Server v15.0.2000.5 | https://www.microsoft.com/en-us/licensing/product-licensing/sql-server | Windows Services, CAL (Closed Source) |
MySQL v8.0.23 | https://www.mysql.com/about/legal/licensing/oem/ | GPLv2, commercial proprietary License (dual license) |
MongoDB v4.4.4 | https://www.mongodb.com/community/licensing | SSPLv1, GNU AGPLv3, Apache Software License v2 |
GitHub v2.18.0 | https://docs.github.com/en/github/site-policy/github-terms-of-service | FOSS, SSI |
Tableau v2020.4.2 | https://mkt.tableau.com/legal/tableau_eula.pdf | TSL |
Visual Studio Code v1.54.3 | https://code.visualstudio.com/license | MIT, Proprietary License |
Azure Data Studio v1.27.0 | https://docs.microsoft.com/en-us/sql/azure-data-studio/faq?view=sql-server-ver15 | MIT |
Nano v5.6.1 | https://www.nano-editor.org | GNU GPLv3 |
System developed by:
Kareem Kermad (kareem.kermad.19@ucl.ac.uk)
Varun Wignarajah (varun.wignarajah.19@ucl.ac.uk)
Albert Mukhametov (albert.mukhametov.19@ucl.ac.uk)
Clients and organisations:
Neel Desai (neel.desai.13@ucl.ac.uk)
Marilyn Aviles (marilyn.aviles@ucl.ac.uk)
Prof. Ann Blandford (ann.blandford@ucl.ac.uk)
Dr. Simon Knowles (s.knowles@ucl.ac.uk)
Supervisors and Teaching Assistants:
Toron Najiba (najiba.toron.09@ucl.ac.uk)
University College London