UCL5 MieMie

Our development tracking blog

Track project progress, challenges and accomplishments

Week 1: First Client Meeting
- Albert
- 19 October 2020
Week 2: Discovering User Requirements
- Albert
- 26 October 2020
Week 3: Interviews
- Albert
- 02 November 2020
Week 4: Developing Sketches & Prototypes
- Albert
- 08 November 2020
Week 5: Analysing Data Sources & Implementation
- Albert
- 17 November 2020
Week 6: Alternative Source & Research
- Albert
- 24 November 2020
Week 7: Developing Scraping Scripts
- Albert
- 1 December 2020
Week 8: Developing Scrapes & NLP Model
- Albert
- 8 December 2020
Week 9: Developing NLP Model Using Scopus Data
- Albert
- 15 December 2020
Week 10: NLP & Scopus
- Albert
- 22 December 2020
Week 11: Prototype Development & Data Pre-Processing
- Albert
- 29 December 2020
Week 12: Scopus Scrape Infrastructure
- Albert
- 5 January 2021
Week 13: SDG's & Module Catalogue
- Albert
- 12 January 2021
Week 14: Bubble Chart Development
- Albert
- 19 January 2021
Week 15: Alternative SDG Mapping
- Albert
- 26 January 2021
Week 16: GuidedLDA Implementation
- Albert
- 2 February 2021
Week 17: Search Functionality & LDA
- Albert
- 9 February 2021
Week 18: Django Search Engine Development & GuidedLDA Optimisation
- Albert
- 16 February 2021
Week 19: Django Continuation & Changes to GuidedLDA-->LDA
- Albert
- 23 February 2021
Week 20: Model Validation & SDG Mapping Django Interface
- Albert
- 2 March 2021
Week 21: Model Validation Continued
- Albert
- 9 March 2021
Week 22: IHE Proof of Concept
- Albert
- 16 March 2021

Week 1: First Client Meeting

Albert
19 October 2020

The initial client meeting proved to be productive. Our discussions were aimed towards discovering user requirements to form a coherent understanding of basic functionalities and qualitative characteristics of the product to be delivered. Unfortunately, detailed specifications could not be gathered due to the multiplicity of functional requests and the absence of the fourth client, an additional meeting is to be set up with each client individually to gain intricate insight into individual demands.

Week 2: Discovering User Requirements

Albert
26 October 2020

Having met individually with one of our clients - Niel, we have better grasped the goals and requirements for his specific request. Furthermore, the team has been working on constructing and planning the interviews to be conducted with users. So far, our group has constructed an abstract Persona based on generalized data, however, it is still to be developed as we interview more users. We are also having a scheduled client interview with Ann on Wednesday (28.10.2020) and with Simon (29.10.2020). Our goal is to have a complete understanding of the requirements and a plan for working taking a user-centered approach to design.

Week 3: Interviews

Albert
02 November 2020

Last week our team has met with Ann, our fourth client, a meeting that has proved very useful for our general understanding. A meeting with Simon Knowles and Allen Richard was vital to understand the way one would use keywords in the classification of ongoing research and teaching. He has shared several suggestions for potential data sources, describing the advantages and disadvantages of each. On Monday 02/11/2020 my team and I conducted a meeting with Karolina, Marilyn, and Ann. We received some insight into Karolina's work, as well as a data file, which we aim to extend as part of the project. Our next steps include conducting a team meeting to process given data sources and evaluate which are fit for us to start with.

Week 4: Developing Sketches & Prototypes

Albert
08 November 2020

Our work contained further development and subsequent improvement of Personas, and development of sketches. Towards the end of the week, we have put significant effort into designing interactive prototypes based and improved upon the aforementioned sketches. The team evaluated particular design decisions against user requirements, as well as adding small touches of our creative thought. Despite, we are yet to begin analysis of provided data spreadsheets and sources.

Week 5: Analysing Data Sources & Implementation

Albert
17 November 2020

This week, the team has met our new TA to track and aid us with establishing the project plan. We have now finalized our MoSCoW list and are just waiting for the reply to the access request to RPS. In the meantime, we are researching deployment and integration environments, as well as planning out the structure for our near future implementation.

Week 6: Alternative Source & Research

Albert
24 November 2020

We have got in touch with Neel regarding RPS Access, however, an alternative source was suggested - UCL Discovery, an open-access data repository. We are yet to contact the clients regarding this source. So far, the team has researched various deployment and DevOps platforms and evaluated them (can be accessed on the UCL5 MieMie website). Next steps including updating the website with further details and produce sample data to confirm with the clients (to be done due to 25/11/2020).

Week 7: Developing Scraping Scripts

Albert
1 December 2020

Team 16 has started the scraping procedure for the UCL Discovery repository. Unfortunately, we were denied access to UCL RPS, due to our undergraduate student status. Although one of the clients (Neel) has approved the usage of UCL Discovery, we are yet to meet with another client for the same purpose. In the meantime, we continue the development of our website and scraping scripts. The team has scheduled a meeting for Saturday, a due date for working python scripts for single, non-paginated data.

Week 8: Developing Scrapes & NLP Model

Albert
8 December 2020

Today my team and I have met our clients, members of the Library.Open.Access team and other researchers who have worked on the SDG Research analysis at UCL. Our team has discussed the need for RPS access, however, eventually were persuaded to use existing sources such as Web of Science and Scopus to perform string search. I have scheduled a supplementary meeting for Thursday with one of the researchers (not one of our clients) to demonstrate how to exercise the search given a provided static data report from RPS. In the meantime, other group members are working on developing an NLP model or source to parse out keywords from a block of text. Additionally, we have kicked off the process of scraping the UCL Module Catalogue - which is a requirement for one of our clients.

Week 9: Developing NLP Model Using Scopus Data

Albert
15 December 2020

Having met with Andrew Grey and Richard Allen - two researchers with Scopus experience and keyword extraction for SDG's, our team has gained immense insight and shifted our projection of methodology. At the moment, we have started developing an NLP keyword extraction model and will begin training on and validating the Scopus sample training set. Consequently, we plan to then apply the same model on the UCL Module Catalogue data that have finished scraping earlier this week. We set out goals and spread out the workload, as well as agreed on particular deadlines and team meetings that will be conducted throughout the upcoming winter break.

Week 10: NLP & Scopus

Albert
22 December 2020

During our first few days of the winter break, our team and I have shared our work for the upcoming elevator pitch. Additionally, Kareem continued working with the NLP model, and I continued attempting to use and test Scopus API using Postman software. Varun has continued his work on populating website contents. We plan to have some base data from Scopus within the next two weeks and start model training as soon as the pre-processing is complete. The end goal for Scopus is to accumulate a total of 229,967 point dataset of publications and their details by UCL researchers.

Week 11: Prototype Development & Data Pre-Processing

Albert
29 December 2020

The NLP integration is still in process, Kareem continues his work on pre-processing and ensuring clean data for higher result validity. Following that, we partitioned the data into test and validation sets using a 70 to 30 split. Now that our data is in the correct format, we fed it into the Support Vector Machine classification algorithm to train our model. Additionally, we have scraped sample data from Scopus, however, still resolving a few issues regarding missing data and the possibility of using different calls to gather various parts of the data to comprise the whole picture.

Week 12: Scopus Scrape Infrastructure

Albert
5 January 2021

During the second week of the holidays we have continued our work on the Scopus data scraper. I have finally figured out the long-lasting issue - data access through the API. Previously, there has been a restriction placed, which was solved by using UCL VPN to access data points like an abstract, author identifiers as well as data on the author's affiliation. This allowed the construction of a clean JSON format file for each of the research publications. So far, as a demo version, I have put together ~1300 such files. It allowed us to focus on feeding that abstract data into the SVM model. We plan to continue cleaning up that data and to get started on SDG -> UCL Module Catalogue classification.

Week 13: SDG's & Module Catalogue

Albert
12 January 2021

This week, we have met with one of our clients regarding his particular request for SDG mapping to UCL Module Catalogue. The issue is that we are unable to perform the training process without the annotated data. However, there are over 5500 modules to annotate. Hence, we contacted our TA, who has approved an alternative decision - to map the modules solely based on a statistical analysis of keywords present. Therefore, we continued using TF-IDF and automated the process of meaningful keyword extraction from the module descriptions and pushed the data in a formatted manner onto the Azure database table. Our next step is to continue that work and get started on the Bubble Chart generation, given that we already possess the Scopus research publications data.

Week 14: Bubble Chart Development

Albert
19 January 2021

Our plan for this week was to get started on the Bubble Chart, have our second prototype demonstration and lastly, meet with the rest of our clients to discuss our progress and confirm it following our MoSCoW list. Bubble Chart generation is challenging, since data used to be comprised manually and we aim to automate the topic generation & matching of that topic to the author & index keywords from the data gathered through the Scopus API. Our plan for the rest of the week is to continue our work on the chart generation and continue SDG -> module catalogue matching. To further note, last Friday, we have successfully presented our project's Elevator Pitch.

Week 15: Alternative SDG Mapping

Albert
26 January 2021

The team made progression towards mapping UCL modules from the module catalogue to SDGs (sustainable development goals). Through the research of many different NLP and machine learning algorithms, we have decided to use the semi-supervised LDA (Latent Dirichlet Allocation) algorithm called GuidedLDA. This version of the general LDA algorithm allows us to use SDG keywords provided by our client, as seed topics. The model then encourages topics to be built around these seed words, with some set confidence, whilst also uncovering more words related to a given topic. This reference point lets us classify known topics more easily, without the need for manual annotation, by allowing the topics to converge in the direction specified by the SDG keywords.

Week 16: GuidedLDA Implementation

Albert
2 February 2021

This week our team has had an additional client meeting regarding the Bubble Chart. We requested the source code for its implementation and deployment. Once received, we began reverse-engineering its structure and attempting to understand how the "approach" and "speciality" categories are assigned. We continued the work on the GuidedLDA algorithm by focusing on applying it to the scraped data and figuring out how to then validate the results. A proposed solution is to sue the simple string matched results as a validation set, however, it is still yet to be evaluated.

Week 17: Search Functionality & LDA

Albert
9 February 2021

Our group has continued working on LDA implementation. Additionally, during one of our client meetings, the client proposed a general-purpose search functionality for the data we have so far accumulated. Hence, we plan to implement a tight-search (plain matching algorithm without pre-processing) using robust string matching algorithms: Rabin-Karp (with window running hash technique), Boyer-Moore algorithm (using bad character table with a good suffix table). On another note, we ran into a few issues with the data for the Bubble Chart, as we managed to deconstruct the sqlite3 database, however, the data there has been assigned without an obvious script that performs the matching of a researcher with his/her speciality and approach.

Week 18: Django Search Engine Development & GuidedLDA Optimisation

Albert
16 February 2021

Upon our meeting with Neel and Simon to catch up on our progress, I have immediately started the development of the Django Search Engine Web Application. It will interface the back-end data and provide extensive search functionality and detailed view for both modules and publications. In the meantime, our group has finally implemented the GuidedLDA algorithm for our data, however, the results are not perfect and need to be tuned. Hence, four solutions were proposed, including using Tf-IDF Vectoriser, including stop-words to prevent detection of unnecessary keywords (such as "module" or "student") and finally, using modules that have a direct string-matched mapping, to prevent redundant processing and reduce GuidedLDA skewing.

Week 19: Django Continuation & Changes to GuidedLDA-->LDA

Albert
23 February 2021

At this moment, the Django web-application is complete. Our team is having a meeting today to get a final stamp of approval from Neel to signify the completion of his specific requirement to the project. On a different note, the team has progressed immensely with the GuidedLDA optimisation. However, it has become clear that LDA should be used and has proved to produce more valuable results. Hence, we are now able to assign each UCL module to a UN Sustainable Development Goal, as well as demonstrate potential relevance/overlap with other goals. The next step includes further testing of the model. Moreover, Marilyn has requested a similar model testing using keywords relating to Digital Health and Engineering. If the model proves insightful, we shall apply it to this area too.

Week 20: Model Validation & SDG Mapping Django Interface

Albert
2 March 2021

This week our team has focused on model tuning, by the end of which, we were happy to see great results. Specifically, the keywords assigned to each SDG were accurate and relevant. Moreover, SDG mapping to modules that do not have any relation to any SDG was consistently accurate, and similarly with modules that have SDG relevancy. By the same token, although the model has been trained on module data, it performed just as well when applied to unseen Scopus publication data. Furthermore, we conducted a meeting with Simon and Richard (UCL researcher, not in the client group) to gather feedback on current progress. It was suggested we pursue model validation by comparing results to simple string matched classification. In order to visualise the mappings, we have developed a second page in our Django web application, with similar searching functionality and checkbox for the preferred view option. Our next steps are to develop validation scrips, improve the Django interface and figure out our deployment and cloud hosting solution.

Week 21: Model Validation Continued

Albert
9 March 2021

The team has continued work on the Django interface. We have integrated better visualisation by running the model validation through the use of a simple string matching algorithm. Two 18-dimensional vectors are computed, and a rudimentary cosine of an angle is used to compute the similarity index between the two predictions. Furthermore, we developed a function to translate the index with a range [0,1] to an RGB scale, which in this case would only vary between red (when the index falls close to 0) and green (when similarity is close to 1). Moreover, we have continued our struggle with deployment, however, towards the end of the week, we succeeded. The application was deployed to Microsoft Azure App Service and utilises Azure's PostgreSQL services. Additionally, the production version now uses an additional cloud storage service - MongoDB, as it leverages JSON format data storage and retrieval. It allowed for a significant optimization in Django's weight and synchronization speed.

Week 22: IHE Proof of Concept

Albert
16 March 2021

This week the team began utilizing Marilyn's newly suggested approach to tackling the bubble chart. We drafted custom topics such as Cybersecurity, Bioinformatics, Software Engineering, etc. For each of the topics, we composed a large set of keywords relating to a respective topic, which filtered further. The keywords were then fed into the GuidedLDA algorithm with older tuning parameters for test purposes, allowing the team to evaluate the value of this approach. The results were surprisingly good, hence we developed a separate table within the Django web application to visualise the model results. Furthermore, Kareem developed the TSNE Clustering chart, which leverages dimensionality reduction to visualise the topic categorisation in two dimensions. Our plan is now to finalise the SVM, GuidedLDA, fix Django bugs, begin coding Simon's SDG visualisations, start with unit testing and documentation write-up.