Skip to content

noemiernst/StackExchangeMathDataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

236 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stack Exchange Dataset of Mathematical Formulas

Processing stack exchange data dumps to a dataset of mathematical formulas

Requirements

  • Python 3.7
Python Packages

Install python packages by running: pip install -r requirements.txt

Other tools

Dataset

Creating the Dataset

  • cd main
  • execute: python main.py --input ../input/ --dumps test_dumps --download yes --extract yes --output ../output/database.db
  • execute (only after main.py has been executed): python context.py --input ../input/ --dumps test_dumps --download yes --database ../output/database.db --context 10 --topn 3 --tablename FormulaContext
  • execute (only after main.py has been executed): python statistics.py --dumps test_dumps --database ../output/database.db --output ../output/

Using the Dataset

The data from the database can be accessed via SQL commands, e.g.:

sqlite3
.open output/database.db
SELECT * FROM FormulasPosts WHERE FormulaId="5783548";
SELECT Site, Tag FROM Tags WHERE Count>"1500";
SELECT AnswerText.Site, AnswerText.AnswerId, AnswerText.Body 
   FROM AnswerText INNER JOIN AnswerMeta ON AnswerText.Site=AnswerMeta.Site 
   AND AnswerText.AnswerId=AnswerMeta.AnswerId AND AnswerMeta.Score="100";
SELECT max(Score) FROM QuestionMeta;

main.py

Parameters of main.py

  • input: Input directory of stackexchange dump *.7z files. Where they are or where there should be downloaded to.
    • default= "../input/"
  • dumps: A text file containing a list of stackexchange dump sites names to be processed.
    • format: file containing a list of dump sites. Viable options can be found in the file /main/mathjax_dumps.
    • default="test_dumps" in main directory
  • download: Whether or not the program should download the dumps.
    • options: yes or no
    • default="yes"
  • extract: Whether or not to extract the *.7z dump files.
    • options: yes or no
    • default="yes"
  • output: database output
    • default='../output/database.db'
  • all: Force to process all dumps, even if they have previously been processed and already exist in the database.
    • options: yes or no
    • default="no"

Outputs of main.py

  • The Dataset will be saved in a database (*.db) as specified by the input parameter 'output'.
  • Analysis/Statistics will be saved in file statistics.log in same directory as the database.

Alt text

context.py

main.py must previously been run for the dumps before running context.py

Parameters of context.py

  • input: Input directory of stackexchange dump files and directories
    • default= "../input/"
  • dumps: A text file containing a list of stackexchange dump sites names to be processed.
    • format: file containing a list of dump sites. Viable options can be found in the file /main/mathjax_dumps. The sites must already been processed into the database by running ```main.py``
    • default="test_dumps" in main directory
  • database: database input and output
    • default='../output/database.db'
  • context: The number of words around formula to be reagarded as possible context.
    • options: an integer
    • default="10"
  • topn: The number of top terms in context regarding their tf-idf scores to be retrieved as formula context.
    • default='3'
  • corpus: Whether the corpus for idf ratings should be calculated over all sites or individually for each site.
    • options: all or individual
    • default="all"
  • stopwords: Whether or not to use stopwords and which ones to use
    • Options: none, english, or filename containing list of stopwords (e.g. stopwords). 'english' is the english stopword list of the nltk toolkit
    • default="stopwords"
  • tfidf: Whether or not to show tf-idf ratings with top context words
    • options: yes, no
    • default="yes"
  • tablename: Name of table to write topn contexts words of formulas in (will be overwritten if it exists)
    • default="FormulaContext"
  • all: Get all words as context. This will lead to ignoring the values of input parameters for context and topn.
    • options: yes or no
    • default="no"

Outputs of context.py

  • The Context of the specified sites formulas will be saved in a database (*.db) as specified by the input parameter 'database' inside of the Table specified by the input parameter 'tablename'.
  • Analysis/Statistics will be saved in file statistics.log in same directory as the database.

Alt text

parse_formulas.py

main.py must previously been run for the dumps before running parse_formulas.py.

parse_formulas.py will only parse formulas that have not been parsed due to the long runtime of LaTeXML.

Parameters of parse_formulas.py

  • dumps: A text file containing a list of stackexchange dump sites names to be processed.
    • format: file containing a list of dump sites. Viable options can be found in the file /main/mathjax_dumps. The sites must already been processed into the database by running ```main.py``
    • default="test_dumps" in main directory
  • database: database input and output
    • default='../output/database.db'
  • mode: parse formulas in ContentMathML (cmml), PresentationMathML (pmml), or both
    • options: cmml, pmml, both
  • tree: Whether or not to calculate slt trees from pmml and opt from cmml.
    • options: yes, no
    • default="yes"
  • threads: Number of threads to run parallel. One thread used to convert a single formula in MathML.
    • options: integer
    • default="20"

Outputs of parse_formulas.py

  • The ContentMathML and/or PresentationMathML of the specified sites formulas will be saved in the database (*.db) as specified by the input parameter 'database' inside of the Tables 'FormulasPostsMathML' and 'FormulasCommentsMathML'.
  • Analysis/Statistics will be saved in file statistics.log in same directory as the database.

statistics.py

main.py must previously been run for the dumps before running statistics.py

Parameters of statistics.py

  • dumps: A text file containing a list of stackexchange dump sites names to be processed.
    • format: file containing a list of dump sites. Viable options can be found in the file /main/mathjax_dumps. The sites must already been processed into the database by running ```main.py``
    • default='test_dumps' in main directory
  • database: Database file for data input.
    • default='../output/database.db'
  • output: The output directory.
    • default='../output/'

Outputs of statistics.py

  • Directory '/diagrams/' in the output directory (as specified by the user) filled with diagrams of formula distributions for the sites specified by the user.
  • HTML files displaying the diagram and some statistical values for each of the sites.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors