Difference between revisions of "Redesigning Patent Database"
ShelbyBice (talk | contribs) |
|||
Line 89: | Line 89: | ||
Here at the instructions I'm developing for downloading, parsing, and hopefully adding new data to the database since the documentation is very sparse (can also be found under McNair/Projects/Redesigning Patent Database/Instructions on how to download patent data form USPTO bulk data. | Here at the instructions I'm developing for downloading, parsing, and hopefully adding new data to the database since the documentation is very sparse (can also be found under McNair/Projects/Redesigning Patent Database/Instructions on how to download patent data form USPTO bulk data. | ||
− | How to run Perl Scripts to extract Patent Data | + | ===How to run Perl Scripts to extract Patent Data=== |
1) map the network drive to your computer (instructions can be found under Help for New Staff | 1) map the network drive to your computer (instructions can be found under Help for New Staff | ||
scroll down to the "Working with the infrastructure and click on the link to "How to Map the network | scroll down to the "Working with the infrastructure and click on the link to "How to Map the network | ||
drive) | drive) | ||
− | 2) | + | 2) Don't consult this handy link on how to install and run perl programs (https://www.thoughtco.com/how-to-install-and-run-perl-2641103) because perl is already installed for you to use. Just open a command line and type |
+ | perl scriptname.pl | ||
3) Now it's time to download the data so it can be parsed. | 3) Now it's time to download the data so it can be parsed. | ||
Revision as of 16:31, 11 April 2017
Redesigning Patent Database | |
---|---|
Project Information | |
Project Title | Redesigning Patent Database |
Owner | Shelby Bice |
Start Date | 201701 |
Deadline | 201705 |
Keywords | Database, Patent |
Primary Billing | |
Notes | |
Has project status | Active |
Copyright © 2016 edegan.com. All Rights Reserved. |
Documentation on the process and eventual designs for the new patent database. Not to be confused with "Patent Data Restructure", which deals with condensing and cleaning USPTO assignees data.
Deliverables by end of semester: Design and build new patent database with core tables from data, add tables from Patent Data Restructure and Small Inventors Project, write thorough documentation on schema of new patent database and instructions on what documentation should be written when it is altered
TODO on 3/28/2017
- Find all SQL and documentation related to merging Harvard and USPTO data
- Start documentation for the "core" tables of the databases that should not be deleted.
- Make ER Diagram for "core" tables in database
- Start documentation for how one should go about changing/altering the "core" tables
Future:
- Start documentation for how new tables should be added to database
Contents
Redesigning Patent Database
- Design a better representation for database
- Fix scripts if necessary
- Start moving data into new database by querying existing databases (using SQL)
- Use scripts to query new data
- Test database
- Remove extraneous information from database (copies, patents that we're not interested in, etc.)
Documentation I need to include:
- Schema of new database (with justification of design), would like to include a visual representation
- SQL commands that were used to fill database with explanation of what they do
- Clear instructions on where to find scripts in bulk drive and an explanation of what each script does
- Visual representation of example table entries that isn't just copied and pasted from a CSV file
Related Projects
- Patent Data Restructure
- Small Inventors Project - uses Fee Status and Citations
- Medical Centers and Grants - uses patent assignees, specifically their zipcodes and organizations
Documentation Relevant to Current Patent Database
Previous documentation on the patent database:
As of 3/21/2017 the most up-to-date database containing patent data is "patent" not "allpatent" or "allpatent_clone" and "patent" is the database that the the other patent data redesign project, Restructuring Patent Data (link above) is working with.
Patent Data - overview of what the data is and where it came from, probably starting point for changing documentation
Patent Database - overview of schema of database (specifically, the database patent, which includes data from Harvard dataverse (originally stored in patentdata) and USPTO (patent_2015)
USPTO Assignees Database - enhances assignee info in patent database, also being redesigned
Problems with Patent Database - lists issues with current schema
Previous ER Diagram - does not match up with schema described in Patent Database and contains outdated list of what we want to pull from XML files
Processing Patent Data - states that allpatent is the newest database and an amalgamation of patentdata or patent_2015
Description
The purpose of this project is to create a new, redesigned database to hold all of the patent information that the McNair Center has accumulated and document the process so that the design can be easily understood and replicated or edited as needed.
This database will include design patents, utility patents, and reissues.
Development
Design will be built upon a relational database model. I will be referencing this article on database design as I develop the design (http://en.tekstenuitleg.net/articles/software/database-design-tutorial/one-to-many.html), and I will be creating an ER diagram using ERDPlus or Creately.
Current Design and Scripts Documentation
The scripts for querying data for the patent database exist in McNair/software/scripts/patent. If the design for the schema of the database tables are altered too much, may have to write new scripts.
The following pages are relevant to how previous databases are built/how to build tables in the database:
Harvard Dataverse Data - explains how to make tables from Harvard Dataverse data, where to find scripts, etc.
USPTO Data - explains how to make tables from USPTO data, where to find scripts, etc, specifically for assignment data.
Patent Data Extraction - explains locations of XML files and lists (at the bottom) where the Perl scripts can be found
Patent Data Cleanup - explains changes that were made to clean up problems in the database allpatent as a result of merging the Harvard Dataverse data and the USPTO data
Patent Data Processing - SQL Steps - explains SQL needed to merge two existing databases, one that contained the Harvard Dataverse data and one that contained the USPTO data
Here at the instructions I'm developing for downloading, parsing, and hopefully adding new data to the database since the documentation is very sparse (can also be found under McNair/Projects/Redesigning Patent Database/Instructions on how to download patent data form USPTO bulk data.
How to run Perl Scripts to extract Patent Data
1) map the network drive to your computer (instructions can be found under Help for New Staff scroll down to the "Working with the infrastructure and click on the link to "How to Map the network drive) 2) Don't consult this handy link on how to install and run perl programs (https://www.thoughtco.com/how-to-install-and-run-perl-2641103) because perl is already installed for you to use. Just open a command line and type
perl scriptname.pl
3) Now it's time to download the data so it can be parsed.
For USPTO Assignment Data, there appears to be a script, under McNair/usptoAssignment, called USPTO_Assignee_Download, which appears to let a user pass it a url and then ir downloads all th zip files available at that URL. It then places the downloaded zip files in "E:/McNair/usptoAssigneeData/name", where "name" is the name of the file. If you want to check which files have already been processed, check "McNair/usptoAssigneeData/Finished" to see the finished zip files.
The equivalent for patent data is called "USPTO_Parser" and can be found under McNair/Software/Scripts/Patent. Instead of taking a url, as the USPTO_Assignee Download does, ir takes two arguments, year 1 and year2, which are supposed to represent the range of data that you wish to download (for example, 2015 to 2016). The perl script places the downloaded zip files into "E:/PatentData/name" where "name" is the name of the zip file. This location is not quite accurate - the files are actually stored in what appears to be McNair/PatentData. The folder "Processed" under McNair/PatentData appears to hold all the unzipped zip files that have been downloaded and processed already. So if you are curious if some files have already been processed, you could look there. They are organized by year.
Now to actually run the scripts:
- Insert how one would do this once I figure it out, I searched online to try to troubleshoot
why I could not run the scripts, but I couldn't figure it out 4/11/2017*
4) The next step would be to parse the actual files.
For the patent data files, based on the existing documentation, it looks like PatentParser, found in McNair/Software/Scripts/Patent, has to be run on each xml file that was downloaded and unzipped during the previous step. It then stores the parsed xml files all in a text file called "Results.txt" (which I assume will have to be deleted afterward). Thi script utilizes the Claim.pm, Inventor.pm, PatentApplication.pm, and Loader.pm modules. It no longer uses the AddressBook.pm module.
- If we have a perl module for getting the inventor, why do we not have an inventors table in the database?*
For the USPTO Assignment Data, the parsing file is called USPTO_Assignee_XML_parser. It takes the path to the files that need to be parsed (an example I think is ":E/PatentData/Processed/year" where "year" is the name of the folder you've placed the xml files to be parsed. It iterates through all the files in the "year" directory that you passed. This file directly loads the information into the database while it parses the file.
5) For patent data, I assume the next step would be to create a table from the text file - possibly using CreateTables (a PostgreSQL file).
Specifications of USPTO Data To Extract
Go to https://bulkdata.uspto.gov/ to bulk data from USPTO.
To see a description of what each file the USPTO bulk data contains, go to the bulk drive and navigate to McNair/Projects/Redesigning Patent Database/2017BulkDataProductDescriptions. This gives an overview, but does not explain how the XML files are structured. Those are the DTDs.
For assignment data, we pull from https://bulkdata.uspto.gov/data2/patent/assignment/. A description of all the elements in an USPTO assignment XML file is called a DTD, and the one for the assignment data can be found in the bulk drive under McNair/Project/Redesigning Patent Database/USPTO Assignment DTD
I'm currently looking at the DTsD for USPTO patent data (2005 and up DTDs have to be opened by Microsoft Visual Studio) to ascertain if there are any fields we're not currently pulling from the bulk data that we should. I am using the following link to figure out how to read a DTD: http://www.ldodds.com/delta/dtd_guide.html. There are only DTDs for 2005 and up, but there is a very long pdf that appears to detail the format of patent files pre - 2005. It has been saved under McNair/Projects/Redesigning Patent Database/
Pre - 2005
According to the documentation, the inventor's name and city are required to be listed, and the street, state, and country of the inventor may also be listed.
Paragraphs of abstract may also be listed under logical group "Abstract" in a field called "abstract":
2005 - present
The USPTO patent data for 2005 seems to include multiple paragraphs for the abstract under an element called "abstract". I've included the line from the DTD below:
<!ELEMENT abstract (doc-page+ | (abst-problem , abst-solution) | p+)>
<!ATTLIST abstract id ID #IMPLIED lang CDATA #IMPLIED status CDATA #IMPLIED >
An abstract is required to be included for all patents that are not design patents, and according to the DTD must have at least one paragraph element.
As far as inventors go, it looks like for 2005 - up patents, the inventor is simply the applicant. this is copied from the DTD:
<!ELEMENT applicant (addressbook+ , nationality , residence , us-rights* , designated-states? , designated-states-as-inventor?)>
<!ATTLIST applicant sequence CDATA #REQUIRED app-type (applicant | applicant-inventor ) #REQUIRED designation (all | all-except-us | us-only | as-indicated ) #REQUIRED >
If for some reason the inventor could not be the applicant, this element would contain the inventor:
<!ELEMENT inventor (addressbook+ , designated-states?)>
<!ATTLIST inventor sequence CDATA #REQUIRED designation (all | all-except-us | us-only | as-indicated ) #IMPLIED >
<!ELEMENT inventors (inventor | deceased-inventor)+>
It seems that we have already pulled patent abstracts before with the existing scripts. I found a huge word document with abstracts under McNair/Software/Scripts/Patent/Abstracts_2016.
Test Plan
Log:
2/16/2017 - Talked over project with Ed, began reading existing wiki pages related to patent data and databases
2/21/2017 - Brushed up on SQL, Entity - Relationship model of designing databases
- In the documentation, I want to briefly explain what the entity-relationship model is before including
the diagram so that readers have a little bit of background
- Found a tool for creating a visual representation called ERDPlus.com - create a standalone instead of an account, can download
Learning commands from Patent Data - SQL Steps
- copy command is PostgreSQL that copies a SQL table to a text file
- DELIMITER set what will separate columns in text file
- HEADER specifies that there will be a header in the text file with the names of the columns
- Definitely need to include more detail about what these do in the documentation
- insert into command inserts a new entry into the table
2/23/2017 - Read great database design article, dug through some more wiki articles, started reviewing Perl
- What client do we use to interact with the current patent database?
- Will need to determine all the fields that need to be included in the database before finishing the design and ER diagram, will need Ed's input
3/2/2017 - Started compiling a list of what fields to include and how they would be related.
- Created an Excel spreadsheet that records the each table, their current attributes in the existing patentdata table, what I think the attributes should be in the new table, their relationship to a patent (i.e. one-to-many, many-to-many, etc.), their primary key, questions I have relating to the table, future steps for cleaning up the data in the table (i.e., once all the data has been move to the new database, removing patents that are not US-based), and current problems that have been recorded with the existing table for that information (if an existing table exists)
- Once Excel spreadsheet is completed (and questions in the Questions column are answered or removed from the spreadsheet entirely) I will look into trying to embed it on my "Redesigning Patent Database" wiki page so that future users can sort of follow my thought process. I will also create separate wiki pages to explain each table once the new database is created
- instructions for adding a table to a wiki page https://www.mediawiki.org/wiki/Help:Tables
- Making spreadsheet led me to realize there is some data that is repeated (filedate in fee table when it is also located in patent table, and the fee table includes the patent
3/7/2017 - Continued working on spreadsheet, update project page with relevant links, developed following plan for documentation
- Most important thing for new documentation in my opinion is a clear flow, i.e., one central page on the database, which contains links to the following:
- Where data is (patent data) where it came from (Harvard Dataverse, etc), the scripts/SQL used to obtain the data, and where original xml files can be located
- Next, there should be an ER diagram that shows the general design, followed by short descriptions of each table is, what it's related to, and what it's use is
- Each table should contain a link to a separate wiki page on the table which will explain all the columns, where the data came from, provide a visual example of data in the table, list issues, and also contain a log of what has been done to the table over time
- there should also be instructions on the page for how, if someone is going to change the database (add a table, alter a column, etc.) which documentation they should update (the wikipage for the table, for example)
- Links to USPTO Assignee Database with an explanation of how it is joined with the Patent Database
- Lastly, everything should be dated. I know that each wiki-page lists when it was last edited, but the main page, next to table descriptions, should have the date the table was originally added. The ER diagram should be updated as needed and also dated so that readers know if it's the most current representation of the database.
3/9/2017 - Finished up first draft of spreadsheet, found ER diagram tools
- Old ER diagram (see Documentation Relevant to Current Databases) is difficult to read - it appears to show tables that don't exist and data that doesn't exist in the current schema
- psql allpatent does not work. psql allpatent_clone does for some reason
3/21/2017 - Working on determining "core" tablse that will always be in database and understanding work on assignees info
- http://postgresonline.com/special_feature.php?sf_name=postgresql83_psql_cheatsheet&outputformat=html
- helpful commands for interacting with psql
- Notes on determining "core" tables for database are saved under Projects/Redesigning Patent Database/Determining which tables should be core tables
3/23/2017 - Made spreadsheet with core tables and share it with other people working on patent data to get their approval
- This spreadsheet can be found under Projects/Redesigning Patent Database/Core Tables for New Patent Database
- Some tables that will later be deleted were included on the spreadsheet because their are currently being tables built to replace them
- May try to just move all the (twenty-something) "pto-" tables that have been created due to the "Restructuring Patent Data" project from "patent" to the new database
- Will work on understanding SQL for filling new database from this link next week http://mcnair.bakerinstitute.org/wiki/Patent_Data_Processing_-_SQL_Steps and http://mcnair.bakerinstitute.org/wiki/Patent_Data_Cleanup_-_June_2016
4/4/2017 - Found all the pages on extracting data and making tables and databases
- Do not need to pull Harvard Dataverse data again, it's saved in CSV files on the bulk drive
- Started looking through DTDs for USPTO patent and assignment data to determine if there is extra information that we should extract from USPTO data.