Difference between revisions of "Patent Data Issues"
(Created page with "===Citations Table=== The table has two columns, 'citingpatentnumber' and 'citedpatentnumber'. There are rows with 'citedpatentnumber' greater than 10000000. For instance:...") |
SONIAZHANG (talk | contribs) |
||
(24 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
===Citations Table=== | ===Citations Table=== | ||
+ | <section begin=citation /> | ||
The table has two columns, 'citingpatentnumber' and 'citedpatentnumber'. There are rows with 'citedpatentnumber' greater than 10000000. For instance: | The table has two columns, 'citingpatentnumber' and 'citedpatentnumber'. There are rows with 'citedpatentnumber' greater than 10000000. For instance: | ||
Line 84: | Line 85: | ||
The foreign cited patents will be moved to a separate table. | The foreign cited patents will be moved to a separate table. | ||
− | === | + | <section end=citation /> |
+ | |||
+ | ===Assigneeinfo Table=== | ||
+ | <section begin=assignee /> | ||
'Country' is missing. | 'Country' is missing. | ||
− | + | allpatent_clone=# SELECT COUNT(*) FROM assigneeinfo WHERE country = ''; | |
− | + | count | |
− | + | --------- | |
2361543 | 2361543 | ||
− | + | (1 row) | |
− | + | allpatent_clone=# SELECT COUNT(*) FROM assigneeinfo WHERE country='unknown'; | |
− | allpatent_clone=# SELECT COUNT(*) FROM | ||
count | count | ||
------- | ------- | ||
Line 100: | Line 103: | ||
(1 row) | (1 row) | ||
− | allpatent_clone=# SELECT COUNT(*) FROM | + | allpatent_clone=# SELECT COUNT(*) FROM assigneeinfo WHERE country IN ('unknown',' ') AND state=' '; |
count | count | ||
--------- | --------- | ||
Line 108: | Line 111: | ||
UPDATE: The source of the problem seems to be the Harvard Dataverse. | UPDATE: The source of the problem seems to be the Harvard Dataverse. | ||
+ | <section end=assignee /> | ||
===Assignees & assigneesUSU Tables=== | ===Assignees & assigneesUSU Tables=== | ||
− | + | <section begin=assignee2 /> | |
No information provided about the assignee. No entries for orgname or first and last names. | No information provided about the assignee. No entries for orgname or first and last names. | ||
Line 132: | Line 136: | ||
344793 | 344793 | ||
(1 row) | (1 row) | ||
+ | <section end=assignee2 /> | ||
+ | |||
+ | ===HistPatent Table=== | ||
+ | <section begin="histpatent" /> | ||
+ | The USPTO bulk data contains negative patent numbers. | ||
+ | |||
+ | applicationid | pubno | patent | nber | uspc | uspc_sub | applicationdate | prioritydate | pubdate | displaydate | disptype | exp_dt | exp_dt_max | pta | ||
+ | --------------+-------+----------+------+------+----------+-----------------+--------------+---------+-------------+----------+------------+------------+------ | ||
+ | 8466602 | | -2037052 | 23 | 347 | 259 | 1995-06-06 | 1995-06-06 | | 1997-06-19 | ABN | | | 0 | ||
+ | 8605804 | | -2962913 | 70 | 395 | 500 | 1996-02-23 | 1996-02-23 | | 1998-08-22 | ABN | | | 0 | ||
+ | | | 1332054 | 63 | 112 | 153 | | | | 1920-02-24 | ISS | 1937-02-23 | | 0 | ||
+ | |||
+ | Application numbers are 8 digits long (99/999999). The first two digits are a series code and the last six represent a serial code assigned by the USPTO. The histpatent table has over 3 million application numbers with 7 digits, since leading zeros were dropped. Leading zeros were also dropped for patent numbers. | ||
+ | |||
+ | patent=# SELECT COUNT(*) FROM histpatent WHERE applicationid > 9999999; | ||
+ | count | ||
+ | --------- | ||
+ | 4048050 | ||
+ | (1 row) | ||
+ | |||
+ | patent=# SELECT COUNT(*) FROM histpatent WHERE applicationid <9999999; | ||
+ | count | ||
+ | --------- | ||
+ | 3028846 | ||
+ | (1 row) | ||
+ | |||
+ | <section end="histpatent /> | ||
+ | |||
+ | ===LexJudge=== | ||
+ | <section begin="lexjudge" /> | ||
+ | There are duplicate entries for certain judges that may be dropped once data on patent litigation has been added. | ||
+ | |||
+ | name | court | count | ||
+ | -----------------------------+-------+------- | ||
+ | Malcolm Jones Howard | EDNC | 2 | ||
+ | Richard Leroy Williams | EDVa | 2 | ||
+ | Peter Jo Messitte | DMd | 2 | ||
+ | Andre Maurice Davis | DMd | 2 | ||
+ | Paula Xinis | DMd | 2 | ||
+ | Mary Hannah Lauck | EDVa | 2 | ||
+ | Julie E. Carnes | NDGa | 2 | ||
+ | Claude M. Hilton | EDVa | 2 | ||
+ | William D. Quarles Jr. | DMd | 2 | ||
+ | James C. Dever III | EDNC | 2 | ||
+ | Catherine C. Blake | DMd | 2 | ||
+ | Gerald Bruce Lee | EDVa | 2 | ||
+ | Leonie M. Brinkema | EDVa | 2 | ||
+ | <section end="lexjudge" /> | ||
+ | |||
+ | ===PatentId=== | ||
+ | <section begin="patentid" /> | ||
+ | The patentid table has many entries that do not consist of only integers. | ||
+ | |||
+ | SELECT COUNT(*) FROM patentid WHERE patentid ~ '[A-Z]'; | ||
+ | count | ||
+ | ------- | ||
+ | 775982 | ||
+ | |||
+ | SELECT COUNT(*) FROM patentid WHERE patentid ~ '[^A-Z]'; | ||
+ | count | ||
+ | -------- | ||
+ | 27294418 | ||
+ | <section end="patentid" /> | ||
+ | [[Category: Internal]] | ||
+ | [[Internal Classification: Data Tables| ]] |
Latest revision as of 16:53, 1 March 2017
Contents
Citations Table
The table has two columns, 'citingpatentnumber' and 'citedpatentnumber'. There are rows with 'citedpatentnumber' greater than 10000000. For instance:
citingpatentnumber | citedpatentnumber --------------------+------------------- 9226901 | 102005013726 9226905 | 101332187 9226905 | 2006528175 9226905 | 2011513304 9226905 | 1020090061010 9226905 | 1020110049808 9226905 | 2010126349 9226909 | 101340916 9226909 | 10128910 9226914 | 102318827 9226914 | 2014109862 9226915 | 10111049 9226918 | 2008005345 9226918 | 2008005345 9226918 | 2008077092 9226918 | 2008077092 9226918 | 2008070268 9226918 | 2008128126 9226921 | 2008129994 9226922 | 2012000595 9226922 | 2012058127 9226923 | 2010135524 9226930 | 102002040 9226930 | 19507522 9226930 | 50106981 9226930 | 53082783 9226930 | 57052334 9226930 | 10029979 9226930 | 10045750 9226931 | 2008203212 9226931 | 2010227111 9226931 | 2012068515 9226931 | 2013106565 9226932 | 103458880 9226933 | 2006257751 9226933 | 101366734 9226933 | 101396048 9226933 | 101926831 9226933 | 20202562 9226933 | 202005009120 9226933 | 61063618 9226937 | 11510473 9226937 | 2001506579 9226937 | 2001519791 9226937 | 2003119127 9226937 | 2003517831 9226937 | 2005099761 9226937 | 2006076681 9226937 | 2006078941 9226937 | 2006079021 9226937 | 2006094209 9226937 | 2006094210 9226937 | 2006094233 9226937 | 2006094235
allpatent_clone=# SELECT COUNT(*) FROM citations WHERE citedpatentnumber>10000000; count --------- 1411140 (1 row)
allpatent_clone=# SELECT COUNT(*) FROM citations WHERE citedpatentnumber IS NULL; count ---------- 23516667 (1 row)
allpatent_clone=# SELECT COUNT(*) FROM citations; count ---------- 97680838 (1 row)
Possible Solution
Jul 8, 2016: The blank citedpatentnumbers were created due to inconsistency between the original type and the type in the citation table (string to integer). The blank entries mostly correspond to publication number, non-U.S. patent number, and non-standardized patent number. The next step would be to recreate the table accounting for these issues. The U.S. publication number could be matched to the publications numbers in the histpatent table and be replaced by the corresponding patent numbers.
The foreign cited patents will be moved to a separate table.
Assigneeinfo Table
'Country' is missing.
allpatent_clone=# SELECT COUNT(*) FROM assigneeinfo WHERE country = ;
count
2361543
(1 row)
allpatent_clone=# SELECT COUNT(*) FROM assigneeinfo WHERE country='unknown'; count ------- 3918 (1 row)
allpatent_clone=# SELECT COUNT(*) FROM assigneeinfo WHERE country IN ('unknown',' ') AND state=' '; count --------- 1851353 (1 row)
UPDATE: The source of the problem seems to be the Harvard Dataverse.
Assignees & assigneesUSU Tables
No information provided about the assignee. No entries for orgname or first and last names.
allpatent_clone=# SELECT COUNT(patentnumber) FROM assignees WHERE (orgname= OR orgname IS NULL) AND (firstname= OR firstname IS NULL) AND (lastname= OR lastname IS NULL); count -------- 344794 (1 row)
AssigneesUSU table was made with the following code:
SELECT orgname, patentnumber, country, firstname, lastname, state INTO assigneesUSU -- assignees in US,unknown, and blank entries FROM assignees WHERE country IN ('US', , 'unknown') OR (state IS NOT NULL AND state!=) ORDER BY orgname;
allpatent_clone=# SELECT COUNT(patentnumber) FROM assigneesUSU WHERE (orgname= OR orgname IS NULL) AND (firstname= OR firstname IS NULL) AND (lastname= OR lastname IS NULL); count -------- 344793 (1 row)
HistPatent Table
The USPTO bulk data contains negative patent numbers.
applicationid | pubno | patent | nber | uspc | uspc_sub | applicationdate | prioritydate | pubdate | displaydate | disptype | exp_dt | exp_dt_max | pta --------------+-------+----------+------+------+----------+-----------------+--------------+---------+-------------+----------+------------+------------+------ 8466602 | | -2037052 | 23 | 347 | 259 | 1995-06-06 | 1995-06-06 | | 1997-06-19 | ABN | | | 0 8605804 | | -2962913 | 70 | 395 | 500 | 1996-02-23 | 1996-02-23 | | 1998-08-22 | ABN | | | 0 | | 1332054 | 63 | 112 | 153 | | | | 1920-02-24 | ISS | 1937-02-23 | | 0
Application numbers are 8 digits long (99/999999). The first two digits are a series code and the last six represent a serial code assigned by the USPTO. The histpatent table has over 3 million application numbers with 7 digits, since leading zeros were dropped. Leading zeros were also dropped for patent numbers.
patent=# SELECT COUNT(*) FROM histpatent WHERE applicationid > 9999999; count --------- 4048050 (1 row)
patent=# SELECT COUNT(*) FROM histpatent WHERE applicationid <9999999; count --------- 3028846 (1 row)
LexJudge
There are duplicate entries for certain judges that may be dropped once data on patent litigation has been added.
name | court | count -----------------------------+-------+------- Malcolm Jones Howard | EDNC | 2 Richard Leroy Williams | EDVa | 2 Peter Jo Messitte | DMd | 2 Andre Maurice Davis | DMd | 2 Paula Xinis | DMd | 2 Mary Hannah Lauck | EDVa | 2 Julie E. Carnes | NDGa | 2 Claude M. Hilton | EDVa | 2 William D. Quarles Jr. | DMd | 2 James C. Dever III | EDNC | 2 Catherine C. Blake | DMd | 2 Gerald Bruce Lee | EDVa | 2 Leonie M. Brinkema | EDVa | 2
PatentId
The patentid table has many entries that do not consist of only integers.
SELECT COUNT(*) FROM patentid WHERE patentid ~ '[A-Z]'; count ------- 775982
SELECT COUNT(*) FROM patentid WHERE patentid ~ '[^A-Z]'; count -------- 27294418