Difference between revisions of "Patent Data Issues"

From edegan.com
Jump to navigation Jump to search
 
(10 intermediate revisions by 3 users not shown)
Line 87: Line 87:
 
<section end=citation />
 
<section end=citation />
  
===Assignees Table===
+
===Assigneeinfo Table===
 
<section begin=assignee />
 
<section begin=assignee />
 
'Country' is missing.
 
'Country' is missing.
  
allpatent_clone=# SELECT COUNT(*) FROM assignees WHERE country=' ';
+
allpatent_clone=# SELECT COUNT(*) FROM assigneeinfo WHERE country = '';
  count   
+
count   
---------
+
---------
 
  2361543
 
  2361543
(1 row)
+
(1 row)
  
 
+
  allpatent_clone=# SELECT COUNT(*) FROM assigneeinfo WHERE country='unknown';
  allpatent_clone=# SELECT COUNT(*) FROM assignees WHERE country='unknown';
 
 
  count  
 
  count  
 
  -------
 
  -------
Line 104: Line 103:
 
  (1 row)
 
  (1 row)
  
  allpatent_clone=# SELECT COUNT(*) FROM assignees WHERE country IN ('unknown',' ') AND state=' ';
+
  allpatent_clone=# SELECT COUNT(*) FROM assigneeinfo WHERE country IN ('unknown',' ') AND state=' ';
 
   count   
 
   count   
 
  ---------
 
  ---------
Line 149: Line 148:
 
               |      | 1332054  |  63 | 112  | 153      |                |              |        | 1920-02-24  | ISS      | 1937-02-23 |            |    0
 
               |      | 1332054  |  63 | 112  | 153      |                |              |        | 1920-02-24  | ISS      | 1937-02-23 |            |    0
  
Application numbers are 8 digits long  (99/999999). The first two digits are a series code and the last six represent a serial code assigned by the USPTO. The histpatent table has over 3 million application numbers with 7 digits, since leading zeros were dropped.  
+
Application numbers are 8 digits long  (99/999999). The first two digits are a series code and the last six represent a serial code assigned by the USPTO. The histpatent table has over 3 million application numbers with 7 digits, since leading zeros were dropped. Leading zeros were also dropped for patent numbers.  
  
 
  patent=# SELECT COUNT(*) FROM histpatent WHERE applicationid > 9999999;  
 
  patent=# SELECT COUNT(*) FROM histpatent WHERE applicationid > 9999999;  
Line 164: Line 163:
  
 
<section end="histpatent />
 
<section end="histpatent />
 +
 +
===LexJudge===
 +
<section begin="lexjudge" />
 +
There are duplicate entries for certain judges that may be dropped once data on patent litigation has been added.
 +
 +
            name            | court | count
 +
-----------------------------+-------+-------
 +
Malcolm Jones Howard        | EDNC  |    2
 +
Richard Leroy Williams      | EDVa  |    2
 +
Peter Jo Messitte            | DMd  |    2
 +
Andre Maurice Davis          | DMd  |    2
 +
Paula Xinis                  | DMd  |    2
 +
Mary Hannah Lauck            | EDVa  |    2
 +
Julie E. Carnes              | NDGa  |    2
 +
Claude M. Hilton            | EDVa  |    2
 +
William D. Quarles Jr.      | DMd  |    2
 +
James C. Dever III          | EDNC  |    2
 +
Catherine C. Blake          | DMd  |    2
 +
Gerald Bruce Lee            | EDVa  |    2
 +
Leonie M. Brinkema          | EDVa  |    2
 +
<section end="lexjudge" />
 +
 +
===PatentId===
 +
<section begin="patentid" />
 +
The patentid table has many entries that do not consist of only integers.
 +
 +
SELECT COUNT(*) FROM patentid WHERE patentid ~ '[A-Z]';
 +
  count 
 +
-------
 +
775982
 +
 +
SELECT COUNT(*) FROM patentid WHERE patentid ~ '[^A-Z]';
 +
  count 
 +
--------
 +
27294418
 +
<section end="patentid" />
 +
[[Category: Internal]]
 +
[[Internal Classification: Data Tables| ]]

Latest revision as of 16:53, 1 March 2017

Citations Table

The table has two columns, 'citingpatentnumber' and 'citedpatentnumber'. There are rows with 'citedpatentnumber' greater than 10000000. For instance:

 citingpatentnumber | citedpatentnumber 
--------------------+-------------------
           9226901 |      102005013726
           9226905 |         101332187
           9226905 |        2006528175
           9226905 |        2011513304
           9226905 |     1020090061010
           9226905 |     1020110049808
           9226905 |        2010126349
           9226909 |         101340916
           9226909 |          10128910
           9226914 |         102318827
           9226914 |        2014109862
           9226915 |          10111049
           9226918 |        2008005345
           9226918 |        2008005345
           9226918 |        2008077092
           9226918 |        2008077092
           9226918 |        2008070268
           9226918 |        2008128126
           9226921 |        2008129994
           9226922 |        2012000595
           9226922 |        2012058127
           9226923 |        2010135524
           9226930 |         102002040
           9226930 |          19507522
           9226930 |          50106981
           9226930 |          53082783
           9226930 |          57052334
           9226930 |          10029979
           9226930 |          10045750
           9226931 |        2008203212
           9226931 |        2010227111
           9226931 |        2012068515
           9226931 |        2013106565
           9226932 |         103458880
           9226933 |        2006257751
           9226933 |         101366734
           9226933 |         101396048
           9226933 |         101926831
           9226933 |          20202562
           9226933 |      202005009120
           9226933 |          61063618
           9226937 |          11510473
           9226937 |        2001506579
           9226937 |        2001519791
           9226937 |        2003119127
           9226937 |        2003517831
           9226937 |        2005099761
           9226937 |        2006076681
           9226937 |        2006078941
           9226937 |        2006079021
           9226937 |        2006094209
           9226937 |        2006094210
           9226937 |        2006094233
           9226937 |        2006094235


allpatent_clone=# SELECT  COUNT(*) FROM citations WHERE citedpatentnumber>10000000;
  count  
---------
1411140
(1 row)


 allpatent_clone=# SELECT  COUNT(*) FROM citations WHERE citedpatentnumber IS NULL;
  count   
----------
 23516667
(1 row)
allpatent_clone=# SELECT  COUNT(*) FROM citations;
 count   
----------
 97680838
(1 row)

Possible Solution

Jul 8, 2016: The blank citedpatentnumbers were created due to inconsistency between the original type and the type in the citation table (string to integer). The blank entries mostly correspond to publication number, non-U.S. patent number, and non-standardized patent number. The next step would be to recreate the table accounting for these issues. The U.S. publication number could be matched to the publications numbers in the histpatent table and be replaced by the corresponding patent numbers.

The foreign cited patents will be moved to a separate table.

Assigneeinfo Table

'Country' is missing.

allpatent_clone=# SELECT COUNT(*) FROM assigneeinfo WHERE country = ;

count  

2361543

(1 row)

allpatent_clone=# SELECT COUNT(*) FROM assigneeinfo WHERE country='unknown';
count 
-------
 3918
(1 row)
allpatent_clone=# SELECT COUNT(*) FROM assigneeinfo WHERE country IN ('unknown',' ') AND state=' ';
 count  
---------
1851353
(1 row)


UPDATE: The source of the problem seems to be the Harvard Dataverse.

Assignees & assigneesUSU Tables

No information provided about the assignee. No entries for orgname or first and last names.

allpatent_clone=# SELECT COUNT(patentnumber) FROM assignees WHERE (orgname= OR orgname IS NULL) AND (firstname= OR firstname IS NULL) AND (lastname= OR lastname IS NULL); 
count  
--------
344794
(1 row)

AssigneesUSU table was made with the following code:

SELECT orgname, patentnumber, country, firstname, lastname, state
INTO assigneesUSU -- assignees in US,unknown, and blank entries
FROM assignees
WHERE country IN ('US', , 'unknown') OR (state IS NOT NULL AND state!=)
ORDER BY orgname; 
allpatent_clone=# SELECT COUNT(patentnumber) FROM assigneesUSU WHERE (orgname= OR orgname IS NULL) AND (firstname= OR firstname IS NULL) AND (lastname= OR lastname IS NULL);  
count  
--------
344793
(1 row)

HistPatent Table

The USPTO bulk data contains negative patent numbers.

applicationid | pubno |  patent  | nber | uspc | uspc_sub | applicationdate | prioritydate | pubdate | displaydate | disptype |   exp_dt   | exp_dt_max | pta  
--------------+-------+----------+------+------+----------+-----------------+--------------+---------+-------------+----------+------------+------------+------
      8466602 |       | -2037052 |   23 | 347  | 259      | 1995-06-06      | 1995-06-06   |         | 1997-06-19  | ABN      |            |            |    0
      8605804 |       | -2962913 |   70 | 395  | 500      | 1996-02-23      | 1996-02-23   |         | 1998-08-22  | ABN      |            |            |    0
              |       | 1332054  |   63 | 112  | 153      |                 |              |         | 1920-02-24  | ISS      | 1937-02-23 |            |    0

Application numbers are 8 digits long (99/999999). The first two digits are a series code and the last six represent a serial code assigned by the USPTO. The histpatent table has over 3 million application numbers with 7 digits, since leading zeros were dropped. Leading zeros were also dropped for patent numbers.

patent=# SELECT COUNT(*) FROM histpatent WHERE applicationid > 9999999; 
 count  
---------
4048050
(1 row)
patent=# SELECT COUNT(*) FROM histpatent WHERE applicationid <9999999; 
 count  
---------
3028846
(1 row)

LexJudge

There are duplicate entries for certain judges that may be dropped once data on patent litigation has been added.

            name             | court | count 
-----------------------------+-------+-------
Malcolm Jones Howard         | EDNC  |     2
Richard Leroy Williams       | EDVa  |     2
Peter Jo Messitte            | DMd   |     2
Andre Maurice Davis          | DMd   |     2
Paula Xinis                  | DMd   |     2
Mary Hannah Lauck            | EDVa  |     2
Julie E. Carnes              | NDGa  |     2
Claude M. Hilton             | EDVa  |     2
William D. Quarles Jr.       | DMd   |     2
James C. Dever III           | EDNC  |     2
Catherine C. Blake           | DMd   |     2
Gerald Bruce Lee             | EDVa  |     2
Leonie M. Brinkema           | EDVa  |     2

PatentId

The patentid table has many entries that do not consist of only integers.

SELECT COUNT(*) FROM patentid WHERE patentid ~ '[A-Z]'; 
 count  
-------
775982
SELECT COUNT(*) FROM patentid WHERE patentid ~ '[^A-Z]';
 count   
--------
27294418