Latest revision as of 12:47, 21 September 2020

The objective of this project is come up with a fast, reliable algorithm that finds the smallest circle area that can drawn around N Cartesian points such that each circle contains at least M<N points. This algorithm will be used in an academic paper on Urban Start-up Agglomeration, where the points will represent venture capital backed firms within cities.

To-Do

Project
Enclosing Circle Algorithm
Project Information
Has title	Enclosing Circle Algorithm
Has owner	Christy Warden, Peter Jalbert
Has start date	January 2017
Has deadline date
Has keywords	Tool
Has project status	Subsume
Dependent(s):	Parallel Enclosing Circle Algorithm, Top Cities for VC Backed Companies
Subsumed by:	Enclosing Circle Algorithm (Rework)
Has sponsor	McNair Center
Has project output	Tool
	Copyright © 2019 edegan.com. All Rights Reserved.

1) Cleanup of geocoding

   a) Take from tab delimited not from whitespace
   b) Load in to database
   c) Use google maps API 
   d) Throw out center of world/center of city 
   e) Change script to take Company Year 
   f) 100 largest cities

To-Do 04/26 Update

   a) Thursday 04/27: Plot circles resulting from the new Algorithm and ensure that they are correct, specifically on previously bad outputs. 
   b) Friday 04/28: Look for ways to speed up algorithm (reinstate memoization). Run algorithm over weekend. 
   c) Monday: Hope that it ran over the weekend. Check through data. Debug where necessary. Say many prayers to computer gods.

Overview

This program takes in a set of points and the minimum number that should be included inside a unit, and returns circles of the smallest total area which encompass all of the data points. Function make_circle and all of its helper functions were taken from https://www.nayuki.io/res/smallest-enclosing-circle/smallestenclosingcircle.py.

Update 04/26/17: The most recent Enclosing Circle algorithm is EnclosingCircleRemake.py. I believe that in the past there was an issue with the way that I selected the next point to go into the circle. I was previously choosing the point closest to the arbitrary starting point, rather than the point closest to the center of the new circle, which was pretty illogical. EnclosingCircleRemake should fix this error, but it needs to be tested pretty thoroughly before I am ready to run it on the large data set.

Input: A sequence of pairs of floats or ints, e.g. [(0,5), (3.1,-2.7)]. Output: A triple of floats representing a circle. Returns the smallest circle that encloses all the given points. Runs in expected O(n) time, randomized.

Algorithm Description

Location

The original script is located in:

E:\McNair\Software\CodeBase\EnclosingCircle.py

Explanation

Inputs: A set of points, a minimum number of points to include in each circle.

1) Find a point on the outside of the set of points by choosing the rightmost point in the set.

2) If the quantity of points in the input is less than twice the minimum number of points to include in the circle, return a circle containing all of the points. This is because there is no way to split up the points without having a circle containing less than n points.

3) Sort the points by their distance from starting point (using distance formula). (Changed in enclosing circle remake! Choose the next point to be the one closest to the center of the previous core.)

4) Create a list of points called "Core." The core should contain starting point and the n - 1 points that are closest to it. This is the smallest circle that contains starting point.

5) Run the same algorithm on the rest of the points that are not contained in core and store the area of the core + the area of the result on the remaining points.

6) Add one point (the next closest one to starting point) to the core and then repeat step 5.

7) Repeat step 6 until there aren't enough points outside of the core to constitute a valid circle.

8) Choose the scheme that resulted in the smallest total area.

K-Means Based Algorithm

Location

The script is located in:

E:\McNair\Software\CodeBase\New Implement of Enclosing Circle (Constrained K Means, Smallest Circle)\enclosing_circle_new_implement.py

Explanation

The algorithm relies on an object oriented implementation of a "cluster".
Each "cluster" has the following associated with it:

  1. area of minimum circle enclosing points in the cluster
  2. points associated with the cluster
  3. minimum circle associated with the cluster (x-coord,y-coord,radius)
  4. "parent" cluster
  5. "children" cluster(s)

Inputs:

  1. n, the minimum number of points to include in a circle
  2. k, the amount of clusters to split the points into
  3. valid_nodes, an empty list to store the possible clusters to explore
  4. explored_nodes, an empty list to store the explored clusters
  5. cur_node, initialized to an object with all values set to none
  6. end_nodes, initialized to an empty list used to store the clusters which cannot be split further

Algorithm:

  1. Calculate the area of the minimum circle enclosing the total set of points
  2. Split initial points into "k" clusters of at least "n" points using Constrained-K-Means Algorithm
  3. construct minimum circles of "k" clusters and compute their total area
  4. if this area <= area of the area of the first circle
  5. Add all "k" clusters to "valid_nodes"
  6. for node in valid_nodes, add node to explored_nodes, change cur_node to node, and update respective areas, parents, children, circle, points
  7. run this algorithm again, change k = 2
  8. if the cur_node cannot be split into 2, or the cur_node cannot be split into 2 circles with smaller areas than its parent, add cur_node to end_nodes and resume algorithm with next valid_node not in 
     explored_node as the cur_node      
  9. if all the valid_nodes are in explored_nodes, return end_nodes
  10. repeat the above algorithm for all k from 2 to int(#number of total points/n) and take the end_nodes with the lowest total area

Runtime: The runtime of the minimum enclosing circle is O(n), and the runtime of constrained k-means depends on k itself. Here, the value of k is proportional to the total number of points.
We would expect the algorithm to slow as the number of points increase. For reference, a set of 80 points with n =3, took about 30 minutes.
Further improvements could implement an early-stopping method to converge to a local optima.

Visualization

The K-Means based algorithm returns the optimal solution, albeit slower

Brute Force

Location

The script is located in:

E:\McNair\Projects\Accelerators\Enclosing_Circle\enclosing_circle_brute_force.py

The Julia Script is located in:

E:\McNair\Software\CodeBase\Julia Code for enclosing circle

Explanation

1) Final all combinations of two points in the total data set.

2) For each of these combinations, draw a circle and figure out what other points are contained in a circle around those two points. If the circle contains at least n points, add it to a set of valid circles. There are Number of Points choose 2 possible circles.

3) Combine the valid circles in all possible ways and see if the resulting scheme contains all of the points. If it does, add it to a set of valid schemes. I believe that the number of ways to do this is the sum from i = 1 to i = number of valid circles Number of Circles choose i.

4) Iterate through the valid schemes and calculate the areas of the schemes.

5) Return the scheme with the minimum area.

Problems

In my first implementation of EnclosingCircle (EnclosingCircle.py), there is a mistake in my memoization scheme where I had been setting a variable equal to a "scheme[1].sort()", forgetting that the sort function operates in place and returns None. Thus, the memoization scheme was being used randomly and excessively such that not all possible circles were calculated and incorrect results were returned (hence the subsuming circle nonsense). Another mistake in the original algorithm was that when we were choosing the next point to add to a given circle based on how far it was from the first point I chose in the circle, rather than from the center of the circle. I corrected this by recalculating the circle every time I add a point and then sorting the rest of the points based on their distance from that center. Unfortunately, this code works very well and produces correct results where the original algorithm did not, but runs incredibly slowly. I do not think that it is feasible to use this algorithm on the volume of data that we need to analyze. I am currently running an example with 81 points (Houston 2012) split into groups of at least 4 and will report statistics on it soon.

Ideas for Fixing

One major holdup of the new algorithm (besides the fact that it calculates a ton more circles) is the recalculating of the center point every time I need to add a new point to the circle, which seems a bit absurd. I think it might be faster to adjust the center of the circle heuristically so between the first two, take the average x and y coordinate. For the next addition. Take a weighted average of the three points. Etc. I am not sure if this would work out mathematically but I can work it out and see. This might cut down on runtime, but we still need to calculate the distance from each individual point after that, which also takes a long time. This could possibly be resolved heuristically as well, since I don't really need to calculate all the distances since points that were really far from the circle before probably aren't going to be chosen anyway. I can look into implementing some of these algorithmic changes and testing them on Monday and Tuesday.

In the longer term, I would recommend trying to parallelize the algorithm, possibly with python threads or implementing it in java and using the constructs there. I do not know if I will have time to start this before the summer, but there are many processes that can be run in parallel for this algorithm so I think it is a good strategy.

Specifically, all calls to findgroups can be run in parallel so long as you wait to receive the results from all the threads before deciding on an answer. For example imagine a system with 5 points split into groups of at least 2. There would be one call to findgroups on all 5 points, then calls to findgroups on a 3-2 split and on a 2-3 split. Those two calls could run in parallel, as long as the result from the call on all 5 points waits to receive the results from both children threads before deciding which is the optimal solution. You would have to be careful to keep threads from writing to the same memory (specifically for the removals in pointset), but I think (depending on how many cores our system operates on) this could incur impressive speed up.

Update: Did not incur any speed up by implementing changes the the calculation for center or distance. Reading about python threads https://pymotw.com/2/threading/

Update: Implemented a thread version of the program and am running that in addition to the non-threaded version to see how the timing compares. EnclosingCircleRemake2.py. How the threading works:

1) Split the pointset into all its initial possible schemes (one split).

2) Divide these schemes amongst four threads.

3) Have each thread compute the best outcome of each of those schemes and put them into a global array.

4) Join all the threads.

5) Have the main function iterate through the schemes returned by the threads and choose the best one.

6) Create circles for the schemes and return them.

On preliminary examples, the code works correctly. It actually runs slower than the original algorithm on small examples but I suspect that is the overhead of creating the threads and that they will be useful to the large examples.

Use PostGIS

If the point of the project is to solve an application, the Not-Invented-Here approach is doomed to fail. Instead, I suggest an approach where

Load reverse geolocated into a PostGIS PostgreSQL table
Use ST_MinimumBoundingRadius which is an implementation of a linear (read: very fast) minimum bounding circle algorithm. See https://github.com/postgis/postgis/blob/svn-trunk/liblwgeom/lwboundingcircle.c for implementation details.
Run queries against DB

Runtime

On preliminary analysis, the brute force approach to the enclosing circle algorithm runs O( n!), where n is the number of pairs in the input sequence.

Applications

VC Data

The Enclosing Circle Algorithm will be applied to VC data acquired through the SDC Platinum database. The script makes use of the Python GeoPy GeoCoder to get latitude and longitude coordinates to be used by the Enclosing Circle Algorithm.

Geopy Geocoder User Agreements can be found here.

The relevant files are located in:

E:\McNair\Projects\Accelerators\Enclosing_Circle

The results may eventually be plotted to a graph using python as well. Here is documentation for a python library called basemap.

CURRENT STATUS: Bug fixes needed in EnclosingCircle.py. The program errors with a key error on line 187 in cases where n is not a multiple of the length of the dataset. I made some temporary fixes to the enclosing circle file located in the above directory, but I am not certain if it is a permanent fix.

Speeding up with C

With the large amount of VC data we have, the enclosing circle algorithm would take an extremely long amount of time to run (on the order of weeks/months). If we can compile the code into C, we can speed up runtime dramatically. I've listed some possible sources for running python code as C.

PyPy. Documentation here.

Cython

Cython. Documentation here. Basic tutorial for Cython is given here.

Currently, the RDP is missing a compiler to run Cython successfully. The error that appears is "unable to find vcvarsall.bat".

Getting a C++ Compiler

These are proposed fixes to solve the Cython error shown above. A possible C++ Compiler for Python can be downloaded directly from Windows here.

The C++ Compiler for Python has been downloaded and installed. Instructions for installation and uninstallation can be found here.

The Installation file is located in :

E:\McNair\Software\Utilities\VCForPython27.msi

Running Python vs. Cython

A trial test run on 82 coordinates resulted int the following time stamps:

EnclosingCircle in python: 16.069 seconds
Enclosing Circle in cython:  9.633 seconds

The test files can be found in the following:

EnclosingCircle in python: E:\McNair\Projects\Accelerators\EnclosingCircle\EnclosingCircle.py
EnclosingCircle in cython: E:\McNair\Projects\Accelerators\EnclosingCircle\EnclosingCircleC_Test.py

Usage

The basic tutorial for cython can be found phttp://docs.cython.org/en/latest/src/tutorial/cython_tutorial.html here].

Essentially, a setup.py file needs to be created with the following format:

try:
    from setuptools import setup
    from setuptools import Extension
except ImportError:
    from distutils.core import setup
    from distutils.extension import Extension

from Cython.Build import cythonize
setup(
    ext_modules = cythonize("filename.pyx")
)

Then, after changing to the proper directory, execute the following from the command line

python setup.py build_ext --inplace

This will wrap your python program in C, and produce a filename.pyd file.

To use this new python code wrapped in C, simply import the pyd file as if it were a python file:

import filename

Treat this file as any other module. It will work just as if it were in Python, except it exhibits a faster run time.

NOTE: We fixed Enclosing Circle and drastically improved its runtime, so we decided to go with its Python implementation.

Wrapping C for Python

Documentation for multiple ways in which this can be done can be found here.

A GitHub example can be found here.

An advanced and detailed tutorial can be found here.

A step-by-step Tutorials Point guide can be found here.

Results

A data set with city, state, company name, year, and geocoded coordinates can be found at:

E:\McNair\Projects\Accelerators\Code+Final_Data\ChristyCode\GeoCodedBusinesses.txt

NEXT STEP: We need to determine how many companies we want in each circle, and then we can begin running the enclosing circle algorithm on the city data.

The Top 50 cities with the maximum number of companies in any given year are: 'Santa Monica', 'Nashville', 'Santa Clara', 'Chicago', 'Philadelphia', 'Denver', 'Dallas', 'Burlington', 'San Francisco', 'San Mateo', 'Milpitas', 'Boulder', 'Bellevue', 'Herndon', 'Pittsburgh', 'Mountain View', 'San Diego', 'Fremont', 'Ann Arbor', 'Irvine', 'Brooklyn', 'Durham', 'Los Angeles', 'Atlanta', 'Alpharetta', 'Menlo Park', 'Rockville', 'San Jose', 'Lexington', 'Saint Louis', 'Sunnyvale', 'Palo Alto', 'Richardson', 'Redwood City', 'Austin', 'Waltham', 'Baltimore', 'Cupertino', 'Houston', 'Cambridge', 'Boston', 'Washington', 'Minneapolis', 'Pleasanton', 'New York', 'Cleveland', 'South San Francisco', 'Portland', 'Seattle'.

Data on the Top 50 VC Backed Companies can be found here.

These were determined by the decide_cities.py script located in:

E:\McNair\Projects\Accelerators\EnclosingCircle

The final data of cities at a given year and their minimized circles can be found at:

 E:\McNair\Projects\Accelerators\EnclosingCircle\final_vc_circles.txt

You can plot this data onto a map using the script located:

E:\McNair\Projects\Accelerators\EnclosingCircle\Draw_circles_test\draw_vc_circles.py

@@ Line 1: / Line 1: @@
-{{McNair Projects
+{{Project
-|Has Image=Red-circle.jpg
+|Has project output=Tool
+|Has sponsor=McNair Center
 |Has title=Enclosing Circle Algorithm
-|Has owner=Christy Warden
+|Has owner=Christy Warden, Peter Jalbert,
-|Has start date=201701
+|Has start date=January 2017
-|Has deadline=201704
+|Has deadline=April 2017
 |Has keywords=Tool
-|Is billed to=
+|Has project status=Subsume
-|Has notes=
+|Has Image=Red-circle.jpg
-|Has project status=Active
-|Is dependent on=
-|Depends upon it=
 }}
 The objective of this project is come up with a fast, reliable algorithm that finds the smallest circle area that can drawn around N Cartesian points such that each circle contains at least M<N points. This algorithm will be used in an academic paper on [[Urban Start-up Agglomeration]], where the points will represent venture capital backed firms within cities.
@@ Line 25: / Line 22: @@
-=To-Do 04/16 Update=
+=To-Do 04/26 Update=
      a) Thursday 04/27: Plot circles resulting from the new Algorithm and ensure that they are correct, specifically on previously bad outputs.
      b) Friday 04/28: Look for ways to speed up algorithm (reinstate memoization). Run algorithm over weekend.
@@ Line 68: / Line 65: @@
 ) Choose the scheme that resulted in the smallest total area.
+=K-Means Based Algorithm=
+==Location==
+The script is located in:
+ E:\McNair\Software\CodeBase\New Implement of Enclosing Circle (Constrained K Means, Smallest Circle)\enclosing_circle_new_implement.py
+==Explanation==
+The algorithm relies on an object oriented implementation of a "cluster". <br>
+Each "cluster" has the following associated with it: <br>
+. area of minimum circle enclosing points in the cluster
+. points associated with the cluster
+. minimum circle associated with the cluster (x-coord,y-coord,radius)
+. "parent" cluster
+. "children" cluster(s)
+'''Inputs:'''
+. n, the minimum number of points to include in a circle
+. k, the amount of clusters to split the points into
+. valid_nodes, an empty list to store the possible clusters to explore
+. explored_nodes, an empty list to store the explored clusters
+. cur_node, initialized to an object with all values set to none
+. end_nodes, initialized to an empty list used to store the clusters which cannot be split further
+'''Algorithm:'''
+. Calculate the area of the minimum circle enclosing the total set of points
+. Split initial points into "k" clusters of at least "n" points using Constrained-K-Means Algorithm
+. construct minimum circles of "k" clusters and compute their total area
+. if this area <= area of the area of the first circle
+. Add all "k" clusters to "valid_nodes"
+. for node in valid_nodes, add node to explored_nodes, change cur_node to node, and update respective areas, parents, children, circle, points
+. run this algorithm again, change k = 2
+. if the cur_node cannot be split into 2, or the cur_node cannot be split into 2 circles with smaller areas than its parent, add cur_node to end_nodes and resume algorithm with next valid_node not in
+      explored_node as the cur_node
+. if all the valid_nodes are in explored_nodes, return end_nodes
+. repeat the above algorithm for all k from 2 to int(#number of total points/n) and take the end_nodes with the lowest total area
+'''Runtime:'''
+The runtime of the minimum enclosing circle is O(n), and the runtime of constrained k-means depends on k itself. Here, the value of k is proportional to the total number of points. <br>
+We would expect the algorithm to slow as the number of points increase. For reference, a set of 80 points with n =3, took about 30 minutes. <br>
+Further improvements could implement an early-stopping method to converge to a local optima. <br>
+==Visualization ==
+The K-Means based algorithm returns the optimal solution, albeit slower <br>
+[[File:kmeans comparison.png]] [[File:kmeans random.png]] [[File:houston.png]]
 =Brute Force=
@@ Line 73: / Line 110: @@
 The script is located in:
   E:\McNair\Projects\Accelerators\Enclosing_Circle\enclosing_circle_brute_force.py
+The Julia Script is located in:
+ E:\McNair\Software\CodeBase\Julia Code for enclosing circle
 ==Explanation==
@@ Line 84: / Line 123: @@
 ) Return the scheme with the minimum area.
+==Problems==
+In my first implementation of EnclosingCircle (EnclosingCircle.py), there is a mistake in my memoization scheme where I had been setting a variable equal to
+a "scheme[1].sort()", forgetting that the sort function operates in place and returns None. Thus, the memoization scheme was being used randomly and excessively such that not all possible circles were calculated and incorrect results were returned (hence the subsuming circle nonsense). Another mistake in the original algorithm was that when we were choosing the next point to add to a given circle based on how far it was from the first point I chose in the circle, rather than from the center of the circle. I corrected this by recalculating the circle every time I add a point and then sorting the rest of the points based on their distance from that center. Unfortunately, this code works very well and produces correct results where the original algorithm did not, but runs incredibly slowly. I do not think that it is feasible to use this algorithm on the volume of data that we need to analyze. I am currently running an example with 81 points (Houston 2012) split into groups of at least 4 and will report statistics on it soon.
+==Ideas for Fixing==
+One major holdup of the new algorithm (besides the fact that it calculates a ton more circles) is the recalculating of the center point every time I need to add a new point to the circle, which seems a bit absurd. I think it might be faster to adjust the center of the circle heuristically so between the first two, take the average x and y coordinate. For the next addition. Take a weighted average of the three points. Etc. I am not sure if this would work out mathematically but I can work it out and see. This might cut down on runtime, but we still need to calculate the distance from each individual point after that, which also takes a long time. This could possibly be resolved heuristically as well, since I don't really need to calculate all the distances since points that were really far from the circle before probably aren't going to be chosen anyway. I can look into implementing some of these algorithmic changes and testing them on Monday and Tuesday.
+In the longer term, I would recommend trying to parallelize the algorithm, possibly with python threads or implementing it in java and using the constructs there. I do not know if I will have time to start this before the summer, but there are many processes that can be run in parallel for this algorithm so I think it is a good strategy.
+Specifically, all calls to findgroups can be run in parallel so long as you wait to receive the results from all the threads before deciding on an answer.
+For example imagine a system with 5 points split into groups of at least 2. There would be one call to findgroups on all 5 points, then calls to findgroups on a 3-2 split and on a 2-3 split. Those two calls could run in parallel, as long as the result from the call on all 5 points waits to receive the results from both children threads before deciding which is the optimal solution. You would have to be careful to keep threads from writing to the same memory (specifically for the removals in pointset), but I think (depending on how many cores our system operates on) this could incur impressive speed up.
+Update: Did not incur any speed up by implementing changes the the calculation for center or distance. Reading about python threads https://pymotw.com/2/threading/
+Update: Implemented a thread version of the program and am running that in addition to the non-threaded version to see how the timing compares. EnclosingCircleRemake2.py.
+How the threading works:
+) Split the pointset into all its initial possible schemes (one split).
+) Divide these schemes amongst four threads.
+) Have each thread compute the best outcome of each of those schemes and put them into a global array.
+) Join all the threads.
+) Have the main function iterate through the schemes returned by the threads and choose the best one.
+) Create circles for the schemes and return them.
+On preliminary examples, the code works correctly. It actually runs slower than the original algorithm on small examples but I suspect that is the overhead of creating the threads and that they will be useful to the large examples.
+=== Use PostGIS ===
+If the point of the project is to solve an application, the Not-Invented-Here approach is doomed to fail. Instead, I suggest an approach where
+# Load reverse geolocated into a PostGIS PostgreSQL table
+# Use [https://postgis.net/docs/ST_MinimumBoundingRadius.html ST_MinimumBoundingRadius] which is an implementation of a linear (read: very fast) minimum bounding circle algorithm. See https://github.com/postgis/postgis/blob/svn-trunk/liblwgeom/lwboundingcircle.c for implementation details.
+# Run queries against DB
 ==Runtime==

Difference between revisions of "Enclosing Circle Algorithm"

Latest revision as of 12:47, 21 September 2020

Contents

To-Do

To-Do 04/26 Update

Overview

Algorithm Description

Location

Explanation

K-Means Based Algorithm

Location

Explanation

Visualization

Brute Force

Location

Explanation

Problems

Ideas for Fixing

Use PostGIS

Runtime

Applications

VC Data

Speeding up with C

Cython

Getting a C++ Compiler

Running Python vs. Cython

Usage

Wrapping C for Python

Results

Navigation menu

Search